<a href="https://colab.research.google.com/github/jeremysb1/data_analysis_projects/blob/main/customer_geographies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer Geographies

With this project, I will report on spending volumes for London-based customers versus those based in the rest of the United Kingdom.

I will answer two questions:

1. Which UK cities are currently underserved?
2. Are customers primarily London-based?

## Minimum Viable Answer

1. Are different cities underserved? This requires me to calculate total customer spend by city and find cities with the lowest customer spend.

2. How does London compare to the rest of the UK? This can then be answered from the output of the first answer.

In [None]:
import pandas as pd
customers = pd.read_csv("/content/drive/MyDrive/Data Analysis Projects/Project 1/addresses.csv")
print(customers.shape)

(100000, 3)


In [None]:
customers.isnull().sum()

company_id       0
address        968
total_spend      0
dtype: int64

In [None]:
customers.dropna(subset=["address"], inplace=True)

In [None]:
customers["total_spend"].describe()

count    99032.000000
mean      4951.673197
std       1500.642398
min          0.000000
25%       3900.000000
50%       5000.000000
75%       6000.000000
max      11700.000000
Name: total_spend, dtype: float64

## Extract City Column from Addresses

In [None]:
for address in customers["address"].head():
  print(address, "\n")

APARTMENT 2,
52 BEDFORD ROAD,
LONDON,
ENGLAND,
SW4 7HJ 

107 SHERINGHAM AVENUE,
LONDON,
N14 4UJ 

43 SUNNINGDALE,
YATE,
BRISTOL,
ENGLAND,
BS37 4HZ 

HAWESWATER HOUSE,
LINGLEY MERE BUSINESS PARK,
LINGLEY GREEN AVENUE,
GREAT SANKEY, WARRINGTON,
WA5 3LP 

AMBERFIELD BARN HOUSE AMBER LANE,
CHART SUTTON,
MAIDSTONE,
ENGLAND,
ME17 3SF 



In [None]:
customers["address_clean"] = customers["address"].str.upper()

In [None]:
len(customers[customers["address_clean"].str.contains("LONDON")])
len(customers[customers["address_clean"].str.contains("LONDON,")])

20831

In [None]:
customers["address_lines"] = (
    customers["address_clean"]
    .str.split(",\n")
    .apply(len)
)
customers["address_lines"].value_counts().sort_index()

address_lines
1        6
2       52
3     3284
4    35850
5    45931
6    13909
Name: count, dtype: int64

In [None]:
print(customers.loc[customers["address_lines"] == 1, "address_clean"])
print((
    customers[customers["address_lines"] == 2]
    .sample(5, random_state=42)
    ["address_clean"])
)

17789                      FALKIRK
31897                   HADDINGTON
61750          CREAG BHAITHEACHAIN
75330                     NEWMILNS
78045    REDCLOAK FARM, STONEHAVEN
90897     REFER TO PARENT REGISTRY
Name: address_clean, dtype: object
39443                                    FORFAR,\nANGUS
80846                        12 HOPE STREET,\nEDINBURGH
95979    BRANCH REGISTRATION,\nREFER TO PARENT REGISTRY
23563    BRANCH REGISTRATION,\nREFER TO PARENT REGISTRY
81155                             PO BOX 2230,\nGLASGOW
Name: address_clean, dtype: object


In [None]:
cities = pd.read_csv("/content/drive/MyDrive/Data Analysis Projects/Project 1/cities - cities.csv", header=None, names=["city"])
cities.head()

Unnamed: 0,city
0,City
1,England
2,Bath
3,Birmingham*
4,Bradford*


In [None]:
countries_to_remove = ["England", "Scotland", "Wales", "Northern Ireland"]

print(len(cities))
cities_to_remove = cities[cities["city"].isin(countries_to_remove)].index
cities = cities.drop(index=cities_to_remove)
print(len(cities))

cities["city"] = cities["city"].str.replace("*", "", regex=False)

cities["city"] = cities["city"].str.upper()
cities.head()

81
77


Unnamed: 0,city
0,CITY
2,BATH
3,BIRMINGHAM
4,BRADFORD
5,BRIGHTON & HOVE


### Create a City Column

In [None]:
for city in cities["city"].values:
    customers.loc[customers["address_clean"].str.contains(f"\n{city},"),"city"] = city

customers["city"].fillna("OTHER", inplace=True)

customers.head()

Unnamed: 0,company_id,address,total_spend,address_clean,address_lines,city
0,1,"APARTMENT 2,\n52 BEDFORD ROAD,\nLONDON,\nENGLA...",5700,"APARTMENT 2,\n52 BEDFORD ROAD,\nLONDON,\nENGLA...",5,LONDON
1,2,"107 SHERINGHAM AVENUE,\nLONDON,\nN14 4UJ",4700,"107 SHERINGHAM AVENUE,\nLONDON,\nN14 4UJ",3,LONDON
2,3,"43 SUNNINGDALE,\nYATE,\nBRISTOL,\nENGLAND,\nBS...",5900,"43 SUNNINGDALE,\nYATE,\nBRISTOL,\nENGLAND,\nBS...",5,BRISTOL
3,4,"HAWESWATER HOUSE,\nLINGLEY MERE BUSINESS PARK,...",7200,"HAWESWATER HOUSE,\nLINGLEY MERE BUSINESS PARK,...",5,OTHER
4,5,"AMBERFIELD BARN HOUSE AMBER LANE,\nCHART SUTTO...",4600,"AMBERFIELD BARN HOUSE AMBER LANE,\nCHART SUTTO...",5,OTHER


Now that I have a new city column, I need to explore it to see which cities the customers are in and what proportion of the customers I wasn’t able to allocate to a city based on their address.

In [None]:
customers["city"].value_counts().head(20)

city
OTHER            54457
LONDON           20762
MANCHESTER        1902
BIRMINGHAM        1866
GLASGOW           1273
BRISTOL           1150
LEEDS             1040
EDINBURGH         1038
LEICESTER          905
NOTTINGHAM         838
LIVERPOOL          838
CARDIFF            797
SHEFFIELD          706
COVENTRY           553
MILTON KEYNES      493
SOUTHAMPTON        477
NORWICH            449
BRADFORD           417
BELFAST            416
PRESTON            406
Name: count, dtype: int64

Over half of the data is in the “Other” category, which means half of the customer base is based outside of major cities. This is an important insight to communicate to the project stakeholders.