Using the attached dataset, please create a notebook (preferably Python but any tool is allowed) to explore the data in order to answer questions like:
+ **How would you make a geo expansion recommendation?**
+ **What new columns would you create?**
+ **How might clustering analysis help - how would you go about it?**
+ **However, use these questions just as a starting point, and feel free to also use your own creativity/inspiration**

In the interview process, we'll ask you to take us through your notebook, thought process and the libraries that you've used. A presentation/deck of any kind is not necessary.

In [None]:
import pandas as pd 
import seaborn as sns
import contextily as ctx

import matplotlib.pyplot as plt
import geopandas as gpd
import folium

In [None]:
#loading provided dataset
df = pd.read_excel('./US Census Dataset.xlsx')
df.head()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
#Checking geometry + viz
#creating geodataframe

Creating GeoDataFrame using conventional way throws an error, which indicates that there are issues with WKT

```python
gdf = gpd.GeoDataFrame(df, geometry=df['geom'])
```
Looking at the data in QGIS, the assumption that there are extra numbers (or numbers removed from a pair) which distorts geometry
<img src="./images/qgis_wkt_issue.png" alt="qgis_wkt_issue" width="600"/>

In [None]:
#Fixing geomertry using shapely
from shapely import wkt

geom = []

for g in df['geom']:
    try:
        geom.append(wkt.loads(g))
    except:
        geom.append(None)

df['geometry'] = geom

In [None]:
df.head()

In [None]:
#chekcing for empty geometries
df["geometry"].isna().sum()

In [None]:
df.dropna(subset=['geometry'], inplace=True)

In [None]:
gdf = gpd.GeoDataFrame(df, geometry=df['geometry'])

In [None]:
#quick check that data looks OK
gdf.plot()

In [None]:
#CRS check
print(gdf.crs)

In [None]:
#As this data covers US and is US cencus NAD83 or WGS84 coudl be used (differences will not have effect on this scale)
# setting CRS to WGS84 to avoid possilbe CRS conversions later if/when adding addtional datasets

gdf.set_crs(epsg=4326, inplace=True)
gdf.head()

In [None]:
#Better visualisaton with some background mapping


fig, ax = plt.subplots(figsize=(12,8))

# Plot the data
#os_data.plot(ax=ax)
gdf.plot(color = '#ffcc00',ax=ax)
# Add basemap

ctx.add_basemap(ax, crs="EPSG:4326", source=ctx.providers.CartoDB.Voyager)

In [None]:
#Even better interactive mapping

# Stamen Terrain
map = folium.Map(location = [30.266666,-97.733330], tiles = "OpenStreetMap", zoom_start = 10)

for _, r in df.iterrows():
    # Without simplifying the representation of each borough,
    # the map might not be displayed
    sim_geo = gpd.GeoSeries(r['geometry']).simplify(tolerance=0.001)
    geo_j = sim_geo.to_json()
    geo_j = folium.GeoJson(data=geo_j,style_function=lambda x: {'color': '#ffcc00'})
    geo_j.add_to(map)
map

In [None]:

# This https://www.statista.com/statistics/310344/us-online-dating-app-site-usage-age/ 
#suggest that main users off dating apps are 
# 18 - 44


### TO-DO 

Calculate median age from the  chart using those who either used or is using the dating app.

Then it could be used with median age in a data

In [None]:
#Simplistic way
#extract only this age group - this is our target group
#check where those people live

gdf_18_29 = gdf.loc[(gdf['median_age'] >= 18) & (df['median_age'] <45) ]

In [None]:
#subselecting above average median income to subselect those who could afford premium features

gdf_18_29_high_inc = gdf_18_29.loc[(gdf_18_29['median_income'] >=  gdf_18_29['median_age'].mean()) ]

In [None]:
print(len(gdf_18_29))
print(len(gdf_18_29_high_inc))

In [None]:
gdf_18_29_high_inc['dominant_ethnic_group'] = gdf_18_29_high_inc[['black_pop', 'hispanic_pop', 'white_pop']].idxmax(axis=1)

In [None]:
#minimising dataset by dissolving boundaries based on the value
# dissolve the state boundary by region 
dominant_group = gdf_18_29_high_inc.dissolve(by='dominant_ethnic_group')

In [None]:
 # create the plot
fig, ax = plt.subplots(figsize = (10,6))

# plot the data 
dominant_group.reset_index().plot(column = 'dominant_ethnic_group', ax=ax)

##TO-DO 

#ADD LEGEND

### TO-DO 

calculate  percent of each ethnic group

map highest percent 

map the data