In [None]:
# Reporting version of Capstone project

In [None]:
# Code to import packages - (learn how to hide)
import descartes

import folium # map rendering library

import geopandas as gpd
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import json

import matplotlib.pyplot as plt


import numpy as np
from numpy.polynomial.polynomial import polyfit

import pandas as pd
from pandas.plotting import scatter_matrix

import requests # library to handle requests

from scipy.stats import chi2_contingency

import seaborn as sns

from shapely import wkt
from shapely.geometry import MultiPolygon, Polygon

from sklearn.cluster import KMeans # KMeans clustering 
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# Introduction

With the national debt being at an all-time high for the US, becoming more efficent at managing funds at the local-level could be a great opportunity to also maximize spending at a federal level (and nationwide). So this study is going to explore a dataset of various cities within a specific area that includes various socioeconomic factors, by grouping them into different groups and creating different benchmarks for each group. By using unsupervised machine learning models that help us cluster cities into groups, this would hopefully help us provide us with enough data to make informed decisions on how to use supervised machine learning techiques to create potential benchmarks and (logisical) models that cities can use to measure the effectiveness of current resources and predict future success. Once the data has been group accordingly, then we are going to grabbing location data from popular areas in each city to see if there is an indirect relationship that we can identify (for future studies).

# Data

Understanding that Los Angeles county ranks #1 for largest county population in the US (~10 Million - larger than US 41 states), the goal of this study is to use statisical analysis and machine learning techiques on this dataset to classify cities within LA County into groups of clusters that help indentify population averages, benchmarks and indicators of success for each group - based on a variety of socioeconomic factors (i.e., income, school enrollment, life expectancy, etc.). This will be helpful for city planning and future research purposes by building off the initial research (www.measureofamerica.org/los-angeles-county/). This framework will also be useful for inserting other Los Angeles datasets for classification purposes.

Leveraging data made available by the County of Los Angeles at (www.data.lacounty.gov/), we will be using 'A Portrait of Los Angeles County using the Human Development Index: GIS Data' at (www.data.lacounty.gov/Community/A-Portrait-of-Los-Angeles-County-using-the-Human-D/j7aj-mn8v). HD Index explaination - (https://ssrc-static.s3.amazonaws.com/moa/PoLA%20Methodological%20Note.pdf)

Once the cities have been grouped into clusters, we will be grabbing population locations for each city and grouping the location data by cluster for further analysis.

Original data:

In [None]:
LA_HPI_CSV='A_Portrait_of_Los_Angeles_County_using_the_Human_Development_Index__GIS_Data.csv' # csv filename
LA_HPI=pd.read_csv(LA_HPI_CSV) # Read in csv data into a pandas dataframe
LA_HPI.head() # Dataframe preview

Cleaning up data up by reformatting columns, dropping irrelevant columns, and converting coordinates into polygon objects for mapping

In [None]:
LA_HPI.drop(columns=['GEO_TYPE','GEO_ID'],inplace=True) # Drop irrelevant columns
LA_HPI_columns=['Polygon','City','Human Development Index', 'Life Expectancy', 'No HS Diplomas', 'Bachelors Degrees', 'Graduate Degrees',
       'School Enrollment', 'Earnings', 'Health Index', 'Education Index', 'Income Index'] # Reformat column names
LA_HPI.columns=LA_HPI_columns # Replace column names
LA_HPI["Polygon"]=LA_HPI["Polygon"].apply(wkt.loads) # Create polygon object for graphing
LA_HPI.head() # Dataframe preview

The dataset contains 140 rows and 12 columns. One row for each city; along with various columns for factors that pertain to health, education, and living standards, along with name and geographic information.

After downloading and formatting the dataset into a pandas dataframe (to make it easy to manipulate, plot, map and analyze the data), we now create another dataframe that we can use for calculations by transforming our cleaned up 140x12 dataset into a 140x10 dataset by setting 'City' as the index and removing the 'Polygon' column.

In [None]:
LA_HPI_Table=LA_HPI # Create table dataframe
LA_HPI_Table=LA_HPI_Table.drop(columns='Polygon') # Drop city column from table dataframe
LA_HPI_Table.set_index('City',inplace=True) # Set city names as index
LA_HPI_Table.head() # Dataframe preview

#### Now we take a look at how these different areas differ from city to city using maps:

In [None]:
# Get coordinates (latitude, longtitude) for Los Angeles County
address='Los Angeles County, US'
geolocator = Nominatim(user_agent="CA_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# Converting 'Polygon' column from dataframe into geodataframe for plotting
LA_HPI_gdf=gpd.GeoDataFrame(LA_HPI,geometry='Polygon')
LA_HPI_gdf_json=LA_HPI_gdf.to_json() # Convert from geodataframe to json for choropleth map

# Create map of Los Angeles County using latitude and longitude values
map_LA_County = folium.Map(location=[latitude, longitude], zoom_start=9)

# Map features
LA_HPI_gdf_Points=folium.features.Choropleth(LA_HPI_gdf_json)
map_LA_County.add_child(LA_HPI_gdf_Points)

Map of Cities within LA County (above)

## Map of cities by category density

In [None]:
# For plotting features on map
style_function = lambda x: {'fillColor': '#ffffff', 
                            'color':'#000000', 
                            'fillOpacity': 0.1, 
                            'weight': 0.1}
highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

In [None]:
Enrollment_Geo=['City','School Enrollment']

# Initialize the map:
map_LA_County = folium.Map([latitude, longitude], zoom_start=9)

choropleth=folium.Choropleth(
    geo_data=LA_HPI_gdf_json,
    name='choropleth',
    data=LA_HPI[Enrollment_Geo],
    columns=Enrollment_Geo,
    key_on='feature.properties.City',
    bins=9,
    fill_color='PuBu',
    fill_opacity=0.7,
    line_opacity=1.2,
    legend_name='School Enrollment (%)',
    highlight=True
).add_to(map_LA_County)
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['City'],labels=False)
)

choropleth=folium.features.GeoJson(
    LA_HPI_gdf_json,
    style_function=style_function, 
    control=False,
    highlight_function=highlight_function, 
    tooltip=folium.features.GeoJsonTooltip(
        fields=Enrollment_Geo,
        aliases=['City: ','School Enrollment in population %: '],
        style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;") 
    )
)
map_LA_County.add_child(choropleth)

map_LA_County

Map of Cities within LA County by School Enrollment (above)

In [None]:
Graduate_Geo=['City','Graduate Degrees']

# Initialize the map:
map_LA_County = folium.Map([latitude, longitude], zoom_start=9)

choropleth=folium.Choropleth(
    geo_data=LA_HPI_gdf_json,
    name='choropleth',
    data=LA_HPI[Graduate_Geo],
    columns=Graduate_Geo,
    key_on='feature.properties.City',
    bins=9,
    fill_color='PuBu',
    fill_opacity=0.7,
    line_opacity=1.2,
    legend_name='Graduate Degrees (%)',
    highlight=True
).add_to(map_LA_County)
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['City'],labels=False)
)

choropleth=folium.features.GeoJson(
    LA_HPI_gdf_json,
    style_function=style_function, 
    control=False,
    highlight_function=highlight_function, 
    tooltip=folium.features.GeoJsonTooltip(
        fields=Graduate_Geo,
        aliases=['City: ','Graduate degrees in population %: '],
        style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;") 
    )
)
map_LA_County.add_child(choropleth)

map_LA_County

Map of Cities within LA County by Graduate Degrees (above)

In [None]:
Earnings_Geo=['City','Earnings']

# Initialize the map:
map_LA_County = folium.Map([latitude, longitude], zoom_start=9)

choropleth=folium.Choropleth(
    geo_data=LA_HPI_gdf_json,
    name='choropleth',
    data=LA_HPI[Earnings_Geo],
    columns=Earnings_Geo,
    key_on='feature.properties.City',
    bins=9,
    fill_color='PuBu',
    fill_opacity=0.7,
    line_opacity=1.2,
    legend_name='Earnings ($)',
    highlight=True
).add_to(map_LA_County)
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['City'],labels=False)
)

choropleth=folium.features.GeoJson(
    LA_HPI_gdf_json,
    style_function=style_function, 
    control=False,
    highlight_function=highlight_function, 
    tooltip=folium.features.GeoJsonTooltip(
        fields=Earnings_Geo,
        aliases=['City: ','Earnings in population $: '],
        style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;") 
    )
)
map_LA_County.add_child(choropleth)

map_LA_County

Map of Cities within LA County by Earnings (above)

In [None]:
No_HS_Geo=['City','No HS Diplomas']

# Initialize the map:
map_LA_County = folium.Map([latitude, longitude], zoom_start=9)

choropleth=folium.Choropleth(
    geo_data=LA_HPI_gdf_json,
    name='choropleth',
    data=LA_HPI[No_HS_Geo],
    columns=No_HS_Geo,
    key_on='feature.properties.City',
    bins=9,
    fill_color='PuBu',
    fill_opacity=0.7,
    line_opacity=1.2,
    legend_name='No HS Diplomas (%)',
    highlight=True
).add_to(map_LA_County)
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['City'],labels=False)
)

choropleth=folium.features.GeoJson(
    LA_HPI_gdf_json,
    style_function=style_function, 
    control=False,
    highlight_function=highlight_function, 
    tooltip=folium.features.GeoJsonTooltip(
        fields=No_HS_Geo,
        aliases=['City: ','No HS Diplomas in population %: '],
        style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;") 
    )
)
map_LA_County.add_child(choropleth)

map_LA_County

Map of Cities within LA County by No HS Diplomas (above)

In [None]:
HDI_Geo=['City','Human Development Index']

# Initialize the map:
map_LA_County = folium.Map([latitude, longitude], zoom_start=9)

choropleth=folium.Choropleth(
    geo_data=LA_HPI_gdf_json,
    name='choropleth',
    data=LA_HPI[HDI_Geo],
    columns=HDI_Geo,
    key_on='feature.properties.City',
    bins=9,
    fill_color='PuBu',
    fill_opacity=0.7,
    line_opacity=1.2,
    legend_name='Human Development Index (1-10)',
    highlight=True
).add_to(map_LA_County)
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['City'],labels=False)
)

choropleth=folium.features.GeoJson(
    LA_HPI_gdf_json,
    style_function=style_function, 
    control=False,
    highlight_function=highlight_function, 
    tooltip=folium.features.GeoJsonTooltip(
        fields=HDI_Geo,
        aliases=['City: ','Human Development Index in population (1-10): '],
        style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;") 
    )
)
map_LA_County.add_child(choropleth)

map_LA_County

Map of Cities within LA County by Human Development Index (above)

In [None]:
Bachelors_Geo=['City','Bachelors Degrees']

# Initialize the map:
map_LA_County = folium.Map([latitude, longitude], zoom_start=9)

choropleth=folium.Choropleth(
    geo_data=LA_HPI_gdf_json,
    name='choropleth',
    data=LA_HPI[Bachelors_Geo],
    columns=Bachelors_Geo,
    key_on='feature.properties.City',
    bins=9,
    fill_color='PuBu',
    fill_opacity=0.7,
    line_opacity=1.2,
    legend_name='Bachelors Degrees (%)',
    highlight=True
).add_to(map_LA_County)
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['City'],labels=False)
)

choropleth=folium.features.GeoJson(
    LA_HPI_gdf_json,
    style_function=style_function, 
    control=False,
    highlight_function=highlight_function, 
    tooltip=folium.features.GeoJsonTooltip(
        fields=Bachelors_Geo,
        aliases=['City: ','Bachelors Degrees in population (%): '],
        style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;") 
    )
)
map_LA_County.add_child(choropleth)

map_LA_County

Map of Cities within LA County by Bachelors Degrees (above)

From looking at the different maps, we see a clear correlation between higher performing cities and their proximity to the ocean which is not surprising. But their also seems to be a line of high performing cities that run from the ocean through LA all the way up into the Angeles forest. Would be interesting the analyze the ages in these populations to see if this is predictive of general migration patterns as people progress throughout their careers. 

After the data was clean-up and formatted, we then do a quick visual analysis of the data to get a better understanding of the overall distribution for the different categories. Using histograms:

In [None]:
LA_HPI_Table.hist(figsize=(25,25)) # Create hisogram table
plt.show() # Plot histogram (remove pre-plot messages)

Looking at the histogram shows a a couple of different features.
##### School Enrollment:
Compared to the other charts, there doesn't seem to be the least amount of disparty between cities in this area, so seeing how this doesn't directly transfer to the greater disparity that we see with bachelor's degrees and earning, this could be worth investigating to see if these communities are doing a poor job of educating their residents or doing a poor job of retaining their residents once they are educated and higher-income earners. \
##### Graduate Degrees vs Bachelors Degrees:
Seems like graduate degrees are a lot more concentrated then how Bachelors degrees are distributed around LA county.
##### Redundent Indexes:
Their doesn't seem to be any significant relationships between the indexes and their corresponding values, so we will be dropping these later to increase the predictive power of our clustering model.

# Methodology

Given our general understanding of how different area of cities within LA County are performing in different areas, now we look to explore the strength of the relationships between the different variables by looking at the correlations, to help us determine what is important for our calculations that will help us classify the cities.

After our intial exploratory data analysis, we now move onto the data cleaning phase by using machine learning to help determine which factors would be relevant for building our dimensions, clusters, and for further analysis.

Since the goal of the study is to understand how the cities within Los Angeles county group together and differ, we will be using unsupervised machine learning methods in the form of PCA and k-means clustering -- to find out how many dimensions and clusters our data should be grouped together to give us the best results.

First, we start off by standardizing our data in order to get a better understanding of the relationships within the variables. Then we create a heatmap and scatterplots to explore the relationships.

In [None]:
LA_HPI_fit=preprocessing.StandardScaler().fit(LA_HPI_Table).transform(LA_HPI_Table) # Standardizing and transforming dataset
LA_HPI_fit=pd.DataFrame(LA_HPI_fit, columns=LA_HPI_Table.columns) # Converting into dataframe with the mathcing column names
LA_HPI_corr=LA_HPI_fit.corr() # Create correlation analysis object
sns.heatmap(LA_HPI_corr) # Map correlation analysis as heatmap

Correlation matrix of our dataset (above)

In [None]:
sns.pairplot(LA_HPI, diag_kind='hist',size=2.85) # Create scatterplot of all the variables correlations using seaborn
plt.show() # Plot

Our inital look at the strenghs of the different relationships from the correlation charts also shows us that there are clear redundancies between indexes and their corresponding values (i.e., life expectancy and health index). So we do a principal componenets analysis to make sure our dataset has enough predicitve power in it's first few columns, so that we can get rid of redundant columns.

In [None]:
#Calculating Eigenvecors and eigenvalues of Covariance matrix
mean_vec = np.mean(LA_HPI_fit, axis=0)
cov_mat = np.cov(LA_HPI_fit.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

In [None]:
eig_pairs = [ (np.abs(eig_vals[i]),eig_vecs[:,i]) for i in range(len(eig_vals))] # Create a list of (eigenvalue, eigenvector) tuples
eig_pairs.sort(key = lambda x: x[0], reverse= True) # Sort from high to low
# Calculation of Explained Variance from the eigenvalues
tot = sum(eig_vals)
var_exp = [(i/tot)*100 for i in sorted(eig_vals, reverse=True)] # Individual explained variance
cum_var_exp = np.cumsum(var_exp) # Cumulative explained variance

In [None]:
# PLOT OUT THE EXPLAINED VARIANCES SUPERIMPOSED 
plt.figure(figsize=(10, 5))
plt.bar(range(len(var_exp)), var_exp, alpha=0.3333, align='center', label='individual explained variance', color = 'g')
plt.step(range(len(cum_var_exp)), cum_var_exp, where='mid',label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.show()
print(cum_var_exp)

Here we see that 3 components can account for 94.62% of variance in our dataset. So we remove redundent columns (to give our data greater predictive power) and then re-analyze the relationships between the variables.

In [None]:
LA_HPI_fit_V2=LA_HPI_fit # Storing information onto new dataframe
LA_HPI_fit_V2=LA_HPI_fit_V2.drop(columns=['Human Development Index','Health Index','Education Index','Income Index']) # Dropping redundent columns
LA_HPI_corr_V2=LA_HPI_fit_V2.corr() # Build correlation object
sns.heatmap(LA_HPI_corr_V2) # Create heatmap of correlation object

Correlation matrix of refined dataset (above)

In [None]:
LA_HPI_V2=LA_HPI # Create dataframe for scatterplots
LA_HPI_V2=LA_HPI_V2.drop(columns=['Human Development Index','Health Index','Education Index','Income Index']) # Dropping redundent columns
sns.pairplot(LA_HPI_V2, diag_kind='hist',size=2.85) # Create scatterplot of all the variables correlations using seaborn
plt.show() # Plot

Scatterplot matrix of refined dataset (above)

Now that we are happy with our dataset we then re-do a principal component analysis to see how many dimensions we should split our data into, in order to give us the most predictive power per dimension

In [None]:
#Calculating Eigenvecors and eigenvalues of Covariance matrix
mean_vec = np.mean(LA_HPI_fit_V2, axis=0)
cov_mat = np.cov(LA_HPI_fit_V2.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

# Create a list of (eigenvalue, eigenvector) tuples
eig_pairs = [ (np.abs(eig_vals[i]),eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort from high to low
eig_pairs.sort(key = lambda x: x[0], reverse= True)

# Calculation of Explained Variance from the eigenvalues
tot = sum(eig_vals)
var_exp = [(i/tot)*100 for i in sorted(eig_vals, reverse=True)] # Individual explained variance
cum_var_exp = np.cumsum(var_exp) # Cumulative explained variance

# PLOT OUT THE EXPLAINED VARIANCES SUPERIMPOSED 
plt.figure(figsize=(10, 5))
plt.bar(range(len(var_exp)), var_exp, alpha=0.3333, align='center', label='individual explained variance', color = 'g')
plt.step(range(len(cum_var_exp)), cum_var_exp, where='mid',label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.show()

In [None]:
cum_var_exp

In [None]:
pca=PCA()
pca.fit(LA_HPI_fit_V2)
pca.explained_variance_ratio_

With our consolidated correlation matrix, our top 3 variables still account for 93.29% for variability for a dataset, so we will move forward with this 140 x 6 table.

In [None]:
LA_HPI_V2.head()

Given how most of the variance in the LA County datset can be explained through 3 'principal component' variables (from the analysis above), we use Prinicipal Component Analysis (PCA) to reduce the number of features from our dataset into 3.

In [None]:
pca3 = PCA(n_components=3) # PCA object for grouping dataset into three dimensions, by 3 components
x_3d = pca3.fit_transform(LA_HPI_fit_V2) # Fit to our dataset, then transform it based on the three dimensions

In [None]:
x_3d[:5,:] # Preview of our 3 dimensional dataset

In [None]:
df_pca3=pd.DataFrame(x_3d) # Dataframe from principal component analysis of 3
sns.pairplot(df_pca3) # Plot dataframe

In [None]:
plt.scatter(x_3d[:,0],x_3d[:,2], alpha=0.5)

After transforming our data into 3 dimension (above), now we find out what would be our optimal k for using k-means to cluster the data.

In [None]:
# For loop to collect 'sum of squared distances' for k-means clustering ranging from 1 to 15
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(LA_HPI_fit_V2)
    Sum_of_squared_distances.append(km.inertia_)

In [None]:
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

We see that our biggest drop off in accuracy comes where K is equal to 3, so we will use that for our K-means clustering of the PCA below.

In [None]:
kmeans=KMeans(n_clusters=3) #Set a 3 KMeans clustering

X_clustered=kmeans.fit_predict(LA_HPI_fit_V2) #Compute cluster centers and predict cluster indices

LABEL_COLOR_MAP = {0:'r', 1: 'g', 2: 'b'} #Define our own color map
label_color = [LABEL_COLOR_MAP[l] for l in X_clustered]

# Plot the scatter digram
plt.figure(figsize = (10,10))
plt.scatter(x_3d[:,0],x_3d[:,2], c=label_color, alpha=0.5) 
plt.show()

3 Clusters formed from (3-Dimension) PCA data (above)

We also visualiza how these groups cluster together based on the different dimensions that were created from PCA, along with mapping how the clusters form on a map

In [None]:
# Create a temp dataframe from our PCA projection data "x_10d"
df=pd.DataFrame(x_3d)
df['X_cluster']=X_clustered
LA_HPI['Cluster']=X_clustered

In [None]:
X_clustered # Our array of clusters that were formed

Our array of clusters that were formed (above)

In [None]:
# Call Seaborn's pairplot to visualize our feature interactions based on clusters
sns.pairplot(df, hue='X_cluster', palette= 'Dark2', diag_kind='kde',size=1.85)

Map of our PCA data based on the clusters that were formed using k-means (above)

# Results

After our clusters of groups have been created, then we place the cluster data into our earlier graphs to get a better understanding of how LA County is broken down.

In [None]:
# Call Seaborn's pairplot to visualize our KMeans clustering on the PCA projected data
sns.pairplot(LA_HPI, hue='Cluster', palette= 'Dark2', diag_kind='kde',size=1.85)

Our original dataframe grouped by clusters (above)

In [None]:
LA_HPI_V2['Cluster']=X_clustered

In [None]:
# Call Seaborn's pairplot to visualize our KMeans clustering on the PCA projected data
sns.pairplot(LA_HPI_V2, hue='Cluster', palette= 'Dark2', diag_kind='kde',size=1.85)

Our refined dataframe grouped by clusters (above)

Upon our initial research for how the factors correlated to each other, we discovered an interesting relationship between 'school enrollment', 'earnings' and 'bachelors degrees' that could warrant further analysis.

To help faciliate further research, we grabbed location data from the top 3 popular places in each city using foursquare, and segmented by cluster below.

In [None]:
# Credentials and Parameters
CLIENT_ID = 'B3WEP1QRUXRIZQSZWGWO1JLR2P5XT1513G4K0ZLJ4AYAAZ12' # your Foursquare ID
CLIENT_SECRET = 'REIU1MYR5KK4O1033IKMEG40YOUTCEBGBJNGH3FLSZVH4PSJ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 3
INTENT='browse'

In [None]:
VENUE_List=[]
# for loop for column rows
for i in range(len(LA_HPI)):
    CITY=LA_HPI['City'][i]
    CLUSTER=LA_HPI['Cluster'][i]
    CITIES=LA_HPI['City'][i].split(" - ")

# for loop for column items
    for j in range(len(CITIES)):
        NEAR=CITIES[j] +', CA'

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&limit={}&intent={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            NEAR,  
            LIMIT,
            INTENT)
            
        # make the GET request
        results = requests.get(url).json()
        
        if results['meta']['code']==200:
            for k in range(LIMIT):
                # Save relevant field from results into a dataframe
                NAME=results['response']['groups'][0]['items'][k]['venue']['name']
                CATEGORY=results['response']['groups'][0]['items'][k]['venue']['categories'][0]['name']
                LOCATION=results['response']['geocode']['where']
                AREA=CITY
                GROUP=CLUSTER+1
                VENUE=(NAME,CATEGORY,LOCATION,AREA,GROUP)
                VENUE_List.append(VENUE)
#                print(VENUE)
        else:
                    j=j+1

In [None]:
## Select columns for dataframe to download results 
Venue_Columns=('Name','Category','City','Area','Group')
# Convert list to dataframe, add columns
df_VENUE_List=pd.DataFrame(VENUE_List,columns=Venue_Columns)
# Formate 'city' column dataframe within the dataframe by Capitalizing it and removing the ' Ca' at the end
df_VENUE_List['City']=df_VENUE_List['City'].str.title().str.rstrip(' Ca')
# Save results into a csv
df_VENUE_List.to_csv('LA_County_Venue_List.csv')

Venue Location Dataframe

In [None]:
# Read from csv
df_VENUE_List_File=pd.read_csv('LA_County_Venue_List.csv')
df_VENUE_List_File.drop(columns='Unnamed: 0')

#### Cluster 1

In [None]:
LA_Cluster_Data_1=LA_HPI[LA_HPI['Cluster']==0].mean()
df_VENUE_List_File.loc[df_VENUE_List_File['Group'] == 1]

#### Cluster 2

In [None]:
LA_Cluster_Data_2=LA_HPI[LA_HPI['Cluster']==1].mean()
df_VENUE_List_File.loc[df_VENUE_List_File['Group'] == 2]

#### Cluster 3

In [None]:
LA_Cluster_Data_3=LA_HPI[LA_HPI['Cluster']==2].mean()
df_VENUE_List_File.loc[df_VENUE_List_File['Group'] == 3]

#### Cluster Map

In [None]:
# Converting 'Polygon' column from dataframe into geodataframe for plotting
LA_HPI_gdf=gpd.GeoDataFrame(LA_HPI,geometry='Polygon')
LA_HPI_gdf_json=LA_HPI_gdf.to_json() # Convert from geodataframe to json for choropleth map

Cluster_Geo=['City','Cluster']

# Initialize the map:
map_LA_County = folium.Map([latitude, longitude], zoom_start=9)

choropleth=folium.Choropleth(
    geo_data=LA_HPI_gdf_json,
    name='choropleth',
    data=LA_HPI[Cluster_Geo],
    columns=Cluster_Geo,
    key_on='feature.properties.City',
    bins=4,
    fill_color='Set3',
    fill_opacity=0.7,
    line_opacity=1.2,
    legend_name='Cluster',
    highlight=True
).add_to(map_LA_County)
choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['City'],labels=False)
)

choropleth=folium.features.GeoJson(
    LA_HPI_gdf_json,
    style_function=style_function, 
    control=False,
    highlight_function=highlight_function, 
    tooltip=folium.features.GeoJsonTooltip(
        fields=['City','Cluster','Human Development Index', 'Life Expectancy', 'No HS Diplomas', 'Bachelors Degrees', 'Graduate Degrees',
       'School Enrollment', 'Earnings', 'Health Index', 'Education Index', 'Income Index'
               ],
        style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;") 
    )
)
map_LA_County.add_child(choropleth)


map_LA_County

Looking at the map, we see a clear positive relationship between higher performing cities and their proximiting to the ocean. We also see that inner cities regions with Los Angeles and the San Fernando Valley are the worst performers, to go along with the Lancaster region. There also seem to be pockets of higher former cities in pockers of more mountain areas as well.

In [None]:
LA_Clusters=[]
LA_Clusters=pd.concat([LA_Cluster_Data_1,LA_Cluster_Data_2,LA_Cluster_Data_3],axis=1)
LA_Clusters.sort_values(by='Human Development Index',axis=1,inplace=True)
LA_Clusters=LA_Clusters.transpose().rename(columns = {'X_cluster':'Cluster'})
LA_Clusters

From the breakdown of the averages for the different groups above, we see the least disparity in life expectancy and school enrollment, while we the highest disparity is seen in no HS diplomas, graduate degrees and earnings.

# Discussion

Now that we grouped the cities within LA counties into clusters and have seen how they are plotted out on a map, it is very interesting to see how the different clusters seemed to be grouped throughout the area. There seems to be an obvious association between highest performing cities and their proximity to the ocean, but we also see highest performing cities among mountain regions which would be interesting to explore from an age perspective to see if this is representative of migration patterns within LA County. It's also worth noting how close the different city clusters are in their school enrollment levels, while there is a fair amount of discrepancy in other categories. This could also be worth further explaination in the form of creating a logistical regression model, and also seeing if this is a result of the quality of education in various regions or if it is a results of cities not retaining their citizens once they have become educated and involved in the workforce.

# Conclusion

From the results of our studies, it seems like there could be a lot of good information to further explore education effective and migration patterns within LA to see how they effect earnings and graduation rates. An imporant question to ask is are higher performing areas offering better education and/or are higher earning individuals moving to these areas once they've reached a certain level of income. While this dataset was limited to factors that related to health, income and education -- we are fortunate that LA County has a great amount of dataset available that can evaluated under a similar model to help with other classification tasks. Once we understand the different clusters and where their greatest opportunities for improvements are, we can use these clusters to develop benchmarks and allocate resources where they will 'move the needle' the most.