**Library Imports:**

In [351]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.svm import SVR
from sklearn.cluster import DBSCAN, KMeans

**Data Imports:**

In [352]:
df = pd.read_csv('data/Socioeconomic Data/Socioeconomic determinants/socioeconomic determinant for state.csv')
df_2 = pd.read_csv('data/csse_covid_19_daily_reports_us/12-31-2020.csv')
unemp_covid_df = pd.read_csv('data/yun_data/output/unemployment_covid_2020.csv')
covid_df = pd.read_csv('data/yun_data/output/covid_monthly_2020.csv')
df3 = pd.read_csv('data/yun_data/output/unemp_rate_2020.csv')
df4 = pd.read_csv('data/yun_data/output/cov_unemp_summary.csv')
df5 = pd.read_csv('data/yun_data/output/covid_socio_2020.csv')

In [353]:
#set panda options to max display to view all columns:
pd.options.display.max_columns = None

## Modeling:

For modeling purposes, since we don't necessarily have an output that we are either classifying or predicting, we have elected to utilize unsupervised modeling through clustering. The advantages of using clustering will be to identify groups that exist within our data and try and distinguish any identifying features that supports our conclusions discovered through EDA. Clustering in genreal attempts to group observations (data) with the notion that clusters of the same group are more similar than clusters of a different group. Ideally we would like to create two or three clusters that will allow us to draw comparisons easier and filter out any trends that would influence our recommendations given to the Covid resource agency. For clustering methods we will explore KMeans and DBSCAN. 

### DBSCAN

DBSCAN pictures from: https://git.generalassemb.ly/DSIR-20201214-E/lesson-dbscan

<img src='./assets/Screen Shot 2021-02-20 at 6.25.46 PM.png'>

DBSCAN is somewhat limited when it comes to clustering observations that may not be overly distinct in separation. So for something as complex as states it will be challenging, but we should be able to at least create some degree of clusters that will provide us some intel on where to best allocate some of sources. Using our findings from our EDA let's create a feature dataframe for some of the determinants that we noticed to be features of possible concern. 

In [354]:
features=pd.DataFrame(df5[['Median household income', 'Poverty rate', 'hospital', 
                       'White Population', 'Africa-American Population',
                       'Hispanic population', 'Deaths',
                       'Case_Fatality_Ratio', 'deaths_per_population',
                      'confirmed_per_population', 'unemp_year_rate']])

In [355]:
#Since many of these metrics are measured on varying scales, it's important to standardized our data:
ss = StandardScaler()
X_scaled = ss.fit_transform(features)

Another challenge of DBSCAN is tuning the parameters. Through some trial and error we were able to estanlish our epsilon along with the min_samples so that it would give us 4 total clusters. It's important to note that DBSCAN will also determine a 'noise' cluster denoted as -1, so when analyzing the results, we should drop that column.

In [356]:
#instantiating and setting the model paramters:
dbscan = DBSCAN(eps=2.2, min_samples=3)

#fitting the model to our scaled data:
dbscan.fit(X_scaled)

#creating a cluster column with the labels that our model was able to create:
features['cluster'] = dbscan.labels_

Next we will want to groupby our clusters to examine the mean values across our two clusters. Hopefully we will be able to uncover some distinctions between the two clusters that will provide support for our final recommendations. Don't forget to drop the noise column! 

In [357]:
features.groupby('cluster').mean().T.drop(columns=-1)

cluster,0,1,2
Median household income,62974.71875,62865.25,83475.666667
Poverty rate,11.846875,12.3,9.533333
hospital,101.375,223.5,83.0
White Population,83.75625,78.45,75.833333
Africa-American Population,10.24375,15.3,12.733333
Hispanic population,7.778125,14.25,16.733333
Deaths,3497.59375,17154.0,12486.666667
Case_Fatality_Ratio,1.297728,2.109361,3.509479
deaths_per_population,0.000834,0.001244,0.001876
confirmed_per_population,0.065681,0.060246,0.05536


**Findings:** Judging by the clusters we have created we can see that there are a couple of notable differences among each pocket. Consistent with our EDA we notice that across each cluster, as CFR rises, so do the minority populations, while inversely the white population percentage decreases. It is also worth pointing out that there is a similar trend with deaths per capita, once again supporting what we saw with the correlational EDA analysis we performed earlier. While the clustering method is far from perfect, this should be enough to let us draw a conclusion that covid, particularly CFR and deaths per capita, disproportionately affect minority communities at a concerning rate. 

For exploratory sake let's take a look at the same features but with different parameters:

In [358]:
features_2 = features.copy()
ss = StandardScaler()
X_scaled_2 = ss.fit_transform(features_2)

In [359]:
#instantiating and setting the model paramters:
dbscan_2 = DBSCAN(eps=2.5, min_samples=2)

#fitting the model to our scaled data:
dbscan_2.fit(X_scaled_2)

#creating a cluster column with the labels that our model was able to create:
features_2['cluster'] = dbscan_2.labels_

In [360]:
features_2.groupby('cluster').mean().T.drop(columns=-1)

cluster,0,1,2,3
Median household income,62974.71875,83475.666667,62865.25,48432.5
Poverty rate,11.846875,9.533333,12.3,19.3
hospital,101.375,83.0,223.5,164.5
White Population,83.75625,75.833333,78.45,61.45
Africa-American Population,10.24375,12.733333,15.3,36.15
Hispanic population,7.778125,16.733333,14.25,4.2
Deaths,3497.59375,12486.666667,17154.0,6137.5
Case_Fatality_Ratio,1.297728,3.509479,2.109361,2.296607
deaths_per_population,0.000834,0.001876,0.001244,0.00161
confirmed_per_population,0.065681,0.05536,0.060246,0.070166


**Findings:** Even with this different clustering we can still see a similar trend in disproportions across racial demogrpahics. With higher CFR rates we see a trending of increasing minority populations whereas the white populations are still negatively correlated. Again, with a different clustering we are still seeing the trends that we have suspected throughout our EDA and our first modeling attempts.  