# Milestone 2 Assignment - Capstone Check-in

## Author - Naris Silpakit

### Capstone Project Instructions
Select a problem and data sets of particular interest and apply the analytics process to find and report on a solution.

Students will construct a simple dashboard to allow a non-technical user to explore their solution. The data should be read from a suitable persistent data storage, such as an Internet URL or a SQL data base.

The process followed by the students and the grading criteria include:
<ol style="list-style-type: lower-alpha;">
<li>Understand the business problem <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span></li>
<li>Evaluate and explore the available data <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span></li>
<li>Proper data preparation <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span> <span class="label" style="border-radius: 3px; background-color: royalblue; color: white;">Milestone 2</span></li>
<li>Exploration of data and understand relationships <span class="label" style="border-radius: 3px; background-color: darkcyan; color: white;">Milestone 1</span> <span class="label" style="border-radius: 3px; background-color: royalblue; color: white;">Milestone 2</span></li>
<li>Perform basic analytics and machine learning, within the scope of the course, on the data.  <span class="label" style="border-radius: 3px; background-color: royalblue; color: white;">Milestone 2</span> <span class="label" style="border-radius: 3px; background-color: slateblue; color: white;">Milestone 3</span> <BR/>For example, classification to predict which employees are most likely to leave the company.</li>
<li>Create a written and/or oral report on the results suitable for a non-technical audience. <span class="label" style="border-radius: 3px; background-color: slateblue; color: white;">Milestone 3</span></li>
</ol>



## Tasks
<img src="https://library.startlearninglabs.uw.edu/DATASCI420/img/Milestone2Sample.PNG" style="float: right; width: 400px;">
For this check-in, you are to:

1). Explicitly state the problem, list sources, and define the methodology: classification, regression, other

2). List data processing steps (psuedo code) including steps from data source collection & preparation, feature engineering & selection, modeling, performance evaluation.

3). Read in the previously generated data file of cleaned up data

4). Perform feature engineering and selection

5). Conduct some preliminary modeling 

6). Identify potential machine learning model(s) to improve performance


## Project Goal



The goal of this project is to accurately classify whether or not a state will vote republican or democrat based on data about that state.

## Data Sources

- My team member Elizabeth gathered federal election data from https://uselectionatlas.org/RESULTS/
- Yulia gathered labor and economic data from the census and other federal data sources.
- I gathered data on education spending from the National Center for Education Statistics.

## Data Processing Steps

1. Load in three datasets
2. Merge on state and year
3. Check for any NAs and duplicates
4. Encode categorical data using risk calculation
5. Run Lasso, Ridge, and ElasticNet to inform feature selection
6. Split the data into train and test
7. Test a model on the selected features
8. Identify additional machine learning models to test with.

## Import Libraries and Functions

In [229]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.linear_model import ElasticNet
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.feature_selection import SelectFromModel
pd.set_option('display.max_columns', None)  

In [259]:
# Define Functions
def print_model_metrics(y_test, y_pred):
    '''
        Calculates and prints model metrics given target test values and predicted target values.
    '''
    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)
    print('Accuracy: {}'.format(metrics.accuracy_score(y_test, y_pred)))
    print('AUC: {}'.format(metrics.auc(fpr, tpr)))
    print('Recall: {}'.format(metrics.recall_score(y_test, y_pred)))
    print('Precision: {}'.format(metrics.precision_score(y_test, y_pred)))
    print('F1: {}'.format(metrics.f1_score(y_test, y_pred)))

## Data Processing

In [231]:
# load election data
elections = pd.read_csv('presidential_election_data.csv')

In [232]:
print(elections.info())
elections.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 663 entries, 0 to 662
Data columns (total 25 columns):
Year           663 non-null int64
State          663 non-null object
EV_R           663 non-null int64
EV_D           663 non-null int64
Population     612 non-null float64
Total_VAP      663 non-null int64
Total_VAC      663 non-null int64
Total_REG      663 non-null int64
Total_Vote     663 non-null int64
Perc_VAP       663 non-null object
Perc_VAC       663 non-null object
Perc_REG       663 non-null object
D_placed       663 non-null int64
R_placed       663 non-null int64
O_placed       306 non-null float64
Margin         663 non-null int64
Perc_Margin    663 non-null float64
Vote_Perc_D    663 non-null float64
Vote_Perc_R    663 non-null float64
Vote_Perc_T    306 non-null float64
Vote_Perc_O    663 non-null float64
Vote_D         663 non-null int64
Vote_R         663 non-null int64
Vote_T         306 non-null float64
Vote_O         663 non-null int64
dtypes: float64(8), int64

Unnamed: 0,Year,State,EV_R,EV_D,Population,Total_VAP,Total_VAC,Total_REG,Total_Vote,Perc_VAP,Perc_VAC,Perc_REG,D_placed,R_placed,O_placed,Margin,Perc_Margin,Vote_Perc_D,Vote_Perc_R,Vote_Perc_T,Vote_Perc_O,Vote_D,Vote_R,Vote_T,Vote_O
0,2016,Alabama,9,0,4863300.0,0,0,3333058,2123372,-,-,63.7,2,1,3.0,588708,27.73,34.36,62.08,2.09,1.46,729547,1318255,44467.0,31103
1,2016,Alaska,3,0,741894.0,0,0,528560,318608,-,-,60.3,2,1,3.0,46933,14.73,36.55,51.28,5.88,6.29,116454,163387,18725.0,20042
2,2016,Arizona,11,0,6931071.0,0,0,4088036,2604657,-,-,63.7,2,1,3.0,91234,3.5,44.58,48.08,4.08,3.25,1161167,1252401,106327.0,84762
3,2016,Arkansas,6,0,2988248.0,0,0,1759982,1130635,-,-,64.2,2,1,3.0,304378,26.92,33.65,60.57,2.64,3.13,380494,684872,29829.0,35440
4,2016,California,0,55,39250017.0,0,0,19411771,14237893,-,-,73.3,1,2,3.0,4269978,29.99,61.48,31.49,3.36,3.66,8753792,4483814,478500.0,521787


In [233]:
elections.columns.values

array(['Year', 'State', 'EV_R', 'EV_D', 'Population', 'Total_VAP',
       'Total_VAC', 'Total_REG', 'Total_Vote', 'Perc_VAP', 'Perc_VAC',
       'Perc_REG', 'D_placed', 'R_placed', 'O_placed', 'Margin',
       'Perc_Margin', 'Vote_Perc_D', 'Vote_Perc_R', 'Vote_Perc_T',
       'Vote_Perc_O', 'Vote_D', 'Vote_R', 'Vote_T', 'Vote_O'],
      dtype=object)

### Election Data Preprocessing
- Create logical target variable from EV_R and EV_D (if EV_R is higher than EV_D, republicans won the state, otherwise democrat won)
- Create variable that denotes whether the electoral vote outcome doesn't match the popular outcome)
- Remove columns that may cause target leakage
- Set all variable names to lowercase

In [234]:
# create target variable - whether a state voted republican
elections['voted_R'] = elections['EV_R'] > elections['EV_D']

In [235]:
# create percent_registered column
elections['perc_registered'] = elections['Total_Vote'] / elections['Total_REG']

In [236]:
# drop columns
cols_to_drop = ['EV_R', 'EV_D', 'Total_VAP', 'Total_VAC', 'Total_REG', 'Total_Vote', 'Perc_VAP', 'Perc_VAC', 'Perc_REG', 'D_placed', 
                'R_placed', 'O_placed', 'Margin', 'Perc_Margin', 'Vote_Perc_D', 'Vote_Perc_R', 
                'Vote_Perc_T', 'Vote_Perc_O', 'Vote_D', 'Vote_R', 'Vote_T', 'Vote_O']
elections = elections.drop(columns=cols_to_drop)

In [237]:
# rename columns to lowercase
# elections.columns = elections.columns.str.lower()
elections = elections.rename(columns=str.lower)
elections.head()

Unnamed: 0,year,state,population,voted_r,perc_registered
0,2016,Alabama,4863300.0,True,0.637064
1,2016,Alaska,741894.0,True,0.602785
2,2016,Arizona,6931071.0,True,0.637141
3,2016,Arkansas,2988248.0,True,0.642413
4,2016,California,39250017.0,False,0.733467


### Education Spending

In [238]:
edu_spending = pd.read_csv('../milestone_1/state_edu_spending.csv')
edu_spending.head()

Unnamed: 0,year,state,total_revenue,instruction_expense,property_expense,total_edu_expense,per_pupil_expense
0,2000,Alabama,6734880000.0,3592552000.0,224927800.0,6879115000.0,7314.254756
1,2000,Alaska,1895197000.0,923974400.0,72375200.0,1916757000.0,12751.12796
2,2000,Arizona,7670325000.0,3631073000.0,698406000.0,7788936000.0,7102.916696
3,2000,Arkansas,3805996000.0,2017782000.0,134297400.0,3663107000.0,7403.965385
4,2000,California,62828740000.0,33217650000.0,2263272000.0,61966320000.0,8563.135731


### Economic data Preprocessing
- Set all variable names to lowercase
- Replace spaces in variable names to underscore
- Remove extra columns

In [239]:
economic = pd.read_csv('final_dataset_yulia.csv')

In [240]:
print(economic.info())
economic.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2184 entries, 0 to 2183
Data columns (total 18 columns):
Unnamed: 0                                         2184 non-null int64
State                                              2184 non-null object
Year                                               2184 non-null int64
labor force                                        2184 non-null int64
unemployment rate                                  2184 non-null float64
All Ages SAIPE Poverty Universe                    2142 non-null float64
All Ages in Poverty Count                          2142 non-null float64
All Ages in Poverty Percent                        2142 non-null float64
Under Age 18 SAIPE Poverty Universe                2142 non-null float64
Under Age 18 in Poverty Count                      2142 non-null float64
Under Age 18 in Poverty Percent                    2142 non-null float64
Ages 5 to 17 in Families SAIPE Poverty Universe    2142 non-null float64
Ages 5 to 17 in Families

Unnamed: 0.1,Unnamed: 0,State,Year,labor force,unemployment rate,All Ages SAIPE Poverty Universe,All Ages in Poverty Count,All Ages in Poverty Percent,Under Age 18 SAIPE Poverty Universe,Under Age 18 in Poverty Count,Under Age 18 in Poverty Percent,Ages 5 to 17 in Families SAIPE Poverty Universe,Ages 5 to 17 in Families in Poverty Count,Ages 5 to 17 in Families in Poverty Percent,Under Age 5 SAIPE Poverty Universe,Under Age 5 in Poverty Count,Under Age 5 in Poverty Percent,Median Household Income in Dollars
0,0,Alabama,1976,1501284,6.8,4524161.0,749749.5,17.2,1097007.0,267674.0,24.625,797580.5,178183.0,22.7,294729.5,82291.0,27.7,36131.0
1,1,Alabama,1977,1568504,7.3,4524161.0,749749.5,17.2,1097007.0,267674.0,24.625,797580.5,178183.0,22.7,294729.5,82291.0,27.7,36131.0
2,2,Alabama,1978,1621710,6.4,4524161.0,749749.5,17.2,1097007.0,267674.0,24.625,797580.5,178183.0,22.7,294729.5,82291.0,27.7,36131.0
3,3,Alabama,1979,1656358,7.2,4524161.0,749749.5,17.2,1097007.0,267674.0,24.625,797580.5,178183.0,22.7,294729.5,82291.0,27.7,36131.0
4,4,Alabama,1980,1669289,8.9,4524161.0,749749.5,17.2,1097007.0,267674.0,24.625,797580.5,178183.0,22.7,294729.5,82291.0,27.7,36131.0


In [241]:
economic = economic.drop(columns=['Unnamed: 0'])

In [242]:
economic = economic.rename(columns=str.lower)
economic = economic.rename(columns={col: col.replace(' ', '_') for col in economic.columns})
economic.columns.values

array(['state', 'year', 'labor_force', 'unemployment_rate',
       'all_ages_saipe_poverty_universe', 'all_ages_in_poverty_count',
       'all_ages_in_poverty_percent',
       'under_age_18_saipe_poverty_universe',
       'under_age_18_in_poverty_count', 'under_age_18_in_poverty_percent',
       'ages_5_to_17_in_families_saipe_poverty_universe',
       'ages_5_to_17_in_families_in_poverty_count',
       'ages_5_to_17_in_families_in_poverty_percent',
       'under_age_5_saipe_poverty_universe',
       'under_age_5_in_poverty_count', 'under_age_5_in_poverty_percent',
       'median_household_income_in_dollars'], dtype=object)

### Join Datasets

In [243]:
dataset = elections.merge(edu_spending, on=['state', 'year']).merge(economic, on=['state', 'year'])
dataset.head()

Unnamed: 0,year,state,population,voted_r,perc_registered,total_revenue,instruction_expense,property_expense,total_edu_expense,per_pupil_expense,labor_force,unemployment_rate,all_ages_saipe_poverty_universe,all_ages_in_poverty_count,all_ages_in_poverty_percent,under_age_18_saipe_poverty_universe,under_age_18_in_poverty_count,under_age_18_in_poverty_percent,ages_5_to_17_in_families_saipe_poverty_universe,ages_5_to_17_in_families_in_poverty_count,ages_5_to_17_in_families_in_poverty_percent,under_age_5_saipe_poverty_universe,under_age_5_in_poverty_count,under_age_5_in_poverty_percent,median_household_income_in_dollars
0,2016,Alabama,4863300.0,True,0.637064,7421546000.0,3865843000.0,100705900.0,7408654000.0,9236.418059,2173175,5.9,4741355.0,814197.0,17.2,1081979.0,267674.0,24.7,791471.0,185889.0,23.5,287177.0,78675.0,27.4,46309.0
1,2016,Alaska,741894.0,True,0.602785,2609913000.0,1322420000.0,52367450.0,2559971000.0,17509.975316,363047,6.9,723955.0,71916.0,9.9,183650.0,24897.0,13.6,130053.0,16061.0,12.3,52408.0,7919.0,15.1,76144.0
2,2016,Arizona,6931071.0,True,0.637141,9727226000.0,4557118000.0,373145000.0,9358784000.0,7613.006435,3225703,5.4,6771106.0,1107153.0,16.4,1601458.0,377445.0,23.6,1165956.0,263614.0,22.6,428317.0,106817.0,24.9,53481.0
3,2016,Arkansas,2988248.0,True,0.642413,5524230000.0,2725227000.0,225535300.0,5501220000.0,9845.568548,1342561,3.9,2898653.0,497388.0,17.2,691387.0,165724.0,24.0,503758.0,112376.0,22.3,184115.0,50341.0,27.3,44406.0
4,2016,California,39250017.0,False,0.733467,68793000000.0,36414520000.0,735442600.0,68488960000.0,11495.330166,19093658,5.5,38513333.0,5527621.0,14.4,8959115.0,1782764.0,19.9,6487993.0,1242780.0,19.2,2430975.0,502432.0,20.7,67715.0


In [244]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 250 entries, 0 to 249
Data columns (total 25 columns):
year                                               250 non-null int64
state                                              250 non-null object
population                                         250 non-null float64
voted_r                                            250 non-null bool
perc_registered                                    250 non-null float64
total_revenue                                      250 non-null float64
instruction_expense                                250 non-null float64
property_expense                                   250 non-null float64
total_edu_expense                                  250 non-null float64
per_pupil_expense                                  250 non-null float64
labor_force                                        250 non-null int64
unemployment_rate                                  250 non-null float64
all_ages_saipe_poverty_universe        

In [245]:
dataset = dataset.replace([np.inf, -np.inf], np.nan)
dataset = dataset.fillna(dataset.median())

## Feature Engineering and Selection

In [246]:
# encode state
le_state = preprocessing.LabelEncoder()
le_state.fit(dataset.state)
dataset['state'] = le_state.transform(dataset.state)

# encode voted_r
le_vote = preprocessing.LabelEncoder()
le_vote.fit(dataset.voted_r)
dataset['voted_r'] = le_vote.transform(dataset.voted_r)

dataset.head(3)

Unnamed: 0,year,state,population,voted_r,perc_registered,total_revenue,instruction_expense,property_expense,total_edu_expense,per_pupil_expense,labor_force,unemployment_rate,all_ages_saipe_poverty_universe,all_ages_in_poverty_count,all_ages_in_poverty_percent,under_age_18_saipe_poverty_universe,under_age_18_in_poverty_count,under_age_18_in_poverty_percent,ages_5_to_17_in_families_saipe_poverty_universe,ages_5_to_17_in_families_in_poverty_count,ages_5_to_17_in_families_in_poverty_percent,under_age_5_saipe_poverty_universe,under_age_5_in_poverty_count,under_age_5_in_poverty_percent,median_household_income_in_dollars
0,2016,0,4863300.0,1,0.637064,7421546000.0,3865843000.0,100705900.0,7408654000.0,9236.418059,2173175,5.9,4741355.0,814197.0,17.2,1081979.0,267674.0,24.7,791471.0,185889.0,23.5,287177.0,78675.0,27.4,46309.0
1,2016,1,741894.0,1,0.602785,2609913000.0,1322420000.0,52367450.0,2559971000.0,17509.975316,363047,6.9,723955.0,71916.0,9.9,183650.0,24897.0,13.6,130053.0,16061.0,12.3,52408.0,7919.0,15.1,76144.0
2,2016,2,6931071.0,1,0.637141,9727226000.0,4557118000.0,373145000.0,9358784000.0,7613.006435,3225703,5.4,6771106.0,1107153.0,16.4,1601458.0,377445.0,23.6,1165956.0,263614.0,22.6,428317.0,106817.0,24.9,53481.0


In [247]:
# separate dataset into target variable and features, train and test dataset
X_train, X_test, y_train, y_test = train_test_split(dataset.drop('voted_r', axis=1), 
                                                    dataset['voted_r'], test_size=0.2, random_state=42)

### Lasso

In [253]:
clf = linear_model.LogisticRegression(C=1.0, penalty="l1", dual=False, random_state=42).fit(X_train, y_train)
model =  SelectFromModel(clf, prefit=True)
names_l1 = model.get_support()
# X_new = model.transform(X_train)
# X_train.columns[names]
print(X_train.columns.values)
print(clf.coef_)
print('Selected columns: {}'.format(X_train.columns[names_l1].values))

['year' 'state' 'population' 'perc_registered' 'total_revenue'
 'instruction_expense' 'property_expense' 'total_edu_expense'
 'per_pupil_expense' 'labor_force' 'unemployment_rate'
 'all_ages_saipe_poverty_universe' 'all_ages_in_poverty_count'
 'all_ages_in_poverty_percent' 'under_age_18_saipe_poverty_universe'
 'under_age_18_in_poverty_count' 'under_age_18_in_poverty_percent'
 'ages_5_to_17_in_families_saipe_poverty_universe'
 'ages_5_to_17_in_families_in_poverty_count'
 'ages_5_to_17_in_families_in_poverty_percent'
 'under_age_5_saipe_poverty_universe' 'under_age_5_in_poverty_count'
 'under_age_5_in_poverty_percent' 'median_household_income_in_dollars']
[[ 6.66298724e-04 -5.65956929e-03  4.02659161e-06  0.00000000e+00
  -7.19497858e-11 -1.93119827e-10  4.64665166e-10 -7.01368662e-11
  -1.12198093e-04 -5.13171821e-06 -4.74024691e-01 -1.30529778e-06
  -2.35319068e-06  4.25213591e-01 -2.61120814e-07  7.57900136e-05
   0.00000000e+00 -6.16498900e-06 -9.13621302e-05 -2.32248870e-01
   1.51

### Ridge

In [264]:
clf = linear_model.LogisticRegression(C=1.0, penalty="l2", dual=False, random_state=42).fit(X_train, y_train)
model =  SelectFromModel(clf, prefit=True)
names_l2 = model.get_support()
# X_new = model.transform(X_train)
# X_train.columns[names]
print(X_train.columns.values)
print(clf.coef_)
print('Selected columns: {}'.format(X_train.columns[names_l2].values))

['year' 'state' 'population' 'perc_registered' 'total_revenue'
 'instruction_expense' 'property_expense' 'total_edu_expense'
 'per_pupil_expense' 'labor_force' 'unemployment_rate'
 'all_ages_saipe_poverty_universe' 'all_ages_in_poverty_count'
 'all_ages_in_poverty_percent' 'under_age_18_saipe_poverty_universe'
 'under_age_18_in_poverty_count' 'under_age_18_in_poverty_percent'
 'ages_5_to_17_in_families_saipe_poverty_universe'
 'ages_5_to_17_in_families_in_poverty_count'
 'ages_5_to_17_in_families_in_poverty_percent'
 'under_age_5_saipe_poverty_universe' 'under_age_5_in_poverty_count'
 'under_age_5_in_poverty_percent' 'median_household_income_in_dollars']
[[ 8.41436616e-10  1.09100137e-11  4.08857632e-07  2.70799780e-13
   4.45085384e-10 -1.40657140e-09  8.25392569e-10 -2.41027988e-10
   2.77771408e-09  1.28692832e-07  1.80985112e-12  4.01786519e-07
   1.50938834e-07  7.33809445e-12  1.50676684e-07  6.38344078e-08
   1.01417789e-11  9.76804217e-08  3.85921865e-08  9.06934190e-12
   5.12

### ElasticNet

In [255]:
# performs logistic regression using elasticnet
clf = linear_model.SGDClassifier(penalty='elasticnet', random_state=42).fit(X_train, y_train)
model =  SelectFromModel(clf, prefit=True)
names_elastic = model.get_support()
X_new = model.transform(X_train)
# X_train.columns[names]
print(X_train.columns.values)
print(clf.coef_)
print('Selected columns: {}'.format(X_train.columns[names_elastic].values))

['year' 'state' 'population' 'perc_registered' 'total_revenue'
 'instruction_expense' 'property_expense' 'total_edu_expense'
 'per_pupil_expense' 'labor_force' 'unemployment_rate'
 'all_ages_saipe_poverty_universe' 'all_ages_in_poverty_count'
 'all_ages_in_poverty_percent' 'under_age_18_saipe_poverty_universe'
 'under_age_18_in_poverty_count' 'under_age_18_in_poverty_percent'
 'ages_5_to_17_in_families_saipe_poverty_universe'
 'ages_5_to_17_in_families_in_poverty_count'
 'ages_5_to_17_in_families_in_poverty_percent'
 'under_age_5_saipe_poverty_universe' 'under_age_5_in_poverty_count'
 'under_age_5_in_poverty_percent' 'median_household_income_in_dollars']
[[ 1.96101681e+06  2.23817541e+04  1.70624024e+09  6.54497552e+02
   1.21711720e+11 -3.41915871e+11  1.90969629e+11  5.99105320e+10
   7.24485479e+06  7.58938326e+08  4.51997158e+03  1.68785716e+09
   3.58176793e+08  1.49397554e+04  5.00821237e+08  1.41727719e+08
   2.05275300e+04  3.44604153e+08  8.82745679e+07  1.84156425e+04
   1.50



## Preliminary Data Model



### Test Logistic Model performance on all features

In [279]:
model = linear_model.LogisticRegression(random_state=42).fit(X_train, y_train)

In [280]:
# performance
y_pred = model.predict(X_test)

print_model_metrics(y_test, y_pred)

Accuracy: 0.6
AUC: 0.6000000000000001
Recall: 0.56
Precision: 0.6086956521739131
F1: 0.5833333333333334


### Test Logistic Model Performance on L1 Selected Features

In [271]:
# Train on Logistic Regression Classifer
model_1 = linear_model.LogisticRegression(random_state=42).fit(X_train.loc[:, names_l1], y_train)


In [272]:
# Performance Metrics
y_pred_1 = model_1.predict(X_test.loc[:, names_l1])

print_model_metrics(y_test, y_pred_1)

Accuracy: 0.8
AUC: 0.7999999999999999
Recall: 0.92
Precision: 0.7419354838709677
F1: 0.8214285714285714


### Test Logistic Model Performance on L2 Selected Features

In [274]:
# Train on Logistic Regression Classifer
model_2 = linear_model.LogisticRegression(random_state=42).fit(X_train.loc[:, names_l2], y_train)

In [275]:
# Performance Metrics
y_pred_2 = model_2.predict(X_test.loc[:, names_l2])

print_model_metrics(y_test, y_pred_2)

Accuracy: 0.68
AUC: 0.68
Recall: 0.6
Precision: 0.7142857142857143
F1: 0.6521739130434783


### Test Logistic Model Performance on Elasticnet Selected Features

In [276]:
# Train on Logistic Regression Classifer
model_3 = linear_model.LogisticRegression(random_state=42).fit(X_train.loc[:, names_elastic], y_train)

In [277]:
# Performance Metrics
y_pred_3 = model_3.predict(X_test.loc[:, names_elastic])

print_model_metrics(y_test, y_pred_3)

Accuracy: 0.56
AUC: 0.5599999999999999
Recall: 0.44
Precision: 0.5789473684210527
F1: 0.5


## Improved Machine Learning Model(s)

We should try using several different modeling methods. I'd test out the following models, using the features selected with the different feature selection methods:
- k-means classification
- decision tree classification with gini and entropy
- random forest classifier with gini and entropy
- support vector machine classifier