# Data Exploration

__We will attempt to build a binary classifier for the education deserts using census tract data to reveal which features are most useful in determining whether a census tract is an education desert or not__

---
## Binary Classifier
__In this section we will first perform PCA on the large dataset of census tract features, and afterwards use vanilla classification models to predict whether a census tract is an education or not__

In [4]:
# Library Imports
import json
import fiona 
import rtree
import shapely
import pandas as pd
import numpy as np
import subprocess
import os
import requests
from bs4 import BeautifulSoup
import seaborn as sns
sns.set(style="ticks")

# default dictionary (a dictionary with a default value if a key doesn't exist)
from collections import defaultdict

# To unzip file
import zipfile

# To have progress bar
from tqdm import tqdm

# plotting libraries
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn-paper')
%matplotlib inline

# Helper function to create a new folder
def mkdir(path):
    try: 
        os.makedirs(path)
    except OSError:
        if not os.path.isdir(path):
            raise
        else:
            print("(%s) already exists" % (path))

---
## Datasets

In [2]:
# Census tracts data url from 2012 - 2017
ct_file_name = 'acs_5_year_estimates_census_tracts.csv'

# American University Data
au_file_name = 'IPEDS_data.xlsx'

# Directory of datasets
DATASETS_PATH = 'datasets/'

# JSON file for dictionary of census tracts 
# and the census tracts within 50 miles
ct_50_miles_file_name = 'ct_50_miles.json'

# CSV file containing the labels of
# each census tract being an education desert or not
education_deserts_file_name = 'education_deserts.csv'

### Census Tract Data

__Census Tracts have a population of around ${2,500}$ - ${8,000}$ people__

In [104]:
# Let's take a look at the census tract data
census_tracts = pd.read_csv(DATASETS_PATH + ct_file_name, encoding='ISO-8859-1', low_memory=False, index_col='FIPS')
census_tracts.head()

Unnamed: 0_level_0,Geographic Identifier,Name of Area,Qualifying Name,State/U.S.-Abbreviation (USPS),Summary Level,Geographic Component,File Identification,Logical Record Number,US,Region,...,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: Not in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: Not in Labor Force,Households.1,Households with Housing Costs more than 30% of Income
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Geo_FIPS,Geo_GEOID,Geo_NAME,Geo_QName,Geo_STUSAB,Geo_SUMLEV,Geo_GEOCOMP,Geo_FILEID,Geo_LOGRECNO,Geo_US,Geo_REGION,...,SE_T254_004,SE_T254_005,SE_T254_006,SE_T254_007,SE_T254_008,SE_T254_009,SE_T254_010,SE_T254_011,SE_T255_001,SE_T255_002
01001020100,14000US01001020100,"Census Tract 201, Autauga County, Alabama","Census Tract 201, Autauga County, Alabama",al,140,00,ACSSF,0001766,,,...,36,7,80,1360,880,845,35,480,754,144
01001020200,14000US01001020200,"Census Tract 202, Autauga County, Alabama","Census Tract 202, Autauga County, Alabama",al,140,00,ACSSF,0001767,,,...,59,0,204,1230,823,793,30,407,783,218
01001020300,14000US01001020300,"Census Tract 203, Autauga County, Alabama","Census Tract 203, Autauga County, Alabama",al,140,00,ACSSF,0001768,,,...,61,3,305,2291,1491,1421,70,800,1279,357
01001020400,14000US01001020400,"Census Tract 204, Autauga County, Alabama","Census Tract 204, Autauga County, Alabama",al,140,00,ACSSF,0001769,,,...,16,0,66,3241,1953,1833,120,1288,1749,361


__Let's look at what features are in this dataframe__

In [105]:
# Since we have 2161 features, let's first remove
# those that we will definitely not use, so
# let's try to see what we can remove in the first
# few columns since they seem like the ones that have
# no feature importance... Also let's see which 
# columns have categorical data
for idx, feat in enumerate(census_tracts.columns):
    print(feat)
    if idx == 100:
        break;

Geographic Identifier
Name of Area
Qualifying Name
State/U.S.-Abbreviation (USPS)
Summary Level
Geographic Component
File Identification
Logical Record Number
US
Region
Division
State (Census Code)
State (FIPS)
County
County Subdivision (FIPS)
Place (FIPS Code)
Place (State FIPS + Place FIPS)
Census Tract
Block Group
Consolidated City
American Indian Area/Alaska Native Area/Hawaiian Home Land (Census)
American Indian Area/Alaska Native Area/Hawaiian Home Land (FIPS)
American Indian Trust Land/Hawaiian Home Land Indicator
American Indian Tribal Subdivision (Census)
American Indian Tribal Subdivision (FIPS)
Alaska Native Regional Corporation (FIPS)
Metropolitan and Micropolitan Statistical Area
Combined Statistical Area
Metropolitan Division
Metropolitan Area Central City
Metropolitan/Micropolitan Indicator Flag
New England City and Town Combined Statistical Area
New England City and Town Area
New England City and Town Area Division
Urban Area
Urban Area Central Place
Current Congression

In [106]:
census_tracts['Public Use Microdata Area - 1% File'].head()

FIPS
Geo_FIPS       Geo_PUMA1
01001020100          NaN
01001020200          NaN
01001020300          NaN
01001020400          NaN
Name: Public Use Microdata Area - 1% File, dtype: object

__Seems like everything until `Total Population` might not be useful, so let's remove it along with the first row (just internal id labels for each feature) and `Total Population.1` since it's a duplicate__

In [107]:
census_tracts = census_tracts.loc[:, 'Total Population':].drop(['Total Population.1', 'Total Population:', 'Total Population:.1'], axis=1)
census_tracts.head()

Unnamed: 0_level_0,Total Population,Population Density (Per Sq. Mile),Area (Land),Area Total:,Area Total: Area (Land),Area Total: Area (Water),Total Population: Male,Total Population: Female,Total Population: Male.1,Total Population: Male: Under 5 Years,...,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: Not in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: Not in Labor Force,Households.1,Households with Housing Costs more than 30% of Income
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Geo_FIPS,SE_T001_001,SE_T002_002,SE_T002_003,SE_T003_001,SE_T003_002,SE_T003_003,SE_T004_002,SE_T004_003,SE_T005_002,SE_T005_003,...,SE_T254_004,SE_T254_005,SE_T254_006,SE_T254_007,SE_T254_008,SE_T254_009,SE_T254_010,SE_T254_011,SE_T255_001,SE_T255_002
01001020100,1845,487.1106,3.78764071493768,3.801661,3.787641,0.01402014,899,946,899,39,...,36,7,80,1360,880,845,35,480,754,144
01001020200,2172,1684.013,1.28977624606755,1.292033,1.289776,0.002257153,1167,1005,1167,48,...,59,0,204,1230,823,793,30,407,783,218
01001020300,3385,1638.934,2.06536632602159,2.068862,2.065366,0.003495769,1533,1852,1533,65,...,61,3,305,2291,1491,1421,70,800,1279,357
01001020400,4267,1731.473,2.46437628282448,2.470648,2.464376,0.006271844,2001,2266,2001,119,...,16,0,66,3241,1953,1833,120,1288,1749,361


In [108]:
census_tracts = census_tracts.drop('Geo_FIPS', axis=0)
census_tracts.head()

Unnamed: 0_level_0,Total Population,Population Density (Per Sq. Mile),Area (Land),Area Total:,Area Total: Area (Land),Area Total: Area (Water),Total Population: Male,Total Population: Female,Total Population: Male.1,Total Population: Male: Under 5 Years,...,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: Not in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: Not in Labor Force,Households.1,Households with Housing Costs more than 30% of Income
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001020100,1845,487.1106,3.78764071493768,3.801661,3.787641,0.01402014,899,946,899,39,...,36,7,80,1360,880,845,35,480,754,144
1001020200,2172,1684.013,1.28977624606755,1.292033,1.289776,0.002257153,1167,1005,1167,48,...,59,0,204,1230,823,793,30,407,783,218
1001020300,3385,1638.934,2.06536632602159,2.068862,2.065366,0.003495769,1533,1852,1533,65,...,61,3,305,2291,1491,1421,70,800,1279,357
1001020400,4267,1731.473,2.46437628282448,2.470648,2.464376,0.006271844,2001,2266,2001,119,...,16,0,66,3241,1953,1833,120,1288,1749,361
1001020500,9965,2264.419,4.4006864124467,4.419378,4.400686,0.01869198,5054,4911,5054,333,...,341,21,385,6551,4539,4446,93,2012,4194,1456


__Let's proceed to check if any columns contain missing data__

In [109]:
contain_nans = 0
for idx, feat in enumerate(census_tracts.columns):
    if census_tracts[feat].isnull().values.any():
        contain_nans += 1
#         print(feat + ' contains missing values...')
    
print(contain_nans)

308


__Let's see which of these columns have > 25% missing data and remove them because it'll be an unreliable feature for us to use__

In [110]:
bad_col_inds = (census_tracts.isna().sum() / census_tracts.shape[0]) > 0.25
census_tracts = census_tracts[census_tracts.columns[~bad_col_inds]]

__We will impute the most frequest value for the columns with NaNs__

In [111]:
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'most_frequent')
imputer = imputer.fit(census_tracts.loc[:, census_tracts.isna().any()])
census_tracts.loc[:, census_tracts.isna().any()] = imputer.transform(census_tracts.loc[:, census_tracts.isna().any()])



### Dictionary of Census Tract GeoID to List of Census Tracts within 50 Miles

__Let's read in the JSON file to load the dictionary of which census tracts are within 50 miles__

In [112]:
# Reading JSON
with open(DATASETS_PATH + ct_50_miles_file_name, 'r') as f:
    ct_50_miles_dict = json.load(f)
    
for i in ct_50_miles_dict.items():
    print(i)
    break;

('29001950900', ['29001950300', '29001951000', '29001950400', '29001950500', '29001950200', '29001950100'])


### DataFrame of Education desert labels for each census tract

__Let's read in the CSV File to load the pandas dataframe of education desert labels__

In [113]:
# Reading CSV
edu_deserts = pd.read_csv(DATASETS_PATH + education_deserts_file_name, index_col=0)
edu_deserts.head()

Unnamed: 0,Number of Accessible Universities,Education Desert
29001950900,1,0
29007950100,0,1
29009960100,0,1
29019001201,3,0
29021000600,1,0


In [114]:
edu_deserts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 73874 entries, 29001950900 to 35043010719
Data columns (total 2 columns):
Number of Accessible Universities    73874 non-null int64
Education Desert                     73874 non-null int64
dtypes: int64(2)
memory usage: 1.7 MB


In [115]:
census_tracts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 74001 entries, 01001020100 to 72153750602
Columns: 2051 entries, Total Population to Households with Housing Costs more than 30% of Income
dtypes: object(2051)
memory usage: 1.1+ GB


__Let's convert the census_tracts index to Int64 to be compatible in comparison with the education deserts dataframe__

In [116]:
census_tracts.index = census_tracts.index.astype('int64')
census_tracts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 74001 entries, 1001020100 to 72153750602
Columns: 2051 entries, Total Population to Households with Housing Costs more than 30% of Income
dtypes: object(2051)
memory usage: 1.1+ GB


__Let's do an inner join of the Education desert labels and the census tract data so that we only keep the census tract data for those that have the label of whether its an education desert or not__

In [117]:
merged_census_df = census_tracts.merge(right=edu_deserts, how="inner", left_on=census_tracts.index, right_on=edu_deserts.index)
merged_census_df.head()

Unnamed: 0,key_0,Total Population,Population Density (Per Sq. Mile),Area (Land),Area Total:,Area Total: Area (Land),Area Total: Area (Water),Total Population: Male,Total Population: Female,Total Population: Male.1,...,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: Not in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: Not in Labor Force,Households.1,Households with Housing Costs more than 30% of Income,Number of Accessible Universities,Education Desert
0,1001020100,1845,487.111,3.78764,3.80166,3.78764,0.0140201,899,946,899,...,80,1360,880,845,35,480,754,144,0,1
1,1001020200,2172,1684.01,1.28978,1.29203,1.28978,0.00225715,1167,1005,1167,...,204,1230,823,793,30,407,783,218,0,1
2,1001020300,3385,1638.93,2.06537,2.06886,2.06537,0.00349577,1533,1852,1533,...,305,2291,1491,1421,70,800,1279,357,0,1
3,1001020400,4267,1731.47,2.46438,2.47065,2.46438,0.00627184,2001,2266,2001,...,66,3241,1953,1833,120,1288,1749,361,0,1
4,1001020500,9965,2264.42,4.40069,4.41938,4.40069,0.018692,5054,4911,5054,...,385,6551,4539,4446,93,2012,4194,1456,0,1


In [118]:
merged_census_df.index = merged_census_df.iloc[:, 0]
merged_census_df.index.name = 'geoID'
merged_census_df.drop('key_0', axis=1, inplace=True)
merged_census_df.head()

Unnamed: 0_level_0,Total Population,Population Density (Per Sq. Mile),Area (Land),Area Total:,Area Total: Area (Land),Area Total: Area (Water),Total Population: Male,Total Population: Female,Total Population: Male.1,Total Population: Male: Under 5 Years,...,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: Not in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: Not in Labor Force,Households.1,Households with Housing Costs more than 30% of Income,Number of Accessible Universities,Education Desert
geoID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001020100,1845,487.111,3.78764,3.80166,3.78764,0.0140201,899,946,899,39,...,80,1360,880,845,35,480,754,144,0,1
1001020200,2172,1684.01,1.28978,1.29203,1.28978,0.00225715,1167,1005,1167,48,...,204,1230,823,793,30,407,783,218,0,1
1001020300,3385,1638.93,2.06537,2.06886,2.06537,0.00349577,1533,1852,1533,65,...,305,2291,1491,1421,70,800,1279,357,0,1
1001020400,4267,1731.47,2.46438,2.47065,2.46438,0.00627184,2001,2266,2001,119,...,66,3241,1953,1833,120,1288,1749,361,0,1
1001020500,9965,2264.42,4.40069,4.41938,4.40069,0.018692,5054,4911,5054,333,...,385,6551,4539,4446,93,2012,4194,1456,0,1


__Ok cool, we've now removed the missing values, and matched the observations (census tracts) to their respective labels, so now let's standardize the data first because values like `Total Population` are in the thousands, while `Area` is below 100s__

---
## Standardization

In [119]:
# Allocating our X (independent vars) and y (dependent vars) data
X = merged_census_df.iloc[:, :-1]
y = merged_census_df.iloc[:, -1]

In [122]:
# Import standard scaler module
from sklearn.preprocessing import StandardScaler

X_standard_scaler = StandardScaler()
X = pd.DataFrame(X_standard_scaler.fit_transform(X), index=X.index, columns=X.columns)

X.head()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Unnamed: 0_level_0,Total Population,Population Density (Per Sq. Mile),Area (Land),Area Total:,Area Total: Area (Land),Area Total: Area (Water),Total Population: Male,Total Population: Female,Total Population: Male.1,Total Population: Male: Under 5 Years,...,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: Not in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: Not in Labor Force,Households.1,Households with Housing Costs more than 30% of Income,Number of Accessible Universities
geoID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001020100,-1.152046,-0.409075,-0.081913,-0.080591,-0.081913,-0.033275,-1.135353,-1.129804,-1.135353,-0.90472,...,-0.708636,-0.778694,-1.003444,-0.974553,-0.957812,-0.797438,-0.850361,-1.107306,-1.099164,-0.078863
1001020200,-1.004599,-0.309729,-0.086577,-0.08494,-0.086577,-0.033424,-0.895048,-1.07808,-0.895048,-0.823184,...,-0.829505,-0.250723,-1.085532,-1.024327,-1.005029,-0.862604,-0.981578,-1.07056,-0.848354,-0.078863
1001020300,-0.457645,-0.31347,-0.085129,-0.083594,-0.085129,-0.033408,-0.56687,-0.335536,-0.56687,-0.669172,...,-0.777704,0.179318,-0.415564,-0.441008,-0.434797,-0.341277,-0.275163,-0.442076,-0.377238,-0.078863
1001020400,-0.059943,-0.305789,-0.084384,-0.082897,-0.084384,-0.033373,-0.147233,0.027408,-0.147233,-0.179955,...,-0.829505,-0.838303,0.184312,-0.037576,-0.060696,0.310382,0.602014,0.153463,-0.363681,-0.078863
1001020500,2.509341,-0.261554,-0.080768,-0.07952,-0.080768,-0.033216,2.590273,2.346216,2.590273,1.758792,...,-0.466899,0.519944,2.27441,2.2206,2.31194,-0.041514,1.903399,3.251532,3.34763,-0.078863


__Sweet, we've finally got a nice dataset to work with now, let's just jump in and perform PCA first to reduce the number of features we're working with__

---
## Principal Component Analysis

In [None]:
# Import sklearn decomposition module
from sklearn import decomposition
pca = decomposition.PCA(n_components='mle')

X = pca.fit_transform(X)

print("New feature set shape:", X.shape)

---
## Vanilla Classifiers

__Let's just try out a few classifiers straight out of sklearn on the dataset__

__Train Test Split__

In [None]:
# Import train_test_split module from sklearn
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

__We will first try out the following models:__
1. Multi-Layer Perceptron
2. K-Nearest Neighbours
3. Support Vector Classifier
4. GaussianProcess Classifier
5. Decision Tree Classifier
6. Random Forest Classifier
7. AdaBoost Classifier
8. Gaussian Naive Bayes Classifier

In [None]:
# Import models from sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB