# Data Exploration

__We will attempt to build a binary classifier for the education deserts using census tract data to reveal which features are most useful in determining whether a census tract is an education desert or not__

---
## Binary Classifier
__In this section we will first perform PCA on the large dataset of census tract features, and afterwards use vanilla classification models to predict whether a census tract is an education or not__

In [4]:
# Library Imports
import json
import fiona 
import rtree
import shapely
import pandas as pd
import numpy as np
import subprocess
import os
import requests
from bs4 import BeautifulSoup
import seaborn as sns
sns.set(style="ticks")

# default dictionary (a dictionary with a default value if a key doesn't exist)
from collections import defaultdict

# To unzip file
import zipfile

# To have progress bar
from tqdm import tqdm

# plotting libraries
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn-paper')
%matplotlib inline

# Helper function to create a new folder
def mkdir(path):
    try: 
        os.makedirs(path)
    except OSError:
        if not os.path.isdir(path):
            raise
        else:
            print("(%s) already exists" % (path))

---
## Datasets

In [2]:
# Census tracts data url from 2012 - 2017
ct_file_name = 'acs_5_year_estimates_census_tracts.csv'

# American University Data
au_file_name = 'IPEDS_data.xlsx'

# Directory of datasets
DATASETS_PATH = 'datasets/'

# JSON file for dictionary of census tracts 
# and the census tracts within 50 miles
ct_50_miles_file_name = 'ct_50_miles.json'

# CSV file containing the labels of
# each census tract being an education desert or not
education_deserts_file_name = 'education_deserts.csv'

### Census Tract Data

__Census Tracts have a population of around ${2,500}$ - ${8,000}$ people__

In [3]:
# Let's take a look at the census tract data
census_tracts = pd.read_csv(DATASETS_PATH + ct_file_name, encoding='ISO-8859-1', low_memory=False)
census_tracts.head()

Unnamed: 0,FIPS,Geographic Identifier,Name of Area,Qualifying Name,State/U.S.-Abbreviation (USPS),Summary Level,Geographic Component,File Identification,Logical Record Number,US,...,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months Below Poverty Level: Male: Not in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Employed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: in Labor Force: Unemployed,Civilian Population 16 Years and Over for Whom Poverty Status Is Determined: Income in the Past 12 Months At or Above Poverty Level: Male: Not in Labor Force,Households.1,Households with Housing Costs more than 30% of Income
0,Geo_FIPS,Geo_GEOID,Geo_NAME,Geo_QName,Geo_STUSAB,Geo_SUMLEV,Geo_GEOCOMP,Geo_FILEID,Geo_LOGRECNO,Geo_US,...,SE_T254_004,SE_T254_005,SE_T254_006,SE_T254_007,SE_T254_008,SE_T254_009,SE_T254_010,SE_T254_011,SE_T255_001,SE_T255_002
1,01001020100,14000US01001020100,"Census Tract 201, Autauga County, Alabama","Census Tract 201, Autauga County, Alabama",al,140,00,ACSSF,0001766,,...,36,7,80,1360,880,845,35,480,754,144
2,01001020200,14000US01001020200,"Census Tract 202, Autauga County, Alabama","Census Tract 202, Autauga County, Alabama",al,140,00,ACSSF,0001767,,...,59,0,204,1230,823,793,30,407,783,218
3,01001020300,14000US01001020300,"Census Tract 203, Autauga County, Alabama","Census Tract 203, Autauga County, Alabama",al,140,00,ACSSF,0001768,,...,61,3,305,2291,1491,1421,70,800,1279,357
4,01001020400,14000US01001020400,"Census Tract 204, Autauga County, Alabama","Census Tract 204, Autauga County, Alabama",al,140,00,ACSSF,0001769,,...,16,0,66,3241,1953,1833,120,1288,1749,361


### Dictionary of Census Tract GeoID to List of Census Tracts within 50 Miles

__Let's read in the JSON file to load the dictionary of which census tracts are within 50 miles__

In [6]:
# Reading JSON
with open(DATASETS_PATH + ct_50_miles_file_name, 'r') as f:
    ct_50_miles_dict = json.load(f)

In [8]:
for i in ct_50_miles_dict.items():
    print(i)
    break;

('29001950900', ['29001950300', '29001951000', '29001950400', '29001950500', '29001950200', '29001950100'])
