# **Lab 3: The Multi-Layer Perception**
### Authors: Will Lahners, Edward Powers, and Nino Castellano
________________________________________________________________

## **Describing the Data**

This dataset from Kaggle, contains US Census data taken from the DP03 and DP05 tables of the 2015 American Community Survey 5-year estimates. We will be utilizing the *acs2015_census_tract_data.csv* file which is data for each census tract in the US, including DC and Puerto Rico. A tract ID, also known as a GEOID (Geographic Identifier), is a numeric code assigned to specific geographic areas by the Census Bureau and other state and federal agencies. These codes uniquely identify various administrative, legal, and statistical geographic entities for which the Census Bureau collects and tabulates data. Our classification task we will be:

- Predicting, for each tract ID, what the child poverty rate will be. 

We are converting this from regression to four levels of classification by quantizing the variable of interest. 

## **Load, Split, and Balance (1.5 points total)**

***[.5 points]** **(1)** Load the data into memory and save it to a pandas data frame. Do not normalize or one-hot encode any of the features until asked to do so later in the rubric. **(2)** Remove any observations that having missing data. **(3)** Encode any string data as integers for now. **(4)** You have the option of keeping the "county" variable or removing it. Be sure to discuss why you decided to keep/remove this variable.*

We've decided to go ahead with the option of removing the "County" variable due to the fact that our primary focus is on the TractId's. Including the county would be extra unnecessary data given that their could be multiple TractId's in the same county. However, the same could be said about the "States" as well but there aren't as many states as there is counties so that leads to more computational power when eventually one-hot encoding these features which is another reason why we decided to remove it. There are only 52 states in total compared to the hundreds of counties. 



In [48]:
import pandas as pd
import numpy as np

# (1) Load the data into a pandas DataFrame
data = pd.read_csv('./acs2017_census_tract_data.csv')

# (2) Remove observations that have missing data
data = data.dropna()

# (3) Encode any string data as integers for now (Credits to ChatGPT)
states = data['State'].unique()
state_to_int_mapping = {state: idx + 1 for idx, state in enumerate(states)} 
data['State'] = data['State'].map(state_to_int_mapping)

# (4) Removing "County" variable
data = data.drop(columns=['County']) 

pd.set_option('display.max_columns', None)
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
Index: 72718 entries, 0 to 74000
Data columns (total 36 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   TractId           72718 non-null  int64  
 1   State             72718 non-null  int64  
 2   TotalPop          72718 non-null  int64  
 3   Men               72718 non-null  int64  
 4   Women             72718 non-null  int64  
 5   Hispanic          72718 non-null  float64
 6   White             72718 non-null  float64
 7   Black             72718 non-null  float64
 8   Native            72718 non-null  float64
 9   Asian             72718 non-null  float64
 10  Pacific           72718 non-null  float64
 11  VotingAgeCitizen  72718 non-null  int64  
 12  Income            72718 non-null  float64
 13  IncomeErr         72718 non-null  float64
 14  IncomePerCap      72718 non-null  float64
 15  IncomePerCapErr   72718 non-null  float64
 16  Poverty           72718 non-null  float64
 17

Unnamed: 0,TractId,State,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Pacific,VotingAgeCitizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,Poverty,ChildPoverty,Professional,Service,Office,Construction,Production,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001020100,1,1845,899,946,2.4,86.3,5.2,0.0,1.2,0.0,1407,67826.0,14560.0,33018.0,6294.0,10.7,20.8,38.5,15.6,22.8,10.8,12.4,94.2,3.3,0.0,0.5,0.0,2.1,24.5,881,74.2,21.2,4.5,0.0,4.6
1,1001020200,1,2172,1167,1005,1.1,41.6,54.5,0.0,1.0,0.0,1652,41287.0,3819.0,18996.0,2453.0,22.4,35.8,30.5,24.9,22.9,6.3,15.4,90.5,9.1,0.0,0.0,0.5,0.0,22.2,852,75.9,15.0,9.0,0.0,3.4
2,1001020300,1,3385,1533,1852,8.0,61.4,26.5,0.6,0.7,0.4,2480,46806.0,9496.0,21236.0,2562.0,14.7,21.1,27.9,19.4,33.3,9.9,9.6,88.3,8.4,0.0,1.0,0.8,1.5,23.1,1482,73.3,21.1,4.8,0.7,4.7
3,1001020400,1,4267,2001,2266,9.6,80.3,7.1,0.5,0.2,0.0,3257,55895.0,4369.0,28068.0,3190.0,2.3,1.7,29.0,16.6,25.8,9.1,19.5,82.3,11.2,0.0,1.5,2.9,2.1,25.9,1849,75.8,19.7,4.5,0.0,6.1
4,1001020500,1,9965,5054,4911,0.9,77.5,16.4,0.0,3.1,0.0,7229,68143.0,14424.0,36905.0,10706.0,12.2,17.9,48.8,13.8,20.5,3.5,13.4,86.9,11.2,0.0,0.8,0.3,0.7,21.0,4787,71.4,24.1,4.5,0.0,2.3


*The next two requirements will need to be completed together as they might depend on one another:*

***[.5 points]** Balance the dataset so that about the same number of instances are within each class. Choose a method for balancing the dataset and explain your reasoning for selecting this method. One option is to choose quantization thresholds for the "ChildPoverty" variable that equally divide the data into four classes. Should balancing of the dataset be done for both the training and testing set? Explain.*

We decided to go along the balancing method suggested in the instructions, quantizing the "ChildPoverty" variable into four different classes. This is because using quantization thresholds for balancing the dataset simplifies the classification task by categorizing the poverty rate into discrete levels, such as low, medium, high, and very high. This approach ensures a balanced representation of each category and facilitates clear interpretation of the results, aiding policymakers in understanding poverty severity across different areas. By mitigating imbalanced data challenges and providing meaningful bins, quantization supports the objective of predicting poverty levels for each Tract ID and informs targeted interventions and policies to address disparities effectively.

Also regarding the question of balancing both the training and testing sets, we believe that just balancing the training set would be more effective. While it's essential to balance the training set to ensure the model learns from a diverse set of instances, it's equally important to evaluate the model on an unbiased testing set that reflects the true distribution of classes in real-world data. Therefore, balancing should typically be applied to the training set only.

In [49]:
# Using Quantization thresholds for Balancing
data['ChildPovertyClass'] = pd.qcut(data['ChildPoverty'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])

# Dropping the orginal Child Poverty Variable since we now have the 4 classes
data.drop(['ChildPoverty'], axis = 1, inplace = True)

# Tranform the Labels into discret values 0, 1, 2, 3
classes = data['ChildPovertyClass'].unique()
class_to_int_mapping = {i: idx + 1 for idx, i in enumerate(classes)} 
data['ChildPovertyClass'] = data['ChildPovertyClass'].map(class_to_int_mapping)

data.head()

Unnamed: 0,TractId,State,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Pacific,VotingAgeCitizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,Poverty,Professional,Service,Office,Construction,Production,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment,ChildPovertyClass
0,1001020100,1,1845,899,946,2.4,86.3,5.2,0.0,1.2,0.0,1407,67826.0,14560.0,33018.0,6294.0,10.7,38.5,15.6,22.8,10.8,12.4,94.2,3.3,0.0,0.5,0.0,2.1,24.5,881,74.2,21.2,4.5,0.0,4.6,1
1,1001020200,1,2172,1167,1005,1.1,41.6,54.5,0.0,1.0,0.0,1652,41287.0,3819.0,18996.0,2453.0,22.4,30.5,24.9,22.9,6.3,15.4,90.5,9.1,0.0,0.0,0.5,0.0,22.2,852,75.9,15.0,9.0,0.0,3.4,2
2,1001020300,1,3385,1533,1852,8.0,61.4,26.5,0.6,0.7,0.4,2480,46806.0,9496.0,21236.0,2562.0,14.7,27.9,19.4,33.3,9.9,9.6,88.3,8.4,0.0,1.0,0.8,1.5,23.1,1482,73.3,21.1,4.8,0.7,4.7,1
3,1001020400,1,4267,2001,2266,9.6,80.3,7.1,0.5,0.2,0.0,3257,55895.0,4369.0,28068.0,3190.0,2.3,29.0,16.6,25.8,9.1,19.5,82.3,11.2,0.0,1.5,2.9,2.1,25.9,1849,75.8,19.7,4.5,0.0,6.1,3
4,1001020500,1,9965,5054,4911,0.9,77.5,16.4,0.0,3.1,0.0,7229,68143.0,14424.0,36905.0,10706.0,12.2,48.8,13.8,20.5,3.5,13.4,86.9,11.2,0.0,0.8,0.3,0.7,21.0,4787,71.4,24.1,4.5,0.0,2.3,1


***[.5 points]** Assume you are equally interested in the classification performance for each class in the dataset. Split the dataset into 80% for training and 20% for testing. There is no need to split the data multiple times for this lab.*



In [50]:
from sklearn.model_selection import train_test_split

# Handaling Target Variable
X = data.drop(columns=['ChildPovertyClass']).to_numpy()
y = data['ChildPovertyClass'].to_numpy()

# Test and Train using 80/20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

Here we split the data into training and testing sets, using the 80/20 split. Given that we are equally interested in the classification performance for each class, we stratified the sets to ensure each class is equally represented within both the training and testing data, reducing the biased toward one class over the others.

*Note: You will need to one hot encode the target, but do not one hot encode the categorical data until instructed to do so in the lab.* 

## **Pre-processing and Initial Modeling (2.5 points total)**