# Processing the American Community

__Matthew Lee__

# Introduction 

The American Community dataset is a survey conducted by the US Census Bureau in 2013 to provide reasearchers information about the US population. Laws and policies are developed to benefit the majority of citizens within the United States. Since developing policies require not only a very extensive amount of time and money to create (especially if it is not well recieved), decisions need to be made with upmost accuracy based on both seen and unseen trends. The consolidation of this information helps


## Business Understanding 

### Data Info and Motivation

In this dataset, there are 283 attributes for over 1.6 million rows (people surveyed); actually it is about 6.4 million if you consider there are four files. The objective of this analysis is to provide a good understanding of how much financial success Americans are having in the United States and determine whether or not more philanthropic policies should be geared towards them. Secondly, legal and illegal immigrants are a large discussion within current politics. Since there is a limited amount of funds that can be enacted toward policies, the second objective is to figure out whether lower income or immigrants deserve priority in terms of politics. For example, should policies towards the health care system or social security favor immigrants who are or are not citizens, yet show successful integration into society? These are the questions this analyses is looking to solve.

### Financial Success of Immigrants 

The benefit of being born an American citizen allows that person to be directly integrated into society, thus allowing that person to understand social norms. Immigrants on the other hand, once they arrive in the United States, they have to integrate into society, which inheritantly gives them a slight disadvantage in terms of being able to compete for schools and job positions. In a sense, if an immigrant is able to successfully land into a good college, yet do not have the funds to go to the school, do they deserve to have priority in earning FAFSA money over an American born citizen? Success is a hard word to define, but the main analysis will use financial income as a determinant of success. The reasoning behind it is because jobs that have a high salary not only means the job is hard obtain, but the person has a higher ability at influencing the economy through his or her purchasing power. A country's gdp-to-citizen results in higher quality of life. In other words, a high circulation of money within an economy will boost quality of life. 

Choosing whether to fund an immigrant or US citizen for college on the basis of who will have the greatest impact to the US is a hard choice, but it can be inspected through analysis of the data provided by the US bureau. The ultimate goal is to figure out if the US government should focus on creating politics that are geared more beneficial towards immigrants with a high potential or US citizens.

## Purpose of data

Understanding the American populace is important to allow policy makers to create meaningful laws which will impact the US population the best. 

### Data Grade and Worth

In order to figure out the potential of an immigrant or US citizen based on anonymous surveys, the most important value will be based on immigrant's and income after graduating college or currently in the work force. There are a lot values which can determine the success of a immigrant from one country in comparison to an immigrant from another country. Of course, there are variables which contribute to financial success, in which they will be examined. Considering the size of the dataset and 283 attributes, there is bound to be a correlation. The best way to value if a learning algorithm works is by seeing how correct the algorithm is in guessing the success of a person based on given inputs.

# Data Understanding
[10 points] Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. 

Here are the definitions to important metrics that will be used for analysis:

Nominal:(no order) ID numbers, eye color, zipcodes | distincness, = not equal (one hot encoded)
Ordinal: rankings, grades, height | greater than (integer)
Continuous:(floating), temperatures | plus minus (float)
## Attribute Description
__PINCP__ (ordinal) - Person's total income O

__WAGP__ (ordinal) - Wage or salary in past 12 months O

__SCHL__ (nominal) - Level of education N

__ESR__ (nominal) - Employment state record N

__CIT__ (nominal) - Citizenship status N
           1 .Born in the U.S.
           2 .Born in Puerto Rico, Guam, the U.S. Virgin Islands,
             .or the Northern Marianas
           3 .Born abroad of American parent(s)
           4 .U.S. citizen by naturalization
           5 .Not a citizen of the U.S.

__AGEP__ (ordinal) - Age of person O

__ST__ (nominal) - State Code N

__MARHDP__ (nominal) - Divorced in past 12 months N

__FMARHDP__ (nominal) - Divorced in past 12 months 

__MARHT__ (ordinal) - Number of times married O

__WKL__ (nominal) - when last worked

__FOD1P__ (nominal) - field of degree

__LANP__ (nominal) - Language spoken at home

__MIGPUMA__ - migration puma

__MIGSP__ - Migration record: need to remove states

__MSP__ (nominal) - Married, spouse present/spouse absent

__NAICSP__ (nominal) - Job the individual has

__NATIVITY__ (nominal) - Born in america or not in america 

__OC__ (nominal) - own child

__POVIP__ (ordinal) - Poverty to income ratio (housing)

__POWSP__ (nominal) - Place of work (state)

__RAC1P__ (nominal) - Recorded detailed race code
           1 .White alone                             
           2 .Black or African American alone         
           3 .American Indian alone                   
           4 .Alaska Native alone                     
           5 .American Indian and Alaska Native tribes specified; or American
             .Indian or Alaska Native, not specified and no other races
           6 .Asian alone                             
           7 .Native Hawaiian and Other Pacific Islander alone
           8 .Some Other Race alone                   
           9 .Two or More Races 
__RAC3P__ (nominal) - Recorded detailed race code including europeans

__ENG__ (nominal) - Ability to speak english b .N/A (less than 5 years old/speaks only English)
           1 .Very well
           2 .Well
           3 .Not well
           4 .Not at all

# Data Preprocessing

Time for data extraction and refinement!

### Data Attribute Removal
There are quite a few attributes that are not going to be used, so they will be removed.

In [None]:
# Import libraries for data modfication
import pandas as pd
import numpy as np

# Load in all the data
housingA = pd.read_csv("pums/ss13husa.csv")
#housingB = pd.read_csv("pums/ss13husb.csv")
individualsA = pd.read_csv("pums/ss13pusa.csv")
individualsB = pd.read_csv("pums/ss13pusb.csv")

# Combine seperate data into one data frame
individualdfs = [individualsA, individualsB]
individualdf = pd.concat(individualdfs)

# Delete data attibutes that are not going to be used
delete_columns_housing = ['']
delete_columns_individuals = ['pwgtp1', 'pwgtp2', 'pwgtp3', 'pwgtp4', 'pwgtp5', 'pwgtp6', 'pwgtp7', 'pwgtp8', 'pwgtp9', 'pwgtp10',
                    'pwgtp11', 'pwgtp12', 'pwgtp13', 'pwgtp14', 'pwgtp15', 'pwgtp16', 'pwgtp17', 'pwgtp18', 'pwgtp19', 'pwgtp20',
                    'pwgtp21', 'pwgtp22', 'pwgtp23', 'pwgtp24', 'pwgtp25', 'pwgtp26', 'pwgtp27', 'pwgtp28', 'pwgtp29', 'pwgtp30',
                    'pwgtp31', 'pwgtp32', 'pwgtp33', 'pwgtp34', 'pwgtp35', 'pwgtp36', 'pwgtp37', 'pwgtp38', 'pwgtp39', 'pwgtp40',
                    'pwgtp41', 'pwgtp42', 'pwgtp43', 'pwgtp44', 'pwgtp45', 'pwgtp46', 'pwgtp47', 'pwgtp48', 'pwgtp49', 'pwgtp50',
                    'pwgtp51', 'pwgtp52', 'pwgtp53', 'pwgtp54', 'pwgtp55', 'pwgtp56', 'pwgtp57', 'pwgtp58', 'pwgtp59', 'pwgtp60',
                    'pwgtp61', 'pwgtp62', 'pwgtp63', 'pwgtp64', 'pwgtp65', 'pwgtp66', 'pwgtp67', 'pwgtp68', 'pwgtp69', 'pwgtp70',
                    'pwgtp71', 'pwgtp72', 'pwgtp73', 'pwgtp74', 'pwgtp75', 'pwgtp76', 'pwgtp77', 'pwgtp78', 'pwgtp79', 'pwgtp80']

# delete all columns from certain number?
individualdf = individualdf.drop(delete_columns_individuals, 1)

### Quality of Data and Feature Discretization

The data attributes will be labeled as continuous, ordinal, or categorical. This will help identify invalid or missing data.

In [None]:
# Feature Discretization
ordinal_features = ['PINCP', 'WAGP', 'MARHT', 'AGEP'] 
continuous_features = ['SCHL', 'ESR', 'CIT', 'WKL', 'MSP', 'NAICSP', 'POWSP', 'ST', 'RAC3P'] 
binary_features = ['FMARHDP', 'NATIVITY', 'OC'] 
categorical_features = individualdf.columns.difference(continuous_features).difference(ordinal_features)

### Missing Information for Continuous Features
Continuous feature columns missing information: POWSP (Place of work) and NAICSP (Job held by individual)
POWSP has the value NaN filled in when the current individual is not working anywhere. NaN will be imputed with -1 to signify that the individual is not working and the data will not be ignored. NAICSP, similar to POWSP has NaN to signify that the individual is currently not employed. NaN will be imputed with -1 so the data will be given value. ESR (Employment state) does provide whether or not the individual is employed, this attribute will be used to check if the individual misinput the value for POWSP or NAICSP. Data with conflicting ESR and POWSP and NAICSP will be excluded from the dataset to provide better accuracy.


In [None]:
# Impute NAICSP's and POWSP's NaN with -1
individualdf = individualdf.replace(to_replace=np.nan,value=-1)
# Check and remove data conflicts with ESR and POWSP and NAICSP
# print individualdf.at[0, 'ESR']
# for row in range(0, len(individualdf[nominal_features].axes[0]):
#     if individualdf[nominal_features].iat[row, 1] != 3 and (individualdf[nominal_features].iat[row, 1] != -1] or individualdf[nominal_features].iat[row, 1] != -1):
#         individualdf[nominal_features].drop(row)

individualdf[continuous_features].describe()

### Missing Information and Imputing for Ordinal Features
Ordinal features were examined, NaN is present in these attributes ......

In [None]:
individualdf[ordinal_features].describe()

### Imputing for Continuous Features

### Missing Information and Imputing for Categorical Features
In attributes ...... NaN is present, 

In [None]:
individualdf[categorical_features].describe()

### Imputing for Categorical Features

### Missing Information and Imputing for Binary Features
There were surprisingly no 

In [None]:
# Setting the data attributes to the correct types
individualdf[continuous_features] = individualdf[continuous_features].astype(np.float64)
individualdf[ordinal_features] = individualdf[ordinal_features].astype(np.int64)
individualdf[binary_features] = individualdf[binary_features].astype(np.int8)

# Feature Discretization
Process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes/features/variables/intervals

# Inclination of Data

Majority of the data attributes are chosen under the purpose to extract individuals who recently came to America. The problem with relying on one attribute such as nativity is not enough. Individuals can choose to not reveal the fact they were born in american because of fear of being deported regardless of anonymity. Instead, combining Nativity, MIGSP, and MIGPUMA will establish a dataset of individuals who came to America. All the other attributes, contribute to developing a consensus of whether coming to America is worth it.

# Analysis of Nato

Nato created a dataset which values the economic strength of different countries. Information from data attributes ("""""""") will determine the poverty-to-income. Life in immigrants vs life in previous state. Since the data is anonymized, determining -- we will look at correlation. High immigration rate from specific countries.







# Data Visualization

In [None]:
# Setup Seaborn 
import seaborn as sns
sns.set_palette('muted')

# Setup plotly
import plotly
plotly.offline.init_notebook_mode() 

# Setup matplotlib
import matplotlib.pyplot as plt

# Embed figures into Notebook
%matplotlib inline

# Use GGPlot style for matplotlib
plt.style.use('ggplot')

## General Population

### Distribution of Income
Income is a very important attribute because it defines the financial stability and success of a person. 

In [None]:
print("Mean Income of all US citizens: " + str(individualdf['PINCP'].mean()))
print("Median Income of all US citizens: " + str(individualdf['PINCP'].median()))

individualdf['PINCP'].plot(kind='box', ylim=[-20, 120000], title = "Income")

This box plot shows the distribution of income in 2013. What's really interesting about the graph is that the US Census Bureau claims the average income is 52,250. However, based on taking a simple mean and median, the income distribution is skewed by a huge margin of 20000. Considering the number of people taken from this survey, it seems as if only people from the low income bracket (47,248) took the survey. This proves to be a formitable issue in the analysis of this dataset because this is biased towards middle and upper class. Upper class data does not even exist.

College degree income average

Income vs out of america

Income vs American

### Demographics

Some of the data attributes are chosen under the purpose to extract individuals who recently came to America. The problem with relying on one attribute such as NATIVITY (born in US) is not enough. Individuals can choose to not reveal the fact they were born in american because of fear of being deported regardless of anonymity. Instead, combining Nativity, MIGSP, and MIGPUMA will establish a dataset of individuals who came to America. All the other attributes, contribute to developing a consensus of whether coming to America is worth it.

In [None]:
individualdf.groupby("NATIVITY").OC.count().sort_values().plot(kind='barh', title = "Born in America")

There are about 2.75 million American born citzens and 350,000 immigrants.

In [None]:
individualdf.groupby("OC").OC.count().sort_values().plot(kind='barh', title = "Child and don't down children")

About 2.5 million have children, while a little over 500,000 do not.

In [None]:
individualdf.groupby("ESR").OC.count().sort_values().plot(kind='barh', title = "Employment Status")

-1 .N/A (less than 16 years old)                     
1 .Civilian employed, at work                        
2 .Civilian employed, with a job but not at work     
3 .Unemployed                                          
4 .Armed forces, at work
5 .Armed forces, with a job but not at work
6 .Not in labor force

The reason why the data skewed is not because of low income bracket, but because the total number of people not in the labor force and less than 16 years old well exceeds the number of people who are in work force. People not in the labor force may include: elderly, stay home moms, or those who are disabled.

In [None]:
ax = plt.axes()
sns.distplot(individualdf['RAC3P'])

ax.set_title('Distribution of Race (General)')
plt.show()

## Immigration Population

In [None]:
# Create dataset for immigrants
count = 0
immigrantdf = individualdf
immigrantdf = immigrantdf[immigrantdf.NATIVITY != 3]
immigrantdf = immigrantdf[immigrantdf.NATIVITY <= 109]

Immigrants make up 

In [None]:
immigrantdf.groupby("OC").OC.count().sort_values().plot(kind='barh', title = "Child and don't down children")

In [None]:
immigrantdf.groupby("ESR").OC.count().sort_values().plot(kind='barh', title = "Employment Status")

In [None]:
print("Mean Income of all US citizens: " + str(immigrantdf['PINCP'].mean()))
print("Median Income of all US citizens: " + str(immigrantdf['PINCP'].median()))

immigrantdf['PINCP'].plot(kind='box', ylim=[-20, 120000], title = "Income")

In [None]:
ax = plt.axes()
sns.distplot(individualdf['RAC3P'])

ax.set_title('Distribution of Race (Immigrants)')
plt.show()

Distribution of immigrants

# Dimensionality Reduction
implement dimensionality reduction, then visualize and interpret the results. 

Kernal PCA
