## Business Understanding 1

The dataset selected for this project was collected by the US Census Bureau and the Bureau of Labor Statistics over the course of 1994 and 1995 for the Current Population Survey (CPS). The main purpose of the CPS is to obtain current information on the status of the labor force in the United States. More specifically, this survey is conducted to enumerate the number of jobless and unemployed individuals as well as to get an idea about the social well-being of the citizens.

The Current Population Survey was developed in the late 1930s after the Great Depression, as prior there was not an effective technique to classify the labor force. A great need arose for a reliable survey of the population after this period of widespread unemployment. Previously, there were several indirect surveying techniques employed, however there were great discrepancies between these methods. The first surveys began in the 1940s and responsibility for conducting the survey has changed hands in the government, but currently, the survey portion is conducted by the US Census Bureau and the data is analyzed by the Bureau of Labor Statistics. 

The CPS is conducted monthly and is administered by asking a series of questions pertaining to socioeconomic factors of roughly 60,000 probability sampled households from all 50 states and the District of Columbia. Eligible candidates must be over the age of 15, not in the Armed Forces, or in an institution such as a prison, nursing home or long-term health care facility. Typically, labor force questions are asked pertaining to eligible workers in the household, in addition to supplemental questions asked that are of particular interest to labor force analysts. These subjects range greatly in the both the frequency in which they are asked, as in, annually, biannually, or one-time, as well the topics. Supplemental survey topics which vary monthly and cover questions relating to many differing topics, such as veterans status, child support, displaced workers, fertility, disability, school enrollment, just to name a few.

This data was obtained from the University of California Irvine Machine Learning repository, which a citation and a direct link to the dataset can be found [here](https://www2.1010data.com/documentationcenter/prod/Tutorials/MachineLearningExamples/CensusIncomeDataSet.html). 

Due to the wide-ranging nature of this survey, outcomes derived from this data would be highly dependent on the questions at hand. For purposes of this study, the main goals will be classification of two variables. The first variable will be the income as a binomial response, above or below $50,000 annually. The second classification variable will be created from the level of education feature, whether the person is a college graduate or not.

This data was compiled as it is important for labor force analysts to produce a statistical summary on questions of interest pertaining the United States workforce. The survey is a very useful tool that will provides insight into the social and economic status of the US population. This project will not be necessarily using this data set for what it was initially intended for but rather produce two unique classification problems. If useful knowledge has been mined from this data, an accurate classification algorithm will be produced for the desired variables selected. Ten-fold cross validation will be used to assess if useful knowledge has been mined.

For these studies, association rules will be created. Association rule mining is likelihood of co-occurence and there is no causality. The effectiveness of a good prediction apriori algorithm will be assessed by confidence of the outcome. This validation method of apriori algorithm will help graduate schools to target students who might be interested for higher studies. Since, we do not need any causality here, so association rule will work just fine. 

With current pandemic, grad schools are facing dropping number of enrollment which is due to various reasons (i.e. contamination issue at campus, layoff and furloughs in many different comanies forcing working students to take break). Grad schools can target students who are more likely to pursue in higher studies by giving them scholarships and lure them to enroll.

## Data Understanding 1

The census dataset was grabbed from the UCI Machine Learning repository. To learn more about this dataset please visit the hyperlink that is attached in the above segments. 

Overall this dataset includes about 40 attributes and all varying around census information around the year 1994 and 1995. We decided to include almost all attributes to help us with our classification models in subsequent labs. Below is the high level breakdown of each attribute:

|Attribute	       | Type of Data	 |     Description    |
|:-----------------|:---------------:|:-------------------|
|Age	           | Nominal	     | -Age of each individual|
|Class of Worker   | Nominal	     | -This is the information of what industry the person works in |
|Industry Code	   | Integer         | -The code of the industry they work in, if applicable|
|Occupation Code   | Integer	     | -The code associated with their occupation|
|Education	       | Nominal	     | -The level of education the person received |
|Wage per hour	   | Integer         | -How much the person makes per hour |
|Marital status	   | Nominal         | -This is whether the person is married, single, or  divorced |
|Major Industry code| Nominal        | -The actual description of the industry, e.g. Construction, Finance, etc… |
|Major occupation code| Nominal	     | -Description of the role of the individual |
|Race 	              | Nominal	     | -Race of the individual e.g. Caucasian, Asian, Hispanic, etc… |
|Hispanic Origin	  | Nominal	     | -Whether the individual has a hispanic origin |
|Sex	              | Binary	     | -The gender of the individual, Male or Female |
|Member_of_labor union| Nominal	     | -This tells us if they are a member of a labor union|
|Reason for unemployment| Nominal	 | -The reason for unemployment if not employed |
|Full or part time employment stat|	 Nominal | -The working status of the individual |
|Capital gains	| Integer	         | -Their overall capital gains this year |
|Capital losses | Integer            | -The overall loss to the capital gains |
|Dividends from stocks|	Integer	     | -The return that was gained if the individual owns any stocks |
|Tax filer status |	Binary	         | -This is the current status for their tax filing |
|Region of previous residence | Nominal	| -The region the individual lived in before |
|State of previous residence  | Nominal	|     -The state the individual lived previously, if applicable |
|Detailed household and family stat | Nominal |    -Statistic of the individual in a household and the family as a whole|
|Detailed household summary in household |	Nominal	| -Overall summary that closely depicts the members of the household|
|Instance weight | Double	|   -Number of people in the population that each record represents duel to stratified sampling|
|Migration code-change in msa | Nominal	| -Information regarding whether the person moved to the state|
|Migration code-change in reg |  Nominal |	     -Originated from the region and whether they moved to the region or away |
|Migration code-move within reg | Nominal	|   -Same as above |
|Migration prev res in sunbelt |	Nominal	|  -Previously lived in the sunbelt  |
|Num persons worked for employer | Nominal	| -The number of previous employers the individual worked for |
|Family members under 18 | Nominal | -Family members under the age of 18 |
|Country of birth father | Nominal	| -The country of birth for the father of the individual |
|Country of birth mother | Nominal	| -The country of birth for the mother of the individual |
|Country of birth self 	 | Nominal	| -The country of birth for the individual |
|Citizenship 	         | Binary	| -Whether they are a citizen of the country or not |
|Total person income	 | Binary	| -Income of the peson for the year, above 50k or below 50k |
|Own business or self employed | 	Nominal |	 -If they own a business or are self-employed |
|Veterans benefits	| Ordinal  |  -The benefit tier for a veteran |
|Weeks worked in year |	Integer | -Number of weeks worked that given year for the person |

As seen above, our dataset incorporates a miriad of attributes. These attributes range from age to veteran information. We decided to keep all of these variables and remove the features that will not benefit our ML models in the subsequent labs that follow. There are some attributes such as veteran benefits, that are ambiguous and not enough information for us to interpret the values.

Some attributes that we want to elaborate a little bit more on is as follows:

Previous migration under the sunbelt:

This field is basically describing information about individuals that previously migrated from states that were under the sunbelt. These are states that stretch across the Southeast and Southwest regions. (E.G. Socal, Texas, Atlanta, etc...) More information can be found [here](https://www.census.gov/quickfacts/fact/map/US/PST045219).

Industry Code and Major Industry Code:
This is the numerical value for each industry, the code matches up with the "Major Industry Code" which tells us the actual name of the industy.

Occupation Code and Major Occupation Code:
These two attributes are related with each other. The occupation code gives us the numerical value for each occupation. This code matches up with the "Major Occupation Code" attribute which tells us the actual name of the occupation.


In [1]:
#Importing the libraries we need for this analysis

import numpy as np
import pandas as pd 

import matplotlib.pyplot as plt
import matplotlib.style as style
import scipy.stats as stats
import seaborn as sns
%matplotlib inline
style.use('bmh') ## style for charts

So most of the missing values are characterized as "Not in Universe" which basically means that this information does not apply to that specific individual. For example, an person that is under the age of 18 would not have class_worker information. This is not something we want to impute since these are technically not missing values and likely not mistakes.
We observed a large number of values for capital gains and stock dividends that had values of 99999. Additionally, values of 9999 for instances such as wage_per_hour. We believed these values were capped and are not error, for that reason they will be left as is. We are not planning to use it in our assocation rule method.

## Modeling and Evaluation

#### Option B: Association Rule Mining

-  [10 Points] Train and adjust parameters

Detail to be added

-  [10 Points] Evaluate and Compare

Detail to be added

-  [10 Points] Visualize Results

In [None]:
plot(is_max_grad)
plot(is_max_income)

In the above scatter plot the support and confidence of the rules for income over $50k and graduated and displayed. For graduated we were able to find 2 rules with about 3% support and one rule with over 4%, the confidences and lift were all comparable with confidence around 75%. For income, the three maximal rules we found all had similar support around 1.5% but ranged in confidence from 25% to 30%.

In [None]:
plot(is_max_grad, method="graph")
plot(is_max_income, method="graph")

In the above plot features that combined to form these rules are shown. For graduated: professional specialty occupation codes, married with spouse present, white, tax filers joint under 65 and native born citizen were the combination of factors coincided with graduation. Each of these contributed to at least 2 of the 3 rulesets generated.

For income the rule sets associated with income over $50k are: males with a bachelor degree or white people that worked in professional specialty or were native born citizens in executive admin/ management.

-  [20 Points] Summarize the Ramifications

Detail to be added

## Deployment


The below companies and parties might want to use our model to predict the following:

To predict income:
- IRS
- Marketing agencies
- Real Estate

To predict graduation:
- Job agency
- Political parties
- Graduate school recruiting


Graduate schools can use this model to target students who are more likely to do higher studies based on other attributes that are associated with higher education. 

The model can be deployed in the cloud environment for high availability and scaling. The pipeline was designed so it can process the unseen data which than can be used in the model. This is so that anyone can easily utilize the model to process the unseen data. Other data that can be collected is real estate information (rent or own, type of home, price of home, etc…) and whether this individual will want to pursue a higher education. This type of model can be updated monthly based on additional data that is collected. However, updating more than a month is not necessary because these types of attributes are not prone to frequent change.

## Exceptional Work

### Sources

UCI Citation for dataset:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[Direct Link to Census-Income Dataset](https://www2.1010data.com/documentationcenter/prod/Tutorials/MachineLearningExamples/CensusIncomeDataSet.html)

[History of the Current Population Survey](https://www.census.gov/programs-surveys/cps/about/history-of-the-cps.html).

[Methodology of the Current Population Survey](https://www.census.gov/programs-surveys/cps/technical-documentation/methodology.html)

[Current Population Survey: Supplemental Survey Topics](https://www.census.gov/programs-surveys/cps/about/supplemental-surveys.html)

[Link to the Census Gov Page for Sunbelt States](https://www.census.gov/quickfacts/fact/map/US/PST045219)