## **Preparing data for Machine learning**

### **Goal**:
You are the data scientist assigned to perform the data pre-processing and preparing the data for Machine Learning algorithms.

1. Perform data exploration to understand the data
2. Prepare the test and training sets.
3. Pre-processing of the data, including fixing all the missing values (set the missing values to median values) and any other ones that you think are appropriate to perform. Build a pipeline to perform data transformation. (5 points)

In the next hands-on, we will use 14 out of 15 attributes as pedictors describe below to predict if income goes above or below \$50K/yr based on census data. `Income` will be the label.

Dataset is from [UCI](https://archive.ics.uci.edu/dataset/2/adult)

### Data:
An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, and etc.

### Fields:
The dataset contains 15 columns

#### Target field: Income
- The income is divide into two classes: 50K

#### Number of attributes: 14
-- These are the demographics and other features to describe a person

We can explore the possibility in predicting income level based on the individual’s personal information

- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `fnlwgt`: continuous.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male.
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- `salary`: >50K,<=50K

Note: "?" is used to represent missing data in this dataset.

In [1]:
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Read in data. Consider that "?" is used in place of na
adults_fp = 'https://raw.githubusercontent.com/csbfx/advpy122-data/master/adult.csv'


## Part 1: Data Exploration
Let's examine the data. What are some ways to look at the different attribute in the dataframe? What are some good way to quickly get a feel of the type of data you are dealing with? Examples we have used in class includes looking at relationship/correlation and histograms.

In [None]:
# Your code here . . .

## Part 2: Prepare Training & Testing data sets
We next want to create a test set. Set aside 20% of the dataset. To reduce the risk of sampling bias, consider using stratified sampling. For example consider the `income` where we create 5 stratas ranging from \$0 to \$15,000 (take a look at the income data and consider how to bin to reflect the income in \$USD).

In [None]:
# Your code here . . .

## Part 3: Pre-processing data

Once we have our training and testing dataset, clean up your dataframe if there are redundant attributes or features that would not be useful in machine learning models such as unique identifiers.

In [None]:
# Your code here . . .

It is a good idea to further investigate your attributes by using correlation matrix and Seaborn pairplot to understand how different attributes are related.

**Feature engineering**-   
It is useful for ML algorithms to prepare data in meaningful ways. For example, it could be useful to look at the total number of rooms relative to each household. Similarly, the total number of bedrooms is not as meaninful as the number of bedrooms relative to the total number of rooms. Another interesting attribute could be the population per household

## Part 4: ML models
Select two ML Model to make prediction on housing price. Explain why you selected these two models. Perform 10-fold Cross Validation.

In [None]:
# Your code here . . .

## Part 5: Fine-tune your model
Pick the best model from Part 4 and perform fine-tuning. Use the fine-tuned model and evaluate its performance using the test set. How does the fine tune compared to your original evaluation in Part 4.

In [None]:
# Your code here . . .

## Bonus: Evaluating the model using the Confusion Matrix and a Precision-Recall Curve
A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It can be used to evaluate the performance of a classification model through the calculation of performance metrics such as [accuracy, precision, recall, and F1-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html). Here is [an article](https://medium.com/swlh/explaining-accuracy-precision-recall-and-f1-score-f29d370caaa8) that gives a good explaination of Precision, Recall, and F1-score.

#### **Accuracy**
   
Accuracy = $\frac{True\ Positives\ +\ True\ Negatives}{All\ Samples}$

**Precision (aka Specificity)**

Precision = $\frac{True\ Positives}{True\ Positives\ +\ False\ Positives}$
= $\frac{True\ Positives}{Total\ Predicted\ Positives}$


**Recall (aka Sensitivity)**

Recall = $\frac{True\ Positives}{True\ Positives\ +\ False\ Negatives}$
= $\frac{True\ Positives}{Total\ Actual\ Positives}$

**F1-score (combining Precision and Recall)**

F1-score = $\frac{2\ ×\ (Precision\ ×\ Recall)}{Precision\ +\ Recall}$

In [None]:
# Your code here . . .

#### **Precision-Recall Curve**
[Precision-Recall Curve documentation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#sphx-glr-auto-examples-model-selection-plot-precision-recall-py)


In [None]:
# Your code here . . .
