# Lab 8: Define and Solve an ML Problem of Your Choosing

In [54]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [35]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(adultDataSet_filename)

df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K
5,37.0,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40.0,United-States,<=50K
6,49.0,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16.0,Jamaica,<=50K
7,52.0,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,45.0,United-States,>50K
8,31.0,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50.0,United-States,>50K
9,42.0,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,5178,0,40.0,United-States,>50K


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. I chose the Census information  Dataset
2. I will be predicting the if the person income is greater or less than 50k. The label is income_binary.
3. This a supervised learning problem. It is a binary classificationvproblem
4. My features will be all numeric columns along with workclass, sex_selfID, and occupation which I will convert to one hot encoded bariables.
5. This is an important problem because a company may want to see how much their employees are being paid. This model can also help to give accurate estimates for the company to know how much they should pay new employees based on the skills and background information. 

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [36]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K


In [37]:
df.shape

(32561, 15)

In [38]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex_selfID',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income_binary'],
      dtype='object')

In [39]:
df.dtypes

age               float64
workclass          object
fnlwgt              int64
education          object
education-num       int64
marital-status     object
occupation         object
relationship       object
race               object
sex_selfID         object
capital-gain        int64
capital-loss        int64
hours-per-week    float64
native-country     object
income_binary      object
dtype: object

In [40]:
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32399.0,32561.0,32561.0,32561.0,32561.0,32236.0
mean,38.589216,189778.4,10.080679,615.907773,87.30383,40.450428
std,13.647862,105550.0,2.57272,2420.191974,402.960219,12.353748
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,14084.0,4356.0,99.0


In [41]:
np.sum(df.isnull(), axis = 0)

age                162
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week     325
native-country     583
income_binary        0
dtype: int64

In [42]:
df["income_binary"]

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4        <=50K
         ...  
32556    <=50K
32557     >50K
32558    <=50K
32559    <=50K
32560     >50K
Name: income_binary, Length: 32561, dtype: object

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

Now that I have inspected my dataset, I have a better idea of the data I am working with and the next steps I need to take. Firstly, I want to deal with the null values as I see that there are five columns that contain many empty values. In the numeric columns, I will replace these null values with the mean for each column. Next, I notice there are many catergorical variable in this data set. I will want to perform one hot encoding on a few of these features since machine learning algorithms work best with numeric data. I will perform one-hot-encoding on the features workclass, sex_selfID, and occupation. Most importantly, my label is a string value with two options, "<=50K" or ">50K", I will be changing this to a single binary column which our model can make predictions for. For the features I will be using, I am removing the education feature since its directly corrletaed with our feature education num, which is a better variable to have sicnce its numeric. Only numeric features will be used. I also performed winsorization on capital loss to cutoff outliers. The model I want to use will be a logistic regression since it works well for binary classification probleems. It also is good for interpretability. 

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [43]:
df["age"].fillna(value = df["age"].mean(), inplace = True )
df["hours-per-week"].fillna(value = df["hours-per-week"].mean(), inplace = True )
print(sum(df["age"].isnull())) # check 
print(sum(df["hours-per-week"].isnull())) # check

0
0


In [44]:
cat_cols = df.select_dtypes(include = ["object"]).columns
df[cat_cols].nunique()

workclass          8
education         16
marital-status     7
occupation        14
relationship       6
race               5
sex_selfID         2
native-country    41
income_binary      2
dtype: int64

In [45]:
df_workclass = pd.get_dummies(df['workclass'], prefix='workclass_')

df = df.join(df_workclass)
df.drop(columns = 'workclass', inplace=True)

In [46]:
df_sex_selfID  = pd.get_dummies(df['sex_selfID'], prefix='sex_selfID_')
df = df.join(df_sex_selfID)
df.drop(columns = 'sex_selfID', inplace=True)

In [47]:
df_occupation  = pd.get_dummies(df['occupation'], prefix='occupation_')
df = df.join(df_occupation )
df.drop(columns = 'occupation', inplace=True)

In [68]:
df_income_binary = pd.get_dummies(df["income_binary"])

df = pd.concat((df_income_binary, df), axis=1)

df.drop(columns = 'income_binary', inplace=True)
 
# We want >50K to equal 1 so we drop <=50K. 
df = df.drop(["<=50K"], axis=1)
 
# Rename the label
df= df.rename(columns={">50K" : "income_>50k"})
df["income_>50k"] = df["income_>50k"].astype(bool)

KeyError: 'income_binary'

In [69]:
np.percentile(df['capital-gain'], 90)
df['capital-gain-win'] = stats.mstats.winsorize(df['capital-gain'], limits=[0.01, 0.01])

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [78]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score

In [79]:
df.dtypes

income_>50k                         bool
age                              float64
fnlwgt                             int64
education                         object
education-num                      int64
marital-status                    object
relationship                      object
race                              object
capital-gain                       int64
capital-loss                       int64
hours-per-week                   float64
native-country                    object
workclass__Federal-gov             uint8
workclass__Local-gov               uint8
workclass__Never-worked            uint8
workclass__Private                 uint8
workclass__Self-emp-inc            uint8
workclass__Self-emp-not-inc        uint8
workclass__State-gov               uint8
workclass__Without-pay             uint8
sex_selfID__Female                 uint8
sex_selfID__Non-Female             uint8
occupation__Adm-clerical           uint8
occupation__Armed-Forces           uint8
occupation__Craf

In [80]:
feature_list = list(df.select_dtypes(include = ["float64", "uint8", "int64"]).columns)
y = df['income_>50k'] #label
X = df[feature_list] #features

In [81]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .33)

In [82]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(21815, 33)
(10746, 33)
(21815,)
(10746,)


In [83]:
model = LogisticRegression()

model.fit(X_train, y_train)

probability_predictions = model.predict_proba(X_test)

df_print = pd.DataFrame(probability_predictions, columns = ['Class: False', 'Class: True'])
print('Class Prediction Probabilities: \n' + df_print[0:5].to_string(index=False))


l_loss = log_loss(y_test,probability_predictions )
print('Log loss: ' + str(l_loss))

class_label_predictions = model.predict(X_test)

acc_score = accuracy_score(y_test, class_label_predictions )
print('Accuracy: ' + str(acc_score))

Class Prediction Probabilities: 
 Class: False  Class: True
     0.744356     0.255644
     0.803688     0.196312
     0.737192     0.262808
     0.808375     0.191625
     0.794997     0.205003
Log loss: 0.5156477749043865
Accuracy: 0.7969477014703146


After continuously tuning my features, this is my final logistic regression model. I have obtained an accuracy score of 0.80 and a log loss of 0.5. These are good metrics that I display good model performace for a logistic regression model.