# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(adultDataSet_filename)

df.head(10)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K
5,37.0,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40.0,United-States,<=50K
6,49.0,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16.0,Jamaica,<=50K
7,52.0,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,45.0,United-States,>50K
8,31.0,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50.0,United-States,>50K
9,42.0,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,5178,0,40.0,United-States,>50K


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. I'm choosing to use the Adult Dataset for my problem.
2. For this problem, I'm going to try to predict the label 'income_binary'. In other words, I'm going to predict if one examples makes above or below 50k.
3. I plan on using unsupervised learning as I found it really interesting. I plan on using clustering with there being 2 final groups to see if the groups accuratly represent the labels. In other words, I'll be checking if the 2 groups of examples, those above 50k and those below, naturally form through clustering.
4. As of now, I will try to use all features of the data, even those that are words as I will try to convert them into usable data.
5. By understanding what makes a person earn more, companies can see any hinderances people face in monetary success and whether or not there is a bias against certain people. By exposing these things that work against an individual, companies can create solutions to these problems such as giving higher quality education or providing aid.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
# Column types
df.dtypes

age               float64
workclass          object
fnlwgt              int64
education          object
education-num       int64
marital-status     object
occupation         object
relationship       object
race               object
sex_selfID         object
capital-gain        int64
capital-loss        int64
hours-per-week    float64
native-country     object
income_binary      object
dtype: object

In [4]:
# Columns with null values
df.isnull().sum()

age                162
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week     325
native-country     583
income_binary        0
dtype: int64

In [5]:
workclass_unique = df['workclass'].unique()
race_unique = df['race'].unique()
education_unique = df['education'].unique()
age_unique = df['age'].unique()
relationship_unique = df['relationship'].unique()
country_unique = df['native-country'].unique()
sex_selfID_unique = df['sex_selfID'].unique()
education_num_unique = df['education-num'].unique()
occupation_unique = df['occupation'].unique()
married_unique = df['marital-status'].unique()

workclass_unique

array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', nan, 'Self-emp-inc', 'Without-pay', 'Never-worked'],
      dtype=object)

In [6]:
label_dict = {"<=50K" : 1, ">50K" : 0}
sex_selfID_replacements = {"Female": 1, "Non-Female": 0}
marital_status_dict = {"Never-married" : 0,"Married-civ-spouse": 1, "Divorced" : 0,"Married-spouse-absent" : 1,"Separated" : 0,"Married-AF-spouse" : 1,"Widowed": 0}

df_copy = df.copy()

df_copy['income_binary'].replace(label_dict, inplace = True)
df_copy['sex_selfID'].replace(sex_selfID_replacements, inplace = True)

# Converting marital-status to either 1 or 0 (1 = Currently Married, 0 = Currently Single)
df_copy['marital-status'].replace(marital_status_dict, inplace=True)

df_copy.corr()

Unnamed: 0,age,fnlwgt,education-num,marital-status,sex_selfID,capital-gain,capital-loss,hours-per-week,income_binary
age,1.0,-0.076267,0.036761,0.318189,-0.088614,0.124901,0.057545,0.067066,-0.233638
fnlwgt,-0.076267,1.0,-0.043195,-0.025517,-0.026858,-0.002234,-0.010252,-0.01813,0.009463
education-num,0.036761,-0.043195,1.0,0.078258,-0.01228,0.167089,0.079923,0.147256,-0.335154
marital-status,0.318189,-0.025517,0.078258,1.0,-0.421465,0.13073,0.07813,0.211277,-0.434944
sex_selfID,-0.088614,-0.026858,-0.01228,-0.421465,1.0,-0.072555,-0.045567,-0.229402,0.21598
capital-gain,0.124901,-0.002234,0.167089,0.13073,-0.072555,1.0,-0.055138,0.101594,-0.347555
capital-loss,0.057545,-0.010252,0.079923,0.07813,-0.045567,-0.055138,1.0,0.0545,-0.150526
hours-per-week,0.067066,-0.01813,0.147256,0.211277,-0.229402,0.101594,0.0545,1.0,-0.229523
income_binary,-0.233638,0.009463,-0.335154,-0.434944,0.21598,-0.347555,-0.150526,-0.229523,1.0


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

1. After looking at the features, I think I will remove (1) education because I can use the education-num column instead, (2) relationship because I will utilize the marital-status column, and (3) the occupation feature because I can use the workclass column. Lastly, I'll remove the 'native-country' feature as there are a lot of unique values which will make one-hot encoding difficult.
3. To prepare the data, I will average numeric features and use that average to replace null values. For categorical columns will null values, I'll remove those rows. Then, I can replace certain columns like marital-status and sex_selfID will binary values and remove outliers in others. Then, I can standardize the numeric, non-binary columns.
4. I plan on using a KMeans model with 2 clusters, each representing one of the 2 values of the label.
5. To train, my model, I'll use 70% of my data. Then, once the training data has been given a cluster, I'll split the training data into 2 subsets, one containing the examples that have "<50K" as the label and one containing the label ">=50K". From there, I'll see if the groups are, for the most part, have the same cluster value, indicating the clustering fell into line with the label, essentially predicting it.

To test, the model, I'll make prediction with the test data and use the same method to analyze the data.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [7]:
import scipy.stats as stats
from sklearn.preprocessing import StandardScaler

# Preparing the data
df_copy = df.copy()

# Replacing null values in ages and hours-per-week with averages
df_copy['age'].fillna(value=df_copy['age'].mean(), inplace=True)
df_copy['hours-per-week'].fillna(value=df_copy['hours-per-week'].mean(), inplace=True)

# Removing remaining examples with null values
df_copy = df_copy.dropna(inplace = False)

# Creating our label and feature datasets
df_label = df_copy['income_binary']
df_features = df_copy.drop(['income_binary', 'native-country', 'education', 'relationship', 'occupation'], axis = 1)

# Converting the sex_selfID column to either 1 or 0 (1 = Female, 0 = Non-female)
sex_selfID_replacements = {"Female": 1, "Non-Female": 0}
df_features['sex_selfID'].replace(sex_selfID_replacements, inplace = True)

# Converting marital-status to either 1 or 0 (1 = Currently Married, 0 = Currently Single)
marital_status_dict = {"Never-married" : 0,"Married-civ-spouse": 1, "Divorced" : 0,"Married-spouse-absent" : 1,"Separated" : 0,"Married-AF-spouse" : 1,"Widowed": 0}
df_features['marital-status'].replace(marital_status_dict, inplace=True)

# Removing outliers in education-num
df_features['education-num'] = stats.mstats.winsorize(df_features['education-num'], limits=[0.01, 0.01])

# Scaling columns that aren't binary indicators
scaler = StandardScaler()
df_to_scale = df_features[['fnlwgt', 'hours-per-week', 'age', 'education-num', 'capital-gain', 'capital-loss']]
transformed_data = scaler.fit_transform(df_to_scale)
df_scaled = pd.DataFrame(transformed_data, columns = df_to_scale.columns, index = df_to_scale.index)

# Converting workclass and race to columns via one-hot encoding
features_to_encode = ['workclass', 'race']
encoded_features = pd.get_dummies(df_features[features_to_encode], columns = features_to_encode)

sex_selfID_col = df_features['sex_selfID']
marital_col = df_features['marital-status']

df_features = df_scaled.join(encoded_features)
df_features = df_features.join(sex_selfID_col)
df_features = df_features.join(marital_col)

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [8]:
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_features, df_label, test_size = 0.3, random_state = 1234)

model = KMeans(n_clusters = 2)
model.fit(X_train)

df_features_final = X_train.copy()
df_features_final.insert(0, 'Cluster', model.labels_)
df_features_final.insert(0, 'Income', y_train)

In [9]:
# Percentage of examples above 50K that are in group 1
above_50 = df_features_final.loc[df_features_final['Income'] == ">50K"]
above_percentage = above_50["Cluster"].sum() / above_50.shape[0]

# Percentage of examples below 50K that are in group 1
below_50 = df_features_final.loc[df_features_final['Income'] == "<=50K"]
below_percentage = below_50["Cluster"].sum() / below_50.shape[0]

print("Training Validity")
print(above_percentage)
print(below_percentage)

Training Validity
0.8476007677543186
0.37546374897818025


In [10]:
clusters = model.predict(X_test)

df_test_final = X_test.copy()
df_test_final.insert(0, 'Cluster', clusters)
df_test_final.insert(0, 'Income', y_test)

# Percentage of examples above 50K that are in group 1
above_50 = df_test_final.loc[df_test_final['Income'] == ">50K"]
above_percentage = above_50["Cluster"].sum() / above_50.shape[0]

# Percentage of examples below 50K that are in group 1
below_50 = df_test_final.loc[df_test_final['Income'] == "<=50K"]
below_percentage = below_50["Cluster"].sum() / below_50.shape[0]

print("Testing Validity")
print(above_percentage)
print(below_percentage)

Testing Validity
0.8468233246301131
0.3778699451933047


For my model to be accurate, I expect the printed values to be far apart from one another, indicating one group is lacking in examples in Cluster 1 and the other to be full of them. This indicates the model drew a line between examples below and above 50K.

First Attempt:

I found my values to be near identical in value. Both were around 25%. Soon, I found out that I forgot to standardize, and the 'fnlwgt' feature was significantly skewing the results randomly. To see if this was the error, I removed that feature temporarily, and the numbers generated were farther apart that before.

Second Attempt:

I standardized my numeric features and the results were a bit better with values like 60% and 30%. However, this still was not very accurate. After looking through the data, I found out that I standardized the binary representations too which ended up generating numbers like 0.44 and 0.6 for the sex_selfID column.

Third Attempt:

I standardized only the columns that were numeric, but not binary indicators, which led to the results I have now with about 84% and 37%. After my training was generating these values, I used the test data and predicted the cluster labels.

I employed the same method to generate the percentages and found very similar values to the training data, indicating a lack of overfitting and underfitting.