# Lab 8: Define and Solve an ML Problem of Your Choosing

In [2]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [3]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(WHRDataSet_filename)

df.head()

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,,,
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,,,0.34454


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

I chose the World Happiness Report. I will predicting the "Confidence in National Government" and what features have the biggest impact on this label. This is a supervised learning problem and is a regression problem. I will This is an important problem because this can help companies understand where to focus political efforts or could aid governments in improving their social policy efforts.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
df.shape

(1562, 19)

In [5]:
df.dtypes

country                                                      object
year                                                          int64
Life Ladder                                                 float64
Log GDP per capita                                          float64
Social support                                              float64
Healthy life expectancy at birth                            float64
Freedom to make life choices                                float64
Generosity                                                  float64
Perceptions of corruption                                   float64
Positive affect                                             float64
Negative affect                                             float64
Confidence in national government                           float64
Democratic Quality                                          float64
Delivery Quality                                            float64
Standard deviation of ladder by country-year    

In [6]:
df.head()

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,,,
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,,,0.34454


In [7]:
df.isnull().sum()

country                                                       0
year                                                          0
Life Ladder                                                   0
Log GDP per capita                                           27
Social support                                               13
Healthy life expectancy at birth                              9
Freedom to make life choices                                 29
Generosity                                                   80
Perceptions of corruption                                    90
Positive affect                                              18
Negative affect                                              12
Confidence in national government                           161
Democratic Quality                                          171
Delivery Quality                                            171
Standard deviation of ladder by country-year                  0
Standard deviation/Mean of ladder by cou

In [8]:
df = df.dropna()
df.isnull().sum()

country                                                     0
year                                                        0
Life Ladder                                                 0
Log GDP per capita                                          0
Social support                                              0
Healthy life expectancy at birth                            0
Freedom to make life choices                                0
Generosity                                                  0
Perceptions of corruption                                   0
Positive affect                                             0
Negative affect                                             0
Confidence in national government                           0
Democratic Quality                                          0
Delivery Quality                                            0
Standard deviation of ladder by country-year                0
Standard deviation/Mean of ladder by country-year           0
GINI ind

In [9]:
df.shape

(382, 19)

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

I plan on using a Linear Regression model and test different ensemble models to determine the relationships between the different features that impact "Confidence in National Government". To prepare my feature list, I will drop all non-numeric values and use features that have been documented as floats. In addition to this, I have also removed all the rows with null values from my DataFrame. I will also use grid searches to determine the best parameters that result in the most accurate models. I also plan on using the Linear Regression model to determine the weights of each feature on the label.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import StackingRegressor

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [11]:
features = df.drop(columns=["country", "year", "Confidence in national government"])
features

Unnamed: 0,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
14,5.510124,9.246649,0.784502,68.028885,0.601512,-0.174559,0.847675,0.606636,0.271393,-0.060784,-0.328862,1.921203,0.348668,0.290,0.303250,0.568153
33,6.424133,9.750825,0.918693,66.410309,0.636646,-0.129523,0.884742,0.863786,0.236901,0.023821,-0.570944,2.067742,0.321871,0.453,0.476067,0.368422
34,6.441067,9.836924,0.926799,66.552177,0.730258,-0.125792,0.854695,0.846136,0.210975,0.138446,-0.469284,2.107838,0.327250,0.445,0.476067,0.366742
35,6.775805,9.884781,0.889073,66.694588,0.815802,-0.174472,0.754646,0.840048,0.231855,0.251968,-0.442329,1.987599,0.293338,0.436,0.476067,0.347596
36,6.468387,9.863960,0.901776,66.836693,0.747498,-0.148023,0.816546,0.856516,0.272219,0.199125,-0.572653,2.098197,0.324377,0.425,0.476067,0.317217
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1521,5.295781,8.408774,0.786611,65.105911,0.831494,-0.004510,0.742637,0.685243,0.215798,-0.674288,-0.522663,1.422253,0.268563,0.393,0.364286,0.479870
1523,5.534570,8.499093,0.775009,65.410637,0.856053,-0.109530,0.814885,0.615128,0.221356,-0.576355,-0.504539,1.454641,0.262828,0.357,0.364286,0.438014
1535,3.967958,8.233983,0.638252,54.427925,0.663909,-0.172366,0.885429,0.610585,0.275674,-1.983291,-1.264091,2.438403,0.614523,0.367,0.357000,0.447447
1547,4.843164,8.196217,0.691483,52.730522,0.758654,-0.048977,0.871020,0.690034,0.381731,0.040718,-0.391482,3.080448,0.636040,0.571,0.527400,0.671201


In [12]:
X = features
y = df["Confidence in national government"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 123)

<b>Linear Regression Model</b>

In [13]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

In [14]:
prediction = lr_model.predict(X_test)

In [15]:
print('Model Summary:\n')

# Print intercept (alpha)
print('Intercept:')
print('alpha = ' , lr_model.intercept_)

# Print weights
print('\nWeights:')
i = 0
for w in lr_model.coef_:
    print('w_' + str(i+1),'= ', w, ' [weight of '+ features.columns.tolist()[i] +']')
    i += 1

Model Summary:

Intercept:
alpha =  0.4893122939967756

Weights:
w_1 =  -0.025470181990024546  [weight of Life Ladder]
w_2 =  0.05944367098823414  [weight of Log GDP per capita]
w_3 =  -0.1767270060756251  [weight of Social support]
w_4 =  -0.0029855746041365164  [weight of Healthy life expectancy at birth]
w_5 =  0.5477240477063068  [weight of Freedom to make life choices]
w_6 =  0.018603850074056856  [weight of Generosity]
w_7 =  -0.49430800476044034  [weight of Perceptions of corruption]
w_8 =  0.21019722732396276  [weight of Positive affect]
w_9 =  -0.078409947735317  [weight of Negative affect]
w_10 =  -0.08263405808005304  [weight of Democratic Quality]
w_11 =  -0.06603874253090238  [weight of Delivery Quality]
w_12 =  -0.08967387158813811  [weight of Standard deviation of ladder by country-year]
w_13 =  0.11586457316686544  [weight of Standard deviation/Mean of ladder by country-year]
w_14 =  1.0350611190378385  [weight of GINI index (World Bank estimate)]
w_15 =  -1.04895421447

In [16]:
lr_rmse = root_mean_squared_error(y_test, prediction)
lr_r2 = r2_score(y_test, prediction)

In [17]:
# Print mean squared error
print('\nModel Performance\n\nRMSE = %.2f' % lr_rmse)

# The coefficient of determination: 1 is perfect prediction
print(' R^2 = %.2f' % lr_r2)


Model Performance

RMSE = 0.11
 R^2 = 0.64


<b>Decision Tree Model</b>

In [98]:
param_grid = {"max_depth": np.arange(1,32), "min_samples_leaf": np.arange(1,50)}

print('Running Grid Search...')
# 1. Create a DecisionTreeRegressor model object without supplying arguments. 
#    Save the model object to the variable 'dt_regressor'

dt_regressor = DecisionTreeRegressor()

# 2. Run a Grid Search with 3-fold cross-validation and assign the output to the object 'dt_grid'.

dt_grid = GridSearchCV(dt_regressor, param_grid, cv=3, scoring="neg_root_mean_squared_error")

# 3. Fit the model (use the 'grid' variable) on the training data and assign the fitted model to the 
#    variable 'dt_grid_search'

dt_grid_search = dt_grid.fit(X_train, y_train)

print('Done')

Running Grid Search...
Done


In [102]:
rmse_DT = -1 * dt_grid_search.best_score_
print("[DT] RMSE for the best model is : {:.2f}".format(rmse_DT) )
dt_best_params = dt_grid_search.best_params_

dt_max = dt_grid_search.best_estimator_.max_depth
dt_min = dt_grid_search.best_estimator_.min_samples_leaf

dt_best_params

[DT] RMSE for the best model is : 0.13


{'max_depth': 7, 'min_samples_leaf': 11}

In [20]:
dt_model = DecisionTreeRegressor(max_depth=dt_max, min_samples_leaf=dt_min)
dt_model.fit(X_train, y_train)

<b>Stacking model </b>

In [21]:
estimators = [("DT", DecisionTreeRegressor(max_depth=dt_max, min_samples_leaf=dt_min)),
              ("LR", LinearRegression())]
stacking_model = StackingRegressor(estimators=estimators, passthrough=False)
stacking_model.fit(X_train, y_train)

stacking_pred = stacking_model.predict(X_test)

# 2. Compute the RMSE 
stack_rmse = root_mean_squared_error(y_test, stacking_pred)

# 3. Compute the R2 score
stack_r2 = r2_score(y_test, stacking_pred)

   
print('RMSE: {0}'.format(stack_rmse))
print('R2: {0}'.format(stack_r2))                       

RMSE: 0.10974417637652979
R2: 0.663238281105288


<b>Ensemble Modeling - Gradient Boosted Decision Tree </b>

In [27]:
gbdt_model = GradientBoostingRegressor(max_depth = dt_max)
gbdt_model.fit(X_train, y_train)

In [104]:
n_estimate = np.arange(1,500,10)

print("GBDT Grid Search...")
gb_param_grid = {"n_estimators": n_estimate}

gbdt_grid = GridSearchCV(gbdt_model, gb_param_grid, cv=3)

gbdt_grid_search = gbdt_grid.fit(X_train, y_train)
print("Complete")

GBDT Grid Search...
Complete


In [32]:
rmse_GBDT = gbdt_grid_search.best_score_
print("[GBDT] RMSE for the best model is : {:.2f}".format(rmse_GBDT) )
gbdt_best_params = gbdt_grid_search.best_params_

gbdt_min = gbdt_grid_search.best_estimator_.n_estimators

gbdt_best_params

[GBDT] RMSE for the best model is : 0.47


{'n_estimators': 291}

In [36]:
best_gbdt_model = GradientBoostingRegressor(n_estimators = gbdt_min, max_depth = dt_max)

## Final Analysis of Data

After using the Linear Regression model, the features with the highest positive correlation to a country's "Confidence in their National Government" was the "Freedom to Make Life Choices" and "GINI index", which measures income inequality.

The RMSE value was 0.11 and the R^2 value was 0.64. The accuracy of the model could be improved upon, but the low RMSE value and high R^2 value are good indicators thus far.

After doing some trial and error, I found that one of the best working models was the stacking ensemble model where the Decision Tree and Linear Regression model were stacked. The R^2 value was 0.66 and the RMSE value was 0.10.