# Implement Your Machine Learning Project Plan

In this lab you will implement the machine learning project plan you created in the assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Create features and a label, and prepare your data for your model.
3. Fit your model to the training data and evaluate your model. 
4. Show how you've improved upon your baseline model.

### Import Packages

Before you get started, import a few packages.

In [8]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import the additional packages that you will need for this task (only import packages that you have used in this course).

In [9]:
# YOUR CODE HERE
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score

## Part 1: Load the Data Set


You have chosen to work with one of three data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultDataFull.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is location in file `WHR2018Chapter2OnlineData.csv`

<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [10]:
# YOUR CODE HERE
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
df = pd.read_csv(WHRDataSet_filename, header = 0)
df.dropna(inplace = True)
df.head()

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
14,Albania,2012,5.510124,9.246649,0.784502,68.028885,0.601512,-0.174559,0.847675,0.606636,0.271393,0.364894,-0.060784,-0.328862,1.921203,0.348668,0.29,0.30325,0.568153
33,Argentina,2009,6.424133,9.750825,0.918693,66.410309,0.636646,-0.129523,0.884742,0.863786,0.236901,0.273822,0.023821,-0.570944,2.067742,0.321871,0.453,0.476067,0.368422
34,Argentina,2010,6.441067,9.836924,0.926799,66.552177,0.730258,-0.125792,0.854695,0.846136,0.210975,0.351856,0.138446,-0.469284,2.107838,0.32725,0.445,0.476067,0.366742
35,Argentina,2011,6.775805,9.884781,0.889073,66.694588,0.815802,-0.174472,0.754646,0.840048,0.231855,0.607538,0.251968,-0.442329,1.987599,0.293338,0.436,0.476067,0.347596
36,Argentina,2012,6.468387,9.86396,0.901776,66.836693,0.747498,-0.148023,0.816546,0.856516,0.272219,0.418255,0.199125,-0.572653,2.098197,0.324377,0.425,0.476067,0.317217


## Part 2: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [12]:
# YOUR CODE HERE
y = df['Log GDP per capita']
X = df[['Life Ladder', 'Freedom to make life choices', 'Social support', 'Confidence in national government']].astype(int)


In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1234)


In [43]:
#Implement RF model
rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)
y_rf_pred = rf_model.predict(X_test)


In [44]:
#check performance
rf_rmse = mean_squared_error(y_test, y_rf_pred, squared = False)
rf_r2 = r2_score(y_test, y_rf_pred)
print('Root Mean Squared Error: {0}'.format(rf_rmse))
print('R2: {0}'.format(rf_r2)) 

Root Mean Squared Error: 0.7088988299301084
R2: 0.3249225930288354


In [45]:
param_grid = {'max_depth': [2, 4, 8, 16, 32], 'n_estimators': [20, 40, 60, 80, 100, 120]}

In [46]:
#Find best parameters
rf_model = RandomForestRegressor()
rf_model_grid = GridSearchCV(rf_model, param_grid, cv = 3, scoring = 'neg_root_mean_squared_error')
rf_model_grid_search = rf_model_grid.fit(X_train, y_train)
rf_model_best_params = {'max_depth':rf_model_grid.best_estimator_.max_depth, 'n_estimators':rf_model_grid.best_estimator_.n_estimators}
rf_model_best_params

{'max_depth': 16, 'n_estimators': 20}

In [47]:
#Implement RF model with best parameters
rf_model = RandomForestRegressor(max_depth = 16, n_estimators = 20)
rf_model.fit(X_train, y_train)
y_rf_pred = rf_model.predict(X_test)

In [48]:
#check performance
rf_rmse = mean_squared_error(y_test, y_rf_pred, squared = False)
rf_r2 = r2_score(y_test, y_rf_pred)
print('Root Mean Squared Error: {0}'.format(rf_rmse))
print('R2: {0}'.format(rf_r2)) 

Root Mean Squared Error: 0.7074437029265652
R2: 0.3276911550178654


For this lab, we chose to work with the World Happiness Report dattaset and decided to predict log GDP per capita based on life ladder, social support, freedom to make life choices, and confidence in national government. For a linear regression model, the RMSE was 0.67 and the R2 value was 0.39. For a gradient boosting regression model, the RMSE was 0.41 and the R2 value was 0.44. For a random forest regression model, the RMSE was 0.71 and the R2 value was 0.33. For a KNN model, the RMSE was 0.70 and the R2 value was 0.33. It looks like the gradient boosting model performed the best as the RMSE was the lowest while the R2 value was the highest. 