# Random Forest Regression

In this notebook, we will demonstrate how to build a random forest regression model to predict the birthweight of babies using the [US births 2014](https://www.openintro.org/data/index.php?data=births14) dataset.

`Note`: remember the first step is EDA. Even though its not performed in this notebook doesn’t imply that it isn’t needed. The EDA was excluded to focus on the ML task.

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

In [2]:
#load the data
df = pd.read_csv('births14.csv')
df.head(10)

Unnamed: 0,fage,mage,mature,weeks,premie,visits,gained,weight,lowbirthweight,sex,habit,marital,whitemom
0,34.0,34,younger mom,37,full term,14.0,28.0,6.96,not low,male,nonsmoker,married,white
1,36.0,31,younger mom,41,full term,12.0,41.0,8.86,not low,female,nonsmoker,married,white
2,37.0,36,mature mom,37,full term,10.0,28.0,7.51,not low,female,nonsmoker,married,not white
3,,16,younger mom,38,full term,,29.0,6.19,not low,male,nonsmoker,not married,white
4,32.0,31,younger mom,36,premie,12.0,48.0,6.75,not low,female,nonsmoker,married,white
5,32.0,26,younger mom,39,full term,14.0,45.0,6.69,not low,female,nonsmoker,married,white
6,37.0,36,mature mom,36,premie,10.0,20.0,6.13,not low,female,nonsmoker,married,white
7,29.0,24,younger mom,40,full term,13.0,65.0,6.74,not low,male,nonsmoker,not married,white
8,30.0,32,younger mom,39,full term,15.0,25.0,8.94,not low,female,nonsmoker,married,white
9,29.0,26,younger mom,39,full term,11.0,22.0,9.12,not low,male,nonsmoker,not married,not white


In [3]:
#remove na values
df.dropna(inplace=True)


In [4]:
#encode the categorical feature
df = pd.get_dummies(df, drop_first=True) #drop one of the encoded gender columns

In [5]:
X   = df[['fage', 'mage', 'weeks', 'visits', 'gained', 'habit_smoker']] #get the input features
y   = df['weight']              #get the target

X_train, X_test, y_train, y_test = train_test_split(X,              #the input features
                                                    y,              #the label
                                                    test_size=0.3,  #set aside 30% of the data as the test set
                                                    random_state=7 #reproduce the results
                                                   )

In [6]:
rf = RandomForestRegressor(random_state=7)
rf.fit(X_train, y_train)

RandomForestRegressor(random_state=7)

In [7]:
#predict the labels for the test set
y_pred   = rf.predict(X_test)

#print('The predicted birth weight is: {}'.format(y_pred))

In [8]:
mse = mean_squared_error(y_test, y_pred)

# Evaluate the Predictions
print('The mse of the model is: {}'.format(mse))

The mse of the model is: 1.3092007265690377


The random forest regression has a lower error than the decision tree model. We should still tune our hyperparameters to identify the best model.