# Random Forest Implementation

The following Kernel is an implementation of Random Forest, using sklearn. The score that it gives for this competition is low (0.270), however it is a good example for those who are interested in learning a little bit more about how to use sklearn to create random forests. 

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
from sklearn import metrics
from sklearn.tree import _tree
import graphviz 

pd.options.display.max_rows = 20 # don't display many rows

In [None]:
train_data = pd.read_csv('../input/testtraincsv/train.csv')
test = pd.read_csv('../input/testtraincsv/test.csv')

First, lets take a look at the data and the different feautures.

In [None]:
train_data.describe().transpose()

Next step is to separate the data matrix and response.

In [None]:
X, y = train_data.drop('target', axis=1), train_data['target']

Now we are ready to apply Random Forest.

In [None]:
regr = RandomForestRegressor()
regr.fit(X, y)

To evaluate our model, we are calculating the $R^2$

In [None]:
print(f"R^2 : {regr.score(X, y)}")

Splitting the X and y into train and test and calculate the $R^2$ for each one. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

regr = RandomForestRegressor()
regr.fit(X_train, y_train)
print(f"R^2 of train : {regr.score(X_train, y_train)}")
print(f"R^2 of test : {regr.score(X_test, y_test)}")

The difference between the $R^2$ of train and test imply overfitting.

Now let's increase the number of splits of the Random Forest to improve the $R^2$

In [None]:
regr = RandomForestRegressor(n_estimators=40)
regr.fit(X_train, y_train)
print(f"R^2 of train: {regr.score(X_train, y_train)}")
print(f"R^2 of test: {regr.score(X_test, y_test)}")

In [None]:
regr = RandomForestRegressor(n_estimators=100)
regr.fit(X_train, y_train)
print(f"R-squared of train: {regr.score(X_train, y_train)}")
print(f"R-squared of test: {regr.score(X_test, y_test)}")

As expected, the bigger the number of estimators the $R^2$ are getting closer

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gbrt=GradientBoostingRegressor(n_estimators=100)
gbrt.fit(X, y)
y_pred=gbrt.predict(test)

In [None]:
y_pred

In [None]:
index = test['id']
df = pd.DataFrame(y_pred, index=index)
df.columns = ['target']

In [None]:
df

In [None]:
df.to_csv("submit_me.csv")

As someone can understand from above, the $R^2$ of Train and Test do not converge, which is our goal here, to be sure we do not overfit.
Potential improvements can be feauture engineering, to find the feautures that influence our predictions the most and then apply Random Forests again.