## Random Forest for PlaceToPlug

**The following Kernel is an implementation of Random Forest, using sklearn. The score that it gives for this competition is low (0.270), however it is a good example for those who are interested in learning a little bit more about how to use sklearn to create random forests**

In [0]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.tree import export_graphviz
from sklearn import metrics
from sklearn.tree import _tree


In [0]:
import dateutil.parser
from google.colab import files

In [0]:
#!pip install graphviz
#!import graphviz
pd.options.display.max_rows = 20 # don't display many rows

# # Predefined funcions

### plugStat(status)-> value
query function that returns value assigned to each one of possible plug status

```
# This is formatted as code
```



In [0]:
# Plug status as a query table
def plugStat(s):
  return {
      'AVAILABLE': 0,
      'FINISHING': 1,
      'CHARGING': 2,
      'PREPARING': 3,
      'UNAVAILABLE': 4
  }.get(s, -1)

### pntime(timestamp) -> value
function that returns number of minutes from a time reference.

In [0]:
def pntime(s):
  return ((dateutil.parser.parse(s).timestamp())-1498898468)/60

## Load data from csv##

Upload files from source

In [5]:
uploaded = files.upload()     # To upload files to colab storage


Saving PuntsRecarregaV2_test.csv to PuntsRecarregaV2_test.csv
Saving PuntsRecarregaV2_train.csv to PuntsRecarregaV2_train.csv


In [0]:
#!mv P*.* datalab/
#!rm datalab/t*.*  # DO NOT UNCOMMENT UNLESS YOU WANT TO ERASE THE DATASET

In [7]:
!ls -lah datalab/  # TO SEE IF DATASET CSV'S FILES EXISTS

total 1.1M
drwxr-xr-x 1 root root 4.0K Jul 30 14:11 .
drwxr-xr-x 1 root root 4.0K Jul 30 14:11 ..
drwxr-xr-x 4 root root 4.0K Jul 26 16:56 .config
-rw-r--r-- 1 root root 109K Jul 30 14:11 PuntsRecarregaV2_test.csv
-rw-r--r-- 1 root root 974K Jul 30 14:11 PuntsRecarregaV2_train.csv


In [0]:
#!cat datalab/PuntsRecarregaV2_test.csv

In [0]:
#!cat datalab/PuntsRecarregaV2_train.csv

Load data from csv's to train_data and test

In [0]:
train_data = pd.read_csv('datalab/PuntsRecarregaV2_train.csv')  #    , parse_dates=['timestamp'])
#type(train_data.zoneId), type(train_data.stationId), type(train_data.serviceId), type(train_data.plugId), type(train_data.timestamp), type(train_data.status)

Access to dataFrame via rows/columns

----

```
In [17]: dtfrm_var.iat[0,0]
In [18]: dtfrm_var.at[0,'status']

```
---


In [0]:
#train_data.at[0,'status'], plugStat(train_data.at[0,'status']) # To see how works status transformation

In [0]:
#train_data.at[0,'timestamp'], pntime(train_data.at[0,'timestamp'])  # to see how works timestamp transformation

In [20]:
train_data.head()

Unnamed: 0,zoneId,stationId,serviceId,plugId,timestamp,status
0,4,4,1,1,0.0166667,3
1,4,4,1,1,0.433333,2
2,4,1,1,1,10.6,1
3,4,4,1,1,10.8167,0
4,4,4,1,1,18.2,3


## Transform status field with a query function : plug_stat


In [19]:
count = 0
#train_data['dytime']=train_data['timestamp'].values
for indexrow,row in train_data.iterrows():
  train_data.at[indexrow,'timestamp'] = pntime(train_data.at[indexrow,'timestamp'])
  train_data.at[indexrow,'status'] = plugStat(train_data.at[indexrow,'status'])
  count += 1
print('Transform: performed',count,'substitutions on training dataset.')
  

Transform: performed 25063 substitutions on training dataset.


In [0]:
#train_data['dytime']


In [0]:
#train_data['duration']

In [0]:
test = pd.read_csv('datalab/PuntsRecarregaV2_test.csv')
#type(test.zoneId), type(test.stationId), type(test.serviceId), type(test.plugId), type(test.timestamp), type(test.status)

In [24]:
test.head()

Unnamed: 0,zoneId,stationId,serviceId,plugId,timestamp,status
0,4,4,1,1,507613,3
1,4,4,1,1,507614,2
2,4,4,2,1,507615,3
3,4,4,2,1,507615,2
4,4,1,1,1,507644,1


### Transform status field with a query function : plug_stat


Transform too dataset for test. :-)

In [23]:
count = 0
for indexrow,row in test.iterrows():
  test.at[indexrow,'timestamp'] = pntime(test.at[indexrow,'timestamp'])
  test.at[indexrow,'status'] = plugStat(test.at[indexrow,'status'])
  count += 1
print('Transform: performed',count,'substitutions on test dataset.')

Transform: performed 2785 substitutions on test dataset.


**First, lets take a look at the data and the different feautures.**

In [27]:
train_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
zoneId,25063.0,3.855364,1.028341,1.0,4.0,4.0,4.0,5.0
stationId,25063.0,3.16686,1.514648,1.0,1.0,4.0,4.0,5.0
serviceId,25063.0,1.493995,0.499974,1.0,1.0,1.0,2.0,2.0
plugId,25063.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


In [28]:
test.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
zoneId,2785.0,3.778097,0.844791,1.0,4.0,4.0,4.0,5.0
stationId,2785.0,3.118851,1.395505,1.0,1.0,4.0,4.0,5.0
serviceId,2785.0,1.497307,0.500083,1.0,1.0,1.0,2.0,2.0
plugId,2785.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0


**Next step is to separate the data matrix and response**

In [0]:
X, y = train_data.drop('status', axis=1), train_data['status']  # For training dataset

In [0]:
#X    # Uncomment to see data of attributes without the label(prediction values)

In [0]:
#y    # Uncomment to see data of the label(prediction values)

**Now we are ready to apply Random Forest.**

In [0]:
regr = RandomForestRegressor()

In [33]:
regr.fit(X, y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

**To evaluate our model, we are calculating the $R^2$**

In [34]:
R2_Xy=regr.score(X, y)        # R2_Xy value of evaluation for train dataset
print(f"R^2 : {R2_Xy}")

R^2 : 0.7587221372963804


**Splitting the X and y into train and test and calculate the $R^2$ for each one. **

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [0]:
#regr = RandomForestRegressor()

In [36]:
regr.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [37]:
R2_XyT=regr.score(X_train, y_train)       # R2_XyT value of evaluation for train_train subdataset
print(f"R^2 : {R2_XyT}")

R^2 : 0.7367729037534378


In [41]:
regr.fit(X_test, y_test)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [42]:
R2_Xyt=regr.score(X_test, y_test)       # R2_Xyt value of evaluation for train_test subdataset
print(f"R^2 : {R2_XyT}")

R^2 : 0.7367729037534378


In [43]:
print(f"R^2 of train : {regr.score(X_train, y_train)}")
print(f"R^2 of test : {regr.score(X_test, y_test)}")

R^2 of train : -0.5382256143993589
R^2 of test : 0.7385331577083227


**The difference between the $R^2$ of train and test imply overfitting.**

------
```
Now let's increase the number of splits of the Random Forest to improve the $R^2$
```
----




Increase splits to 40

In [0]:
regr = RandomForestRegressor(n_estimators=40)

In [45]:
regr.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Results after increase number of splits to 40


In [46]:
print(f"R^2 of train: {regr.score(X_train, y_train)}")
print(f"R^2 of test: {regr.score(X_test, y_test)}")

R^2 of train: 0.7799506131426688
R^2 of test: -0.4861209591498361


increase splits to 100

In [0]:
regr = RandomForestRegressor(n_estimators=100)

In [48]:
regr.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

Results after increase number of splits to 100

In [49]:
print(f"R-squared of train: {regr.score(X_train, y_train)}")
print(f"R-squared of test: {regr.score(X_test, y_test)}")

R-squared of train: 0.7890158371435133
R-squared of test: -0.47707462221862285


As expected, the bigger the number of estimators the $R^2$ are getting closer

## Gradient Descent applied ##
---
 the GradientBoostingRegressor successively fits regression trees to the residuals of the previous stage. Now if the tree in stage i predicts a value larger than the target variable for a particular training example, the residual of stage i for that example is going to be negative, and so the regression tree at stage i+1 will face negative target values (which are the residuals from stage i). As the boosting algorithm adds up all these trees to make the final prediction, I believe this can explain why you may end up with negative predictions, even though all the target values in the training set were positive, especially as you mentioned that this happens more often when you increase the number of trees.
 
---

In [0]:
gbrt=GradientBoostingRegressor(n_estimators=100)

In [51]:
gbrt.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)

In [52]:
y_pred=gbrt.predict(X_test)
y_pred

array([1.66745355, 1.69700026, 1.69940024, ..., 1.85680686, 1.71666959,
       1.69209172])

In [0]:
index = test['timestamp']

In [48]:
df = pd.DataFrame(y_pred, index=index)
df

ValueError: ignored

In [0]:
df.columns = ['target']

In [0]:
df.to_csv("submit_me.csv")

As someone can understand from above, the $R^2$ of Train and Test do not converge, which is our goal here, to be sure we do not overfit.
Potential improvements can be feauture engineering, to find the feautures that influence our predictions the most and then apply Random Forests again.