This notebook explains how the modeling must be done once the EDA part is completed. I always separate the EDA code from the modeling code. 


## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime

from sklearn.model_selection import train_test_split
import math

import warnings
pd.options.mode.chained_assignment = None
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Importing the datasets

In [None]:
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

## Preprocessing

1. Save the target variable and concatenate both the train and test sets. To do this, we must drop `casual` and `registered`. Call the new dataframe as `data`. 

In [None]:
y = train['count']
datetime_col = test['datetime']

train.drop(['casual', 'registered'],axis = 1, inplace = True)
data = pd.concat([train.iloc[:,:-1],test], axis = 0)

2. Using the `datetime` column, create additonal columns - `date`, `hour`, `year`, `weekday`, `month'


In [None]:
data["date"] = data.datetime.apply(lambda x : str(x).split()[0])
data["hour"] = data.datetime.apply(lambda x : (str(x).split()[1]).split(":")[0]).astype("int")
data["year"] = data.datetime.apply(lambda x : str(x).split()[0].split("-")[0])
data["weekday"] = data.date.apply(lambda dateString : datetime.strptime(dateString,"%Y-%m-%d").weekday())
data["month"] = data.date.apply(lambda dateString : datetime.strptime(dateString,"%Y-%m-%d").month)

3. Coercing To Category Type

    - We must convert the type of categorical columns to character type so that the model doesn't consider them as numbers. The 1, 2, 3 and 4 in the `season` column are categories - season 4 shouldn't have a higher value than season 1, numerically. Once we convert them to charcters, 4 and 1 are considered as equals. 

In [None]:
categoricalFeatureNames = ["season", "holiday", "workingday", "weather", "weekday", "month", "year", "hour"]
numericalFeatureNames = ["temp","humidity","windspeed"]

for var in categoricalFeatureNames:
    data[var] = data[var].astype("category")

data = pd.get_dummies(data, columns = ["season","weather","weekday","month","year","hour"])

4. Drop unnecessary features

In [None]:
dropFeatures = ["datetime","atemp","windspeed","date"]
data.drop(dropFeatures,axis=1, inplace = True)

5. Separate the data back into train and test sets and find the log of the target variable. 

In [None]:
X = data.iloc[:len(train['count']),:]
test_df = data.iloc[len(train['count']):,:]

y = np.log1p(y)

del data

6. Split the train test in 80-20 partition. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Fitting the Regression model

Here, I have decided to fit a Random Forest model. 

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor()
rfr.fit(X = X_train, y = y_train)
y_pred = rfr.predict(X = X_test)

from sklearn.metrics import mean_absolute_error
#from sklearn.metrics import mean_squared_log_error

print("MAE is ",mean_absolute_error(np.exp(y_test), np.exp(y_pred)))

## Make predictions on the test set

Run the predict method on the test dataframe.

In [None]:
y_pred_submission = rfr.predict(test_df)

## Export to CSV

We must also take the exponent of the perdicted values as we had trained on the log of the target. 

In [None]:
submission = pd.DataFrame({"datetime": datetime_col, "count": np.expm1(y_pred_submission)})
submission.to_csv('submissions/rfr_default_params.csv', index=False)

This notebook results in a score of 0.427 which is the Top 18% rank-wise. 

Hope this notebook was useful! 