# Bike Sharing Demand Step 3

---
### Analysis summary and modeling strategy
#### 1) Analysis summary
1. Transform target values
- before training : 'count' > log(count)
- after training : log(count) > 'count'
2. Add derived features
- 'year', 'month', 'day', 'hour', 'minute', 'second'
- 'weekday'
3. Remove features
- 'casual', 'registered'
- 'datetime'
- 'date', 'month'
- 'day', 'minute', 'second'
- 'windspeed'
4. Delete outliers
- Data whose 'weather' value is 4

#### 2) Modeling strategy
- Baseline model : LinearRegression
- Performance improvement : Ridge, Lasso, RandomForest
    + Feature engineering : Same apply for each model
    + Hyperparameter optimization : GridSearch
- Etc
    + Target value is log(count) not count
--- 

## 3. Baseline model
- When training ML model, features data type should be int/float
- Train
    + Finding the optimal regression coefficient when given the independent variables(features) and target values
- Predict
    + Estimating a target value when new independent variables are given to the trained model  (which has the optimal regression coefficients)  

### 3.1. Import data

In [1]:
# Import data
import pandas as pd

data_path = '../../Datasets/bike_sharing_demand/'

train = pd.read_csv(data_path + 'train.csv')
test = pd.read_csv(data_path + 'test.csv')
submission = pd.read_csv(data_path + 'sampleSubmission.csv')

### 3.2. Feature engineering

In [2]:
# Delete outliers of train data
train = train[train['weather'] != 4]

In [3]:
# Merge train data and test data
all_data = pd.concat([train, test], ignore_index=True)
all_data

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0000,3.0,13.0,16.0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0000,8.0,32.0,40.0
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0000,5.0,27.0,32.0
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0000,3.0,10.0,13.0
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0000,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
17373,2012-12-31 19:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,
17374,2012-12-31 20:00:00,1,0,1,2,10.66,12.880,60,11.0014,,,
17375,2012-12-31 21:00:00,1,0,1,1,10.66,12.880,60,11.0014,,,
17376,2012-12-31 22:00:00,1,0,1,1,10.66,13.635,56,8.9981,,,


In [4]:
# Add derived features
from datetime import datetime

all_data['date'] = all_data['datetime'].apply(lambda x: x.split()[0])
all_data['year'] = all_data['datetime'].apply(lambda x: x.split()[0].split('-')[0])
all_data['month'] = all_data['datetime'].apply(lambda x: x.split()[0].split('-')[1])
all_data['hour'] = all_data['datetime'].apply(lambda x: x.split()[0].split('-')[2])
all_data['weekday'] = all_data['date'].apply(lambda dateStr: datetime.strptime(dateStr, "%Y-%m-%d").weekday())

In [5]:
# Remove unnecessary features
drop_features = ['casual', 'registered', 'datetime', 'date', 'month', 'windspeed']

all_data = all_data.drop(drop_features, axis=1)

In [6]:
# Split data into train data and test data
X_train = all_data[~pd.isnull(all_data['count'])]
X_test = all_data[pd.isnull(all_data['count'])]

# Remove target values from train and test data
X_train = X_train.drop(['count'], axis=1)
X_test = X_test.drop(['count'], axis=1)

# Target values
y = train['count']

In [7]:
X_train.head()

Unnamed: 0,season,holiday,workingday,weather,temp,atemp,humidity,year,hour,weekday
0,1,0,0,1,9.84,14.395,81,2011,1,5
1,1,0,0,1,9.02,13.635,80,2011,1,5
2,1,0,0,1,9.02,13.635,80,2011,1,5
3,1,0,0,1,9.84,14.395,75,2011,1,5
4,1,0,0,1,9.84,14.395,75,2011,1,5


### 3.3. Make evaluation index calculation function

In [8]:
import numpy as np

def rmsle(y_true, y_pred, convertExp=True):
    # Exponential transformation
    if convertExp:
        y_true = np.exp(y_true)
        y_pred = np.exp(y_pred)
        
    # Log transformation and convert missing values to zero    
    log_true = np.nan_to_num(np.log(y_true+1))
    log_pred = np.nan_to_num(np.log(y_pred+1))
    
    # Calculate RMSLE
    output = np.sqrt(np.mean((log_true - log_pred)**2))
    return output

### 3.4. Train model

In [9]:
# Import model
from sklearn.linear_model import LinearRegression
linear_reg_model = LinearRegression()

# Log transformation of target values
log_y = np.log(y)

# Train model
linear_reg_model.fit(X_train, log_y)

LinearRegression()

### 3.5. Validate performance

In [10]:
# Predict with train data
preds = linear_reg_model.predict(X_train)

# RMSLE value of baseline model
print(f'선형 회귀의 RMSLE 값 : {rmsle(log_y, preds, True):.4f}')

선형 회귀의 RMSLE 값 : 1.2056


### 3.6. Submit

In [11]:
# Predict with test data
linearreg_preds = linear_reg_model.predict(X_test)

# Exponential transformation
submission['count'] = np.exp(linearreg_preds)

# Save submission file
submission.to_csv('submission.csv', index=False)

References
===
- [EDA reference](https://www.kaggle.com/viveksrinivasan/eda-ensemble-model-top-10-percentile)
- [Modeling reference](https://www.kaggle.com/viveksrinivasan/eda-ensemble-model-top-10-percentile)
- 머신러닝.딥러닝 문제해결 전략(신백균)