# Linear Regression

## 1. Reading the data 

### A. Sample data: Bike Sharing Demand
- from Kaggle ([Description](https://www.kaggle.com/c/bike-sharing-demand/data))

### A. Sample data: Bike Sharing Demand
- from Kaggle ([Description](https://www.kaggle.com/c/bike-sharing-demand/data))

In [None]:
import pandas as pd

# Read the data
data = pd.read_csv('data/bikeshare.csv')

In [None]:
data.head()

## 2. Data preprocessing and exploring

### A. Extract year, month from datetime

In [None]:
# Year와 Month를 추출
datetime = pd.DatetimeIndex(data['datetime'])
data['year'] = datetime.year
data['month'] = datetime.month
data['hour'] = datetime.hour
data.head()

### B. Data exploring 

In [None]:
# Number of points and variables
data.shape

In [None]:
# data type for each variable 
data.dtypes

In [None]:
# Target variable is 'count'.
# What happens?
data.count

In [None]:
# What happens?
data['count']

In [None]:
# "count" is a method, so it's best to name that column something else
data.rename(columns={'count':'total'}, inplace=True)
data.head()

In [None]:
# correlation matrix (ranges from 1 to -1)
data.corr()

In [None]:
# Corrrelatuons with target variable 'total'
data.corr().total

### C. Data visualization

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (15,15)
plt.rcParams['font.size'] = 14
# Show histograms
data.hist()

In [None]:
# box plot of rentals, grouped by season
plt.rcParams['figure.figsize'] = (10,6)
plt.rcParams['font.size'] = 12
data.boxplot(column='total', by='season')

In [None]:
# box plot of rentals, grouped by month
plt.rcParams['figure.figsize'] = (10,6)
plt.rcParams['font.size'] = 12
data.boxplot(column='total', by='month')

In [None]:
# box plot of rentals, grouped by year
plt.rcParams['figure.figsize'] = (10,6)
plt.rcParams['font.size'] = 12
data.boxplot(column='total', by='year')

### D. Handling categorical variables
- Season의 경우, 1부터 4까지의 값이 특정 계절을 의미하므로 numerical한 type이 아님.
- 이러한 경우 1-of-C coding으로 binary dummy variables를 만들 필요가 있음

In [None]:
season_dummies = pd.get_dummies(data.season, prefix='season')
season_dummies.sample(n = 5, random_state=1)

- Category가 4개인 경우, binray dummy variables이 4개 생성됨.
- 그러나 실제로는 binary dummy variable 3개로 모든 케이스를 설명할 수 있다.
- 만약 변수 선택 과정이 모델에 포함이 되어 있거나, 설명력이 중요한 경우에는 4개의 변수를 모두 넣을 수 있음.
- 만약 변수의 수가 많아지는 것을 방지하고자 한다면, 하나의 dummy variable을 제거할 수 있음.

In [None]:
# drop the first column
season_dummies.drop(season_dummies.columns[0], axis=1, inplace=True)

# print 5 random rows
season_dummies.sample(n=5, random_state=1)

In [None]:
# concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns)
data = pd.concat([data, season_dummies], axis=1)

# print 5 random rows
data.sample(n=5, random_state=1)

### E. Making derivative variables (유도변수)
- **daytime**: as a single categorical feature (daytime=1 from 7am to 8pm, and daytime=0 otherwise)

In [None]:
data['daytime'] = ((data.hour > 6) & (data.hour < 21)).astype(int)

In [None]:
data.head(5)

In [None]:
# Check the correlation between 'total' and 'daytime'
data.corr().total

## 3. Building a linear regression model 

### A. Simple linear regression
- Correlation보고 임의의 입력변수를 선택해봅시다.

In [None]:
# Show all variable names
list(data.columns.values)

In [None]:
data.corr()

In [None]:
data.corr().total

- 선택변수: temp, weather, humidity, season_2, season_3, season_4, daytime

In [None]:
selected = ['temp', 'weather', 'humidity', 'season_2', 'season_3', 'season_4', 'daytime']

In [None]:
# create X and Y
X = data[selected]
Y = data.total

# Initiate the linear regression model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

# Fit the model
linreg.fit(X,Y)

# Print the coefficients
print(linreg.intercept_)
print(linreg.coef_)

### B. Normalize the data
- 모든 변수의 값이 0 이상 1 이하 안에 들어가도록 변동
- 그 후에 각 변수의 계수를 살펴보자.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

In [None]:
print(rescaledX)

In [None]:
linreg_2 = LinearRegression()
linreg_2.fit(rescaledX,Y)

print(linreg_2.intercept_)
print(linreg_2.coef_)

In [None]:
# Compare linreg and linreg_2 in respect to training_RMSE
Y_pred = linreg.predict(X)
Y_pred_2 = linreg_2.predict(rescaledX)

import numpy as np
from sklearn import metrics

training_RMSE = np.sqrt(metrics.mean_squared_error(Y, Y_pred))
training_RMSE_2 = np.sqrt(metrics.mean_squared_error(Y, Y_pred_2))

In [None]:
print('Original scale - Training_RMSE: ', training_RMSE)
print('Normalized scale - Training_RMSE: ', training_RMSE_2)

### Note: Performance metrics for regression model

Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

In [None]:
# example true and predicted response values
true = [10, 7, 5, 5]
pred = [8, 6, 5, 10]

In [None]:
print('MAE:', metrics.mean_absolute_error(true, pred))
print('MSE:', metrics.mean_squared_error(true, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(true, pred)))

## C. Spliting the data and Evaluating the model
- 앞에서는 데이터를 학습/검증 셋으로 나누지 않고 선형회귀모델을 학습
- 여기에서는 데이터를 학습/검증 셋으로 나누고 선형회귀모델의 **예측 성능**을 평가

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Training set: 70%, Test set: 30%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=123)

print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

In [None]:
linreg = LinearRegression()

# Fit the model using training set
linreg.fit(X_train, Y_train)

# Calculate predicted Y using X of test set
Y_pred = linreg.predict(X_test)

# Calculate RMSE in test set
test_RMSE = np.sqrt(metrics.mean_squared_error(Y_test, Y_pred))

In [None]:
print('Training_RMSE: ', training_RMSE)
print('Test_RMSE: ', test_RMSE)

### D. Comparing models using RMSE in test set

In [None]:
# define a function that accepts a list of features and returns testing RMSE
def train_test_rmse(d, feature_cols):
    X = d[feature_cols]
    Y = d.total
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=123)
    linreg = LinearRegression()
    linreg.fit(X_train, Y_train)
    Y_pred = linreg.predict(X_test)
    return np.sqrt(metrics.mean_squared_error(Y_test, Y_pred))

In [None]:
# Compare different sets of features
print(train_test_rmse(data, ['temp', 'weather', 'humidity', 'season_2', 'season_3', 'season_4', 'daytime']))
print(train_test_rmse(data, ['temp', 'humidity', 'season_2', 'season_3', 'season_4', 'daytime']))
print(train_test_rmse(data, ['daytime']))

In [None]:
# How about this model?
print(train_test_rmse(data, ['casual', 'registered']))

### Adding more features
- 시간대를 데이터로 넣으면 어떨까?

In [None]:
# hour as a categorical feature
hour_dummies = pd.get_dummies(data.hour, prefix='hour')
hour_dummies.drop(hour_dummies.columns[0], axis=1, inplace=True)
data = pd.concat([data, hour_dummies], axis=1)

In [None]:
data.columns.str.startswith('hour_')

In [None]:
data.columns[data.columns.str.startswith('hour_')]

In [None]:
hour_cols = list(data.columns[data.columns.str.startswith('hour_')])

In [None]:
print(train_test_rmse(data, hour_cols ))