# Bike Demand Prediction Model
### Assignment Solution (IIT Madras â€“ Kaatru)

## Goal
Develop a model to find significant variables in predicting the demand for shared bikes.

## key Improvements in this Version
- **Data Preprocessing**: Handling categorical variables using One-Hot Encoding.
- **Assumption Checking**: Checking Multicollinearity using VIF.
- **Validation**: Using Train-Test Split to evaluate performance on unseen data.
- **Residual Analysis**: Verifying the assumptions of Linear Regression.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

## 1. Load and Inspect Data

In [None]:
df = pd.read_csv('day.csv')
df.head()

In [None]:
df.info()

## 2. Data Cleaning & Feature Engineering
We drop variables that are not useful for prediction or cause leakage:
- `instant`: Index column.
- `dteday`: Date is redundant as we have year, month, etc.
- `casual`, `registered`: Target leakage (cnt = casual + registered).

In [None]:
df_model = df.drop(['instant', 'dteday', 'casual', 'registered'], axis=1)

### Categorical Encoding
Variables like `season`, `weathersit`, `mnth`, and `weekday` are categorical but encoded as integers. We must One-Hot Encode them to avoid ordinality assumptions (e.g., Season 4 > Season 1 is false logic).

In [None]:
# Define categorical columns
cat_cols = ['season', 'weathersit', 'mnth', 'weekday']

# Create dummy variables (drop_first=True to avoid dummy variable trap)
df_model = pd.get_dummies(df_model, columns=cat_cols, drop_first=True, dtype=int)
df_model.head()

## 3. Multicollinearity Check (VIF)
High correlation between independent variables affects the p-values and interpretation. We check Variance Inflation Factor (VIF).

In [None]:
numeric_cols = ['temp', 'atemp', 'hum', 'windspeed']
X_numeric = df[numeric_cols]
X_numeric = sm.add_constant(X_numeric)

vif_data = pd.DataFrame()
vif_data["feature"] = X_numeric.columns
vif_data["VIF"] = [variance_inflation_factor(X_numeric.values, i) for i in range(len(X_numeric.columns))]
vif_data

**Observation**: `temp` and `atemp` have extremely high VIF (~63), indicating they are duplicates in information. We will drop `atemp`.

In [None]:
df_model = df_model.drop(['atemp'], axis=1)

## 4. Train-Test Split
Splitting data into 70% Training and 30% Testing.

In [None]:
X = df_model.drop('cnt', axis=1)
y = df_model['cnt']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)

print(f"Train size: {X_train.shape}")
print(f"Test size: {X_test.shape}")

## 5. Model Building (OLS Regression)

In [None]:
X_train_sm = sm.add_constant(X_train)
X_test_sm = sm.add_constant(X_test)

lr_model = sm.OLS(y_train, X_train_sm).fit()
print(lr_model.summary())

## 6. Residual Analysis
Validating assumptions: 
1. **Normality of Residuals**: Distribution of error terms should be normal.
2. **Homoscedasticity**: No pattern in residuals vs fitted values.

In [None]:
y_train_pred = lr_model.predict(X_train_sm)
residuals = y_train - y_train_pred

plt.figure(figsize=(10,5))
sns.histplot(residuals, kde=True)
plt.title('Distribution of Residuals (Normality Check)')
plt.xlabel('Residuals')
plt.show()

In [None]:
plt.figure(figsize=(10,5))
plt.scatter(y_train_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residuals vs Fitted Values (Homoscedasticity Check)')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()

## 7. Model Evaluation
checking performance on Test Data.

In [None]:
y_test_pred = lr_model.predict(X_test_sm)

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

print(f"Train R-squared: {r2_train:.4f}")
print(f"Test R-squared: {r2_test:.4f}")
print(f"Test RMSE: {rmse_test:.4f}")

## 8. Final Outcome: Significant Variables
Listing variables that are statistically significant ($p < 0.05$).

In [None]:
p_values = lr_model.pvalues
sig_vars = p_values[p_values < 0.05].index.tolist()

print("Significant Variables (p < 0.05):")
for var in sig_vars:
    print(f"- {var}")