<a href="https://colab.research.google.com/github/najanikhatoon/Bike-Sharing-project/blob/main/Another_copy_of_Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Project Name - Bike Sharing Demand Prediction**

### **Project Type- Regression**

### **Contribution - Individual**
## **Name - Najani khatoon**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


# BUSINESS PROBLEM OVERVIEW

To improve mobility comfort, many urban cities are now offering rental bikes. Because it shortens the time people have to wait, the public must have access to the rental bike at the right time. At some point, ensuring that the city has a consistent supply of rental bikes becomes a major concern. The significant part is the expectation of bicycle count expected at every hour for the steady stockpile of rental bicycles.

Bike sharing systems are a way to rent bikes where a network of locations automates the membership, rental, and bike return processes throughout a city. People can rent bikes from one location and return them to another or the same location as needed through these Bike Sharing systems. Individuals can lease a bicycle through memebership or on request premise. A citywide network of automated stores oversees this procedure.

Based on historical usage patterns in relation to weather, time, and other data, we are forecasting bike sharing demand prediction for the Bike Sharing Program in Seoul in this dataset.

# **Motivation**

Several bike/scooter ride sharing facilities (e.g., Bird, Capital Bikeshare, Citi Bike) have started up lately especially in metropolitan cities like San Francisco, New York, Chicago and Los Angeles, and one of the most important problem from a business point of view is to predict the bike demand on any particular day. While having excess bikes results in wastage of resource (both with respect to bike maintenance and the land/bike stand required for parking and security), having fewer bikes leads to revenue loss (ranging from a short term loss due to missing out on immediate customers to potential longer term loss due to loss in future customer base), Thus, having a estimate on the demands would enable efficient functioning of these companies.

# **Import Modules**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from sklearn.preprocessing import PowerTransformer
import warnings
warnings.filterwarnings("ignore")
#to display all the graph in the workbook
sns.set_style("whitegrid",{'grid.linestyle': '--'})

# **Loading the dataset**

In [None]:
#let's mount the google drive first
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

# Assuming the file is in the same directory as the notebook:
# df = pd.read_csv('SeoulBikeData.csv', encoding='latin1')

# Or, if the file is in your Google Drive and you mounted it:
df = pd.read_csv('/content/drive/MyDrive/SeoulBikeData.csv', encoding='latin1')
# Replace 'MyDrive' with your Google Drive folder if needed.

# Display the first few rows to verify
df.head()

In [None]:
# Data shape
df.shape


In [None]:
# data dtype
df.info()

In [None]:
# Statistical info
df.describe(include='all').transpose()

# **Preprocessing the data**

In [None]:
# Checking null values of data
df.isna().sum()

In [None]:
# Checking duplicate
df.duplicated().sum()

In [None]:
# converting date column dtype object to date
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y') # Specifying the correct format

In [None]:
# split day of week, month and year in three column
df['day_of_week'] = df['Date'].dt.day_name() # extract week name from Date column
df["month"] = df['Date'].dt.month_name()   # extract month name from Date column
df["year"] = df['Date'].map(lambda x: x.year).astype("object")     # extract year from Date column and convert it in object type

In [None]:
# drop the Date column
df.drop(columns=['Date'],inplace=True)

# ** EDA**

# When we observe the data we realize that Hour column is a numerical column but it is a time stamp so we have to treat Hour as a categorical feature

In [None]:
# convert Hour column integer to Categorical
df['Hour']=df['Hour'].astype('object')

In [None]:
# Divide Data in categorical and numerical features
numeric_features= df.select_dtypes(exclude='object')
categorical_features=df.select_dtypes(include='object')

In [None]:
numeric_features.head()

In [None]:
categorical_features.head()

In [None]:
# checking categorical column value count
for i in categorical_features.columns:
  print("\n ")
  print('column name  : ', i)
  print(df[i].value_counts())

In [None]:
#ploting pairplot for more info
sns.pairplot(df, corner=True)

In [None]:
# checking Outliers with seaborn boxplot
n = 1
plt.figure(figsize=(20,15))

for i in numeric_features.columns:
  plt.subplot(3,3,n)
  n=n+1
  sns.boxplot(df[i])
  plt.title(i)
  plt.tight_layout()

In [None]:
# we create point plots with Rented Bike Count during different categorical features with respect of Hour
for i in categorical_features.columns:
  if i == 'Hour':
    pass
  else:
    plt.figure(figsize=(20,10))
    sns.pointplot(x=df["Hour"],y=df['Rented Bike Count'],hue=df[i])
    plt.title(f"Rented Bike Count during different {i} with respect of Hour")
  plt.show()

## **Observation**

From all these pointplot we have observed a lot from every column like :

# Season

In the season column, we are able to understand that the demand is low in the winter season.

# Holiday

In the Holiday column, The demand is low during holidays, but in no holidays the demand is high, it may be because people use bikes to go to their work.

# Functioning Day

In the Functioning Day column, If there is no Functioning Day then there is no demand

# Days of week

In the Days of week column, We can observe from this column that the pattern of weekdays and weekends is different, in the weekend the demand becomes high in the afternoon. While the demand for office timings is high during weekdays, we can further change this column to weekdays and weekends.

# month

In the month column, We can clearly see that the demand is low in December January & Febuary, It is cold in these months and we have already seen in season column that demand is less in winters.

# year

The demand was less in 2017 and higher in 2018, it may be because it was new in 2017 and people did not know much about it.

# **Some more experiments for our categorical features**

In [None]:
# Converting days of weeks in Two variable from Monaday to Friday in Weekdays and Saturday and Sunday to Weekend
df['week'] = df['day_of_week'].apply(lambda x:'Weekend'  if x=='Saturday' or  x== 'Sunday' else 'Weekdays')

In [None]:
# value counts of Week column
df.week.value_counts()

In [None]:
# Getting feel of week column with pointplot
plt.figure(figsize=(15,7))
sns.pointplot(x=df["Hour"],y=df['Rented Bike Count'],hue=df['week'])
plt.title("Rented Bike Count during weekday and weekend with respect of Hour")

# Now we can clearly see the pattern which shows that the demand is high in the afternoon on the weekend. While there is more demand during office hours in weekdays

# Now we can drop the days of week column

In [None]:
# droping the days of week column from df and from categorical feature
df.drop(columns=['day_of_week'], inplace=True)
categorical_features.drop(columns=['day_of_week'], inplace=True)

## **value Counts in percentage**

In [None]:
for i in categorical_features.columns:
  print('feature name : ',i)
  print(df[i].value_counts(normalize=True))
  print('\n')

In [None]:
# creating pieplot for all categorical feature
n=1
plt.figure(figsize=(20,15))
for i in categorical_features.columns:
  plt.subplot(3,3,n)
  n=n+1
  plt.pie(df[i].value_counts(),labels = df[i].value_counts().keys().tolist(),autopct='%.0f%%')
  plt.title(i)
  plt.tight_layout()

## **Now the time of Explore our numerical feature and Trying to take some important information from the Numeical feature**

## **Pays little attention to the skewness of our numerical features**

# In this plots we observe that some of our columns is right skewed and some are left skewed we have to remember this things when we apply algorithms

# Right skewed columns are

Rented Bike Count (Its also our Dependent variable), Wind speed (m/s), Solar Radiation (MJ/m2), Rainfall(mm), Snowfall (cm),

# Left skewed columns are

Visibility (10m), Dew point temperature(°C)

## **Let's try something else to get information from our Numerical features**

### Check Unique Values for each variable.

In [None]:
#plotting histogram with mean and median, and distplot of all the numeric features of the dataset
n=1
for i in numeric_features.columns:
  plt.figure(figsize=(20,40))
  plt.subplot(9,2,n)
  n+=1
  print('\n')
  print('='*70,i,'='*70)
  print('\n')
  # fig=plt.figure()
  # ax=fig.gca()
  feature=df[i]
  feature.hist(bins=50,)
  plt.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
  plt.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
  plt.subplot(9,2,n)
  n+= 1
  sns.distplot(df[i])
  # plt.tight_layout()
  plt.show()


# In Distplot plots we observe that some of our columns is right skewed and some are left skewed we have to remember this things when we apply algorithms

# Right skewed columns are

Rented Bike Count (Its also our Dependent variable), Wind speed (m/s), Solar Radiation (MJ/m2), Rainfall(mm), Snowfall (cm),

# Left skewed columns are

Visibility (10m), Dew point temperature(°C)

# From Histogram we are coming to know that the features which are skewed, their mean and the median are also skewed, which was understood by looking at the graph that this would happen.



## **Lets try to find how is the relation of numerical features with our dependent variable**

In [None]:
# Regression plot to know relation with our independent variable
n=1
plt.figure(figsize=(15,15))
for i in numeric_features.columns:
  if i == 'Rented Bike Count':
    pass
  else:
    plt.subplot(4,2,n)
    n+=1
    # Pass data as a single argument using the 'data' parameter
    # Specify x and y variables using the 'x' and 'y' parameters
    sns.regplot(data=df, x=i, y='Rented Bike Count', scatter_kws={"color": "cyan"}, line_kws={"color": "red"})
    plt.title(f'Dependend variable and {i}')
    plt.tight_layout()

# This regression plots shows that some of our features are positive linear and some are negative linear in relation to our target variable.

## **Now is the time to know what is the correlation of our dependent variable with the independent features**

In [None]:
# Correlation with Rented Bike Count, considering only numeric columns
df.corr(numeric_only=True)['Rented Bike Count']

## As we saw in the regression plot that some features are negatively correlated and some positive, we are seeing the same thing here as well.

## **Let us see the correlation of all the numerical features with the heat map, so that we will also get to know the multicolinearity.**

In [None]:
# using seaborn heatmap for ploting correlation graph
plt.figure(figsize=(10,8))
# Calculate correlation only for numeric columns
sns.heatmap(abs(df.corr(numeric_only=True)), cmap='coolwarm', annot=True)

# From this graph we are able to see that there is multicollinearity in temperature(°C) and dev point temperature(°C) column.

In [None]:
#Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

   # Calculating VIF
   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

   return(vif)

In [None]:
calc_vif(df[[i for i in df.describe().columns if i not in ['Rented Bike Count','Dew point temperature(°C)']]])

In [None]:
# Using Pandas get Dummies for Encoding categorical features
new_df=pd.get_dummies(df,drop_first=True,sparse=True)

In [None]:
new_df.head(2)

## **We saw that our dependent variable is right skewed, it needs to be normalized.**
## We do some experiments to normalize it

In [None]:
fig,axes = plt.subplots(1,3,figsize=(20,5))
# here we use log10
sns.distplot(np.log10(new_df['Rented Bike Count']+0.0000001),ax=axes[0],color='red').set_title("log 10")
# here we use square
sns.distplot((new_df['Rented Bike Count']**2),ax=axes[1],color='red').set_title("square")
# here we use square root
sns.distplot(np.sqrt(new_df['Rented Bike Count']),ax=axes[2], color='green').set_title("Square root")

## Our data in green plot is normalized to some extent: so we will go with square root on our dependent variable

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


## Divide data in dependent feature and Independent feature

In [None]:
X = new_df.drop(columns=['Rented Bike Count','Dew point temperature(°C)'])
y = np.sqrt(new_df['Rented Bike Count'])

In [None]:
# Train test split our data
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.25,random_state=42)


# Geeting Feel of my X_train, X_test, y_train, y_test

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
y_train.head()

In [None]:
y_test.head()

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
# Appending all models parameters to the corrosponding list
mean_absolut_error = []
mean_sq_error=[]
root_mean_sq_error=[]
training_score =[]
r2_list=[]
adj_r2_list=[]
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score


def score_metrix (model,X_train,X_test,Y_train,Y_test):

  '''
    train the model and gives mae, mse,rmse,r2,adj r2 score of the model

  '''
  #training the model
  model.fit(X_train,Y_train)

  # Training Score
  training  = model.score(X_train,Y_train)
  print("Training score  =", training)

  print('\n')

  try:
      # finding the best parameters of the model if any
    print('*'*20, 'Best Parameters & Best Score', '*'*20)
    print(f"The best parameters found out to be :{model.best_params_} \nwhere model best score is:  {model.best_score_} \n")
  except:
    print('None')



  #predicting the Test set and evaluting the models
  print('\n')
  print('*'*20, 'Evalution Matrix', '*'*20)

  if model == Linear or model == L1 or model == L2:
    Y_pred = model.predict(X_test)

    #finding mean_absolute_error
    MAE  = mean_absolute_error(Y_test**2,Y_pred**2)
    print("MAE :" , MAE)

    #finding mean_squared_error
    MSE  = mean_squared_error(Y_test**2,Y_pred**2)
    print("MSE :" , MSE)

    #finding root mean squared error
    RMSE = np.sqrt(MSE)
    print("RMSE :" ,RMSE)

    #finding the r2 score

    r2 = r2_score(Y_test**2,Y_pred**2)
    print("R2 :" ,r2)
    #finding the adjusted r2 score
    adj_r2=1-(1-r2_score(Y_test**2,Y_pred**2))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
    print("Adjusted R2 : ",adj_r2,'\n')

  else:
    # for tree base models
    Y_pred = model.predict(X_test)

    #finding mean_absolute_error
    MAE  = mean_absolute_error(Y_test,Y_pred)
    print("MAE :" , MAE)

    #finding mean_squared_error
    MSE  = mean_squared_error(Y_test,Y_pred)
    print("MSE :" , MSE)

    #finding root mean squared error
    RMSE = np.sqrt(MSE)
    print("RMSE :" ,RMSE)

    #finding the r2 score

    r2 = r2_score(Y_test,Y_pred)
    print("R2 :" ,r2)
    #finding the adjusted r2 score
    adj_r2=1-(1-r2_score(Y_test,Y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
    print("Adjusted R2 : ",adj_r2,'\n')

    #Top 10 feature importance graph
    try:
      best = model.best_estimator_
      features = new_X.columns
      importances = best.feature_importances_[0:10]
      indices = np.argsort(importances)
      plt.figure(figsize=(10,15))
      plt.title('Feature Importance')
      plt.barh(range(len(indices)), importances[indices], color='pink',edgecolor='red' ,align='center')
      plt.yticks(range(len(indices)), [features[i] for i in indices])
      plt.xlabel('Relative Importance')
      plt.show()

    except:
      pass

  # Here we appending the parameters for all models
  mean_absolut_error.append(MAE)
  mean_sq_error.append(MSE)
  root_mean_sq_error.append(RMSE)
  training_score.append(training)
  r2_list.append(r2)
  adj_r2_list.append(adj_r2)


  # print the cofficient and intercept of which model have these parameters and else we just pass them
  if model == Linear:
    print("*"*25, "coefficient", "*"*25)
    print(model.coef_)
    print('\n')
    print("*"*25, "Intercept", "*"*25)
    print('\n')
    print(model.intercept_)
  else:
    pass
  print('\n')

  print('*'*20, 'ploting the graph of Actual and predicted only with 80 observation', '*'*20)

  # ploting the graph of Actual and predicted only with 80 observation for better visualisation which model have these parameters and else we just pass them
  try:
    # ploting the line graph of actual and predicted values
    plt.figure(figsize=(15,7))
    plt.plot((Y_pred)[:80])
    plt.plot((np.array(Y_test)[:80]))
    plt.legend(["Predicted","Actual"])
    plt.show()
  except:
    pass

# transforming X_train and X_test with yeo-johnson transformation

In [None]:
from sklearn.preprocessing import PowerTransformer,MinMaxScaler
yeo = PowerTransformer()
X_train_trans = yeo.fit_transform(X_train) # fit transform the training set
X_test_trans = yeo.transform(X_test) #tranform the test set

## **Linear Regression**

In [None]:
# imporing linear models
from sklearn.linear_model import LinearRegression,Lasso,Ridge
Linear = LinearRegression()

In [None]:
# importing Fitting the linear regression model with our score matrix function
score_metrix(Linear,X_train_trans,X_test_trans,y_train,y_test)

## **RandomForest Regression**

In [None]:
# Importing Randomfroest from sklearn.ensemble
from sklearn.ensemble import RandomForestRegressor

In [None]:
param_grid = {'n_estimators':[100,150,200],
              'min_samples_leaf':[6,4,2],
              'max_depth' : [30,20,25],
              'min_samples_split': [30,25,20],
              'max_features':['auto','sqrt','log2']
              }

In [None]:
# Using Grid SearchCV
Ranom_forest_Grid_search = GridSearchCV(RandomForestRegressor(),param_grid=param_grid,n_jobs=-1,cv=5)

## **XGBRegressor**

In [None]:
#importing XGBoost Regressor
from xgboost import XGBRegressor

In [None]:
params = {
          'subsample': [0.5],#0.3,0.7],
          'n_thread': [4], #2,6],
          'n_estimators': [1000],#range(200,1500,50),
          'min_child_weight': [2],#3,5],
          'max_depth': [4],#range(2,8,2),
          'learning_rate': [0.02],#0.04,0.06],
          'eval_mertric': ['rmse'],#'mse',],
          'colsample_bytree': [0.7],#0.5,1.0],
          }

In [None]:
#creating xgb grid model
xgb_grid_search= GridSearchCV(XGBRegressor(silent=True),param_grid=params,cv=5)

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***