## Data set from Kaggle. 
Link here:  
https://www.kaggle.com/c/bike-sharing-demand/data


### Explanation of the different features on the dataset
 - datetime - hourly date + timestamp  
 - season -  1 = spring, 2 = summer, 3 = fall, 4 = winter 
 - holiday - whether the day is considered a holiday
 - workingday - whether the day is neither a weekend nor holiday
 - weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
 - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
 - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
 - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
 - temp - temperature in Celsius
 - atemp - "feels like" temperature in Celsius
 - humidity - relative humidity
 - windspeed - wind speed
 - casual - number of non-registered user rentals initiated
 - registered - number of registered user rentals initiated
 - count - number of total rentals

## What is covariant shift
Dataset shift is a challenging situation where the joint distribution of inputs and outputs differs between the training and test stages. This occurs when the distribution of input variables is different between training and testing dataset. This is mostly common when the data was collected in different time intervals which turns to influence the data distribution

In [1]:
# Importing python libraries needed
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Imports the dataset using pandas and converts it to a pandas dataframe.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
train.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [6]:
test.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed
0,2011-01-20 00:00:00,1,0,1,1,10.66,11.365,56,26.0027
1,2011-01-20 01:00:00,1,0,1,1,10.66,13.635,56,0.0
2,2011-01-20 02:00:00,1,0,1,1,10.66,13.635,56,0.0
3,2011-01-20 03:00:00,1,0,1,1,10.66,12.88,56,11.0014
4,2011-01-20 04:00:00,1,0,1,1,10.66,12.88,56,11.0014


Since the train set has three extra columns that the test set doesn't have, the last three features(casual, registered and count) will be dropped.

In [8]:
train = train.drop(['casual', 'registered', 'count'], axis=1)

In [5]:
# check for null values
train.isnull().any().any()

False

Going through the two datasets, there are no null values same as none wrong inputs. Features like wind speed and temp have zero values which is normal to have such values in the data.

In [9]:
# Since the target in this dataset was a continuous variable, I will create anew feature target with two classes 0 and 1. 
train['target'] = 0
test['target'] = 1

In [10]:
## Combining both dataframes
final = train.append(test)
y = final['target']
# Drop the target variable from the merged dataframe
final.drop('target',axis=1,inplace=True)

In [14]:
from sklearn.preprocessing import LabelEncoder

number = LabelEncoder()
for i in final.columns:
    if (final[i].dtype == 'object'):
        final[i] = number.fit_transform(final[i].astype('str'))
        final[i] = final[i].astype('object')

In [15]:
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17379 entries, 0 to 6492
Data columns (total 9 columns):
datetime      17379 non-null object
season        17379 non-null int64
holiday       17379 non-null int64
workingday    17379 non-null int64
weather       17379 non-null int64
temp          17379 non-null float64
atemp         17379 non-null float64
humidity      17379 non-null int64
windspeed     17379 non-null float64
dtypes: float64(3), int64(5), object(1)
memory usage: 1.3+ MB


 - To know if a feature is drifting or not, the AUC-ROC value of each dependent varaible will should be >0.8 else the feature is not drifting.

In [22]:
'''
Computation for all drfiting features in this dataset.
Modelling the data with a logistic regression model
'''
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Considering logistic regression classifier with regularization set to l1.
clf_lr = LogisticRegression(penalty='l1', solver='liblinear')
drop_list = []

for i in final.columns:   
    score = cross_val_score(clf_lr ,pd.DataFrame(final[i]),y,cv=2,scoring='roc_auc')
    if (np.mean(score) > 0.8): 
        drop_list.append(i)
        print(i,np.mean(score))

- Since there is no output, this means none of the features have a drifitng value>0.8 which means none of the features is drifting.

In [23]:
for i in final.columns:   
    score = cross_val_score(clf_lr ,pd.DataFrame(final[i]),y,cv=2,scoring='roc_auc')
    if (np.mean(score) < 0.8): 
        drop_list.append(i)
        print(i,np.mean(score))

datetime 0.5467809535084794
season 0.5033598476362562
holiday 0.49886883278433153
workingday 0.5000855219538987
weather 0.5061095845178567
temp 0.5128336053651894
atemp 0.4883625712137811
humidity 0.533073011989778
windspeed 0.5056615899861661


- Here all the features are printed out since their AUC-ROC vaues are all lesser than 0.8 meaning there is no drifting between the columns

### Conclussion
After computation, this dataset therefore doesn't have any drifting values thus the model accuracy won't be affected. The model will therefore perform well on the test data as the train data it was trained on is not diffeernt from the test data.