# Problem Statement
A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
<br><br>
A US bike-sharing provider **BoomBikes** has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state. 
<br><br>
In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.


They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.
How well those variables describe the bike demands
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

## Business Goal
ent features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market. 

In [None]:
# Let us import few of the necessary libraries to start with and load the provided dataset and graphical recognize them
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') # imported warnings in order to ignore and suppress any warnings

In [None]:
#now that the libarraies are present let's try to check the dataset by loading and identifying nulls or other
#factors like categories present through regular EDA techniques using panda library
boombike_ds = pd.read_csv('day.csv')

In [None]:
# lets check the data if loaded correctly
boombike_ds.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,01-01-2018,1,0,1,0,6,0,2,14.110847,18.18125,80.5833,10.749882,331,654,985
1,2,02-01-2018,1,0,1,0,0,0,2,14.902598,17.68695,69.6087,16.652113,131,670,801
2,3,03-01-2018,1,0,1,0,1,1,1,8.050924,9.47025,43.7273,16.636703,120,1229,1349
3,4,04-01-2018,1,0,1,0,2,1,1,8.2,10.6061,59.0435,10.739832,108,1454,1562
4,5,05-01-2018,1,0,1,0,3,1,1,9.305237,11.4635,43.6957,12.5223,82,1518,1600


In [None]:
boombike_ds.shape
# data has 730 rows and 16 columns

(730, 16)

In [None]:
boombike_ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 730 entries, 0 to 729
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     730 non-null    int64  
 1   dteday      730 non-null    object 
 2   season      730 non-null    int64  
 3   yr          730 non-null    int64  
 4   mnth        730 non-null    int64  
 5   holiday     730 non-null    int64  
 6   weekday     730 non-null    int64  
 7   workingday  730 non-null    int64  
 8   weathersit  730 non-null    int64  
 9   temp        730 non-null    float64
 10  atemp       730 non-null    float64
 11  hum         730 non-null    float64
 12  windspeed   730 non-null    float64
 13  casual      730 non-null    int64  
 14  registered  730 non-null    int64  
 15  cnt         730 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.4+ KB


In [None]:
# we can see there are no null values from the info of the dataset, let's be sure of that
boombike_ds.isna().sum()
# there is no null values in the dataset as we could identify so null value imputation or removal not required.
# let's identify if there are any duplicates
temp_df = boombike_ds.copy()
temp_df.drop_duplicates(subset=None, inplace=True)

In [None]:
temp_df.shape
# we could see through above operation that the original and temporrary df where we dropped duplicate has no row drop,
# so there is no duplicates in the data

(730, 16)

> Findings :
> 1. Data has no null values
> 2. Data has no duplicates

In [None]:
# lets identify the data columns which are required and can be derived, converted based on our data dictionary
print(boombike_ds.head())
print(boombike_ds.nunique())


   instant      dteday  season  yr  mnth  holiday  weekday  workingday  \
0        1  01-01-2018       1   0     1        0        6           0   
1        2  02-01-2018       1   0     1        0        0           0   
2        3  03-01-2018       1   0     1        0        1           1   
3        4  04-01-2018       1   0     1        0        2           1   
4        5  05-01-2018       1   0     1        0        3           1   

   weathersit       temp     atemp      hum  windspeed  casual  registered  \
0           2  14.110847  18.18125  80.5833  10.749882     331         654   
1           2  14.902598  17.68695  69.6087  16.652113     131         670   
2           1   8.050924   9.47025  43.7273  16.636703     120        1229   
3           1   8.200000  10.60610  59.0435  10.739832     108        1454   
4           1   9.305237  11.46350  43.6957  12.522300      82        1518   

    cnt  
0   985  
1   801  
2  1349  
3  1562  
4  1600  
instant       730
dteday  

In [None]:
print(boombike_ds.season.unique())
print(boombike_ds.yr.unique)
print(boombike_ds.holiday.unique)
print(boombike_ds.weekday.unique)
print(boombike_ds.workingday.unique)
print(boombike_ds.weathersit.unique)

[1 2 3 4]
<bound method Series.unique of 0      0
1      0
2      0
3      0
4      0
      ..
725    1
726    1
727    1
728    1
729    1
Name: yr, Length: 730, dtype: int64>
<bound method Series.unique of 0      0
1      0
2      0
3      0
4      0
      ..
725    0
726    0
727    0
728    0
729    0
Name: holiday, Length: 730, dtype: int64>
<bound method Series.unique of 0      6
1      0
2      1
3      2
4      3
      ..
725    4
726    5
727    6
728    0
729    1
Name: weekday, Length: 730, dtype: int64>
<bound method Series.unique of 0      0
1      0
2      1
3      1
4      1
      ..
725    1
726    1
727    0
728    0
729    1
Name: workingday, Length: 730, dtype: int64>
<bound method Series.unique of 0      2
1      2
2      1
3      1
4      1
      ..
725    2
726    2
727    2
728    1
729    2
Name: weathersit, Length: 730, dtype: int64>


## What we could find
- 'instant' is just indexing so we can drop the column as it might not be helpful in modelling
- data is present for year 2018 and 2019 only as can be derived from data dict.
- columns like yr, holiday, weekday, workingdaty, weathersit doesn't have clear column names and categories are marked with numbers
- 'casual' & 'registered' column can be dropped as we need total count of the bike booked that is anyway present in cnt
- 'dteday' column can be dropped as we have month and year.

In [None]:
# let's rename some columns to proper conventional names as per data dic for clear understanding
print(boombike_ds.columns)
boombike_ds.rename(columns={'yr' : 'year', 'mnth' : 'month', 'hum' : 'humidity',
                             'cnt' : 'count', 'atemp' : 'feeling temperature',
                               'temp' : 'temperature'}, inplace=True)

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'cnt'],
      dtype='object')


In [None]:
boombike_ds.head()

Unnamed: 0,instant,dteday,season,year,month,holiday,weekday,workingday,weathersit,temperature,feeling temperature,humidity,windspeed,casual,registered,count
0,1,01-01-2018,1,0,1,0,6,0,2,14.110847,18.18125,80.5833,10.749882,331,654,985
1,2,02-01-2018,1,0,1,0,0,0,2,14.902598,17.68695,69.6087,16.652113,131,670,801
2,3,03-01-2018,1,0,1,0,1,1,1,8.050924,9.47025,43.7273,16.636703,120,1229,1349
3,4,04-01-2018,1,0,1,0,2,1,1,8.2,10.6061,59.0435,10.739832,108,1454,1562
4,5,05-01-2018,1,0,1,0,3,1,1,9.305237,11.4635,43.6957,12.5223,82,1518,1600


In [None]:
# dropping columns that doesn't seem to be helping in the model
boombike_ds.drop(['instant','dteday','casual','registered'], axis=1, inplace=True)

In [None]:
boombike_ds.head()

Unnamed: 0,season,year,month,holiday,weekday,workingday,weathersit,temperature,feeling temperature,humidity,windspeed,count
0,1,0,1,0,6,0,2,14.110847,18.18125,80.5833,10.749882,985
1,1,0,1,0,0,0,2,14.902598,17.68695,69.6087,16.652113,801
2,1,0,1,0,1,1,1,8.050924,9.47025,43.7273,16.636703,1349
3,1,0,1,0,2,1,1,8.2,10.6061,59.0435,10.739832,1562
4,1,0,1,0,3,1,1,9.305237,11.4635,43.6957,12.5223,1600


In [30]:
#after dropping columns let's encode/map the column to their categories for clearer picture
#let's create mapping dictionaries first

boombike_ds.season = boombike_ds.apply({1: 'spring', 2: 'summer', 3: 'fall', 4:'winter'})
boombike_ds.month = boombike_ds.map({1:'jan', 2:'feb', 3:'mar', 4:'apr', 5:'may', 6:'jun', 7:'jul', 8:'aug',
 9:'sep', 10:'oct', 11:'nov', 12:'dec'})

AttributeError: 'DataFrame' object has no attribute 'map'