#### Data wrangling is the process of convering and formatting data from its raw form to usable format.in order to analyse it.
#### without proper data wrangling,we cannot perform analysis and model building. A good data wrangling ensures a better accuracy of the model/algoritm

## Problem Statement:
### Lyft, Inc. is a transportation network company based in San Francisco, California and operating in 640 cities in the United States and 9 cities in Canada. It develops, markets, and operates the Lyft mobile app, offering car rides, scooters, and a bicycle-sharing system. It is the second largest rideshare company in the world, second to only Uber.
### Lyft’s bike-sharing service is also among the largest in the USA. Being able to anticipate demand is extremely important for planning of bicycles, stations, and the personnel required to maintain these. This demand is sensitive to a lot of factors like season, humidity, rain, weekdays, holidays, and more. To enable this planning, Lyft needs to rightly predict the demand according to these factors.

## Attribute Information:

#### date = date of the ride
#### season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
#### holiday - whether the day is considered a holiday
#### workingday - whether the day is neither a weekend nor holiday
#### weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
      2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
      3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
      4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
#### temp - temperature in Celsius
#### atemp - "feels like" temperature in Celsius
#### humidity - relative humidity
#### windspeed - wind speed
#### casual - number of non-registered user rentals initiated
#### registered - number of registered user rentals initiated
#### count - number of total rentals

### You are a data scientist and you are assigned a task of performing data wrangling on a set of datasets. These datasets may have ambiguities.You have to identify these ambiguities,and apply different data wrangling techniques to get the datasets ready for usage

In [5]:
###importing the packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
data1=pd.read_csv('dataset1.csv')
data1.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp
0,1,01-01-2011,1,0,1,0,False,6,1,0.24
1,2,01-01-2011,1,0,1,1,False,6,1,0.22
2,3,01-01-2011,1,0,1,2,False,6,1,0.22
3,4,01-01-2011,1,0,1,3,False,6,1,0.24
4,5,01-01-2011,1,0,1,4,False,6,1,0.24


In [7]:
data1.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp
605,606,28-01-2011,1,0,1,11,False,5,3,0.18
606,607,28-01-2011,1,0,1,12,False,5,3,0.18
607,608,28-01-2011,1,0,1,13,False,5,3,0.18
608,609,28-01-2011,1,0,1,14,False,5,3,0.22
609,610,28-01-2011,1,0,1,15,False,5,2,0.2


In [8]:
data1.shape

(610, 10)

In [9]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 610 entries, 0 to 609
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     610 non-null    int64  
 1   dteday      610 non-null    object 
 2   season      610 non-null    int64  
 3   yr          610 non-null    int64  
 4   mnth        610 non-null    int64  
 5   hr          610 non-null    int64  
 6   holiday     610 non-null    bool   
 7   weekday     610 non-null    int64  
 8   weathersit  610 non-null    int64  
 9   temp        610 non-null    float64
dtypes: bool(1), float64(1), int64(7), object(1)
memory usage: 43.6+ KB


In [10]:
data1.isna().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
weathersit    0
temp          0
dtype: int64

In [11]:
data1.duplicated().sum()

0

In [12]:
data2=pd.read_excel('dataset2.xlsx')
data2.head()

Unnamed: 0.1,Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
0,0,1,0.2879,0.81,0.0,3,13,16
1,1,2,0.2727,0.8,0.0,8,32,40
2,2,3,0.2727,0.8,0.0,5,27,32
3,3,4,0.2879,0.75,0.0,3,10,13
4,4,5,0.2879,0.75,0.0,0,1,1


In [13]:
data2.tail()

Unnamed: 0.1,Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
605,605,606,0.2121,0.93,0.1045,0,30,30
606,606,607,0.2121,0.93,0.1045,1,28,29
607,607,608,0.2121,0.93,0.1045,0,31,31
608,608,609,0.2727,0.8,0.0,2,36,38
609,609,610,0.2576,0.86,0.0,1,40,41


In [14]:
data2=data2.drop(['Unnamed: 0'],axis=1)
data2.head()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
0,1,0.2879,0.81,0.0,3,13,16
1,2,0.2727,0.8,0.0,8,32,40
2,3,0.2727,0.8,0.0,5,27,32
3,4,0.2879,0.75,0.0,3,10,13
4,5,0.2879,0.75,0.0,0,1,1


In [15]:
data2.tail()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
605,606,0.2121,0.93,0.1045,0,30,30
606,607,0.2121,0.93,0.1045,1,28,29
607,608,0.2121,0.93,0.1045,0,31,31
608,609,0.2727,0.8,0.0,2,36,38
609,610,0.2576,0.86,0.0,1,40,41


In [16]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 610 entries, 0 to 609
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     610 non-null    int64  
 1   atemp       599 non-null    float64
 2   hum         610 non-null    float64
 3   windspeed   610 non-null    float64
 4   casual      610 non-null    int64  
 5   registered  610 non-null    int64  
 6   cnt         610 non-null    int64  
dtypes: float64(3), int64(4)
memory usage: 33.5 KB


In [17]:
data2.isna().sum()

instant        0
atemp         11
hum            0
windspeed      0
casual         0
registered     0
cnt            0
dtype: int64

In [18]:
###missing value percentage:
mvp=(data2.isna().sum()/data2.shape[0])*100
mvp

instant       0.000000
atemp         1.803279
hum           0.000000
windspeed     0.000000
casual        0.000000
registered    0.000000
cnt           0.000000
dtype: float64

In [19]:
#data2.dropna(inplace=True)

In [20]:
data2['atemp']=data2['atemp'].fillna(data2['atemp'].mean())
data2.isna().sum()

instant       0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [21]:
data2.duplicated().sum()

0

In [22]:
###a primary key is what makes the record in the data unique

In [23]:
data2['instant'].duplicated().sum(axis=0)

0

In [24]:
data2['instant'].nunique()

610

In [25]:
combined_data=pd.merge(data1,data2,on='instant',how='inner')
combined_data.head(3)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,01-01-2011,1,0,1,0,False,6,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,01-01-2011,1,0,1,1,False,6,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,01-01-2011,1,0,1,2,False,6,1,0.22,0.2727,0.8,0.0,5,27,32


In [26]:
combined_data.shape

(610, 16)

In [27]:
data3=pd.read_csv('dataset3.csv')
data3.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,620,29-01-2011,1,0,1,1,False,6,1,0.22,0.2273,0.64,0.194,0,20,20
1,621,29-01-2011,1,0,1,2,False,6,1,0.22,0.2273,0.64,0.1642,0,15,15
2,622,29-01-2011,1,0,1,3,False,6,1,0.2,0.2121,0.64,0.1343,3,5,8
3,623,29-01-2011,1,0,1,4,False,6,1,0.16,0.1818,0.69,0.1045,1,2,3
4,624,29-01-2011,1,0,1,6,False,6,1,0.16,0.1818,0.64,0.1343,0,2,2


In [28]:
data3.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
385,615,28-01-2011,1,0,1,20,False,5,2,0.24,0.2273,0.7,0.194,1,61,62
386,616,28-01-2011,1,0,1,21,False,5,2,0.22,0.2273,0.75,0.1343,1,57,58
387,617,28-01-2011,1,0,1,22,False,5,1,0.24,0.2121,0.65,0.3582,0,26,26
388,618,28-01-2011,1,0,1,23,False,5,1,0.24,0.2273,0.6,0.2239,1,22,23
389,619,29-01-2011,1,0,1,0,False,6,1,0.22,0.197,0.64,0.3582,2,26,28


In [29]:
data3=data3.sort_values(by='instant')


In [30]:
data3.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
381,611,28-01-2011,1,0,1,16,False,5,1,0.22,0.2727,0.8,0.0,10,70,80
382,612,28-01-2011,1,0,1,17,False,5,1,0.24,0.2424,0.75,0.1343,2,147,149
383,613,28-01-2011,1,0,1,18,False,5,1,0.24,0.2273,0.75,0.194,2,107,109
384,614,28-01-2011,1,0,1,19,False,5,2,0.24,0.2424,0.75,0.1343,5,84,89
385,615,28-01-2011,1,0,1,20,False,5,2,0.24,0.2273,0.7,0.194,1,61,62


In [31]:
data3.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
376,996,14-02-2011,1,0,2,3,False,1,1,0.34,0.3182,0.46,0.2239,1,1,2
377,997,14-02-2011,1,0,2,4,False,1,1,0.32,0.303,0.53,0.2836,0,2,2
378,998,14-02-2011,1,0,2,5,False,1,1,0.32,0.303,0.53,0.2836,0,3,3
379,999,14-02-2011,1,0,2,6,False,1,1,0.34,0.303,0.46,0.2985,1,25,26
380,1000,14-02-2011,1,0,2,7,False,1,1,0.34,0.303,0.46,0.2985,2,96,98


In [32]:
combined_data.shape

(610, 16)

In [33]:
data3.shape

(390, 16)

In [34]:
data3.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
       'registered', 'cnt'],
      dtype='object')

In [35]:
combined_data.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
       'registered', 'cnt'],
      dtype='object')

In [36]:
###checking the equality of the columns:
data3.columns.equals(combined_data.columns)

True

In [37]:
final=pd.concat([data3,combined_data],axis=0)

In [38]:
final.shape

(1000, 16)

In [39]:
final.isna().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [40]:
final.duplicated().sum()

0

In [41]:
final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 381 to 609
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     1000 non-null   int64  
 1   dteday      1000 non-null   object 
 2   season      1000 non-null   int64  
 3   yr          1000 non-null   int64  
 4   mnth        1000 non-null   int64  
 5   hr          1000 non-null   int64  
 6   holiday     1000 non-null   bool   
 7   weekday     1000 non-null   int64  
 8   weathersit  1000 non-null   int64  
 9   temp        1000 non-null   float64
 10  atemp       1000 non-null   float64
 11  hum         1000 non-null   float64
 12  windspeed   1000 non-null   float64
 13  casual      1000 non-null   int64  
 14  registered  1000 non-null   int64  
 15  cnt         1000 non-null   int64  
dtypes: bool(1), float64(4), int64(10), object(1)
memory usage: 126.0+ KB


In [42]:
final.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
381,611,28-01-2011,1,0,1,16,False,5,1,0.22,0.2727,0.8,0.0,10,70,80
382,612,28-01-2011,1,0,1,17,False,5,1,0.24,0.2424,0.75,0.1343,2,147,149
383,613,28-01-2011,1,0,1,18,False,5,1,0.24,0.2273,0.75,0.194,2,107,109
384,614,28-01-2011,1,0,1,19,False,5,2,0.24,0.2424,0.75,0.1343,5,84,89
385,615,28-01-2011,1,0,1,20,False,5,2,0.24,0.2273,0.7,0.194,1,61,62


In [43]:
final.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
605,606,28-01-2011,1,0,1,11,False,5,3,0.18,0.2121,0.93,0.1045,0,30,30
606,607,28-01-2011,1,0,1,12,False,5,3,0.18,0.2121,0.93,0.1045,1,28,29
607,608,28-01-2011,1,0,1,13,False,5,3,0.18,0.2121,0.93,0.1045,0,31,31
608,609,28-01-2011,1,0,1,14,False,5,3,0.22,0.2727,0.8,0.0,2,36,38
609,610,28-01-2011,1,0,1,15,False,5,2,0.2,0.2576,0.86,0.0,1,40,41


In [44]:
final=final.sort_values(by='instant')

In [45]:
final.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,01-01-2011,1,0,1,0,False,6,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,01-01-2011,1,0,1,1,False,6,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,01-01-2011,1,0,1,2,False,6,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,01-01-2011,1,0,1,3,False,6,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,01-01-2011,1,0,1,4,False,6,1,0.24,0.2879,0.75,0.0,0,1,1


In [46]:
final.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
376,996,14-02-2011,1,0,2,3,False,1,1,0.34,0.3182,0.46,0.2239,1,1,2
377,997,14-02-2011,1,0,2,4,False,1,1,0.32,0.303,0.53,0.2836,0,2,2
378,998,14-02-2011,1,0,2,5,False,1,1,0.32,0.303,0.53,0.2836,0,3,3
379,999,14-02-2011,1,0,2,6,False,1,1,0.34,0.303,0.46,0.2985,1,25,26
380,1000,14-02-2011,1,0,2,7,False,1,1,0.34,0.303,0.46,0.2985,2,96,98


In [47]:
### Encoding-converting categorical data to its numeric counterpart

In [48]:
final['holiday'].unique()

array([False,  True])

In [49]:
final=pd.get_dummies(data=final,prefix='dummy',columns=['holiday'],dtype=int)
final.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,weekday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,dummy_False,dummy_True
0,1,01-01-2011,1,0,1,0,6,1,0.24,0.2879,0.81,0.0,3,13,16,1,0
1,2,01-01-2011,1,0,1,1,6,1,0.22,0.2727,0.8,0.0,8,32,40,1,0
2,3,01-01-2011,1,0,1,2,6,1,0.22,0.2727,0.8,0.0,5,27,32,1,0
3,4,01-01-2011,1,0,1,3,6,1,0.24,0.2879,0.75,0.0,3,10,13,1,0
4,5,01-01-2011,1,0,1,4,6,1,0.24,0.2879,0.75,0.0,0,1,1,1,0


In [50]:
final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 0 to 380
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   instant      1000 non-null   int64  
 1   dteday       1000 non-null   object 
 2   season       1000 non-null   int64  
 3   yr           1000 non-null   int64  
 4   mnth         1000 non-null   int64  
 5   hr           1000 non-null   int64  
 6   weekday      1000 non-null   int64  
 7   weathersit   1000 non-null   int64  
 8   temp         1000 non-null   float64
 9   atemp        1000 non-null   float64
 10  hum          1000 non-null   float64
 11  windspeed    1000 non-null   float64
 12  casual       1000 non-null   int64  
 13  registered   1000 non-null   int64  
 14  cnt          1000 non-null   int64  
 15  dummy_False  1000 non-null   int32  
 16  dummy_True   1000 non-null   int32  
dtypes: float64(4), int32(2), int64(10), object(1)
memory usage: 132.8+ KB


In [51]:
### To convert date to the right datetime format:
final['dteday']=pd.to_datetime(final['dteday'],dayfirst=True)
final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 0 to 380
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   instant      1000 non-null   int64         
 1   dteday       1000 non-null   datetime64[ns]
 2   season       1000 non-null   int64         
 3   yr           1000 non-null   int64         
 4   mnth         1000 non-null   int64         
 5   hr           1000 non-null   int64         
 6   weekday      1000 non-null   int64         
 7   weathersit   1000 non-null   int64         
 8   temp         1000 non-null   float64       
 9   atemp        1000 non-null   float64       
 10  hum          1000 non-null   float64       
 11  windspeed    1000 non-null   float64       
 12  casual       1000 non-null   int64         
 13  registered   1000 non-null   int64         
 14  cnt          1000 non-null   int64         
 15  dummy_False  1000 non-null   int32         
 16  dummy_True  

In [52]:
### Rename variables like dteday,yr,mnth,weathersit,hum,cnt----Date,Year,Month,Weather,Humidity,Count

In [53]:
final.rename({'yr': 'year', 'hr' : 'hour','dteday':'Date','mnth':'Month','hum':'Humidity','cnt':'Count','weathersit':'Weather'},axis=1,inplace=True)

In [54]:
final.head(2)

Unnamed: 0,instant,Date,season,year,Month,hour,weekday,Weather,temp,atemp,Humidity,windspeed,casual,registered,Count,dummy_False,dummy_True
0,1,2011-01-01,1,0,1,0,6,1,0.24,0.2879,0.81,0.0,3,13,16,1,0
1,2,2011-01-01,1,0,1,1,6,1,0.22,0.2727,0.8,0.0,8,32,40,1,0


In [55]:
### Verify if the total demand(count) is equal to registered and casual demand

In [56]:
###compare total
final['Total']=final['registered']+final['casual']
final['Count'].equals(final['Total'])

True

In [57]:
## checking the above without creating a new variable.
np.sum(final['registered']+final['casual']-final['Count'])

0

In [58]:
final.columns

Index(['instant', 'Date', 'season', 'year', 'Month', 'hour', 'weekday',
       'Weather', 'temp', 'atemp', 'Humidity', 'windspeed', 'casual',
       'registered', 'Count', 'dummy_False', 'dummy_True', 'Total'],
      dtype='object')

In [59]:
### Seggregate the numeric and categorical variables

In [60]:
numeric_variables=final.select_dtypes(include='number')
categorical_variables=final.select_dtypes(exclude='number')


In [61]:
numeric_variables.columns

Index(['instant', 'season', 'year', 'Month', 'hour', 'weekday', 'Weather',
       'temp', 'atemp', 'Humidity', 'windspeed', 'casual', 'registered',
       'Count', 'dummy_False', 'dummy_True', 'Total'],
      dtype='object')

In [62]:
categorical_variables.columns

Index(['Date'], dtype='object')