#### Data wrangling is the process of convering and formatting data from its raw form to usable format.in order to analyse it.
#### without proper data wrangling,we cannot perform analysis and model building. A good data wrangling ensures a better accuracy of the model/algoritm

## Problem Statement:
### Lyft, Inc. is a transportation network company based in San Francisco, California and operating in 640 cities in the United States and 9 cities in Canada. It develops, markets, and operates the Lyft mobile app, offering car rides, scooters, and a bicycle-sharing system. It is the second largest rideshare company in the world, second to only Uber.
### Lyft’s bike-sharing service is also among the largest in the USA. Being able to anticipate demand is extremely important for planning of bicycles, stations, and the personnel required to maintain these. This demand is sensitive to a lot of factors like season, humidity, rain, weekdays, holidays, and more. To enable this planning, Lyft needs to rightly predict the demand according to these factors.

## Attribute Information:

#### date = date of the ride
#### season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
#### holiday - whether the day is considered a holiday
#### workingday - whether the day is neither a weekend nor holiday
#### weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
      2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
      3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
      4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 
#### temp - temperature in Celsius
#### atemp - "feels like" temperature in Celsius
#### humidity - relative humidity
#### windspeed - wind speed
#### casual - number of non-registered user rentals initiated
#### registered - number of registered user rentals initiated
#### count - number of total rentals

### You are a data scientist and you are assigned a task of performing data wrangling on a set of datasets. These datasets may have ambiguities.You have to identify these ambiguities,and apply different data wrangling techniques to get the datasets ready for usage

In [5]:
###importing the packages:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
data1=pd.read_csv('dataset1.csv')
data1.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp
0,1,01-01-2011,1,0,1,0,False,6,1,0.24
1,2,01-01-2011,1,0,1,1,False,6,1,0.22
2,3,01-01-2011,1,0,1,2,False,6,1,0.22
3,4,01-01-2011,1,0,1,3,False,6,1,0.24
4,5,01-01-2011,1,0,1,4,False,6,1,0.24


In [9]:
data1.tail()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,weathersit,temp
605,606,28-01-2011,1,0,1,11,False,5,3,0.18
606,607,28-01-2011,1,0,1,12,False,5,3,0.18
607,608,28-01-2011,1,0,1,13,False,5,3,0.18
608,609,28-01-2011,1,0,1,14,False,5,3,0.22
609,610,28-01-2011,1,0,1,15,False,5,2,0.2


In [11]:
data1.shape

(610, 10)

In [13]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 610 entries, 0 to 609
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     610 non-null    int64  
 1   dteday      610 non-null    object 
 2   season      610 non-null    int64  
 3   yr          610 non-null    int64  
 4   mnth        610 non-null    int64  
 5   hr          610 non-null    int64  
 6   holiday     610 non-null    bool   
 7   weekday     610 non-null    int64  
 8   weathersit  610 non-null    int64  
 9   temp        610 non-null    float64
dtypes: bool(1), float64(1), int64(7), object(1)
memory usage: 43.6+ KB


In [15]:
data1.isna().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
weathersit    0
temp          0
dtype: int64

In [19]:
data1.duplicated().sum()

0

In [21]:
data2=pd.read_excel('dataset2.xlsx')
data2.head()

Unnamed: 0.1,Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
0,0,1,0.2879,0.81,0.0,3,13,16
1,1,2,0.2727,0.8,0.0,8,32,40
2,2,3,0.2727,0.8,0.0,5,27,32
3,3,4,0.2879,0.75,0.0,3,10,13
4,4,5,0.2879,0.75,0.0,0,1,1


In [23]:
data2.tail()

Unnamed: 0.1,Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
605,605,606,0.2121,0.93,0.1045,0,30,30
606,606,607,0.2121,0.93,0.1045,1,28,29
607,607,608,0.2121,0.93,0.1045,0,31,31
608,608,609,0.2727,0.8,0.0,2,36,38
609,609,610,0.2576,0.86,0.0,1,40,41


In [27]:
data2=data2.drop(['Unnamed: 0'],axis=1)
data2.head()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
0,1,0.2879,0.81,0.0,3,13,16
1,2,0.2727,0.8,0.0,8,32,40
2,3,0.2727,0.8,0.0,5,27,32
3,4,0.2879,0.75,0.0,3,10,13
4,5,0.2879,0.75,0.0,0,1,1


In [29]:
data2.tail()

Unnamed: 0,instant,atemp,hum,windspeed,casual,registered,cnt
605,606,0.2121,0.93,0.1045,0,30,30
606,607,0.2121,0.93,0.1045,1,28,29
607,608,0.2121,0.93,0.1045,0,31,31
608,609,0.2727,0.8,0.0,2,36,38
609,610,0.2576,0.86,0.0,1,40,41


In [31]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 610 entries, 0 to 609
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     610 non-null    int64  
 1   atemp       599 non-null    float64
 2   hum         610 non-null    float64
 3   windspeed   610 non-null    float64
 4   casual      610 non-null    int64  
 5   registered  610 non-null    int64  
 6   cnt         610 non-null    int64  
dtypes: float64(3), int64(4)
memory usage: 33.5 KB


In [33]:
data2.isna().sum()

instant        0
atemp         11
hum            0
windspeed      0
casual         0
registered     0
cnt            0
dtype: int64

In [35]:
###missing value percentage:
mvp=(data2.isna().sum()/data2.shape[0])*100
mvp

instant       0.000000
atemp         1.803279
hum           0.000000
windspeed     0.000000
casual        0.000000
registered    0.000000
cnt           0.000000
dtype: float64

In [37]:
#data2.dropna(inplace=True)

In [39]:
data2['atemp']=data2['atemp'].fillna(data2['atemp'].mean())
data2.isna().sum()

instant       0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [41]:
data2.duplicated().sum()

0

In [43]:
###a primary key is what makes the record in the data unique