# Practical 1
#### [ ID: 17CE016 ]
---
#### Aim: Data Preprocessing using Pandas (Handling Missing Value, Data Wrangling, Dimension Reduction)

#### Theory:
##### What is Data Preprocessing?
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

Therefore, certain steps are executed to convert the data into a small clean data set. This technique is performed before the execution of Iterative Analysis. The set of steps is known as Data Preprocessing.<br/> 
It includes:

- Data Cleaning
- Data Integration
- Data Transformation
- Data Reduction

##### What Is Data Wrangling?
Data Wrangling is a technique that is executed at the time of making an interactive model. In other words, it is used to convert the raw data into the format that is convenient for the consumption of data.

This technique is also known as Data Munging. This method also follows certain steps such as after extracting the data from different data sources, sorting of data using certain algorithm is performed, decompose the data into a different structured format and finally store the data into another database.

---

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Working with Missing Data in Pandas
- Missing Data can occur when no information is provided for one or more items or for a whole unit. 
- Missing Data is a very big problem in real life scenario. 
- Missing Data can also refer to as NA(Not Available) values in pandas. 
- In DataFrame sometimes many datasets simply arrive with missing data, either because it exists and was not collected or it never existed. 
- For Example, Suppose different user being surveyed may choose not to share their income, some user may choose not to share the address in this way many datasets went missing.


#### In Pandas missing data is represented by two value:
- None: None is a Python singleton object that is often used for missing data in Python code.
- NaN : NaN (an acronym for Not a Number), is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation

In [15]:
data=pd.read_csv("dataset\weather_data.csv")

In [16]:
data

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,7.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,31.0,2.0,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


---
### isnull() & notnull()
Used to check weather dataset contains null values or not.

---

In [17]:
data.isnull().sum()

day            0
temperature    3
windspeed      3
event          2
dtype: int64

---
If the column contain more than 50% value as null then you can remove that column.   
Because it doesn't give any valuable information

---

In [18]:
data.describe()

Unnamed: 0,temperature,windspeed
count,6.0,6.0
mean,32.833333,7.0
std,4.020779,3.224903
min,28.0,2.0
25%,31.25,6.25
50%,32.0,7.0
75%,33.5,7.75
max,40.0,12.0


In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          9 non-null      object 
 1   temperature  6 non-null      float64
 2   windspeed    6 non-null      float64
 3   event        7 non-null      object 
dtypes: float64(2), object(2)
memory usage: 416.0+ bytes


---
### data1=data
In this data1 is also changes as data changes

### data1=data.copy() 
In this data1 will not affect as data changes

---

In [20]:
data1=data.copy()

---
### Use of fillna()
To fill the null value

---

In [21]:
data.fillna(0)

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,0.0,7.0,Sunny
2,2017-01-05,28.0,0.0,Snow
3,2017-01-06,0.0,7.0,0
4,2017-01-07,32.0,0.0,Rain
5,2017-01-08,31.0,2.0,Sunny
6,2017-01-09,0.0,0.0,0
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


In [22]:
data.temperature.fillna(method='ffill',inplace=True)
data.windspeed.fillna(method='bfill',inplace=True)
print(data)
data=data1.copy()

          day  temperature  windspeed   event
0  2017-01-01         32.0        6.0    Rain
1  2017-01-04         32.0        7.0   Sunny
2  2017-01-05         28.0        7.0    Snow
3  2017-01-06         28.0        7.0     NaN
4  2017-01-07         32.0        2.0    Rain
5  2017-01-08         31.0        2.0   Sunny
6  2017-01-09         31.0        8.0     NaN
7  2017-01-10         34.0        8.0  Cloudy
8  2017-01-11         40.0       12.0   Sunny


In [23]:
data.event.value_counts()

Sunny     3
Rain      2
Cloudy    1
Snow      1
Name: event, dtype: int64

### Fill null value with mean(), median(), mode()

In [24]:

data.event.fillna(data.event.mode()[0],inplace=True)
data.temperature.fillna(data.temperature.median(),inplace=True)
data.windspeed.fillna(data.windspeed.median(),inplace=True)
print(data)

          day  temperature  windspeed   event
0  2017-01-01         32.0        6.0    Rain
1  2017-01-04         32.0        7.0   Sunny
2  2017-01-05         28.0        7.0    Snow
3  2017-01-06         32.0        7.0   Sunny
4  2017-01-07         32.0        7.0    Rain
5  2017-01-08         31.0        2.0   Sunny
6  2017-01-09         32.0        7.0   Sunny
7  2017-01-10         34.0        8.0  Cloudy
8  2017-01-11         40.0       12.0   Sunny


---
### interpolation()
It uses various interpolation technique to fill the missing values rather than hard-coding the value.      
There are various method like linear,nearest,quadratic etc. are used in interpolate to fill the null value

---

In [25]:
data1.interpolate(method='linear',limit_direction='forward')

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,30.0,7.0,Sunny
2,2017-01-05,28.0,7.0,Snow
3,2017-01-06,30.0,7.0,
4,2017-01-07,32.0,4.5,Rain
5,2017-01-08,31.0,2.0,Sunny
6,2017-01-09,32.5,5.0,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


In [26]:
data1.interpolate(method='nearest',limit_direction='forward')

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,32.0,7.0,Sunny
2,2017-01-05,28.0,7.0,Snow
3,2017-01-06,28.0,7.0,
4,2017-01-07,32.0,7.0,Rain
5,2017-01-08,31.0,2.0,Sunny
6,2017-01-09,31.0,2.0,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


### Use of dropna()

In [27]:
data1.dropna()

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
5,2017-01-08,31.0,2.0,Sunny
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


---
### dropna(how = 'all') 
  To drop the rows which contains all the null value
### dropna(how='any') 
  To drop the rows which contain at least one null value
  
---

In [28]:
data1.dropna(how = 'all')

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
1,2017-01-04,,7.0,Sunny
2,2017-01-05,28.0,,Snow
3,2017-01-06,,7.0,
4,2017-01-07,32.0,,Rain
5,2017-01-08,31.0,2.0,Sunny
6,2017-01-09,,,
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


In [29]:
data1.dropna(how = 'any')

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32.0,6.0,Rain
5,2017-01-08,31.0,2.0,Sunny
7,2017-01-10,34.0,8.0,Cloudy
8,2017-01-11,40.0,12.0,Sunny


In [30]:
df=pd.read_csv("dataset\weather_data_2.csv")

In [31]:
df

Unnamed: 0,day,temperature,windspeed,event
0,01-01-2017,32,6,Rain
1,01-02-2017,-99999,7,Sunny
2,01-03-2017,28,-99999,Snow
3,01-04-2017,-99999,7,No Event
4,01-05-2017,32,-99999,Rain
5,01-06-2017,31,2,Sunny
6,01-06-2017,34,5,No Event


---
### Use of replace()
syntax:-series.replace(to_replace,with_value)

---

Some time it is possible that data contain some unnecessary value(-99999) and in the categorical feature it's 'No Event'. <br>
So, we have to replace it with null value.

---

In [32]:
df.temperature.replace(-99999,np.nan,inplace=True)
df.windspeed.replace(-99999,np.nan,inplace=True)
df.event.replace("No Event",np.nan,inplace=True)
print(df)

          day  temperature  windspeed  event
0  01-01-2017         32.0        6.0   Rain
1  01-02-2017          NaN        7.0  Sunny
2  01-03-2017         28.0        NaN   Snow
3  01-04-2017          NaN        7.0    NaN
4  01-05-2017         32.0        NaN   Rain
5  01-06-2017         31.0        2.0  Sunny
6  01-06-2017         34.0        5.0    NaN
