**Roll No.:**

**Name:**

**Class:**

**Batch:**

**Date of Experiment:**

<h1> Exp 6: Apply data pre-processing tasks: to normalize the data, to group the data
using data binning and to turn categorical values into numeric values for a given dataset. Demonstrate the same using Pandas and NumPy in Python.</h1>

<h2> Data Pre-processing </h2>

> - The process of converting or mapping data from the initial "raw" form into another format, to make it ready for further analysis.
> - It is also known as Data Cleaning and Data Wrangling.

<h2> Objectives: </h2>


> 1. Normalize the Data (centering/scaling)
> 2. Data Binning
> 3. Turn Categorical values into Numeric values

<h2> Important Shortcut Keys </h2>

> - A -> To **create** cell **above**
> - B -> To **create** Cell **below**
> - D D -> For **deleting** the cell
> - M -> To **markdown** the Cell
> - Y -> For **code** the cell
> - Z -> To **undo** the deleted cell

<h2> 1. Reading the dataset modified by Exp-5 (csv output of Exp-5) </h2>

<h3> 1.1 Import Libraries </h3>

In [None]:
# Import the libraries pandas and matplotlib
import pandas as pd
import numpy as np
import matplotlib.pylab as plt

<h3> 1.2 Import Data </h3> 

First, we assign the filepath of the dataset modified in Exp5 to variable "filename".

Copy the filepath of csv and paste it inside single quote.

In [None]:
filename = 'expt5output'

In [None]:
df = pd.read_csv(filename)

Use the method <b>head()</b> to display the first five rows of the dataframe.

In [None]:
df.columns.values

In [None]:
# To see what the data set looks like, we'll use the head() method.
df.head()

<h2> 2. Data Normalization in Python </h2>

<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
</p>

<center>
    <img src = "fig5aa.png">
</center>

<b>Example</b>

<p>To demonstrate normalization, let's say we want to scale the columns "length", "width" and "height".</p>
<p><b>Target:</b> would like to normalize those variables so their value ranges from 0 to 1</p>
<p><b>Approach:</b> replace original value by (original value)/(maximum value)</p>

<b> Few Methods of normalizing data </b> 

1. **Simple feature scaling:** $x_{new} = \frac{x_{old}}{x_{max}}$

2. **Min-Max:** $x_{new} = \frac{x_{old} - x_{min}}{x_{max} - x_{min}}$

3. **Z-score:** $x_{new} = \frac{x_{old} - \mu}{\sigma}$ where $\mu$ is the mean and $\sigma$ is the standard deviation of the feature.

<h3> 2.1 Simple feature scaling

In [None]:
# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question #1: </h1>

<b>According to the example above, normalize the column "height".</b>

</div>


<details><summary>Click here for the solution</summary>

```python
df['height'] = df['height']/df['height'].max() 

# show the scaled columns
df[["length","width","height"]].head()


```

</details>


Here we can see we've normalized "length", "width" and "height" in the range of \[0,1].


<h2> 3. Data Binning</h2>

- **Binning:** Grouping of **values** into **bins** for grouped analysis.
    - Example: we can bin "age" into [0, 5], [6, 10], [11, 15] and so on.


- Converts **numeric** into **categorical** variables.


- Group a **set of numerical values** into  a **set of bins**.

<b>Example: </b>

<p>In our dataset, "horsepower" is a real valued variable ranging from 48 to 288 and it has 59 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis? </p>

<p>We will use the pandas method 'cut' to segment the 'horsepower' column into 3 bins.</p>

Convert data to correct format:


In [None]:
df["horsepower"]=df["horsepower"].astype(int, copy=True)

Let's plot the histogram of horsepower to see what the distribution of horsepower looks like.


In [None]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["horsepower"])

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

<p>We would like 3 bins of equal size bandwidth so we use numpy's <code>linspace(start_value, end_value, numbers_generated</code> function.</p>
<p>Since we want to include the minimum value of horsepower, we want to set start_value = min(df["horsepower"]).</p>
<p>Since we want to include the maximum value of horsepower, we want to set end_value = max(df["horsepower"]).</p>
<p>Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated = 4.</p>


We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.


In [None]:
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
bins

We set group  names:


In [None]:
group_names = ['Low', 'Medium', 'High']

We apply the function "cut" to determine what each value of `df['horsepower']` belongs to.


In [None]:
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
df[['horsepower','horsepower-binned']].head(20)

Let's see the number of vehicles in each bin:


In [None]:
df["horsepower-binned"].value_counts()

Let's plot the distribution of each bin:


In [None]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
pyplot.bar(group_names, df["horsepower-binned"].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

<p>
    Look at the dataframe above carefully. You will find that the last column provides the bins for "horsepower" based on 3 categories ("Low", "Medium" and "High"). 
</p>
<p>
    We successfully narrowed down the intervals from 59 to 3!
</p>


<h3>Bins Visualization</h3>
Normally, a histogram is used to visualize the distribution of bins we created above. 


In [None]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot


# draw historgram of attribute "horsepower" with bins = 3
plt.pyplot.hist(df["horsepower"], bins = 3)

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

The plot above shows the binning result for the attribute "horsepower".


<h2> 3. Turning Categorical values into Numeric values</h2>

<b>What is an indicator variable?</b>
<p>
    An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. 
</p>

<b>Why we use indicator variables?</b>

<p>
    We use indicator variables so we can use categorical variables for regression analysis in the later modules.
</p>
<b>Example</b>
<p>
    We see the column "fuel-type" has two unique values: "gas" or "diesel". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "fuel-type" to indicator variables.
</p>

<p>
    We will use pandas' method 'get_dummies' to assign numerical values to different categories of fuel type. 
</p>


In [None]:
df.columns

Get the indicator variables and assign it to data frame "dummy_variable\_1":


In [None]:
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.head()

Change the column names for clarity:


In [None]:
dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
dummy_variable_1.head()

In the dataframe, column 'fuel-type' has values for 'gas' and 'diesel' as 0s and 1s now.


In [None]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)

In [None]:
df.head()

The last two columns are now the indicator variable representation of the fuel-type variable. They're all 0s and 1s now.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #2: </h1>

<b>Similar to before, create an indicator variable for the column "aspiration"</b>

</div>


<details><summary>Click here for the solution</summary>

```python
# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df['aspiration'])

# change column names for clarity
dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)

# show first 5 instances of data frame "dummy_variable_1"
dummy_variable_2.head()


```

</details>


 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  #3: </h1>

<b>Merge the new dataframe to the original dataframe, then drop the column 'aspiration'.</b>

</div>


<details><summary>Click here for the solution</summary>

```python
# merge the new dataframe to the original datafram
df = pd.concat([df, dummy_variable_2], axis=1)

# drop original column "aspiration" from "df"
df.drop('aspiration', axis = 1, inplace=True)


```

</details>


Save the new csv:


In [None]:
df.to_csv('expt6output', index = None)

**Conclusion:**