# **Data Wrangling :**

* Data wrangling is  also called as Data pre-processing
* Def :
<code> The process of coverting or mapping data from the initial "raw" form into another format inorder to prepare the data for further analysis</code>

* Identify and handle missing values
<ul>
<li>First, we will have to identify and handle the missing values</li>
<li>A "missing value" condition occurs whenever a data entry is left empty</li>
</ul>

* Data formatting
<ul>
<li>Data from different sources may be in various formats, in different units or in various conventions.</li>
<li>We have some methods in Python pandas that can standardize the values into the same format, or unit, or convention.</li>
</ul>

* Data normalization (Centering/Scaling)
<ul>
<li>Different columns of numerical data may have very different ranges, and direct comparison is often not meaningful.</li>
<li>Normalization is a way to bring all data into a similar range, for more useful comparison.</li>
<li> Specifically, we should focus on centering and scaling</li>
</ul>

* Data binning
<ul>
<li>Binning creates bigger categories from a set of numerical values.</li>
<li>It is particularly useful for comparison between groups of data.</li>
</ul>

* Turning categorical values into numeric values
<ul>
<li>We have to turn the categorical values in the dataset to numerical values</li>
</ul>

<hr>

* In python dataframes, 
<ul>
<li><b>each row is a sample ,</b></li>
<li><b>each column is a feature (pandas series).</b></li>
</ul>

<hr>

# **Dealing with Missing values :**

### What is a missing value?

* Missing values occur when no data is stored for a variable (feature) in an observation
* Could be represented as "?", "N/A", 0 or just a blank cell

### How to deal with missing data?

* **Drop the missing values**
<ul>
<li>drop the variable</li>
<li>drop the data entry</li>
</ul>

* **Rules of handling missing values ,**
<ul>
<li>When you drop data, you can either drop the <b>whole variable or</b> just the <b>single data entry</b> with the missing value.</li>
<li>If you <b>don’t have a lot of observations</b> with missing data, usually <b>dropping the particular entry</b> is the best.</li>
<li>If you’re removing data, you want to look to do something that has the <b>least amount of impact</b>.</li>
</ul>

<hr>

* **Replace the missing values**
<ul>
<li>Replacing the data is better, since <b>no data is wasted</b></li>
<li>However, it is <b>less accurate</b> since we need to replace missing data with a guess of what the data should be.</li>
</ul>

* **Techniques to replace the missing data**
<ul>
<li> Generally, we replace the missing data with the <b>average</b> of all the entries of the column in which it is present</li>
<li>But, <b>if values cannot be averaged</b>, i.e., in case of categorical variables, which are not numeric values, <b>replacing with the most common value (mode) is better</b></li>
</ul>

<hr>


* **Leave it as missing data**
<ul>
<li>In some cases, you may simply want to leave the missing data
as missing data.</li>
<li>For one reason or another, it may be useful to keep that observation, even if some features are missing.</li>
</ul>

* To replace the missing data , in pandas , we have a method called **dropna()** method
* If you want to **delete the entire row** in which the NaN value is present, you have to specify **axis=0** as the argument to this function.
* If you want to **delete the entire column** in which the NaN value is present, you have to specify **axis=1** as the argument to this function.

* Pandas has a built-in method called **replace** which lets us to replace a particular value in the entire column with the new value <br>
<code>df.replace(missing value,new value)</code>

# **Data Formatting :**

<b><code>Bringing data into common standard of expression allows users to make meaningful comparison</code></b>

<br>

* Data is usually collected from different places and stored in different formats
* **Non-formatted data :**
<ul>
<li>confusing</li>
<li>hard to aggregate</li>
<li>hard to compare</li>
</ul>

* **Formatted data :**
<ul>
<li>more clear</li>
<li>easy to aggregate</li>
<li>easy to compare</li>
<ul>

### **Incorrect datatypes :**

* Sometimes, the wrong datatype is assigned to a feature
* For example, in our used car price dataset, the datatype of **"price"** feature is object. But, it is supposed to be an integer/float datatype
* It is important for us to explore the datatypes of the columns in the dataset and convert them to the correct datatypes, otherwise the developed models later on will behave strangely

<hr>

* To know the datatype of a particular colum <br>
<b><code>df.dtypes()</code></b>

* To convert datatype of one column to another datatype <br>
<b><code>df.astype()</code></b> <br>
<code>Ex : df['price'] = df['price'].astype('int')</code>

# **Data Normalizaton :**

* It is an important technique to understand data preprocessing
* Some of the columns may have values only within in a particular range, for exaple 50-100, 1000-1500, etc.
* We may want to normalize these variables so that the range of the values is consistent.
* This normalization can make some statistical analyses easier down the road.

* By making the ranges consistent between variables, normalization enables a fairer comparison between the different features.
* Making sure they have the same impact, it is also important for computational reasons.

* Several approaches for normalization : 
<ul>

<li>
<b>Simple feature scaling : </b> <br>
x<sub>new</sub> = x<sub>old</sub> / x<sub>max</sub>
</li>

<li>
<b>Min-Max : </b> <br>
x<sub>new</sub> = (x<sub>old</sub> - x<sub>min</sub>) / (x<sub>max</sub> - x<sub>min</sub>)
</li>

<li>
<b>Min-Max : </b> <br>
x<sub>new</sub> = (x<sub>old</sub> - x<sub>min</sub>) / (x<sub>max</sub> - x<sub>min</sub>)
</li>

<li>
<b>Z-Score</b> <br>
x<sub>new</sub> = (x<sub>old</sub> - average) / std 

# **Binning :**

* Binning : Grouping of values into bins
* Converts numeric into categorical variables
* Group a set of numeric values into a set of "bins"

<hr>

* For example, you can bin <br>
<code>age into [0-5], [6-10], [11-15] and so on</code>

<hr>

* Sometimes, binning can improve accuracy of the predictive models
* Sometimes we use data binning to group a set of numerical values into a
smaller number of bins to have a better understanding of the data distribution
* In pandas, code to create bins is , <br>
<code>bins = numpy.linspace(min(df['price']),max(df['price']),4) <br>
group_names = ['Low', 'Medium', 'High'] <br>
df['price-binned'] = pandas.cut(df['price'], bins, labels=group_names, include_lowest = True)
</code>

<hr>

### **Visualizing Binned data :**
&emsp;&emsp;&emsp;You can use Histograms to visualize the distribution of the data after they have been divided into bins

# **Turning categorical variables into quantitative variables :**

**Problem :** Most statistical models cannot take object/string as input, for model training
<br>
**Solution :**
* Add dummy variables for each unique category
* Assign 0 or 1 in each category
* This method is called **One-Hot Encoding**

<hr>

* The **get_dummies()** method automatically generates a list of numbers, each one corresponding to a particular category of the variable. <br>
<code>pandas.get_dummies(df['ColumnName'])</code>

# **Check out "DA_Lab_2.ipynb" for the code**