### Pre-processing Data in python

* The process of converting or mapping data from the initial "raw" form into another format, in order to prepare the data for 
further analysis
* Also as known as:
    * Dada Cleaning, Data Wrangling

* access the value using the column name
    * `df["symboling"]`
* assign value
    
    * `df["symboling"] = df["symboling"] + 1` 

In [28]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

df = pd.read_csv(url, header = None)
headers = ["symboling", "normalized-losses", "make", "fuel-type","aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "wheel-base", 
           "length", "width"," height","curb-weight","engine-type","num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower", 
           "peak-rpm","city-mpg","highway-mpg","price"]
df.columns = headers
df.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


#### Missing values
* Missing values occur when no data value is stored for a variable (feature) in an observation.
* Could be represented as "?", "N/A", 0 or just a blank cell.

#### How to deal with missing data
* Check with the data collection source
* Drop the missing values
    * drop the variable
    * drop the data entry
* Replace the missing values
    * replace it with an average (of similar datapoints)
    * replace it by frequency
    * replace it based on other functions
* Leave it as missing data
    

#### how to drop missing values in python
* use dataframes.dropna()
    * axis = 0 drops the entire row
    * axis = 1 drops the entire column
* inplace: writes the result back to the data frame

`df.dropna(subset=["price"], axis=0, inplace = True)` is aqual to  `df = df.dropna(subset=["price"], axis=0)`

In [29]:
df.dropna(subset=["price"], axis=0, inplace = True)

#### How to replace missing values in Python
* Use `dataframe.replace(missing_valu, new_value)`


In [30]:
import numpy as np
#convert non-numeric values ​​to NaN
df["normalized-losses"] = pd.to_numeric( df["normalized-losses"], errors='coerce')
# First we calculate the mean of the column
mean = df["normalized-losses"].mean()
df["normalized-losses"].replace(np.nan, mean)

0      122.0
1      122.0
2      122.0
3      164.0
4      164.0
       ...  
200     95.0
201     95.0
202     95.0
203     95.0
204     95.0
Name: normalized-losses, Length: 205, dtype: float64

####  Data Formatting
    * Data is usually collected from different places and stored in different formats
    * Bringing data into a common standard of expression allows users to make meaningful comparison
    

In [31]:
df["city-mpg"] = 235/df["city-mpg"]
df.rename(columns={"city_mpg": "city-L/100km"}, inplace=True)

#### Incorrect data types
* Sometimes the wrong data type is assigned to a feature
#### Data types in Python and Pandas
* There are many data types in Pandas
* Objects: "A", "Hello"
* Int64: 1,2,3
* Float64: 1.2422,6.3432


##### Correction data types
To identify data types:
    * Use `dataframe.dtypes()` to identify data type

To convert data types:
    * Use `dataframe.astype()` to convert data type

In [35]:
df = df[df['price'] != '?']
# Example: Convert data type to integer in column "price"
df["price"] = df["price"].astype("int")

#### Data  Normalization
* Uniform the features value with different range.
#### Methods of normalizing data
* Several approaches for normalization:
    * Simple Feature Scaling
        $$x_{new} = \frac{x_{old}}{x_{max}}$$
             
    * Min-Max
        $$x_{new} = \frac{x_{old}-x_{min}}{x_{max}-x_{min}}$$
    
    * Z-score
        $$x_{\text{new}} = \frac{x_{\text{old}} - \mu}{\sigma}$$
        * $\mu$ : It is the average of the feature
        * $\sigma$: standard deviation

#### With Pandas
* Simple feature scaling
    * First, we use the simple entity scaling method where we divide it by the maximum value of the entity, using pandas `max()` method
    * `df["length"] = df["length"]/df["length"].max()`
* Min-max
    * `df["length"] = (df["length"]-df["length"].min())/(df["length"].max()-df["length"].min())`
* Z-score
    * `df["length"] = (df["length"]-df["length"].mean())/df["length"].std()`
    * std() return standard deviation
    * mean return average of the feature

In [None]:
df["length"] = df["length"]/df["length"].max()

#### Binning
* Data preprocessing method. Clustering involves grouping values ​​into groups.
* Binning: Grouping of values into bins
* Converts numeric into categorical variables
* Group a set of numerical values into set of "bins"

In [37]:
bins = np.linspace(min(df["price"]), max(df["price"]), 4)
group_names = ["Low", "Medium", "High"]
df["price-binned"] = pd.cut(df["price"], bins, labels=group_names, include_lowest=True)

#### How to turn categorical variables into quatitative variables in python
Problem:
    * Most statistical models cannot take in the objects/strings as inpt
##### Dummy variables in python pandas
* Use  `pandas.get_dummies()` methond.
* Convert categorical variables to dummy variables (0 or 1)

Example:
    * `pd.get_dummies(df['fuel])`