##  Content

* #### **Identify and handle missing values**;
* #### **Data formatting**;
* #### **Data normalization (centering/scaling)**;
* #### **Data binning**;
* #### **Turning Categorial values to numerical variables**.

In [3]:
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

df = pd.read_csv(url, header = None) # none must be passed because there aren't headers in the dataset
headers = ["symbolic", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size", "fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg","price"]
df.columns = headers

In [4]:
#Simple Dataframe Operations

df.head(2)

Unnamed: 0,symbolic,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500


In [5]:
#Acessing a column of the dataframe

df["symbolic"] = df["symbolic"] +1 
df.head(2)

Unnamed: 0,symbolic,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,4,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,4,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500


### Dealing with missing values

**Drop The missing values**;
* Drop the variable
* Drop the data entry

**Replace the missing values**;

* replace it with an average (of similiar datapoints)
* replace it by frequency
* replace it based on other functions

**Leave it as missing data**.

In [6]:
# The dropna fuction to drop values like NaN. By specifying "axis=0" it drops the row, while "axis=1" 
#drops the columns that contain missing values.

#Since we want to predict the price of the cars, we have
#to remove the cars which doesnt  have a listed price.

#subset defines in which column the search will be done
#inplace = True modify the dataframe.
df.dropna(subset = ["price"], axis=0, inplace = True)

df["price"]


0      13495
1      16500
2      16500
3      13950
4      17450
       ...  
200    16845
201    19045
202    21485
203    22470
204    22625
Name: price, Length: 205, dtype: object

In [14]:
import numpy as np 
#Replacing NaN values with newly calculated values

#For example, we want to replace the missing values of the
# variable "normalized losses" by the mean of the variable.
#So, we do:

mean = df["normalized-losses"][3:4].mean()

df["normalized-losses"].replace(mean)

0        ?
1        ?
2        ?
3      164
4      164
      ... 
200     95
201     95
202     95
203     95
204     95
Name: normalized-losses, Length: 205, dtype: object

###  Wrapping up

In [16]:
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
                   "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, pd.Timestamp("1940-04-25"),
                            pd.NaT]})

df.head()

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [28]:
df.dropna(subset = ["toy"], axis = 0) #drop the row with toy NaN or NaT



Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [27]:
df.dropna(axis = 1) # drop all the columns with NaN or NaT

Unnamed: 0,name
0,Alfred
1,Batman
2,Catwoman


In [26]:
df.dropna(subset = ["born"], axis = 0)

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


In [29]:
#Up there, there's no inĺace, so the dataframe wasn't modified:

<bound method NDFrame.head of        name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT>

In [36]:
#Modifying it: 
#subset specifies wich columns the fuction will look for missing values

df.dropna(subset=['toy', 'born'], axis = 0, inplace = True)
df

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


In [44]:
df = pd.DataFrame({"Price": [10, np.nan, 10, 40],
      "Length":[0.1, 0.3, np.nan, 0.1],
      "Sales": [100, 120, 200, np.nan]
     })

df

Unnamed: 0,Price,Length,Sales
0,10.0,0.1,100.0
1,,0.3,120.0
2,10.0,,200.0
3,40.0,0.1,


In [51]:
mean = df["Price"].mean()
df["Price"].replace(np.nan, mean, inplace = True)
df

Unnamed: 0,Price,Length,Sales
0,10.0,0.1,100.0
1,20.0,0.3,120.0
2,10.0,,200.0
3,40.0,0.1,
