# Data Wrangling
The purpose of data wrangling is to transform data from its initial format to a format that are better for analysis. This process include cleaning, structuring and enriching raw data so that the data will be ready for analysis. 

In [1]:
import pandas as pd
import matplotlib.pylab as plt
import numpy as np

To demonstrate how data wrangling is done, I will use automobile dataset which is hosted on IBM Cloud object. 

In [2]:
file_source = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"

The next step is creating python list of headers. The headers will be used to name each column in the dataset.

In [3]:
file_headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

Then, I use `read_csv` method to load the data and set parameter names equal to *file_headers* variable.

In [4]:
df = pd.read_csv(file_source, names=file_headers)

In [5]:
df.head() # Display the first five rows of the dataset

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


## Identify Missing Value

What do we do first after data has been loaded? We need to examine the dataset by looking maybe the first five or ten rows. One of the first thing to check is missing value. It is important because missing prevent us from doing good and right analysis. So, first thing to do is **identify missing value**. The preview of our dataset shows three cells with question marks, no value which indicates missing value and we need to deal with that. 

### 1. Convert "?" with `NaN`
For convenient and performance reasons, as stated __[here](https://pandas.pydata.org/pandas-docs/version/2.0.1/user_guide/missing_data.html)__ we will convert "?" with `NaN`.

In [6]:
df.replace("?", np.nan, inplace=True)
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


I used `inplace = True` because I want to change the original dataframe and we used Pandas `replace` method. As we can see in the preview, the "?" has been replaced by `NaN`.

### 2. Evaluating Missing Data
There are two useful functions for evaluating missing data; `isnull()` and `notnull`. The output will be boolean value indicating missing or non missing data. Those functions are opposite to each other. `isnull` will produce `True` for missing data and `False` for non missing data and vice versa for the other function

In [7]:
missing_data1 = df.isnull()
missing_data1.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [8]:
missing_data2 = df.notnull()
missing_data2.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,True,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
1,True,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
2,True,False,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


As we can see from both preview how those two functions produce different results.

### 3. Count missing value in each column
After evaluating missing data, we for sure want to count how many missing values exist in our dataset. We can achieve that by applying python `for` loop. I will use missing_data1 in this case. 

In [9]:
for column in missing_data1.columns.values.tolist():
    print(missing_data1[column].value_counts())
    print("")

symboling
False    205
Name: count, dtype: int64

normalized-losses
False    164
True      41
Name: count, dtype: int64

make
False    205
Name: count, dtype: int64

fuel-type
False    205
Name: count, dtype: int64

aspiration
False    205
Name: count, dtype: int64

num-of-doors
False    203
True       2
Name: count, dtype: int64

body-style
False    205
Name: count, dtype: int64

drive-wheels
False    205
Name: count, dtype: int64

engine-location
False    205
Name: count, dtype: int64

wheel-base
False    205
Name: count, dtype: int64

length
False    205
Name: count, dtype: int64

width
False    205
Name: count, dtype: int64

height
False    205
Name: count, dtype: int64

curb-weight
False    205
Name: count, dtype: int64

engine-type
False    205
Name: count, dtype: int64

num-of-cylinders
False    205
Name: count, dtype: int64

engine-size
False    205
Name: count, dtype: int64

fuel-system
False    205
Name: count, dtype: int64

bore
False    201
True       4
Name: count, dtype: 

Based on the summary above, each column has 205 rows. One important thing here is the missing data; 7 columns contain missing data. They are:
1. *normalized-losses*: 41 missing data
2. *num-of-doors*: 2 missing data
3. *bore*: 4 missing data
4. *stroke*: 4 missing data
5. *horsepower*: 2 missing data
6. *peak rpm*: 2 missing data
7. *price*: 4 missing data

### 4. Dealing with Missing Data
After identifying and evaluating missing data, what to do next? How to deal with it?

There are two common methods for dealing with missing data. First, we can **drop** them. Second, we can **replace** them with new value. 
1. Drop Data: drop the whole row or drop the whole column.
2. Replace Data: replace it by mean, by frequency or based on other functions.

In this case, I will do the following:
Replace by mean the following columns:
- normalized-losses
- bore
- stroke
- horsepower
- peak rpm

Replace by freq (mode):
- num of doors, rationale: sedans are four doors. Since four doors is most frequent, it is most likely to occur

Drop the whole row:
- price, rationale: this is our target variable (y), we cannot using rows with missing values in supervised learning.

I'll start dealing with missing data by calculating mean.

In [17]:
# Calculate mean for "normalized-losses"
avg_norm_loss = df["normalized-losses"].astype(float).mean()
print("Average noralized-losses:", avg_norm_loss)

Average noralized-losses: 122.0


In [20]:
# Replace NaN with mean values in normalized-loss column
df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)

In [21]:
# Calculate mean value for bore column
avg_bore = df["bore"].astype(float).mean()
print("Average bore:", avg_bore)

Average bore: 3.3297512437810943


In [30]:
# Replace NaN with mean bore value in bore column
df["bore"] = df["bore"].replace(np.nan, avg_bore)
# df["bore"].isnull().value_counts() # I am checking if the original data has been changed.

In [31]:
# Calculate the mean value for "stroke" column
avg_stroke = df["stroke"].astype(float).mean()
print("Average stroke:", avg_stroke)

Average stroke: 3.255422885572139


In [34]:
# replace nan with mean value for stroke
df["stroke"] = df["stroke"].replace(np.nan, avg_stroke)

In [35]:
# Calculate mean value for "horsepower" column
avg_horsepower = df["horsepower"].astype(float).mean()
print("Average horsepower:", avg_horsepower)

Average horsepower: 104.25615763546799


In [36]:
# Replace nan with mean value for horsepower column
df["horsepower"] = df["horsepower"].replace(np.nan, avg_horsepower)

In [37]:
# Calculate mean value for peak-rpm column
avg_peak_rpm = df["peak-rpm"].astype(float).mean()
print("Average peak-rpm:", avg_peak_rpm)

Average peak-rpm: 5125.369458128079


In [38]:
# Replace nan with mean value for peak-rpm
df["peak-rpm"] = df["peak-rpm"].replace(np.nan, avg_peak_rpm)

In [40]:
df["num-of-doors"].value_counts() # I use value_counts() method to see what values present is this column

num-of-doors
four    114
two      89
Name: count, dtype: int64

In [42]:
df["num-of-doors"] = df["num-of-doors"].replace(np.nan, "four")
df["num-of-doors"].value_counts() # checking changes

num-of-doors
four    116
two      89
Name: count, dtype: int64

In [43]:
# drop all rows that do not have price data
df.dropna(subset=["price"], axis=0, inplace=True)