# Data Wrangling
The purpose of data wrangling is to transform data from its initial format to a format that are better for analysis. This process include cleaning, structuring and enriching raw data so that the data will be ready for analysis. 

In [None]:
import pandas as pd
import matplotlib.pylab as plt
import numpy as np

To demonstrate how data wrangling is done, I will use automobile dataset which is hosted on IBM Cloud object. 

In [None]:
file_source = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"

The next step is creating python list of headers. The headers will be used to name each column in the dataset.

In [None]:
file_headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

Then, I use `read_csv` method to load the data and set parameter names equal to *file_headers* variable.

In [None]:
df = pd.read_csv(file_source, names=file_headers)

In [None]:
df.head() # Display the first five rows of the dataset

## Identify Missing Value

What do we do first after data has been loaded? We need to examine the dataset by looking maybe the first five or ten rows. One of the first thing to check is missing value. It is important because missing prevent us from doing good and right analysis. So, first thing to do is **identify missing value**. The preview of our dataset shows three cells with question marks, no value which indicates missing value and we need to deal with that. 

### 1. Convert "?" with `NaN`
For convenient and performance reasons, as stated __[here](https://pandas.pydata.org/pandas-docs/version/2.0.1/user_guide/missing_data.html)__ we will convert "?" with `NaN`.

In [None]:
df.replace("?", np.nan, inplace=True)
df.head()

I used `inplace = True` because I want to change the original dataframe and we used Pandas `replace` method. As we can see in the preview, the "?" has been replaced by `NaN`.

### 2. Evaluating Missing Data
There are two useful functions for evaluating missing data; `isnull()` and `notnull`. The output will be boolean value indicating missing or non missing data. Those functions are opposite to each other. `isnull` will produce `True` for missing data and `False` for non missing data and vice versa for the other function

In [None]:
missing_data1 = df.isnull()
missing_data1.head()

In [None]:
missing_data2 = df.notnull()
missing_data2.head()

As we can see from both preview how those two functions produce different results.

### 3. Count missing value in each column
After evaluating missing data, we for sure want to count how many missing values exist in our dataset. We can achieve that by applying python `for` loop. I will use missing_data1 in this case. 

In [None]:
for column in missing_data1.columns.values.tolist():
    print(missing_data1[column].value_counts())
    print("")

Based on the summary above, each column has 205 rows. One important thing here is the missing data; 7 columns contain missing data. They are:
1. *normalized-losses*: 41 missing data
2. *num-of-doors*: 2 missing data
3. *bore*: 4 missing data
4. *stroke*: 4 missing data
5. *horsepower*: 2 missing data
6. *peak rpm*: 2 missing data
7. *price*: 4 missing data