# 1. Missing Data

## Using statistical value

**fill in a column's missing values with a statistical value such as the mean, median, or mode.**

In [1]:
import numpy as np
import pandas as pd

pd_series = pd.Series([5, 10, np.nan, 15, 20, np.nan, 25, 50, np.nan])
print('Average of non-missing values: {0}'.format(pd_series.mean()))
pd_series = pd_series.fillna(pd_series.mean())
print(pd_series)

Average of non-missing values: 20.833333333333332
0     5.000000
1    10.000000
2    20.833333
3    15.000000
4    20.000000
5    20.833333
6    25.000000
7    50.000000
8    20.833333
dtype: float64


## Tracking and dropping missing data

**Pandas also makes it easy to drop missing values using the dropna() function:**

In [3]:
import numpy as np
import pandas as pd

pd_series = pd.Series([5, 10, np.nan, 15, 20, np.nan, 25, 50, np.nan])
# Drop rows with missing data
pd_series = pd_series.dropna()
print(pd_series)
print(pd_series.isnull())

0     5.0
1    10.0
3    15.0
4    20.0
6    25.0
7    50.0
dtype: float64
0    False
1    False
3    False
4    False
6    False
7    False
dtype: bool


# 2. Outliers



Another area of cleaning can be dealing with outliers. First off, how do you define an outlier? This can require domain knowledge as well as other information, but a simple way to start is by taking a look at box plots:

![image.png](attachment:image.png)

# 3. Scaling

## Introduction to scaling

The scale of your features matter for many machine learning algorithms. Having income values that range from 100 to 100,000 and ages that range from 0 to 100 can cause issues because of the large difference in scale of these two data columns. To deal with this, it is standard to rescale the data. There are many ways to do this, but two most common ones are:

* **Standard scaling**
* **Min/Max scaling**

### Standard scaling

Standard scaling subtracts the mean and divides by the standard deviation. This centers the feature on zero with unit variance.

In [8]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np

data = [[-1, 2],
        [-0.5, 6],
        [0, 10],
        [1, 18]]

print("Before Standard scaling")
print(np.mean(data, 0))
print(np.std(data, 0))

# Initalize a StandardScaler
standard = StandardScaler()
# Fit and transform the data with the StandardScaler
standard_data = standard.fit_transform(data)

print()
print("After Standard scaling")
print(np.mean(standard_data, 0))
print(np.std(standard_data, 0))

Before Standard scaling
[-0.125  9.   ]
[0.73950997 5.91607978]

After Standard scaling
[0. 0.]
[1. 1.]


### Min/Max scaling

Let's look at the same example but instead use the **MinMaxScaler()** from **sklearn**

In [11]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Create matrix of data
data = [[-1, 2], 
        [-0.5, 6], 
        [0, 10], 
        [1, 18]]

# Initalize MinMaxScaler
min_max = MinMaxScaler()
# Fit and transform the data
min_max_data = min_max.fit_transform(data)

print(np.min(min_max_data, 0))
print(np.max(min_max_data, 0))
print(np.mean(min_max_data, 0))
print(np.std(min_max_data, 0))

[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]
[0. 0.]
[1. 1.]
[0.4375 0.4375]
[0.36975499 0.36975499]


# 4. Categorical Data

## Introduction to categorical data

Sometimes you get [categorical data](https://en.wikipedia.org/wiki/Categorical_variable) which are variables with a limited and usually fixed number of values. For example, male and female. Machine learning algorithms need numbers to work, so how do you deal with these? We will discuss two ways:

* Label encoding
* One-hot encoding a.k.a dummy variables.

## Label encoding

**Label encoding** works by converting the unique values to a numeric representation. For example, if we have two categories male and female, we can categorize them as numbers:

* male as 0
* female 1

You can get the integer values by adding **.cat.codes** to the end of your category series

you get the string values by adding **.cat.categories** to the end of your category series

In [19]:
import pandas as pd

non_categorical_series = pd.Series(['male', 'female', 'male', 'female'])

# Convert the text series to a categorical series
categorical_series = non_categorical_series.astype('category')

# Print the numeric codes for each value
print(categorical_series.cat.codes)

# Print the category names
print(categorical_series.cat.categories)


0    1
1    0
2    1
3    0
dtype: int8
Index(['female', 'male'], dtype='object')


## One-hot encoding 

**One-hot encoding** is similar but creates a new column for each category and fills it with a 1 for each row with that value and zero otherwise.

In [20]:
import pandas as pd

# Create series with male and female values
non_categorical_series = pd.Series(['male', 'female', 'male', 'female'])
# Create dummy or one-hot encoded variables
print(pd.get_dummies(non_categorical_series))

   female  male
0       0     1
1       1     0
2       0     1
3       1     0


# 5. Exercise: Cleaning Auto MPG Dataset

We will then create an additional function **outlier_detection** that takes a data frame **df** and returns 2 numbers in a list:

* The 90<sup>th</sup> percentile for every column
* The 10<sup>th</sup> percentile for every column

In [31]:
import pandas as pd

def read_csv():
    # Define the column names as a list
    names = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year", "origin", "car_name"]
    # Read in the CSV file from the webpage using the defined column names
    df = pd.read_csv("./data/auto-mpg.data", header=None, names=names, delim_whitespace=True)
    return df

# Remving outliers from the data
def outlier_detection(df):
    df = df.quantile([.90, .10])
    return df

print(outlier_detection(read_csv()))

       mpg  cylinders  displacement  weight  acceleration  model_year  origin
0.9  34.33        8.0         350.0  4275.2          19.0        81.0     3.0
0.1  14.00        4.0          90.0  1988.5          12.0        71.0     1.0
