# <font color="blue">Lesson 6 - Feature Engineering and Selection</font>
# One Hot Encoding
For this lesson, we'll import the Automobile Dataset from the UCI Machine Learning repository.  

## Import Data

In [None]:
import pandas as pd
import numpy as np

# add headers to the data
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values='?' )
df.head()

## Pre-Process Data
Before we can beging working with this dataset, let's explore the dataset and do some parsing. 

In [None]:
# check what datatypes the dataframe contains and make sure they are correct
df.dtypes

In [None]:
# check for nulls
df[df.isnull().any(axis=1)].head()

In [None]:
# use pandas fillna function to remove NaN

df = df.fillna(0)
df.head()


## Consider this
What did the pandas fill the NaN values with? Why? What could be a better fill?

See <a href="https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.fillna.html">pandas fillna documentation</a>

### Convert categorical strings to numeric
There are two columns in this dataset that use strings for numerical values instead of just using the number; num_doors and num_cylinders. 


In [None]:
df['num_doors'].head()

In [None]:
df['num_cylinders'].head()

In Pandas we can use dictionaries to create a map between old values and new values, and this will work great for converting these strings into actual numbers. 

In [None]:
# here's how it works on the new doors column
new_door_counts = {"four":4, "two":2}
df.replace(new_door_counts, inplace=True)
df["num_doors"].head()

In [None]:
type(new_door_counts)

## Your Turn

In [None]:
# here are the values you will need to map for the num_cylinders column
df["num_cylinders"].unique()

Following the code we used to map num_doors, create a mapping dictionary and replace the values for the num_cylinders column. 

In [None]:
# string to number dictionary 
new_num_cyls = {???}

# pandas replace on df
df.replace(???, inplace=True)

In [None]:
df["num_cylinders"].head()

## One Hot Encoding with Pandas
Now that we've parsed our dataset, we can use pandas get_dummies function to one-hot-encode the categorical variables into integers. 

First we'll use pandas select_dtypes() function to pull out columns that are categorical; in pandas categorical data is stored in an object dtype. 

In [None]:
# store object column names in a list
obj_cols = df.select_dtypes(include=["object"]).columns
obj_cols

Now we can use pandas get_dummies function to one hot encode. The syntax is: 

`pd.get_dummies(dataframe, columns=[list of object cols])`

In [None]:
# encode your dataframe
pd.get_dummies(df, columns=???)

We can see from the dataframe above that there are now a lot of redundant columns in our dataframe. Add the argument "drop_first=True" to the get_dummies function you used above and see what happens. 

In [None]:
# encode your data frame again, but add the following argument
drop_first=True

## One Hot Encoding with Sklearn
Sklearn also provides a method for one hot encoding. Some people prefer pandas because it's more straightforward, but you can decide on your own. 

In [None]:
from sklearn.preprocessing import LabelEncoder

# instantiate encoder
le = LabelEncoder()

# fit and transform the object columns
# use df.apply() to apply le.fit_transform to object columns
df2 = df.apply(le.fit_transform, ignore_failures=True)

In [None]:
df2.head()

### Congratulations
So which method do you prefer for one hot encoding?