# Data Preparation for Machine Learning

**Data preparation** is a vital step in the machine learning pipeline. Just as visualization is necessary to understand the relationships in data, proper preparation or **data munging** is required to ensure machine learning models work optimally. 

The process of data preparation is highly interactive and iterative. A typical process includes at least the following steps:
1. **Visualization** of the dataset to understand the relationships and identify possible problems with the data.
2. **Data cleaning and transformation** to address the problems identified. It many cases, step 1 is then repeated to verify that the cleaning and transformation had the desired effect. 
 

In this Session you will learn the following: 
- Recode character strings to eliminate characters that will not be processed correctly.
- Find and treat missing values. 
- Set correct data type of each column. 
- Transform categorical features to create categories with more cases and coding likely to be useful in predicting the label. 
- Apply transformations to numeric features and the label to improve the distribution properties. 
- Locate and treat duplicate cases. 


## An example

As a first example you will prepare the automotive dataset. Careful preparation of this dataset, or any dataset, is required before atempting to train any machine learning model. This dataset has a number of problems which must be addressed. Further, some feature engineering will be applied. 

### Load the dataset

As a first step you must load the dataset. 

Execute the code in the cell below to load the packages required  to run this notebook. 

In [69]:
import pandas as pd
import numpy as np

%matplotlib inline

Execute the code in the cell below to load the dataset and print the first few rows of the data frame.

In [7]:
auto_prices = pd.read_table('Automobile price data _Raw_.txt', delimiter=',')
auto_prices.head(5)

  """Entry point for launching an IPython kernel.


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [18]:
auto_prices.columns

Index(['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration',
       'num_of_doors', 'body_style', 'drive_wheels', 'engine_location',
       'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type',
       'num_of_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke',
       'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg',
       'highway_mpg', 'price'],
      dtype='object')

In [20]:
auto_prices.shape

(205, 26)

In [21]:
auto_prices.count() 

symboling            205
normalized_losses    205
make                 205
fuel_type            205
aspiration           205
num_of_doors         205
body_style           205
drive_wheels         205
engine_location      205
wheel_base           205
length               205
width                205
height               205
curb_weight          205
engine_type          205
num_of_cylinders     205
engine_size          205
fuel_system          205
bore                 205
stroke               205
compression_ratio    205
horsepower           205
peak_rpm             205
city_mpg             205
highway_mpg          205
price                205
dtype: int64

In [22]:
auto_prices.memory_usage()

Index                  80
symboling            1640
normalized_losses    1640
make                 1640
fuel_type            1640
aspiration           1640
num_of_doors         1640
body_style           1640
drive_wheels         1640
engine_location      1640
wheel_base           1640
length               1640
width                1640
height               1640
curb_weight          1640
engine_type          1640
num_of_cylinders     1640
engine_size          1640
fuel_system          1640
bore                 1640
stroke               1640
compression_ratio    1640
horsepower           1640
peak_rpm             1640
city_mpg             1640
highway_mpg          1640
price                1640
dtype: int64

In [67]:
auto_prices['num_of_cylinders'].value_counts()

four      159
six        24
five       11
eight       5
two         4
twelve      1
three       1
Name: num_of_cylinders, dtype: int64

In [25]:
#Getting specific list of data types
auto_prices.select_dtypes(include = ['number'] ).head(5)

Unnamed: 0,symboling,wheel_base,length,width,height,curb_weight,engine_size,compression_ratio,city_mpg,highway_mpg
0,3,88.6,168.8,64.1,48.8,2548,130,9.0,21,27
1,3,88.6,168.8,64.1,48.8,2548,130,9.0,21,27
2,1,94.5,171.2,65.5,52.4,2823,152,9.0,19,26
3,2,99.8,176.6,66.2,54.3,2337,109,10.0,24,30
4,2,99.4,176.6,66.4,54.3,2824,136,8.0,18,22


In [26]:
#Subsetting the Data
auto_prices.loc[1:10,['peak_rpm', 'highway_mpg']]

Unnamed: 0,peak_rpm,highway_mpg
1,5000,27
2,5000,26
3,5500,30
4,5500,22
5,5500,25
6,5500,25
7,5500,25
8,5500,20
9,5500,22
10,5800,29


In [29]:
#New Column:
auto_prices['New_Col']=np.absolute(auto_prices.curb_weight)

In [30]:
auto_prices.New_Col

0      2548
1      2548
2      2823
3      2337
4      2824
5      2507
6      2844
7      2954
8      3086
9      3053
10     2395
11     2395
12     2710
13     2765
14     3055
15     3230
16     3380
17     3505
18     1488
19     1874
20     1909
21     1876
22     1876
23     2128
24     1967
25     1989
26     1989
27     2191
28     2535
29     2811
       ... 
175    2414
176    2414
177    2458
178    2976
179    3016
180    3131
181    3151
182    2261
183    2209
184    2264
185    2212
186    2275
187    2319
188    2300
189    2254
190    2221
191    2661
192    2579
193    2563
194    2912
195    3034
196    2935
197    3042
198    3045
199    3157
200    2952
201    3049
202    3012
203    3217
204    3062
Name: New_Col, Length: 205, dtype: int64

You will now perform some data preparation steps. 

### Recode names

Notice that several of the column names contain the '-' character. Python will not correctly recognize character strings containing '-'.  Rather, such a name will be recognized as two character strings. The same problem will occur with column values containing many special characters including, '-', ',', '*', '/', '|', '>', '<', '@', '!' etc. If such characters appear in column names of values, they must be replaced with another character. 

Execute the code in the cell below to replace the '-' characters by '_':

In [9]:
auto_prices.columns=auto_prices.columns.str.replace('-','_')

In [None]:
# auto_prices.columns = [str.replace('-', '_') for str in auto_prices.columns]

### Dropping Variables

In [32]:
auto_prices.drop("New_Col",axis = 1).head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [33]:
auto_prices.columns

Index(['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration',
       'num_of_doors', 'body_style', 'drive_wheels', 'engine_location',
       'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type',
       'num_of_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke',
       'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg',
       'highway_mpg', 'price', 'New_Col'],
      dtype='object')

In [38]:
auto_prices.columns.difference(['New_Col']) 

Index(['aspiration', 'body_style', 'bore', 'city_mpg', 'compression_ratio',
       'curb_weight', 'drive_wheels', 'engine_location', 'engine_size',
       'engine_type', 'fuel_system', 'fuel_type', 'height', 'highway_mpg',
       'horsepower', 'length', 'make', 'normalized_losses', 'num_of_cylinders',
       'num_of_doors', 'peak_rpm', 'price', 'stroke', 'symboling',
       'wheel_base', 'width'],
      dtype='object')

### Renaming Columns (single or multiple)

In [40]:
#renaming column "RevolvingUtilization with Rev_Utilization" and "SeriousDlqin2yrs with SeriousDlq"
auto_prices.rename(columns={'aspiration':'Aspiration_', 'price':'Car_Price'}).head(10)

Unnamed: 0,symboling,normalized_losses,make,fuel_type,Aspiration_,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,Car_Price,New_Col
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.0,111,5000,21,27,13495,2548
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.0,111,5000,21,27,16500,2548
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,mpfi,2.68,3.47,9.0,154,5000,19,26,16500,2823
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,mpfi,3.19,3.4,10.0,102,5500,24,30,13950,2337
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,mpfi,3.19,3.4,8.0,115,5500,18,22,17450,2824
5,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,mpfi,3.19,3.4,8.5,110,5500,19,25,15250,2507
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,mpfi,3.19,3.4,8.5,110,5500,19,25,17710,2844
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,...,mpfi,3.19,3.4,8.5,110,5500,19,25,18920,2954
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,mpfi,3.13,3.4,8.3,140,5500,17,20,23875,3086
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,mpfi,3.13,3.4,7.0,160,5500,16,22,?,3053


### Sorting Data (single, multiple columns) in ascending and descending

In [41]:
## Sorting the data
auto_prices.sort_values(by='peak_rpm', ascending=False).head(10)

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price,New_Col
131,2,?,renault,gas,std,two,hatchback,fwd,front,96.1,...,mpfi,3.46,3.90,8.7,?,?,23,31,9895,2460
130,0,?,renault,gas,std,four,wagon,fwd,front,96.1,...,mpfi,3.46,3.90,8.7,?,?,23,31,9295,2579
165,1,168,toyota,gas,std,two,sedan,rwd,front,94.5,...,mpfi,3.24,3.08,9.4,112,6600,26,29,9298,2265
166,1,168,toyota,gas,std,two,hatchback,rwd,front,94.5,...,mpfi,3.24,3.08,9.4,112,6600,26,29,9538,2300
33,1,101,honda,gas,std,two,hatchback,fwd,front,93.7,...,1bbl,2.91,3.41,9.2,76,6000,30,34,6529,1940
58,3,150,mazda,gas,std,two,hatchback,rwd,front,95.3,...,mpfi,?,?,9.4,135,6000,16,23,15645,2500
57,3,150,mazda,gas,std,two,hatchback,rwd,front,95.3,...,4bbl,?,?,9.4,101,6000,17,23,13645,2385
56,3,150,mazda,gas,std,two,hatchback,rwd,front,95.3,...,4bbl,?,?,9.4,101,6000,17,23,11845,2380
55,3,150,mazda,gas,std,two,hatchback,rwd,front,95.3,...,4bbl,?,?,9.4,101,6000,17,23,10945,2380
36,0,78,honda,gas,std,four,wagon,fwd,front,96.5,...,1bbl,2.92,3.41,9.2,76,6000,30,34,7295,2024


In [43]:
auto_prices.sort_values(by = ["bore","stroke"]).head(5)

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price,New_Col
134,3,150,saab,gas,std,two,hatchback,fwd,front,99.1,...,mpfi,2.54,2.07,9.3,110,5250,21,28,15040,2707
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,mpfi,2.68,3.47,9.0,154,5000,19,26,16500,2823
18,2,121,chevrolet,gas,std,two,hatchback,fwd,front,88.4,...,2bbl,2.91,3.03,9.5,48,5100,47,53,5151,1488
32,1,101,honda,gas,std,two,hatchback,fwd,front,93.7,...,1bbl,2.91,3.07,10.1,60,5500,38,42,5399,1837
30,2,137,honda,gas,std,two,hatchback,fwd,front,86.6,...,1bbl,2.91,3.41,9.6,58,4800,49,54,6479,1713


### Type Conversions(Convert Data types of columns)

As has been previously noted, there are five columns in this dataset which do not have the correct type as a result of missing values. This is a common situation, as the methods used to automatically determine data type when loading files can fail when missing values are present. 

The code in the cell below iterates over a list of columns setting them to numeric. Execute this code and observe the resulting  types.

In [44]:
auto_prices['wheel_base'].astype('str')

0       88.6
1       88.6
2       94.5
3       99.8
4       99.4
5       99.8
6      105.8
7      105.8
8      105.8
9       99.5
10     101.2
11     101.2
12     101.2
13     101.2
14     103.5
15     103.5
16     103.5
17     110.0
18      88.4
19      94.5
20      94.5
21      93.7
22      93.7
23      93.7
24      93.7
25      93.7
26      93.7
27      93.7
28     103.3
29      95.9
       ...  
175    102.4
176    102.4
177    102.4
178    102.9
179    102.9
180    104.5
181    104.5
182     97.3
183     97.3
184     97.3
185     97.3
186     97.3
187     97.3
188     97.3
189     94.5
190     94.5
191    100.4
192    100.4
193    100.4
194    104.3
195    104.3
196    104.3
197    104.3
198    104.3
199    104.3
200    109.1
201    109.1
202    109.1
203    109.1
204    109.1
Name: wheel_base, Length: 205, dtype: object

### Resetting Index
It is used to create a DF with the data _conformed_ to a new index.

If we subset a Series or DataFrame with an index object, 

the data is _rearranged_ to obey this new index and missing values are introduced wherever the data was not present



In [47]:
auto_prices.set_index("make").head(5)

Unnamed: 0_level_0,symboling,normalized_losses,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price,New_Col
make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
alfa-romero,3,?,gas,std,two,convertible,rwd,front,88.6,168.8,...,mpfi,3.47,2.68,9.0,111,5000,21,27,13495,2548
alfa-romero,3,?,gas,std,two,convertible,rwd,front,88.6,168.8,...,mpfi,3.47,2.68,9.0,111,5000,21,27,16500,2548
alfa-romero,1,?,gas,std,two,hatchback,rwd,front,94.5,171.2,...,mpfi,2.68,3.47,9.0,154,5000,19,26,16500,2823
audi,2,164,gas,std,four,sedan,fwd,front,99.8,176.6,...,mpfi,3.19,3.4,10.0,102,5500,24,30,13950,2337
audi,2,164,gas,std,four,sedan,4wd,front,99.4,176.6,...,mpfi,3.19,3.4,8.0,115,5500,18,22,17450,2824


In [49]:
auto_prices.reset_index().head(4) #create variable

Unnamed: 0,index,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price,New_Col
0,0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,...,mpfi,3.47,2.68,9.0,111,5000,21,27,13495,2548
1,1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,...,mpfi,3.47,2.68,9.0,111,5000,21,27,16500,2548
2,2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,...,mpfi,2.68,3.47,9.0,154,5000,19,26,16500,2823
3,3,2,164,audi,gas,std,four,sedan,fwd,front,...,mpfi,3.19,3.4,10.0,102,5500,24,30,13950,2337


### Handling Duplicates
* `df.duplicated()` Returns boolean Series denoting duplicate rows, optionally only considering certain columns
* `df.drop_duplicates()` Returns DataFrame with duplicate rows removed, optionally only considering certain columns

In [55]:
auto_prices.loc[auto_prices.duplicated()].head(5)

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price,New_Col


In [57]:
auto_prices.drop_duplicates()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price,New_Col
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.00,111,5000,21,27,13495,2548
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.00,111,5000,21,27,16500,2548
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,mpfi,2.68,3.47,9.00,154,5000,19,26,16500,2823
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,mpfi,3.19,3.40,10.00,102,5500,24,30,13950,2337
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,mpfi,3.19,3.40,8.00,115,5500,18,22,17450,2824
5,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,15250,2507
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,17710,2844
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,18920,2954
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,mpfi,3.13,3.40,8.30,140,5500,17,20,23875,3086
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,mpfi,3.13,3.40,7.00,160,5500,16,22,?,3053


In [58]:
auto_prices.drop_duplicates(keep='last')

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price,New_Col
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.00,111,5000,21,27,13495,2548
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.00,111,5000,21,27,16500,2548
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,mpfi,2.68,3.47,9.00,154,5000,19,26,16500,2823
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,mpfi,3.19,3.40,10.00,102,5500,24,30,13950,2337
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,mpfi,3.19,3.40,8.00,115,5500,18,22,17450,2824
5,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,15250,2507
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,17710,2844
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,18920,2954
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,mpfi,3.13,3.40,8.30,140,5500,17,20,23875,3086
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,mpfi,3.13,3.40,7.00,160,5500,16,22,?,3053


In [59]:
#To find the number of duplicated rows
auto_prices.duplicated().value_counts()

False    205
dtype: int64



### Treat & Handling missing values

**Missing values** are a common problem in data set. Failure to deal with missing values before training a machine learning model will lead to biased training at best, and in many cases actual failure. The Python scikit-learn package will not process arrays with missing values. 

There are two problems that must be deal with when treating missing values:
1. First you must find the missing values. This can be difficult as there is no standard way missing values are coded. Some common possibilities for missing values are:
  - Coded by some particular character string, or numeric value like -999. 
  - A NULL value or numeric missing value such as a NaN. 
2. You must determine how to treat the missing values:
  - Remove features with substantial numbers of missing values. In many cases, such features are likely to have little information value. 
  - Remove rows with missing values. If there are only a few rows with missing values it might be easier and more certain to simply remove them. 
  - Impute values. Imputation can be done with simple algorithms such as replacing the missing values with the mean or median value. There are also complex statistical methods such as the expectation maximization (EM) or SMOTE algorithms. 
  - Use nearest neighbor values. Alternatives for nearest neighbor values include, averaging, forward filling or backward filling. 
  
Carefully observe the first few cases from the data frame and notice that missing values are coded with a '?' character. Execute the code in the cell below to identify the columns with missing values.

In [None]:
(auto_prices.astype(np.object) == '?').any()

In [61]:
auto_prices.isnull().sum()

symboling            0
normalized_losses    0
make                 0
fuel_type            0
aspiration           0
num_of_doors         0
body_style           0
drive_wheels         0
engine_location      0
wheel_base           0
length               0
width                0
height               0
curb_weight          0
engine_type          0
num_of_cylinders     0
engine_size          0
fuel_system          0
bore                 0
stroke               0
compression_ratio    0
horsepower           0
peak_rpm             0
city_mpg             0
highway_mpg          0
price                0
New_Col              0
dtype: int64

In [62]:
# Replace missing values with 0
auto_prices.fillna(0)

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price,New_Col
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.00,111,5000,21,27,13495,2548
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.00,111,5000,21,27,16500,2548
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,mpfi,2.68,3.47,9.00,154,5000,19,26,16500,2823
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,mpfi,3.19,3.40,10.00,102,5500,24,30,13950,2337
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,mpfi,3.19,3.40,8.00,115,5500,18,22,17450,2824
5,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,15250,2507
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,17710,2844
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,18920,2954
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,mpfi,3.13,3.40,8.30,140,5500,17,20,23875,3086
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,mpfi,3.13,3.40,7.00,160,5500,16,22,?,3053


In [63]:
# Fill with median
auto_prices.fillna(auto_prices.median())

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price,New_Col
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.00,111,5000,21,27,13495,2548
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.00,111,5000,21,27,16500,2548
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,mpfi,2.68,3.47,9.00,154,5000,19,26,16500,2823
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,mpfi,3.19,3.40,10.00,102,5500,24,30,13950,2337
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,mpfi,3.19,3.40,8.00,115,5500,18,22,17450,2824
5,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,15250,2507
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,17710,2844
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,18920,2954
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,mpfi,3.13,3.40,8.30,140,5500,17,20,23875,3086
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,mpfi,3.13,3.40,7.00,160,5500,16,22,?,3053


In [64]:
# dropping the observations
auto_prices.dropna()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price,New_Col
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.00,111,5000,21,27,13495,2548
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.00,111,5000,21,27,16500,2548
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,mpfi,2.68,3.47,9.00,154,5000,19,26,16500,2823
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,mpfi,3.19,3.40,10.00,102,5500,24,30,13950,2337
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,mpfi,3.19,3.40,8.00,115,5500,18,22,17450,2824
5,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,15250,2507
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,17710,2844
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,...,mpfi,3.19,3.40,8.50,110,5500,19,25,18920,2954
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,mpfi,3.13,3.40,8.30,140,5500,17,20,23875,3086
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,...,mpfi,3.13,3.40,7.00,160,5500,16,22,?,3053


### Handling Outliers

In [None]:
#Handling Outliers - Method1
auto_prices['wheel_base'].clip_upper(120) 
auto_prices['wheel_base'].clip_lower(0) 

In [None]:
#Handling Outliers-Method2
auto_prices['wheel_base'].head(10).clip_upper(auto_prices['wheel_base'].quantile(0.95)) 
auto_prices['wheel_base'].head(10).clip_lower(auto_prices['wheel_base'].quantile(0.05)) 

### Handling Categorical variables for analysis - Create Dummies for a Categorical Variable

In [65]:
pd.get_dummies(auto_prices['make'], prefix="D").head(10)

Unnamed: 0,D_alfa-romero,D_audi,D_bmw,D_chevrolet,D_dodge,D_honda,D_isuzu,D_jaguar,D_mazda,D_mercedes-benz,...,D_nissan,D_peugot,D_plymouth,D_porsche,D_renault,D_saab,D_subaru,D_toyota,D_volkswagen,D_volvo
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Feature engineering and transforming variables

In most cases, machine learning is not done with the raw features. Features are transformed, or combined to form new features in forms which are more predictive. This process is known as **feature engineering**. In many cases, good feature engineering is more important than the details of the machine learning model used. It is often the case that good features can make even poor machine learning models work well, whereas, given poor features even the best machine learning model will produce poor results. As the famous saying goes, "garbage in, garbage out".

Some common approaches to feature engineering include:
- **Aggregating categories** of categorical variables to reduce the number. Categorical features or labels with too many unique categories will limit the predictive power of a machine learning model. Aggregating categories can improve this situation, sometime greatly. However, one must be careful. It only makes sense to aggregate categories that are similar in the domain of the problem. Thus, domain expertise must be applied. 
- **Transforming numeric variables** to improve their distribution properties to make them more covariate with other variables. This process can be applied not only features, but to labels for regression problems. Some common transformations include, **logarithmic** and **power** included squares and square roots. 
- **Compute new features** from two or more existing features. These new features are often referred to as **interaction terms**. An interaction occurs when the behavior of say, the produce of the values of two features, is significantly more predictive than the two features by themselves. Consider the probability of purchase for a luxury mens' shoe. This probability depends on the interaction of the user being a man and the buyer being wealthy. As another example, consider the number of expected riders on a bus route. This value will depend on the interaction between the time of day and if it is a holiday. 

#### Aggregating categorical variables

When a dataset contains categorical variables these need to be investigated to ensure that each category has sufficient samples. It is commonly the case that some categories may have very few samples, or have so many similar categories as to be meaningless. 

As a specific case, you will examine the number of cylinders in the cars. Execute the code in the cell below to print a frequency table for this variable and examine the result. 

## Summary

Good data preparation is the key to good machine learning performance. Data preparation or data munging is a time interactive and iterative process. Continue to visualize the results as you test ideas. Expect to try many approaches, reject the ones that do not help, and keep the ones that do. In summary, test a lot of ideas, fail fast, keep what works. The reward is that well prepared data can improve the performance of almost any machine learning algorithm.