## Data Preprocessing

#### First, we begin by importing the data and analysing it.

In [1]:
import pandas as pd

data = pd.read_csv('raw_data.csv')

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
R&D Spend          47 non-null float64
Administration     49 non-null float64
Marketing Spend    49 non-null float64
State              50 non-null object
Profit             50 non-null float64
dtypes: float64(4), object(1)
memory usage: 2.0+ KB


We observe that there are total, 50 entries - out of which, there are 3 NaNs in 'R&D Spend', 1 NaN in 'Administration', 1 NaN in 'Markering Spend'

In [3]:
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


While the first 3 Features are numerical, 'State' is a categorical feature.

### Now, we're about to perform 4 tasks: 
#### 1. Fill Missing Values
#### 2. One-Hot encode 'State'
#### 3. Normalize
#### 4. Split training and testing data

Before we proceed, let's store the data frame in an array X and y, where X is the matrix of features and y is an array of output values

In [6]:
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

### 1. Fill Missing values
#### We'll be using Sklearn's Imputer library for this task

In [10]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN',strategy='mean',axis=0)  
# imputer = imputer.fit(X[:,:-1])
X[:,:-1]= imputer.fit_transform(X[:,:-1])

### 2. One-Hot encoding

#### We'll first Label encode State, and then One-hot encode it

In [16]:
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()
X[:, -1] = labelencoder.fit_transform(X[:, -1])

In [17]:
X[:,-1]

array([2, 0, 1, 2, 1, 2, 0, 1, 2, 0, 1, 0, 1, 0, 1, 2, 0, 2, 1, 2, 0, 2,
       1, 1, 2, 0, 1, 2, 1, 2, 1, 2, 0, 1, 0, 2, 1, 0, 2, 0, 0, 1, 0, 2,
       0, 2, 1, 0, 2, 0], dtype=object)

But, just label encoding isn't enough. In fact, it can cause many problems. Why?

Let's one-hot encode it

In [19]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [-1])
X = onehotencoder.fit_transform(X).toarray()

In [20]:
X

array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 1.65349200e+05,
        1.36897800e+05, 4.71784100e+05],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.62597700e+05,
        1.51377590e+05, 4.43898530e+05],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 1.53441510e+05,
        1.01145550e+05, 4.07934540e+05],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 1.44372410e+05,
        1.18671850e+05, 3.83199620e+05],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 1.42107340e+05,
        9.13917700e+04, 3.66168420e+05],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 1.31876900e+05,
        9.98147100e+04, 3.62861360e+05],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.34615460e+05,
        1.47198870e+05, 1.27716820e+05],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 7.56549500e+04,
        1.45530060e+05, 3.23876680e+05],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 1.20542520e+05,
        1.48718950e+05, 

Now, instead of 'State', we have 3 new features - each for one state

### 3. Normalization
#### For this, we'll use Sklearn's StandardScaler library

In [23]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X[:,3:] = scaler.fit_transform(X[:,3:])

In [24]:
X[:,3:]

array([[ 2.12598736e+00,  5.58751776e-01,  2.18727067e+00],
       [ 2.06076964e+00,  1.08085969e+00,  1.94943591e+00],
       [ 1.84374409e+00, -7.30392097e-01,  1.64270066e+00],
       [ 1.62878280e+00, -9.84340453e-02,  1.43173764e+00],
       [ 1.57509474e+00, -1.08209096e+00,  1.28647930e+00],
       [ 1.33260666e+00, -7.78379126e-01,  1.25827354e+00],
       [ 1.39751767e+00,  9.30184659e-01, -7.47263636e-01],
       [-3.44918296e-16,  8.70011210e-01,  9.25774961e-01],
       [ 1.06395233e+00,  9.84995246e-01,  8.21181061e-01],
       [ 1.13013854e+00, -4.58747094e-01,  7.64619847e-01],
       [ 6.22386078e-01, -3.89698764e-01,  1.17948773e-01],
       [ 5.92968301e-01, -1.06770971e+00,  2.93505368e-01],
       [ 4.31595990e-01,  2.13412047e-01,  2.94314681e-01],
       [ 3.87239884e-01,  5.08172560e-01,  3.18413158e-01],
       [ 1.04974784e+00,  1.26727186e+00,  3.51232491e-01],
       [ 9.21288487e-01,  4.38132172e-02,  3.96123024e-01],
       [ 5.58945346e-02,  7.05996503e-03

Now, all values are in the same range.

### 4. Finally, train-test split
#### For this, we'll use sklearn's train_test_split library 

In [25]:
from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2, random_state = 0)

#### Now, are data is clean and ready to be fed into a Machine Learning algorithm.