## Data Preprocessing

#### First, we begin by importing the data and analysing it.

In [1]:
import pandas as pd

data = pd.read_csv('raw_data.csv')

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
R&D Spend          47 non-null float64
Administration     49 non-null float64
Marketing Spend    49 non-null float64
State              50 non-null object
Profit             50 non-null float64
dtypes: float64(4), object(1)
memory usage: 2.0+ KB


We observe that there are total, 50 entries - out of which, there are 3 NaNs in 'R&D Spend', 1 NaN in 'Administration', 1 NaN in 'Markering Spend'

In [3]:
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


While the first 3 Features are numerical, 'State' is a categorical feature.

### Now, we're about to perform 4 tasks: 
#### 1. Fill Missing Values
#### 2. One-Hot encode 'State'
#### 3. Normalize
#### 4. Split training and testing data

Before we proceed, let's store the data frame in an array X and y, where X is the matrix of features and y is an array of output values

In [4]:
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

### 1. Fill Missing values
#### We'll be using Sklearn's Imputer library for this task

In [5]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN',strategy='mean',axis=0)  
X[:,:-1]= imputer.fit_transform(X[:,:-1])

In [None]:
# data = data.dropna()
# data = data.reset_index()

### 2. One-Hot encoding

#### We'll first Label encode State, and then One-hot encode it

In [6]:
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()
X[:, -1] = labelencoder.fit_transform(X[:, -1])

In [8]:
X[:,-1]

array([2, 0, 1, 2, 1, 2, 0, 1, 2, 0, 1, 0, 1, 0, 1, 2, 0, 2, 1, 2, 0, 2,
       1, 1, 2, 0, 1, 2, 1, 2, 1, 2, 0, 1, 0, 2, 1, 0, 2, 0, 0, 1, 0, 2,
       0, 2, 1, 0, 2, 0], dtype=object)

But, just label encoding isn't enough. In fact, it can cause many problems. Why?

Let's one-hot encode it

In [9]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [-1])
X = onehotencoder.fit_transform(X).toarray()

In [None]:
X

Now, instead of 'State', we have 3 new features - each for one state

### 3. Normalization
#### For this, we'll use Sklearn's StandardScaler library

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X[:,3:] = scaler.fit_transform(X[:,3:])

In [None]:
X[:,3:]

Now, all values are in the same range.

### 4. Finally, train-test split
#### For this, we'll use sklearn's train_test_split library 

In [None]:
from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(X, y, test_size = 0.2, random_state = 0)

#### Now, are data is clean and ready to be fed into a Machine Learning algorithm.