# Data Preprocessing

It is the process of preparing the data for analysis and is the first step of every ML model.

1. Dealing with missing data.
2. Dealing with catagorical data.
3. Splitting the dataset into training and testing sets.
4. Rescaling data.


### Dealing with missing data

In [1]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("Data.csv")
df.head()

# I want to predict on the basis of data that user will subscribe or not then, Subscribe is my dependent variable
# So we will split dependent and independent variables.

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

# Handling missing values using Scikit learn library
from sklearn.preprocessing import Imputer

# Imputer - Handeling missing values

imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)   # Replace Nan with mean values
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

X



array([['Direct', 54.0, 25000.0],
       ['Email', 38.0, 47995.0],
       ['Tele', 37.0, 54014.0],
       ['Email', 52.0, 61014.0],
       ['Tele', 35.0, 59673.166666666664],
       ['Direct', 47.166666666666664, 58011.0],
       ['Email', 35.0, 52001.0],
       ['Direct', 45.0, 79013.0],
       ['Tele', 65.0, 83011.0],
       ['Direct', 48.0, 67007.0],
       ['Tele', 48.0, 59673.166666666664],
       ['Direct', 46.0, 57995.0],
       ['Email', 47.166666666666664, 52014.0],
       ['Direct', 63.0, 79003.0]], dtype=object)

### Handling Catagorical features
### Categorical variables - Subscribe & Acquisition_Mode (Non-numerical data)

In [2]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# On independent variable

labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

onehotencoder_X = OneHotEncoder(categorical_features = [0])
X = onehotencoder_X.fit_transform(X).toarray()

# On dependent variable

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
y

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1])

### Splitting data into training and test

### train : 80%  (0.8)
### test : 20%   (0.2)

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)

In [4]:
df.head()

Unnamed: 0,Acquisition_Mode,Age,Income,Subscribe
0,Direct,54.0,25000.0,No
1,Email,38.0,47995.0,Yes
2,Tele,37.0,54014.0,Yes
3,Email,52.0,61014.0,No
4,Tele,35.0,,Yes


### Feature Scalling
### Standardise range if there's huge differance between independent variables in terms of range

In [5]:
from sklearn.preprocessing import StandardScaler
SS_X = StandardScaler()
X_train = SS_X.fit_transform(X_train) # Scalling only done on the independent variable
X_test = SS_X.fit_transform(X_test)
X_train

array([[ 0.91287093, -0.61237244, -0.47140452, -0.25768236, -0.01789735],
       [-1.09544512, -0.61237244,  2.12132034, -1.57719374, -0.29838984],
       [ 0.91287093, -0.61237244, -0.47140452,  2.23472803,  1.46228007],
       [ 0.91287093, -0.61237244, -0.47140452,  0.03554239,  0.61706833],
       [-1.09544512,  1.63299316, -0.47140452, -1.43058136, -0.72247533],
       [ 0.91287093, -0.61237244, -0.47140452, -0.40429473,  1.46298465],
       [-1.09544512, -0.61237244,  2.12132034,  0.03554239,  0.10034258],
       [-1.09544512,  1.63299316, -0.47140452,  0.6219919 ,  0.19481474],
       [ 0.91287093, -0.61237244, -0.47140452,  0.91521665, -2.34265239],
       [ 0.91287093, -0.61237244, -0.47140452, -0.08663459, -0.01677002],
       [-1.09544512,  1.63299316, -0.47140452, -0.08663459, -0.43930544]])