# Preprocessing the Data
### In our daily life, we deal with lots of data but this data is in raw form. To provide the data as the input of machine learning algorithms, we need to convert it into a meaningful data. That is where data preprocessing comes into picture. In other simple words, we can say that before providing the data to the machine learning algorithms we need to preprocess the data.

### Data preprocessing steps Follow these steps to preprocess the data in Python −

# Step 1 :
### Importing the useful packages − If we are using Python then this would be the first step for converting the data into a certain format, i.e., preprocessing. It can be done as follows −

In [6]:
import numpy as np
from sklearn import preprocessing

#### Here we have used the following two packages −

#### NumPy − Basically NumPy is a general purpose array-processing package designed to efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed for small multi-dimensional arrays.

#### Sklearn.preprocessing − This package provides many common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for machine learning algorithms.

# Step 2 :
#### Defining sample data − After importing the packages, we need to define some sample data so that we can apply preprocessing techniques on that data. We will now define the following sample data −

In [7]:
Input_data = np.array([[2.1, -1.9, 5.5],
                      [-1.5, 2.4, 3.5],
                      [0.5, -7.9, 5.6],
                      [5.9, 2.3, -5.8]])

# Step 3 :
#### Applying preprocessing technique − In this step, we need to apply any of the preprocessing techniques.

# Techniques for Data Preprocessing
### The techniques for data preprocessing are described below −

# 1. Rescale Data

### When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

### The  attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.

### You can rescale your data using scikit-learn using the MinMaxScaler class.


In [8]:
# Python code to Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
#url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
url = "C:\\Users\\Administrator\\Desktop\\Data\\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
 
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
 
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]
 [0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.   ]
 [0.    0.688 0.328 0.354 0.199 0.642 0.944 0.2  ]]


# 2 . Standardize Data

### Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
### It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.
### You can standardize data using scikit-learn with the StandardScaler class.


In [9]:
# Python code to Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
#url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
url = "C:\\Users\\Administrator\\Desktop\\Data\\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
 
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
 
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


# 3. Normalization 

### Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).
### This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.
### You can normalize data in Python with scikit-learn using the Normalizer class.
### It is another data preprocessing technique that is used to modify the feature vectors. Such kind of modification is necessary to measure the feature vectors on a common scale. Followings are two types of normalization which can be used in machine learning −

## L1 Normalization
#### It is also referred to as Least Absolute Deviations. This kind of normalization modifies the values so that the sum of the absolute values is always up to 1 in each row. It can be implemented on the input data with the help of the following Python code −

In [12]:
# Normalize data
data_normalized_l1 = preprocessing.normalize(Input_data, norm = 'l1')
print("\nL1 normalized data:\n", data_normalized_l1)


L1 normalized data:
 [[ 0.221 -0.2    0.579]
 [-0.203  0.324  0.473]
 [ 0.036 -0.564  0.4  ]
 [ 0.421  0.164 -0.414]]


## L2 Normalization

### It is also referred to as least squares. This kind of normalization modifies the values so that the sum of the squares is always up to 1 in each row. It can be implemented on the input data with the help of the following Python code −

In [14]:
# Normalize data
data_normalized_l2 = preprocessing.normalize(Input_data, norm = 'l2')
print("\nL2 normalized data:\n", data_normalized_l2)


L2 normalized data:
 [[ 0.339 -0.307  0.889]
 [-0.333  0.533  0.778]
 [ 0.052 -0.815  0.578]
 [ 0.687  0.268 -0.675]]


In [15]:
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer
import pandas
import numpy
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
url = "C:\\Users\\Administrator\\Desktop\\Data\\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[0.034 0.828 0.403 0.196 0.    0.188 0.004 0.28 ]
 [0.008 0.716 0.556 0.244 0.    0.224 0.003 0.261]
 [0.04  0.924 0.323 0.    0.    0.118 0.003 0.162]
 [0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
 [0.    0.596 0.174 0.152 0.731 0.188 0.01  0.144]]


# 4. Binarization
You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

### This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

### You can create new binary attributes in Python using scikit-learn with the Binarizer class.
### This is the preprocessing technique which is used when we need to convert our numerical values into Boolean values. We can use an inbuilt method to binarize the input data say by using 0.5 as the threshold value in the following way −

### Now, after running the above code we will get the following output, all the values above 0.5(threshold value) would be converted to 1 and all the values below 0.5 would be converted to 0.

In [17]:
data_binarized = preprocessing.Binarizer(threshold = 0.5).transform(Input_data)
print("\nBinarized data:\n", data_binarized)


Binarized data:
 [[1. 0. 1.]
 [0. 1. 1.]
 [0. 0. 1.]
 [1. 1. 0.]]


In [16]:
# Python code for binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy
#url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
url = "C:\\Users\\Administrator\\Desktop\\Data\\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
 
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
 
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])

[[1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1.]]


# 5. Mean Removal
### It is another very common preprocessing technique that is used in machine learning. Basically it is used to eliminate the mean from feature vector so that every feature is centered on zero. We can also remove the bias from the features in the feature vector. For applying mean removal preprocessing technique on the sample data, we can write the Python code shown below. The code will display the Mean and Standard deviation of the input data − Now, the code below will remove the Mean and Standard deviation of the input data −

In [19]:
print("Mean = ", Input_data.mean(axis = 0))
print("Std deviation = ", Input_data.std(axis = 0))

Mean =  [ 1.75  -1.275  2.2  ]
Std deviation =  [2.714 4.2   4.694]


# 6. Labeling the Data
We already know that data in a certain format is necessary for machine learning algorithms. Another important requirement is that the data must be labelled properly before sending it as the input of machine learning algorithms. For example, if we talk about classification, there are lot of labels on the data. Those labels are in the form of words, numbers, etc. Functions related to machine learning in sklearn expect that the data must have number labels. Hence, if the data is in other form then it must be converted to numbers. This process of transforming the word labels into numerical form is called label encoding.

### Label encoding steps
### Follow these steps for encoding the data labels in Python −

## Step 1 − Importing the useful packages¶
 If we are using Python then this would be first step for converting the data into certain format, i.e., preprocessing. It can be done as follows −

In [21]:
import numpy as np
from sklearn import preprocessing

## Step 2 − Defining sample labels
After importing the packages, we need to define some sample labels so that we can create and train the label encoder. We will now define the following sample labels −

In [22]:
# Sample input labels
input_labels = ['red','black','red','green','black','yellow','white']

# Step 3 − Creating & training of label encoder object

In this step, we need to create the label encoder and train it. The following Python code will help in doing this −

In [23]:
# Creating the label encoder
encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)

LabelEncoder()

# Step 4 − Checking the performance by encoding random ordered list
This step can be used to check the performance by encoding the random ordered list. Following Python code can be written to do the same − The labels would get printed as follows −

In [24]:
# encoding a set of labels
test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)


Labels = ['green', 'red', 'black']


In [25]:
print("Encoded values =", list(encoded_values))

Encoded values = [1, 2, 0]


# Step 5 − Checking the performance by decoding a random set of numbers −
This step can be used to check the performance by decoding the random set of numbers. Following Python code can be written to do the same −

In [26]:
# decoding a set of values
encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)
print("\nEncoded values =", encoded_values)


Encoded values = [3, 0, 4, 1]


  if diff:


In [27]:
print("\nDecoded labels =", list(decoded_list))


Decoded labels = ['white', 'black', 'yellow', 'green']


# Dealing with Missing Data

In [29]:
import pandas as pd 
from io import StringIO 
csv_data = '''A,B,C,D
             1.0,2.0,3.0,4.0
             5.0,6.0,,8.0 
             10.0,11.0,12.0,''' 
df = pd.read_csv(StringIO(csv_data)) 
df 

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


### For a larger DataFrame, it can be tedious to look for missing values manually; in this case, we can use the isnull method to return a DataFrame with Boolean values that indicate whether a cell contains a numeric value (False) or if data is missing (True). Using the sum method, we can then return the number of missing values per column as follows:


In [30]:
 df.isnull().sum() 

A    0
B    0
C    1
D    1
dtype: int64

In [31]:
df.values 

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

### Eliminating samples or features with missing values One of the easiest ways to deal with missing data is to simply remove the corresponding features (columns) or samples (rows) from the dataset entirely;  rows with missing values can be easily dropped via the dropna method:


In [32]:
df.dropna()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


#### Similarly, we can drop columns that have at least one NaN in any row by setting the axis argument to 1:


In [33]:
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


#### The dropna method supports several additional parameters that can come in handy:


In [34]:
#  only drop rows where all columns are NaN 
df.dropna(how='all') 

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [35]:
#drop rows that have not at least 4 non-NaN values 
df.dropna(thresh=4) 

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [36]:
# only drop rows where NaN appear in specific columns (here: 'C') 
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


## Imputing missing values 

Often, the removal of samples or dropping of entire feature columns is simply not feasible, because we might lose too much valuable data. In this case, we can use different interpolation techniques to estimate the missing values from the other training samples in our dataset. One of the most common interpolation techniques is mean imputation, where we simply replace the missing value by the mean value of the entire feature column. A convenient way to achieve this is by using the Imputer class from scikit-learn, as shown in the following code:

In [38]:
from sklearn.preprocessing import Imputer 
imr = Imputer(missing_values='NaN', strategy='mean', axis=0) 
imr = imr.fit(df) 
imputed_data = imr.transform(df.values) 
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])