<a href="https://colab.research.google.com/github/mehakminda/Python_Practice/blob/main/Final_data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Tools

## Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Importing the dataset

Identifying the features and dependent variable entities,
we predict the value of dependent variable based on the feature/independent variable.


iloc[:]/iloc[:,:] :  All rows and all columns
iloc[0] : first row, all column
iloc[1:4] : from row 1 to row 3, all coulmn
iloc[:, :2] : all rows, columns at index 0,1
iloc[:,:-1] : locate indexes .. all row, all column except last one
iloc[:,:-2] : locate indexes .. all row, all column except the last 2
iloc[:,-1] : all rows, only last column

: -> range
-1 -> last column
2:-1 -> from 2 to excluding the last column


In [2]:
#read the dataset from file using pandas and store it in a variable
dataset= pd.read_csv("Data.csv")

#locate the values index wise from the dataset
X= dataset.iloc[:,:-1].values
Y= dataset.iloc[:,-1].values

In [3]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
print(Y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

1. Either delete the missing data if there is huge datset
2. Handle it properly by replacing it with some value based on some strategy
(mean/media/most occured value/some specified constant)

scikit-learn -> simpleImputer -> fill/replaces missing data
import the SimpleImputer class from the sklearn.impute module in scikit-learn, a popular machine learning library in Python.

Fit method: evaluates the mean value and identifies the places where there is missing value,
Transform: replaces the missing value with average

In [5]:
from sklearn.impute import SimpleImputer

#Instance of class 'SimpleImputer'
#SimpleImputer(which values to replace, with what values to replace)
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')

#Connect/Apply this Imputer to our matrix/datset

# the attribute in fit method, should only have numeric values not text or categorical data, hence we removed country column
imputer.fit(X[:,1:3])

#transform method - same atrribute as fit
#tranform method will now returns the matrix X with new replaced values
X[:,1:3]=imputer.transform(X[:,1:3])





In [6]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

Country column : france, spain, germany..this   is a categorical data

ColumnTransformer: allows you to apply different preprocessing steps to different columns of your dataset, especially when the data is a mix of nuerical and categorical data.
lets you define a pipeline that applies:

1. Imputation to numerical columns
2. Encoding to categorical columns
3. Scaling to selected columns ...all in one go

StandardScaler:It standardizes numerical features by removing the mean and scaling to unit variance.

OneHotEncoder: It converts categorical variables into a format that can be provided to ML algorithms.
Convert them into numbers using 'OneHotEncoding'
OneHotEncoding: creates binary vector
France:[1,0,0]
spain:[0,1,0]
Germany:[0,0,1]


In [7]:
#is used to import the ColumnTransformer class from sklearn.compose in scikit-learn.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

#create an object/instance of ColumnTransformer class
#One hot encoding creates binary vector, (ie) get get binary data for the categorical values
#create vector of size to the number of values .. there it is three(spain, france, germany)
#transfor[(kind of tranformation, what kind of transforamtion, on what columns)]
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])], remainder='passthrough')

#Connect with Matrix Feature X
# output of fit_Transform is not a numpy array, but fror our future machine learning steps of traning set we want X as numpy array , hence we convert the output of fit_transform into numpy array
X=np.array(ct.fit_transform(X))

#in output we get tuple for each country value



In [8]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

LabelEncoder is used to convert categorical labels (strings) into numeric values.(but only if the categories have an ordinal relationship[link text]

In [9]:
from sklearn.preprocessing import LabelEncoder

#create an instance of LabelEncoder (only 2 values-> 0,1-> yes, No)
le = LabelEncoder()

#link labelencoder with Y
# we need not convert this into numpy array, as its the dependent variable, it need not be a numpy array
Y=le.fit_transform(Y)


In [10]:
print(Y)

[0 1 0 0 1 1 0 1 0 1]


Should we apply feature scaling before after splitting dataset into training set ad test set

Test set should be like a new set to evaluate our model's accuracy
apply feature scaling on test set is like information leakage that is knowing the customer data in production prehand, with this the results might not be accurate.

## Splitting the dataset into the Training set and Test set

In [10]:
from sklearn.model_selection import train_test_split
#matrix feature of training set
#matrix feature of test set
#matrix of dependent variable of trainig set
#matrix of dependent variable of test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1 )


In [11]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [12]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [13]:
print(Y_train)

[0 1 0 0 1 1 0 1]


In [14]:
print(Y_test)

[0 1]


## Feature Scaling

Why, so that some features do not dominate the other features, because if this happens undominated features can be ignored by machine learning models.

1. Normalization : (x-min(x)/max(x)-min(x)) and the output values lies in the range [0,1]
- applied when the features is normally distributed.
2. Standardization: (x-mean(x)/standard deviation(x)) and the output values lies in range [-3,3]
- works all the time

Purpose is all the values are in same scale/between a certain range


Do we have to apply the feature scaling to the dummy variables in features (ie the values generated after one hot encoding )
NO, standardization means to have all the values in same range[-3,3].
Dummy variables already lie between [-3,3], so even if you apply it would make any different


In [15]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:,3:] = sc.fit_transform(X_train[:,3:])
X_test[:,3:] = sc.fit_transform(X_test[:,3:])
#fit : will getthe mean and standard deviation
#transform will actualy apply the formula for each of the value




In [16]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [17]:
print(X_test)

[[0.0 1.0 0.0 -1.0 -1.0]
 [1.0 0.0 0.0 1.0 1.0]]
