<a href="https://colab.research.google.com/github/mislam3/Machine-Learning-Data-Science-w-Python-R-w.through/blob/master/Copy_of_data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Tools

## Importing the libraries

In [15]:
import numpy as np                # NumPy allows us to work with arrays
import matplotlib.pyplot as plt   # MatPlotLib Allows us to plot charts and graphs
import pandas as pd               # Pandas to import datasets and create matrix of features and dependent variable vectors


## Importing the dataset

In [16]:
# import csv to Python
dataset = pd.read_csv('Data.csv') #creates dataframe in dataset var using pandas library function

# create two new entities: matrix of features and dependent variable vector
# features (independent variable): columns with which to predict the dependent variable ;;; dependent variable: last column in eg. DS Purchased? y/n
# typical format to use in DSci and ML: features (ind. var) in first columns and dependent var in the last column

# now create separate entities: x-> set of features (col: A-C), y -> dependent var (col D)
X = dataset.iloc[:, :-1].values  # iloc[] function locates indexes - columns to extract from the dataset; [rows, columns] parameter format;  : for all the rows i.e. range; in Python index starts at 0 and range includes lower bound but excludes upper bound, so -1 indicates last column
y = dataset.iloc[:, -1].values



In [17]:
print(X)
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [18]:
# Don't want missing data- can get error while training ML models
# one way is to ignore the observation by deleting it- works if working on large dataset where a tiny fraction won't change the learning quality of the model by much (eg. 1%)
# lots of missing data : replace missing data by average/median/most-frequent-value (eg. for categories) of all values of column of which the data is missing (avg. of salaries, here)
# SciKit-Learn libraries -Great for Data Pre-Processing

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # np -> numpy, nan ->refers to all missing values, imputer -> object

# apply imputer object to matrix of features . fit method connects imputer object to matrix of features; transform applies the method to missing salaries
imputer.fit(X[:, 1:3]) # parameter expects col with only numerical values- exclude string columns and select all numerical columns; range excludes last col hence 1:3

# imputer.transform(X[:, 1:3])
# update matrix of features X
X[:, 1:3] = imputer.transform(X[:, 1:3])

print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [19]:
# in our dataset, if we categorize countries as numbers, ML model might associate meaning with those numerical values, perhaps as some order, etc. which could be detrimental in this case
# avoid model to have such interpretation- might cause misinterpreted correlations between features and the outcome to be predicted
# implement On-Hot-Encoding : here, make 3 columns for Germany, Spain, and France Creates binary vectors for each countries. Encodes to 100, 010, 001 where no numerical order exists
# Finally, replace dependent variable Purchased -> No/Yes with 0/1 where binary outcome is proper

# One Hot coding for countries
# scikit learn

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# create object of column transformer class
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder = 'passthrough') # parameters: (kind of transformation to columns(tuple), remainder where transform does not apply) ; 0 since we only need to apply to 1st column or 0 index
# passthrough to keep the columns not one-hot-encoded into matrix of features (age and salary)
# connect ct to matrix of features X

# X = ct.fit_transform(X) # X to one hot encode country column; returns new matrix of features output of 3 columns one hot encoding country col - exactly what we want so update X w/ =
# fit doesn't return output as numpy array but it is required to have the matrix of features X as numpy array as it might be expected by future ML models
# force fit transform to output numpy array

X = np.array(ct.fit_transform(X))

print(X)

# Encoded unique id: France as 1 0 0, Spain as 0 0 1, Germany as 0 1 0 --- prevented numerical ordering by one-hot-encoding a categorical data

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [22]:
# Now, another class called Label Encoder for Purchased=No/Yes to 0/1

from sklearn.preprocessing import LabelEncoder

#create object of the LabelEncoder class- directly input y hence no parameter in parentheses as we are dealing with one vector
le = LabelEncoder()

y = le.fit_transform(y) # to encode/convert no/yes to 0/1 - expects numpy array

print(y)

# Label Encoding might be useful for elements such as degrees eg. low, mid, high as 0, 1, 2 respectively where even if high is considered to have more weight, it would be meaningful vs. use one hot encoding for instances such as this where all elements carry equal weight or...

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

## Feature Scaling