# Importing The Essential Libraries

Libraries are set of tools which can be used to perform specific functions. Like for example statistical function of mean. We import a library which gives the function of mean and we use that function pass it some data and it then returns the mean of the data.

There are many libraries available in Python but there are 3 most commonly used for Data Sciences they are 
* Numpy - Which stands for Numerical Python - This library contains mathematical tools.
* Matplotlib - This library helps us in visualizing the data.
* Pandas - This library helps us manage datasets.

In [1]:
# Importing The Essential Libraries
import numpy as np
import pandas as pd
import matplotlib as plt

# Importing Data Set

When ever you want to import a data set you need to specify the workinf directory which must be the folder which contains your data set. Easiest way do set a working directory is to save the python file in the location which contains the data set.

### NOTE - Python Indexes Start with 0

In [2]:
# Importing data Set
# We will use the pandas library to import the data set.
dataset = pd.read_csv('Data.csv')
# Now we need to separate the Independent and Dependent variables into their respective data sets
X = pd.DataFrame(dataset.iloc[:,:-1].values)
y = pd.DataFrame(dataset.iloc[:,-1].values)
y.head()

Unnamed: 0,0
0,No
1,Yes
2,No
3,No
4,Yes


# Object Oriented Concepts

A class is the model of something we want to build. For example, if we make a house construction plan that gathers the instructions on how to build a house, then this construction plan is the class.

An object is an instance of the class. So if we take that same example of the house construction plan, then an object is simply a house. A house (the object) that was built by following the instructions of the construction plan (the class).
And therefore there can be many objects of the same class, because we can build many houses from the construction plan.

A method is a tool we can use on the object to complete a specific action. So in this same example, a tool can be to open the main door of the house if a guest is coming. A method can also be seen as a function that is applied onto the object, takes some inputs (that were defined in the class) and returns some output.

# Handling Missing Data

One of the problems with Real World data is that they contain a lot of missing data points. There are many ways of handling the missing data.

* First and the easiest way is to remove the observations which contain missing information. That can be bad in some cases, mostly because along with the missing data we are also loosing other important information which is may be of importance.
* Other most common practice is to impute missing values using methodologies. Like for a Continuous Numerical data we can use Mean, for Categorical data we can use Mode etc.

In [3]:
# Taking care of missing data
# We will be using a Python Library which will help us Impute missing values into the data set.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X.iloc[:,1:3])
X.iloc[:,1:3] = imputer.transform(X.iloc[:,1:3])
X

Unnamed: 0,0,1,2
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.8
5,France,35.0,58000.0
6,Spain,38.7778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


# Handling Categorical Data

All the machine learning models are mathematical algorithms which need numbers as input for them to run and do not handle Categorical Data well. This is the main reason to encode all the categorical data into numbers so we can apply machine learning models.

In [5]:
# Encoding Categorical Variables
# Python has a predefined library class which will help us perform categorical variable encoding
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X.iloc[:,0] = labelencoder_X.fit_transform(X.iloc[:,0])
labelencoder_Y = LabelEncoder()
y = labelencoder_Y.fit_transform(y)
y

  y = column_or_1d(y, warn=True)


array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)

Looking at the above output we see that the categories of country which are France, Germany and Spain are replaced by 0,1,2 respectively. These should represent numerical values which have mathematical value like 1 is greater than 0 and 2 greater than 1. By imputing with numbers we are artificially inducing some sort of numerical measure to these categories which when used in an ML algorithm will influence the Model. In order to over come this we will use Dummy Variables which basically pivots the categorical values into their own column and each occurance of the variable will have a 1 representing presence and 0 for absence.

In [6]:
# To create dummy variables we will use another class OneHotEncoder from Sklearn
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
X

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

# Splitting Data Set into Train and Test

With any machine learning project we need to split the available data set into Training and Test Data Set.
The reason to split the data is so that you have some data for your machine learning model to be validated against, This will help us understand the accuracy of the model and also may be tune the model to perform better or if your model is Over Fitting the training data we can validate it against new data which the model has not seen.

In [10]:
# Spllitting the data set into train and test.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04]])

# Feature Scaling

Lets us consider the data set that we have which has two numerical fields Age and Salary. We can see that these two variables are not in the same scale, Age in the scale of 10's and Salary in the scale of 10000's and this will cause some problems with the machine learning models. This is because most of the machine learning models are based on the concepe of Euclidean Distance. During the calculation of the Euclidean Distance the scale of the variables will have a major impact and since Salary is in a higher scale will have more weightage to that variable.

Imagine we are calculating a Euclidean Distance between points (25, 25000) and (50, 50000)
Formula = Sqrt((x2^2-x1^2) + (y2^2-y1^2)
Formula = sqrt((50^2)-(25^2) + (50000^2) - (25000^2))
Formula = sqrt(1875 + 1875000000)
Formula = 1875001875

We can clearly see that the Salary is dominating the distance calculation. This is the main reason to perform feature scaling so we get the variable in the same scales which will eliminate the effect of larger scale variable on the model.

There are several way to scale your data, below are two commonly used methods.

![title](Feature_Scaling.PNG)

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Standardization brings all the standardized variables in a scale which is between -1 and +1

The next few questions that may raise is Should we standardize the dummy variables?

It's not bad to do, necessarily, but it's not a good habit to get into.  Standardising variables when it's not necessary to do so leaves interpretation issues, and can lead to sloppy thinking.

Also, remember that standardisation needs to be applied in the same way to all data sets that are used for a given built model.  Unnecessary standardisation leads to more metadata.  And if someone else uses your model or code without reading carefully, or if you somehow forget...  well, that's not a good thing.  That's true for any variable, true.  But why add unnecessary complications?

Next Question which comes to mind is, Is Feature Scaling required for ML Algorithms which do not use Euclidean Distance?

Yes, Feature Scaling will help a non Euclidean Model to converge lot faster when using scaled data as opposed to unscaled data.

Next question would be that, Should you perform feature scaling to Dependent Variables?

No in most cases where the output is categorical or numerical which are variance is not large, But if the output is numerical for example and its range is large we may need to apply feature scaling on dependent variables as well.

Another question would be, Why do we Fit and Transform on Training data and only transform on test?

https://sebastianraschka.com/faq/docs/scale-training-test.html