# Machine Learning 1

# Some Concepts

## ML Tasks

In [None]:
Type of tasks
  - Classification
  - Regression
  - Structured annotation
  - Clustering
  - Transcription

In [None]:
Challenges
  - Quality of data 
  - Time-Consuming task − Another challenge faced by ML models is the consumption of time especially for data acquisition, feature extraction and retrieval. 
  - Lack of specialist persons − As ML technology is still in its infancy stage, availability of expert resources is a tough job.
  - No clear objective for formulating business problems 
  - Issue of overfitting & underfitting 
  - Curse of dimensionality − Another challenge ML model faces is too many features of data points. This can be a real hindrance.
  - Difficulty in deployment − Complexity of the ML model makes it quite difficult to be deployed in real life.

In [None]:
Applications
  - Emotion analysis
  - Sentiment analysis
  - Error detection and prevention
  - Weather forecasting and prediction
  - Stock market analysis and forecasting
  - Speech synthesis
  - Speech recognition
  - Customer segmentation
  - Object recognition
  - Fraud detection
  - Fraud prevention
  - Recommendation of products to customer in online shopping

## Stats 

In [2]:
import pandas as pd
import statsmodels.formula.api as sm
df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
result = sm.ols(formula="A ~ B + C", data=df).fit()
print(result.params)

Intercept    14.952480
B             0.401182
C             0.000352
dtype: float64


In [3]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                      A   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     1.375
Date:                Wed, 10 Jun 2020   Prob (F-statistic):              0.421
Time:                        17:47:02   Log-Likelihood:                -18.178
No. Observations:                   5   AIC:                             42.36
Df Residuals:                       2   BIC:                             41.19
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     14.9525     17.764      0.842      0.4



## Type of Learning

##### Supervised Learning

In [None]:
- The majority of practical machine learning uses supervised learning.
- Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
- Y = f(X)
- It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.

In [None]:
Supervised learning problems can be further grouped into regression and classification problems.
  - Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.
  - Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.

Some common types of problems built on top of classification and regression include recommendation and time series prediction respectively.

Some popular examples of supervised machine learning algorithms are:
  - Linear regression for regression problems.
  - Random forest for classification and regression problems.
  - Support vector machines for classification problems.

##### Unsupervised Learning

In [None]:
Unsupervised learning is where you only have input data (X) and no corresponding output variables.
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. 
Algorithms are left to their own devises to discover and present the interesting structure in the data.

Unsupervised learning problems can be further grouped into clustering and association problems.
  - Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
  - Association:  An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.

Some popular examples of unsupervised learning algorithms are:
  - k-means for clustering problems.
  - Apriori algorithm for association rule learning problems.

##### Semi Supervised

In [None]:
Problems where you have a large amount of input data (X) and only SOME of the data is labeled (Y) are called semi-supervised learning problems.

These problems sit in between both supervised and unsupervised learning.
  - A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and the majority are unlabeled.
  - Many real world machine learning problems fall into this area.
  - This is because it can be expensive or time-consuming to label data as it may require access to domain experts. Whereas unlabeled data is cheap and easy to collect and store.

You can use unsupervised learning techniques to discover and learn the structure in the input variables.
You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.

#### Machine Learning vs Deep Learning

In [None]:
Deep learning is machine learning.
  - More specifically, deep learning is considered an evolution of machine learning. 
  - It uses a programmable neural network that enables machines to make accurate decisions without help from humans.

However, its capabilities are different.
  - While basic machine learning models do become progressively better at whatever their function is, they still need some guidance. 
  - If an AI algorithm returns an inaccurate prediction, then an engineer has to step in and make adjustments. 
  - With a deep learning model, an algorithm can determine on its own if a prediction is accurate or not through its own neural network.
    
A deep learning model is designed to continually analyze data with a logic structure similar to how a human would draw conclusions. 
  - To achieve this, deep learning applications use a layered structure of algorithms called an artificial neural network. 
  - The design of an artificial neural network is inspired by the biological neural network of the human brain, leading to a process of learning that’s far more capable than that of standard machine learning models.

It’s a tricky prospect to ensure that a deep learning model doesn’t draw incorrect conclusions—like other examples of AI, it requires lots of training to get the learning processes correct. 
But when it works as it’s intended to, functional deep learning is often received as a scientific marvel that many consider being the backbone of true artificial intelligence.

# Simple Regression

#### Importing the tools

In [6]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

#### The basics in Scikit-learn

In [None]:

# Declare the X and y
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms','Area Population']]
y = df['Price']

# Prepare the test / train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Get the sets size
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)

#printing the output and coefficients
coeff_df = pd.DataFrame(linreg.coef_,X.columns,columns=['Coefficient']) 
coeff_df

#### Visualisation of output

In [None]:
# Plotting the predictions vs the test set
y_pred = lm.predict(X_test)  
plt.scatter(y_test,y_pred)

# Plotting the errors
sns.distplot((y_test-y_pred),bins=50)


#### Metrics

In [None]:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors
Mean Squared Error (MSE) is the mean of the squared errors:
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors

Comparing these metrics:

MAE is the easiest to understand because it’s the average error.
MSE is more popular than MAE because MSE “punishes” larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE because RMSE is interpretable in the “y” units.

In [None]:
# to get the metrics

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, y_pred)) 
print('MSE:', metrics.mean_squared_error(y_test, y_pred)) 
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) 

# Different way, just for illustration
y_pred = linreg.predict(X_test) 

from sklearn.metrics import mean_squared_error
MSE = mean_squared_error(y_test, y_pred)
print(MSE)



#### Predictions

In [None]:
# Lets say that the model inputs are
X = df[['Weight', 'Volume']]
y = df['CO2']

regr = linear_model.LinearRegression()
regr.fit(X, y)

# Simply do that for predicting the CO2 emission of a car where the weight is 2300kg, and the volume is 1300ccm:
predictedCO2 = regr.predict([[2300, 1300]])

print(predictedCO2)


#### OLS Regression

In [None]:
https://docs.w3cub.com/statsmodels/generated/statsmodels.regression.linear_model.ols.fit_regularized/

In [None]:
est=sm.OLS(y, X)
est = est.fit()
est.summary()

#### Plotting Errors

In [None]:
# provided that y_test and y_pred have been called (example below)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# y_pred = linreg.predict(X_test)

sns.distplot((y_test-y_pred),bins=50)   

# Cross Validation

In [None]:
There is always a need to validate the stability of the machine learning model and need some kind of assurance that:
  - the  model has got most of the patterns from the data correct
  - the model is not picking up too much on the noise
  - the model is low on bias and variance.

In [None]:
Validation
  - process of deciding whether the numerical results quantifying hypothesized relationships between variables, are acceptable as descriptions of the data.

Residiuals
  - evaluation of residuals = error estimation for the model is made after training 
  - a numerical estimate of the difference in predicted and original responses is done, also called the training error. 
  - However, this only gives us an idea about how well our model does on the data used to train it. 
  - It possible that the model is underfitting or overfitting the data. 

Cross Validation:
  - Pupose: get an indication of how well the learner will generalize to an independent / unseen data set
  - How: discussed below

In [None]:
Model Bias / Variance

Bias
  - In an ideal scenario, these error values should sum up to zero. 
  - To return the model’s bias, we take the average of all the errors. 
  - The lower the average value, better the model.

Variance
  - Similarly for calculating the model variance, we take standard deviation of all the errors. 
  - A low value of standard deviation suggests our model does not vary a lot with different subsets of training data.

We should focus on achieving a balance between bias and variance. 
  - This can be done by reducing the variance and controlling bias to an extent.
  - This will result in a better predictive model.
  - This trade-off usually leads to building less complex predictive models as well. 

## Hold Out Method

In [None]:
Simple
  - Removing a part of the training data and using it to get predictions from the model trained on rest of the data. 
  - The error estimation then tells how our model is doing on unseen data or the validation set. 

However
  - suffers from issues of high variance since It is not certain which data points will end up in the validation set

## K-Fold Cross Validation

In [None]:
The Problem
  - As there is never enough data to train your model, removing a part of it for validation poses a problem of underfitting. 
  - By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias. 

The Solution
  - What we require is a method that provides ample data for training the model and also leaves ample data for validation. 
  - K-Fold cross validation does exactly that.

K Fold cross validation
  - the data is divided into k subsets. 
  - the holdout method is repeated k times, such that each time:
      - one of the k subsets is used as the test set / validation set
      - the other k-1 subsets are put together to form a training set. 
  - The error estimation is averaged over all k trials to get total effectiveness of our model. 
  - As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. 
  - This significantly reduces
      - bias as we are using most of the data for fitting
      - variance as most of the data is also being used in validation set. 
  - Interchanging the training and test sets also adds to the effectiveness of this method. 
  - As a general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any value.

In [None]:
Example of 5 Fold Cross Validation

Validation  XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX
XXXXXXXXXX  Validation  XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX 
XXXXXXXXXX  XXXXXXXXXX  Validation  XXXXXXXXXX  XXXXXXXXXX 
XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX  Validation  XXXXXXXXXX 
XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX  XXXXXXXXXX  Validation

## Stratified K-Fold Cross Validation

In [None]:
In some cases, there may be a large imbalance in the response variables. 
  - For example, in dataset concerning price of houses, there might be large number of houses having high price. 
  - Or in case of classification, there might be several times more negative samples than positive samples. 

For such problems, a slight variation in the K-Fold cross validation technique is made:
  - Each fold contains approximately the same percentage of samples of each target class as the complete set
  - in case of prediction problems, the mean response value is approximately equal in all the folds. 

This variation is also known as Stratified K Fold.

## Leave-P-Out Cross Validation (exchaustive method)

In [None]:
Exhaustive Methods computes all possible ways the data can be split into training and test sets.

In [None]:
Leave P-Out
  - Leaves p data points out of training data
  - Meaning: 
      - if there are n data points in the original sample then, n-p samples are used to train the model and p points are used as the validation set. 
      - This is repeated for all combinations in which original sample can be separated this way
      - Then the error is averaged for all trials, to give overall effectiveness.

This method is exhaustive in the sense that:
  - it needs to train and validate the model for all possible combinations
  - for moderately large p, it can become computationally infeasible.
    
A particular case of this method is when p = 1. 
  - This is known as Leave one out cross validation. 
  - This method is generally preferred over the previous one because it does not suffer from the intensive computation
  - Number of possible combinations is equal to number of data points in original sample or n.

## Python Implementation

In [None]:
# K-Fold Cross Validation

# Implementing K-Fold Cross Validation

from sklearn.model_selection import KFold 
kf = KFold(n_splits=5, shuffle = True) 

linreg = LinearRegression()

scores = []

for i in range(5):
    result = next(kf.split(X), None)
    X_train = X.iloc[result[0]]
    X_test = X.iloc[result[1]]
    y_train = y.iloc[result[0]]
    t_test = y.iloc[result[1]]
    model = linreg.fit(X_train,y_train)
    scores.append(model.score(X_test, y_test))

print('scores from each iteration', scores)
print('average k-fold score', np.mean(scores))

In [None]:
# Stratified k-fold cross validation

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)

# X is the feature set and y is the target
for train_index, test_index in skf.split(X,y): 
    print("Train:", train_index, "Validation:", val_index) 
    X_train, X_test = X[train_index], X[val_index] 
    y_train, y_test = y[train_index], y[val_index]



In [None]:
# k-fold cross validation with repetition (if the train set does not adequately represent the entire population, strtified is not goo)
# In repeated cross-validation, the cross-validation procedure is repeated n times, yielding n random partitions of the original sample

from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=None)

# X is the feature set and y is the target
for train_index, test_index in rkf.split(X):
     print("Train:", train_index, "Validation:", val_index)
     X_train, X_test = X[train_index], X[val_index]
     y_train, y_test = y[train_index], y[val_index]



# Interpretation of Outputs

## Regression models

In [None]:
R-squared        - coeff of determination. How well the regression line approximates real data points
Adj. R-squared   - same as above, adjsuted for number of observations and degrees of freedom of residuals
F-Stat           - measure of how significant the fit is. Mean sq error / mean sq error of residuals
Prob(F-stat)     - prob to get F-stat, given the null hypothesis they are unrelated
Log-Likelihood   - value of the likelihood function of the fitted model
AIC              - Akaike Information Criterion: adjusts log-likelihood based on number of observations and complexity of model   
BIC              - Bayesian Information Criterion: same as AIC, but with higher penalty for models wit more parameters
Df Residuals     - degrees of freedom of the residulas. Number of observations - number of parameters
Df Model         - number of parameters in the model (not including the constant)

In [None]:
Coef             - estimated value of coeff
Std err          - basic standard error of the estimate of coeff. 
t                - t-stat (how statistically significant the coeff is)
P>|t|            - p-value that null hypothesis that the coeff = 0 is true. if < 0.05: strong relationship between term and response
95% Conf.Int     - lower and upper value of the 95% confidence interval

In [None]:
Omnibus          - Angostino test: provides a combined statistical test of the presence of skewness and kurtosis
Prob(Omnibus)    - same as above, turned into prob
Skew             - measure of symmetry of data around mean
Kurtosis         - measure of shape distribution
Durbin-Watson    - test for autocorrelation (important in time series)
Jarque=Bera      - different test of skewness and kurtosis
Prob (JB)
Cond.No          - test for multicolinearity (parameters are related to each other)

In [None]:
Log Likelihood
  - not possible to compare raw log-lieklihoods between models (better to use AIC or BIC)
  - Likelihood is the likelihood of the entire model given a set of parameter estimates.
  - It is calculated by 
    - taking a set of parameter estimates
    - calculating the probability density for each one
    - multiplying the probability densities for all the observations together 
    - >> this follows from probability theory in that P(A and B) = P(A)P(B) if A and B are independent)
  - In practice, what this means for linear regression:
    - you take a set of parameter estimates (beta, sd)
    - plug them into the normal pdf
    - calculate the density for each observation y at that set of parameter estimates
    - multiply them all together. 
  - Typically, we choose to work with the log-likelihood because 
    - it is easier to calculate because instead of multiplying we can sum (log(a*b) = log(a) + log(b)), which is computationally faster. 

Log likelihood is used for almost everything. 
  - It is the basic quantity that we use to find parameter estimates (Maximum Likelihood Estimates) for a huge suite of models. 
  - For simple linear regression, these estimates turn out to be the same as those for least squares, but for more complicated models least squares may not work.

AIC
  - Lower value of AIC suggests "better" model, but it is a relative measure of model fit 
  - It is used for model selection (only), i.e. it lets you to compare different models estimated on the same dataset
  - Lower indicates a more parsimonious model, relative to a model fit with a higher AIC.
  - Model selection conducted with the AIC will choose the same model as leave-one-out cross validation 
  - Dont compare too many models with the AIC (like with p-values) because lowest AIC does not mean that it is the most appropriate model


In [None]:
Omnibus          
  - We want something close to zero, which means normalcy of residuals
        
Prob(Omnibus)    
  - statistical test that residuals are normally distributed
  - we want something close to 1

In [None]:
Skew
  - closer to 0 means symetric residual distribution
  - If skewness is less than −1 or greater than +1, the distribution is highly skewed.
  - If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
  - If skewness is between −½ and +½, the distribution is approximately symmetric.
    
Kurtosis
  - a uniform distribution has a kurtosis of 1.8 (excess -1.2) (lowest is discrete with 2 outcomes: kurto 1)
  - a normal distribution has a kurtosis of 3 (excess 0)
  - a logistic distribution has a kurtosis of 4.2 (excess 1.2)
  - highest kurtosis is a student distribution

In [None]:
Durbin-Watson
  - Value between zero and 4.0
  - we hope to get a value between 1 and 2. ideally 2
  - A value of 2.0 means there is no autocorrelation detected in the sample. 
  - values from zero to 2.0 indicate positive autocorrelation
  - values from 2.0 to 4.0 indicate negative autocorrelation.

In [None]:
Condition Number 
  – This test measures the sensitivity of a function output as compared to its input.
  - When we have multicollinearity, we can expect much higher fluctuations to small changes in the data
  - We want a relatively small number (something below 30)


## Confusion Matrix

In [None]:
Confusion matrix 
  - matrix (table) that can be used to measure the performance of an machine learning algorithm, usually a supervised learning one

By convention here
  - row = instances of an actual class 
  - column = instances of a predicted class


In [None]:
# 2 Class looks like this
-----------------------------------------------------------------
                    Predicted Negative      Predicted Positive
    
Actual Negative       True Negative            False Positive

Actual Positive       False NEgative           True Positive
-----------------------------------------------------------------

Accuracy = (TN + TP) / (Total)
Precision = TP / (FP + TP)

In [None]:
Multi Class Case

Accuracy
  - 

Precision 
  - fraction of cases where the algorithm correctly predicted class i out of all instances where the algorithm predicted i (correctly and incorrectly). 



#### Implementation in python - Example 1 - numpy

In [4]:
import numpy as np

cm = np.array(
[[5825,    1,   49,   23,    7,   46,   30,   12,   21,   26],
 [   1, 6654,   48,   25,   10,   32,   19,   62,  111,   10],
 [   2,   20, 5561,   69,   13,   10,    2,   45,   18,    2],
 [   6,   26,   99, 5786,    5,  111,    1,   41,  110,   79],
 [   4,   10,   43,    6, 5533,   32,   11,   53,   34,   79],
 [   3,    1,    2,   56,    0, 4954,   23,    0,   12,    5],
 [  31,    4,   42,   22,   45,  103, 5806,    3,   34,    3],
 [   0,    4,   30,   29,    5,    6,    0, 5817,    2,   28],
 [  35,    6,   63,   58,    8,   59,   26,   13, 5394,   24],
 [  16,   16,   21,   57,  216,   68,    0,  219,  115, 5693]])

In [5]:
def precision(label, confusion_matrix):
    col = confusion_matrix[:, label]
    return confusion_matrix[label, label] / col.sum()
    
def recall(label, confusion_matrix):
    row = confusion_matrix[label, :]
    return confusion_matrix[label, label] / row.sum()

def precision_macro_average(confusion_matrix):
    rows, columns = confusion_matrix.shape
    sum_of_precisions = 0
    for label in range(rows):
        sum_of_precisions += precision(label, confusion_matrix)
    return sum_of_precisions / rows

def recall_macro_average(confusion_matrix):
    rows, columns = confusion_matrix.shape
    sum_of_recalls = 0
    for label in range(columns):
        sum_of_recalls += recall(label, confusion_matrix)
    return sum_of_recalls / columns

In [6]:
print("label precision recall")
for label in range(10):
    print(f"{label:5d} {precision(label, cm):9.3f} {recall(label, cm):6.3f}")

label precision recall
    0     0.983  0.964
    1     0.987  0.954
    2     0.933  0.968
    3     0.944  0.924
    4     0.947  0.953
    5     0.914  0.980
    6     0.981  0.953
    7     0.928  0.982
    8     0.922  0.949
    9     0.957  0.887


In [7]:
print("precision total:", precision_macro_average(cm))
print("recall total:", recall_macro_average(cm))

precision total: 0.9496885564052286
recall total: 0.9514531547877969


In [8]:
def accuracy(confusion_matrix):
    diagonal_sum = confusion_matrix.trace()
    sum_of_all_elements = confusion_matrix.sum()
    return diagonal_sum / sum_of_all_elements 

#### Implementation in python - Example 2 - simple pandas

In [10]:
import pandas as pd

data = {'y_Actual':    [1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0],
        'y_Predicted': [1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0]}

df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
print (df)

    y_Actual  y_Predicted
0          1            1
1          0            1
2          0            0
3          1            1
4          0            0
5          1            1
6          0            1
7          0            0
8          1            1
9          0            0
10         1            0
11         0            0


In [11]:
# in pandas, we create the confution matrix using pd.crosstab

confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)

Predicted  0  1
Actual         
0          5  2
1          1  4


#### Implementation in Python - Example 3 - Scikit-Learn

In [17]:
from sklearn.metrics import confusion_matrix

y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]
y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2]

confusion_matrix(y_actu, y_pred)

array([[3, 0, 0],
       [0, 1, 2],
       [2, 1, 3]], dtype=int64)

## Arrange Data

### Reduce dimensions

### Normalise, Standardise data

In [None]:
Normalisation:
  - Data Z is rescaled such that any specific Z will now be 0 ≤ Z ≤ 1, and is done through this formula: [(x - min(x)] / [ max(x) - min(x) ]
  - Normalization makes training less sensitive to the scale of features, so we can better solve for coefficients.      
  - After normalisation, features are now more consistent with each other, which will allow us to evaluate the output of our future models better.
  - Normalization makes the data better conditioned for convergence.
  - Normalizing will ensure that a convergence problem does not have a massive variance, making optimization feasible.

However:
  - When data is proportional, normalizing might not provide correct estimators. 
  - Or, when the scale between your data features does matters so you want to keep in your dataset.
  - You need to think about your data, and understand if the transformations you’re applying are in line with the outcomes you’re searching for.                                                                                                           

In [11]:
from sklearn.datasets import load_iris
from sklearn import preprocessing

# load the iris dataset
iris = load_iris()

# separate the data from the target attributes
X = iris['data']
y = iris['target']

# normalize the data attributes 
normalized_X = preprocessing.normalize(X)

In [None]:
Normalisation vs Standardisation
  - Keep in mind, there is some debate stating it is better to have the input values centred around 0 — standardization — rather than between 0 and 1. 
  - So doing your research is important as well, so you understand what type of data is needed by your model.

In [None]:
Standardization
  - Here your data Z is rescaled such that μ = 0 and 𝛔 = 1, and is done through this formula: ( xi - μ) / 𝛔 
  - good for comparing features that have large difference of units or scales
  - good for running models (logistic regression, SVMs, perceptrons, neural networks etc.) as the estimated weights will update similarly rather than at different rates during the build process. 
  - Standardizing tends to make the training process well behaved because the numerical condition of the optimization problems is improved. (example, for PCA, need to have features centered around mean)

However
  - if you do standardize your data be warned you might be discarding some information. 
  - If that information is not needed, the process can be helpful else it will impede your results.

In [10]:
from sklearn.datasets import load_iris
from sklearn import preprocessing

# load the Iris dataset
iris = load_iris()

# separate the data from the target attributes
X = iris['data']
y = iris['target']

# standardize the data attributes
standardized_X = preprocessing.scale(X)

### Binning

In [None]:
A 3rd option is binning
  - Consider the latitude feature, which has a geo point of the area in question
  - We’re going to made new columns for each latitude range, and encode each value in our dataset with a 0 or 1 to see if it is within that latitude range.

In [None]:
# Create range for your new columns
lat_range = zip(xrange(32, 44), xrange(33, 45))
new_df = pd.DataFrame()

# Iterate and create new columns, with the 0 and 1 encoding
for r in lat_range
        new_df["latitude_%d_to_%d" % r] = df["latitude"].apply(
            lambda l: 1.0 if l >= r[0] and l < r[1] else 0.0)
new_df

In [None]:
Now that we can binned values, we have a binary value for each latitude in California. 
With this additional approach, you have another way to clean your data and get it ready for modelling.

### Note: Iris dataset import methods

In [14]:
# Seaborn Method
import seaborn as sns
iris_sns = sns.load_dataset('iris')
iris_sns.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [18]:
# Pandas
import pandas as pd
iris_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [21]:
# Sckikit-learn
from sklearn.datasets import load_iris
iris_scikit = load_iris()
# This will produce arrays of data and target

### Note: Import Iris, standardise, return a df

In [36]:
# Pandas
import pandas as pd
from sklearn import preprocessing

iris_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

# separate the data from the target attributes
X = iris_df[['sepal_length','sepal_width','petal_length','petal_width']]
y = iris_df['species']

# Get column names first (provided data was organised this way)
names = ['sepal_length','sepal_width','petal_length','petal_width']   # it could be : iris_df.columns  but we have non numeric values in the last column

# standardize the data attributes
standardized_X = preprocessing.scale(X)
standardized_X = pd.DataFrame(standardized_X, columns=names)
standardized_X = pd.concat(( standardized_X,y),axis=1)
standardized_X.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,-0.900681,1.019004,-1.340227,-1.315444,setosa
1,-1.143017,-0.131979,-1.340227,-1.315444,setosa
2,-1.385353,0.328414,-1.397064,-1.315444,setosa
3,-1.506521,0.098217,-1.283389,-1.315444,setosa
4,-1.021849,1.249201,-1.340227,-1.315444,setosa


## Multiple Linear Regression

In [None]:
Almost all the real-world problems that you are going to encounter will have more than two variables. 
Linear regression involving multiple variables is called “multiple linear regression” or multivariate linear regression. 
The steps to perform multiple linear regression are almost similar to that of simple linear regression. 

The difference lies in the evaluation. 
You can use it to find out which factor has the highest impact on the predicted output and how different variables relate to each other.