## Regression Analysis : First Machine Learning Algorithm !!

### Machine learning 
- is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

<img style="float: left;" src = "./img/ml_definition.png" width="600" height="600">

<img style="float: left;" src = "./img/traditionalVsml.png" width="600" height="600">

### Types of Machine Learning

<img style="float: left;" src = "./img/types-ml.png" width="700" height="600">

<br>
<br>

<img style="float: left;" src = "./img/ml-ex.png" width="800" height="700">

__Why use linear regression?__

1. Easy to use
2. Easy to interpret
3. Basis for many methods
4. Runs fast
5. Most people have heard about it :-) 

### Libraries in Python for Linear Regression

The two most popular ones are

1. `scikit-learn`
2. `statsmodels`

Highly recommend learning `scikit-learn` since that's also the machine learning package in Python.

### Linear regression 

Let's use `scikit-lean` for this example. 

Linear regression is of the form:

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

- $y$ is what we have to predict/independent variable/response variable
- $\beta_0$ is the intercept/slope
- $\beta_1$ is the coefficient for $x_1$ (the first feature/dependent variable)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature/dependent variable)

The $\beta$ are called *model coefficients*

The model coefficients are estimated in this process. (In Machine Learning parlance - the weights are learned using the algorithm). The objective function is least squares method. 
<br>

**Least Squares Method** : To identify the weights so that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation. [Wiki](https://en.wikipedia.org/wiki/Least_squares)

<img style="float: left;" src = "./img/lin_reg.jpg" width="600" height="600">

<h2> Model Building & Testing Methodology </h2>
<img src="./img/train_test.png" alt="Train & Test Methodology" width="700" height="600">
<br>
<br>
<br>

### Must read blog:
Interpretable Machine Learning by Christoph
https://christophm.github.io/interpretable-ml-book/intro.html

In [None]:
# Step1: Import packages
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(color_codes = True)
%matplotlib inline

In [None]:
# Step2:  Load our data
df = pd.read_csv('./data/Mall_Customers.csv')
df.rename(columns={'CustomerID':'id','Spending Score (1-100)':'score','Annual Income (k$)':'income'},inplace=True)
df.head() # Visualize first 5 rows of data

In [None]:
# Step3: Feature Engineering - transforming variables as appropriate for inputs to Machine Learning Algorithm
# transforming categorical variable Gender using One hot encodding
gender_onhot = pd.get_dummies(df['Gender'])
gender_onhot.head()

In [None]:
# Create input dataset aka X
X = pd.merge(df[['Age','income']], gender_onhot, left_index=True, right_index=True)
X.head()

In [None]:
sns.pairplot(X[['Age','income']])
print("Correlation between variables.........")
X.iloc[:,:4].corr()

In [None]:
# Create target variable
Y = df['score']
Y.head()

In [None]:
# Step3: Split data in train & test set
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.20, random_state=42)
print('Shape of Training Xs:{}'.format(X_train.shape))
print('Shape of Test Xs:{}'.format(X_test.shape))

In [None]:
# Step4: Build Linear Regression Analysis Model
learner = LinearRegression(); #initializing linear regression model

learner.fit(X_train,y_train); #training the linear regression model
y_predicted = learner.predict(X_test)
score=learner.score(X_test,y_test);#testing the linear regression model


### Interpretation

__Score__: R^2 (pronounced as R Square) it is also called as __coefficient of determination__ of prediction.

__Range of Score values__: 0 to 1 , 0 -> No relation between predicted Y and input Xs, 1 -> best case scenario where predicted value is same as actual value.
__Formula for Score__: R^2 = (1 - u/v), where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum()

In [None]:
print(score)
print(y_predicted)

In [None]:
# Step5: Check Accuracy of Model
df_new = pd.DataFrame({"true_score":y_test,"predicted_score":y_predicted})
df_new

In [None]:
# Step6: Diagnostic analysis

from sklearn.metrics import mean_squared_error, r2_score
print("Intercept is at: %.2f"%(learner.intercept_))
# The coefficients
print('Coefficients: \n', learner.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_predicted))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.4f' % r2_score(y_test, y_predicted))