# Machine Learning 1

# Some Concepts

## ML Tasks

In [None]:
Type of tasks
  - Classification
  - Regression
  - Structured annotation
  - Clustering
  - Transcription

In [None]:
Challenges
  - Quality of data 
  - Time-Consuming task − Another challenge faced by ML models is the consumption of time especially for data acquisition, feature extraction and retrieval. 
  - Lack of specialist persons − As ML technology is still in its infancy stage, availability of expert resources is a tough job.
  - No clear objective for formulating business problems 
  - Issue of overfitting & underfitting 
  - Curse of dimensionality − Another challenge ML model faces is too many features of data points. This can be a real hindrance.
  - Difficulty in deployment − Complexity of the ML model makes it quite difficult to be deployed in real life.

In [None]:
Applications
  - Emotion analysis
  - Sentiment analysis
  - Error detection and prevention
  - Weather forecasting and prediction
  - Stock market analysis and forecasting
  - Speech synthesis
  - Speech recognition
  - Customer segmentation
  - Object recognition
  - Fraud detection
  - Fraud prevention
  - Recommendation of products to customer in online shopping

## Stats 

In [4]:
import pandas as pd
import statsmodels.formula.api as sm
df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
result = sm.ols(formula="A ~ B + C", data=df).fit()
print(result.params)

Intercept    14.952480
B             0.401182
C             0.000352
dtype: float64


In [5]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                      A   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.158
Method:                 Least Squares   F-statistic:                     1.375
Date:                Sun, 07 Jun 2020   Prob (F-statistic):              0.421
Time:                        18:23:25   Log-Likelihood:                -18.178
No. Observations:                   5   AIC:                             42.36
Df Residuals:                       2   BIC:                             41.19
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     14.9525     17.764      0.842      0.4



## Type of Learning

##### Supervised Learning

In [None]:
- The majority of practical machine learning uses supervised learning.
- Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
- Y = f(X)
- It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process.

In [None]:
Supervised learning problems can be further grouped into regression and classification problems.
  - Classification: A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.
  - Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”.

Some common types of problems built on top of classification and regression include recommendation and time series prediction respectively.

Some popular examples of supervised machine learning algorithms are:
  - Linear regression for regression problems.
  - Random forest for classification and regression problems.
  - Support vector machines for classification problems.

##### Unsupervised Learning

In [None]:
Unsupervised learning is where you only have input data (X) and no corresponding output variables.
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. 
Algorithms are left to their own devises to discover and present the interesting structure in the data.

Unsupervised learning problems can be further grouped into clustering and association problems.
  - Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.
  - Association:  An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.

Some popular examples of unsupervised learning algorithms are:
  - k-means for clustering problems.
  - Apriori algorithm for association rule learning problems.

##### Semi Supervised

In [None]:
Problems where you have a large amount of input data (X) and only SOME of the data is labeled (Y) are called semi-supervised learning problems.

These problems sit in between both supervised and unsupervised learning.
  - A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and the majority are unlabeled.
  - Many real world machine learning problems fall into this area.
  - This is because it can be expensive or time-consuming to label data as it may require access to domain experts. Whereas unlabeled data is cheap and easy to collect and store.

You can use unsupervised learning techniques to discover and learn the structure in the input variables.
You can also use supervised learning techniques to make best guess predictions for the unlabeled data, feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.

#### Machine Learning vs Deep Learning

In [None]:
Deep learning is machine learning.
  - More specifically, deep learning is considered an evolution of machine learning. 
  - It uses a programmable neural network that enables machines to make accurate decisions without help from humans.

However, its capabilities are different.
  - While basic machine learning models do become progressively better at whatever their function is, they still need some guidance. 
  - If an AI algorithm returns an inaccurate prediction, then an engineer has to step in and make adjustments. 
  - With a deep learning model, an algorithm can determine on its own if a prediction is accurate or not through its own neural network.
    
A deep learning model is designed to continually analyze data with a logic structure similar to how a human would draw conclusions. 
  - To achieve this, deep learning applications use a layered structure of algorithms called an artificial neural network. 
  - The design of an artificial neural network is inspired by the biological neural network of the human brain, leading to a process of learning that’s far more capable than that of standard machine learning models.

It’s a tricky prospect to ensure that a deep learning model doesn’t draw incorrect conclusions—like other examples of AI, it requires lots of training to get the learning processes correct. 
But when it works as it’s intended to, functional deep learning is often received as a scientific marvel that many consider being the backbone of true artificial intelligence.

# Simple Regression

#### Importing the tools

In [6]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

#### The basics in Scikit-learn

In [None]:

# Declare the X and y
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms','Area Population']]
y = df['Price']

# Prepare the test / train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Get the sets size
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)

#printing the output and coefficients
coeff_df = pd.DataFrame(linreg.coef_,X.columns,columns=['Coefficient']) 
coeff_df

#### Visualisation of output

In [None]:
# Plotting the predictions vs the test set
y_pred = lm.predict(X_test)  
plt.scatter(y_test,y_pred)

# Plotting the errors
sns.distplot((y_test-y_pred]),bins=50); 


#### Metrics

In [None]:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors
Mean Squared Error (MSE) is the mean of the squared errors:
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors

Comparing these metrics:

MAE is the easiest to understand because it’s the average error.
MSE is more popular than MAE because MSE “punishes” larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE because RMSE is interpretable in the “y” units.

In [None]:
# to get the metrics

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, y_pred)) 
print('MSE:', metrics.mean_squared_error(y_test, y_pred)) 
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) 

#### Predictions

In [None]:
# Lets say that the model inputs are
X = df[['Weight', 'Volume']]
y = df['CO2']

regr = linear_model.LinearRegression()
regr.fit(X, y)

# Simply do that for predicting the CO2 emission of a car where the weight is 2300kg, and the volume is 1300ccm:
predictedCO2 = regr.predict([[2300, 1300]])

print(predictedCO2)


#### OLS Regression

In [None]:
https://docs.w3cub.com/statsmodels/generated/statsmodels.regression.linear_model.ols.fit_regularized/

In [None]:
est=sm.OLS(y, X)
est = est.fit()
est.summary()

# Model Validation and Interpretation

In [None]:
There is always a need to validate the stability of the machine learning model and need some kind of assurance that:
  - the  model has got most of the patterns from the data correct
  - the model is not picking up too much on the noise
  - the model is low on bias and variance.

In [None]:
Validation
  - process of deciding whether the numerical results quantifying hypothesized relationships between variables, are acceptable as descriptions of the data.

Residiuals
  - evaluation of residuals = error estimation for the model is made after training 
  - a numerical estimate of the difference in predicted and original responses is done, also called the training error. 
  - However, this only gives us an idea about how well our model does on the data used to train it. 
  - It possible that the model is underfitting or overfitting the data. 

Cross Validation:
  - Pupose: get an indication of how well the learner will generalize to an independent / unseen data set
  - How: discussed below

## Hold Out Method

In [None]:
Simple
  - Removing a part of the training data and using it to get predictions from the model trained on rest of the data. 
  - The error estimation then tells how our model is doing on unseen data or the validation set. 

However
  - suffers from issues of high variance since It is not certain which data points will end up in the validation set

## K-Fold Cross Validation

In [None]:
The Problem
  - As there is never enough data to train your model, removing a part of it for validation poses a problem of underfitting. 
  - By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias. 

The Solution
  - What we require is a method that provides ample data for training the model and also leaves ample data for validation. 
  - K-Fold cross validation does exactly that.

K Fold cross validation
  - the data is divided into k subsets. 
  - the holdout method is repeated k times, such that each time:
      - one of the k subsets is used as the test set / validation set
      - the other k-1 subsets are put together to form a training set. 
  - The error estimation is averaged over all k trials to get total effectiveness of our model. 
  - As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. 
  - This significantly reduces
      - bias as we are using most of the data for fitting
      - variance as most of the data is also being used in validation set. 
  - Interchanging the training and test sets also adds to the effectiveness of this method. 
  - As a general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any value.

## Stratified K-Fold Cross Validation

In [None]:
In some cases, there may be a large imbalance in the response variables. 
  - For example, in dataset concerning price of houses, there might be large number of houses having high price. 
  - Or in case of classification, there might be several times more negative samples than positive samples. 

For such problems, a slight variation in the K-Fold cross validation technique is made:
  - Each fold contains approximately the same percentage of samples of each target class as the complete set
  - in case of prediction problems, the mean response value is approximately equal in all the folds. 

This variation is also known as Stratified K Fold.

Above explained validation techniques are also referred to as Non-exhaustive cross validation methods. 
These do not compute all ways of splitting the original sample, i.e. you just have to decide how many subsets need to be made.
Also, these are approximations of method explained below, also called Exhaustive Methods, that computes all possible ways the data can be split into training and test sets.

## Leave-P-Out Cross Validation (exchaustive method)

In [None]:
Exhaustive Methods computes all possible ways the data can be split into training and test sets.

## Confusion Table

## Interpretation of the Outputs

## Multiple Linear Regression

In [None]:
Almost all the real-world problems that you are going to encounter will have more than two variables. 
Linear regression involving multiple variables is called “multiple linear regression” or multivariate linear regression. 
The steps to perform multiple linear regression are almost similar to that of simple linear regression. 

The difference lies in the evaluation. 
You can use it to find out which factor has the highest impact on the predicted output and how different variables relate to each other.