#  Introduction to Machine Learning + Regression

# <font color="blue">Table of Contents</font>

## Introduction to Machine Learning
1. What is Machine Learning
2. Supervised Learning and Unsupervised Learning
3. Semi-supervised Learning and Reinforcement Learning

## Regression
_**Use Case: House Price Prediction**_
1. Simple Linear Regression<br>
2. Multi-variable Linear Regression<br>
3. Regularization:
    * Ridge Regression
    * Lasso Regression
    * Elastic-Net Regression
    
## Model Evaluation
1. What is Overfitting/Underfitting a Model
2. Train/Test Split
3. Cross Validation
4. Model Evaluation Metrics for Regression problems

## Project: SuperStore Sales Prediction
* Project description

# <font color="red">I. Introduction to Machine Learning </font>

## 1.1 What is Machine Learning
**Machine learning (ML)** is the scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. 

Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.

Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics.

### <font color="green">Q: In your words, please explain what is machine learning and how do you apply machine learning in your industry?</font>

## 1.2 Supervised Learning and Unsupervised Learning
There are two main types of tasks within the field of machine learning: Supervised and Unsupervised.

**Supervised machine learning** builds a model that makes predictions based on evidence in the presence of uncertainty. A supervised learning algorithm takes a known set of input data and known responses to the data (output) and trains a model to generate reasonable predictions for the response to new data. Use supervised learning if you have known data for the output you are trying to predict.

**Unsupervised learning** finds hidden patterns or intrinsic structures in data. It is used to draw inferences from datasets consisting of input data without labeled responses.

![image.png](attachment:image.png)

### <font color="green">Q: What are the differences between supervised learning and unsupervised learning?</font>

## 1.3 Semi-supervised Learning and Reinforcement Learning

### Semi-supervised Learning
The most basic disadvantage of any Supervised Learning algorithm is that the dataset has to be hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is very costly especially when dealing with large volumes of data. To counter these disadvantages, the concept of Semi-Supervised Learning was introduced. 

**Semi-supervised learning** is a bit of both supervised and unsupervised learning and uses both labeled and unlabeled data for training. In a typical scenario, the algorithm would use a small amount of labeled data with a large amount of unlabeled data. The basic procedure involved is that first, the programmer will cluster similar data using an unsupervised learning algorithm and then use the existing labeled data to label the rest of the unlabeled data.

### Reinforcement Learning
**Reinforcement learning (RL)** is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

This is a bit similar to the traditional type of data analysis; the algorithm discovers through trial and error and decides which action results in greater rewards. Three major components can be identified in reinforcement learning functionality: the agent, the environment, and the actions. The agent is the learner or decision-maker, the environment includes everything that the agent interacts with, and the actions are what the agent can do.

### <font color="green">Q: Name a business application for semi-supervised learning and/or reinforcement learning.</font>

# <font color="red">II. Regression </font>

## <font color="green">_Use Case - House pricing prediction<br>Data Setup_</font>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option("display.max_columns",100)
%matplotlib inline

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
#import Data


In [None]:
fig = plt.figure(figsize=(6,10))

ax1 = plt.subplot(331)
ax2 = plt.subplot(332)
ax3 = plt.subplot(333)
ax4 = plt.subplot(334)
ax5 = plt.subplot(335)
ax6 = plt.subplot(336)
ax7 = plt.subplot(337)
ax8 = plt.subplot(338)
ax9 = plt.subplot(339)

df.boxplot(column='price', ax=ax1)
df.boxplot(column='bedrooms', ax=ax2)
df.boxplot(column='bathrooms', ax=ax3)
df.boxplot(column='grade', ax=ax4)
df.boxplot(column='sqft_living', ax=ax5)
df.boxplot(column='sqft_living15', ax=ax6)
df.boxplot(column='sqft_lot', ax=ax7)
df.boxplot(column='sqft_lot15', ax=ax8)
df.boxplot(column='condition', ax=ax9)

plt.suptitle('')
plt.tight_layout()

In [None]:
#Filtering Data
#1 Removing Outliers


In [None]:
#Create Dummies columns for zipcode


In [None]:
x_zipcode.head(25)

In [None]:
#Modify columns and additional filter


In [None]:
# Define x and y


In [None]:
#Train and test split
from sklearn.model_selection import train_test_split



## 2.1 Simple Linear Regression
**Simple linear regression** is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables:

* One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
* The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

The Least Squares Regression Line is the line that makes the vertical distance from the data points to the regression line as small as possible. It’s called a “least squares” because the best line of fit is one that minimizes the variance (the sum of squares of the errors).

### 2.1.1 Fit the model

In [None]:
from scipy import stats


### 2.1.2 Create predictions

### 2.1.3 Assess the model

In [None]:
from sklearn.metrics import r2_score, mean_squared_error

### R-squared
**R-squared** is a statistical measure of how close the data are to the fitted regression line. It is also known as the “coefficient of determination”, or the coefficient of multiple determination for multiple regression.

It is the percentage of the response variable variation that is explained by a linear model. In general, the higher the R-squared, the better the model fits your data.

### RMSE
**Root Mean Square Error (RMSE)** is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data – how close the observed data points are to the model’s predicted values. Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. 

As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction.

<font color="green">**Q: Why did single linear regression not work in this case?**</font>

## 2.2 Multiple Linear Regression
**Multiple linear regression** is the most common form of linear regression analysis.  As a predictive analysis, the multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables.  The independent variables can be continuous or categorical (dummy coded as appropriate).

### 2.2.1 Fit the model

### 2.2.2 Create predictions

### 2.2.3 Assess the model

## 2.3 Regularization
**Regularization** is a form of regression. If there is noise in the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

To produce a more accurate model of complex data we can add a penalty term to the OLS equation. A penalty adds a bias towards certain values. These are known as L1 regularization (**Lasso regression**) and L2 regularization (**Ridge regression**), or **Elastic-Net regression** which incorporates penalties from both L1 and L2 regularization.

### 2.3.1 Ridge Regression

In [None]:
from sklearn.linear_model import Ridge


### 2.3.2 Lasso Regression

In [None]:
from sklearn.linear_model import Lasso


### 2.3.3 Elastic-Net Regression

In [None]:
from sklearn.linear_model import ElasticNet


# <font color="red">III. Model Evaluation </font>

## 3.1 What is Overfitting / Underfitting?
In statistics and machine learning we usually split our data into two subsets: training data and testing data, and fit our model on the train data, in order to make predictions on the test data. When we do that, one of two thing might happen: we overfit our model or we underfit our model.

* **Overfitting** means that model we trained has trained “too well” and is now, well, fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably be very not accurate on untrained or new data.
* When a model is **Underfitting**, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. This is usually the result of a very simple model (not enough predictors/independent variables). It could also happen when, for example, we fit a linear model (like linear regression) to data that is not linear. 
![image.png](attachment:image.png)

## 3.2 Train/Test Split
The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

We use `train_test_split` function from the Scikit-Learn library.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5,2)), range(5)

print("Values in X: ", X)
print("Values in y: ", list(y))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print("Values in X_train: ", X_train)
print("Values in X_test: ", X_test)

## 3.3 Cross Validation
Cross Validation is very similar to train/test split, but it’s applied to more subsets. Meaning, we split our data into k subsets, and train on k-1 one of those subset. What we do is to hold the last subset for test. We’re able to do it for each of the subsets.

In **K-Folds Cross Validation** we split our data into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.

In [None]:
from sklearn.model_selection import KFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2) 
kf.get_n_splits(X)

print(kf)

In [None]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

## 3.4 Model Evaluation Metrics for Regression problems
As we discussed earlier, we can use the following metrics for evaluating regression problems:
* R-squared, Mean Absolute Error, Mean Squared Error, Root Mean Squared Error.

# <font color="red">IV. Project: SuperStore Sales Prediction </font>

Use the given data "sales.csv" to predict the prices of the super store sales.