# <div align="center">  Artificial Intelligence Concordia Workshop 1: <br /> </div>
## <div align="center"> Basics of Machine Learning, Linear Regression and Logistic Regression</div>

<div align="center">
  <img src="https://github.com/olibel270/Workshop1JupyterNote/blob/master/images/AICLogo.jpg?raw=1" style="width: 300px" /></div>

---------------------------------------------------------------------

<div align="right">
  Follow us on:<br />
[ Facebook](https://www.facebook.com/AISConU/)<br />
[Our Website](https://www.aisconcordia.com)<br />
  </div>
 


# Introduction

#Machine Learning as a subset of AI

# Python


Short description <br />
  Tools Available:
  Many libraries: Scikit-Learn Numpy, Pytorch
  Highly optimized implementations to make use of multi-cores and GPUs
  Community of users / Ease of finding information
  
  https://realpython.com/python-first-steps/
  
  The Google Colab Environment

In [0]:
# Import Python Packages
import warnings
import sys
import os

# Relevant Libraries for ML
import csv
import scipy
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

# Show Packages versions
print('Installed packages and their versions:')
print('csv: {}'.format(csv.__version__))
print('scipy: {}'.format(scipy.__version__))
print('matplotlib: {}'.format(matplotlib.__version__))
print('numpy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('seaborn: {}'.format(sns.__version__))
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))

# Setting some Parameters
sns.set(color_codes=True)
%matplotlib inline
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points



In [0]:
# Importing metrics for evaluation
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from scipy import stats
from scipy.stats import norm, skew #for some statistics

# General Framework of Machine Learning

<a id="8"></a> <br>
### 4-1 Inputs
* train.csv - the training set

<a id="9"></a> <br>

### 4-2 Outputs
* sale prices for every record in test.csv

* Define Problem 
* Specify Inputs & Outputs
* Exploratory data analysis
* Data Collection
* Data Preprocessing
* Data Cleaning
* Visualization
* Model Design, Training, and Offline Evaluation
* Model Deployment, Online Evaluation, and Monitoring
* Model Maintenance, Diagnosis, and Retraining

 <img src="http://s9.picofile.com/file/8338227634/workflow.png" />
 
Regression Vs. Classification


## 2-1 How to solve Problem?
**Data Science has so many techniques and procedures that can confuse anyone.**

**Step 1**: Translate your business problem statement into technical one

**Step 2**: Decide on the supervised learning technique

Classification Vs. regression

**Step 3**: Literature survey

**Step 4**: Data cleaning

Missing values

different techniques to impute missing values 

Duplicate records

Incorrect values
3 standard deviations from the mean

**Step 5**: Feature engineering

Removing redundant features
 metrics like AIC and BIC to identify redundant features. There are built in packages to perform operations like forward selection, backward selection etc. to remove redundant features.

Transforming a feature
A feature might have a non linear relationship with the output column. While complex models can capture this with enough data, simple models might not be able to capture this. I usually try to visualize different functions of each column like log, inverse, quadratic, cubic etc. and choose the transformation that looks closest to a normal curve.

**Step 6**: Data modification

Scaling
Skew
Up-sample
Down-sample

**Step 7**: Modelling

Start with simple models, then more complex ones

Knowledge of the assumptions of each models

**Step 8**: Model comparison
Cross validation basically brings out an average performance of a model. avoid over-fitting. 
randomize data before cross validation.

A good technique to compare performance of different models is ROC curves. ROC curves help you visualize performance of different models across different thresholds. While ROC curves give a holistic sense of model performance, based on the business decision, you must choose the performance metric like Accuracy, True Positive Rate, False Positive Rate, F1-Score etc.

**Step 9**: Error analysis


**Step 10**: Improving your best model

Once I have the best model, I usually plot training vs testing accuracy [or the right metric] against the number of parameters. Usually, it is easy to check training and testing accuracy against number of data points. Basically this plot will tell you whether your model is over-fitting or under-fitting. This articleDetecting over-fitting vs under-fitting explains this concept clearly.

Understanding if your model is over-fitting or under-fitting will tell you how to proceed with the next steps. If the model is over-fitting, you might consider collecting more data. If the model is under-fitting, you might consider making the models more complex. [Eg. Adding higher order terms to a linear / logistic regression]

**Step 11**: Deploying the model


**Step 12**: Adding feedback
 historical data. 
capture the current trends or changes




# Problem Definition


we will use the house prices data set. This dataset contains information about house prices and the target value is:

* SalePrice

# Data


  ## Datasets

https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research

https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/

https://www.kaggle.com/datasets


### The Housing Dataset
  

<img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/5407/media/housesbanner.png"></img>





#### Variables


The variables are :
* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* TotalBsmtSF: Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
*  LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: $Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale

## Importing the dataset

In [0]:
if not os.path.exists('Workshop1JupyterNote'):
  ! git clone https://github.com/olibel270/Workshop1JupyterNote.git --quiet

%cd Workshop1JupyterNote
!git checkout master --quiet

from subprocess import check_output

print("Checking whether the dataset import properly:")
print(check_output(["ls", "./dataset"]).decode("utf8")) #check the files available in the directory

## Challenges

1- Attributes are numeric and categorical so you have to figure out how to load and handle data.

2- It is a Regression problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.

## Representing Data

Numpy

In [0]:
training_set = open('dataset/train.csv')
csv_training_set = csv.DictReader(training_set)

Pandas

In [0]:
train = pd.read_csv('dataset/train.csv')
test= pd.read_csv('dataset/test.csv')

## Exploring the Dataset

Using Pandas and the documentation can you find out:

1- Dimensions of the dataset.

2- Peek at the data itself.

3- Statistical summary of all attributes.[link text](https://)

4- Breakdown of the data by the class variable.[7]

Pandas Cheat Sheet:
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf

In [0]:
# shape
print(train.shape)

In [0]:
# shape
print(test.shape)

In [0]:
print(train.info())

In [0]:
train.head(5) 

In [0]:
train.tail() 

In [0]:
train.sample(5) 

In [0]:
train.describe() 

In [0]:
train.isnull().sum()

In [0]:
train.groupby('SaleType').count()

In [0]:
train.columns

In [0]:
train[train['SalePrice']>700000]

## Visualizing Data


### Scatter plot

Scatter plot Purpose To identify the type of relationship (if any) between two quantitative variables




In [0]:
columns = ['SalePrice','OverallQual']
sns.FacetGrid(train[columns], size=5).map(plt.scatter, "OverallQual", "SalePrice")
plt.show()

In [0]:
columns = ['SalePrice','1stFlrSF']
sns.FacetGrid(train[columns], size=5).map(plt.scatter, "1stFlrSF", "SalePrice")
plt.show()

### Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

In [0]:
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
pd.plotting.scatter_matrix(train[columns],figsize=(12,12))
plt.figure()

### Heatmap

In [0]:
# Adding total sqfootage feature 
train['TotalSF'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']

In [0]:
plt.figure(figsize=(7,4)) 
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd', 'TotalSF']
sns.heatmap(train[columns].corr(),annot=True)
plt.show()

### Histogram

We can also create a **histogram** of each input variable to get an idea of the distribution.



In [0]:
# histograms
train.hist(figsize=(15,20))
plt.figure()

From the plot, we can see that the species setosa is separataed from the other two across all feature combinations

We can also replace the histograms shown in the diagonal of the pairplot by kde.

### Scraping Data (TO CLEAN AND COMPLETE)

 Literature survey

  
  Step 4: Data cleaning

If you speak with anyone who has spent some time in data science, they will always say that most of their time is spent on cleaning the data. Real world data is always messy. Here are a few common discrepancies in most data-sets and some techniques of how to clean them:

Missing values Missing values are values that are blank in the data-set. This can be due to various reasons like value being unknown, unrecorded, confidential etc. Since the reason for a value being missing is not clear, it is hard to guess the value.

You could try different techniques to impute missing values starting with simple methods like column mean, median etc. and complex methods like using machine leaning models to estimate missing values.

Duplicate records The challenge with duplicate records is identifying a record being duplicate. Duplicate records often occur while merging data from multiple sources. It could also occur due to human error. To identify duplicates, you could approximate a numeric values to certain decimal places and for text values, fuzzy matching could be a good start. Identification of duplicates could help the data engineering team to improve collection of data to prevent such errors.

Incorrect values Incorrect values are mostly due to human error. For Eg. If there is a field called age and the value is 500, it is clearly wrong. Having domain knowledge of the data will help identify such values. A good technique to identify incorrect values for numerical columns could be to manually look at values beyond 3 standard deviations from the mean to check for correctness.

* Integerizing
Label Encoding vs One Hot

#### 6-1-2Select numberical features and categorical features

In [0]:
numberic_features=train.select_dtypes(include=[np.number])
categorical_features=train.select_dtypes(include=[np.object])

#### 6-1-3 Target Value Analysis
as you know **SalePrice** is our target value that we should predict it then now we take a look at it

In [0]:

train['SalePrice'].describe()

Flexibly plot a univariate distribution of observations.



In [0]:
sns.distplot(train['SalePrice']);


In [0]:
#skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())

#### 6-3 Data Preprocessing
**Data preprocessing** refers to the transformations applied to our data before feeding it to the algorithm.
 
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
there are plenty of steps for data preprocessing and we just listed some of them :
* removing Target column (id)
* Sampling (without replacement)
* Making part of iris unbalanced and balancing (with undersampling and SMOTE)
* Introducing missing values and treating them (replacing by average values)
* Noise filtering
* Data discretization
* Normalization and standardization
* PCA analysis
* Feature selection (filter, embedded, wrapper)

#### 6-3-1 removing ID

In [0]:
# Save Id and drop it
train_ID=train['Id']
test_ID=test['Id']
train.drop('Id',axis=1,inplace=True)
test.drop('Id',axis=1,inplace=True)

## 6-3-4 Feature selection
let's first concatenate the train and test data in the same dataframe

In [0]:
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))

#### 6-4 Data Cleaning
When dealing with real-world data, dirty data is the norm rather than the exception. We continuously need to predict correct values, impute missing ones, and find links between various data artefacts such as schemas and records. We need to stop treating data cleaning as a piecemeal exercise (resolving different types of errors in isolation), and instead leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions.

The primary goal of data cleaning is to detect and remove errors and anomalies to increase the value of data in analytics and decision making. While it has been the focus of many researchers for several years, individual problems have been addressed separately. These include missing value imputation, outliers detection, transformations, integrity constraints violations detection and repair, consistent query answering, deduplication, and many other related problems such as profiling and constraints mining.[8]

In [0]:
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data

In [0]:
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

#### 6-4-1 Imputing missing values

We impute them by proceeding sequentially through features with missing values

PoolQC : data description says NA means "No Pool". That make sense, given the huge ratio of missing value (+99%) and majority of houses have no Pool at all in general. 

In [0]:
all_data["PoolQC"] = all_data["PoolQC"].fillna("None")

MiscFeature : data description says NA means "no misc feature"

In [0]:
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")

Alley : data description says NA means "no alley access"

In [0]:
all_data["Alley"] = all_data["Alley"].fillna("None")

Fence : data description says NA means "no fence"

In [0]:
all_data["Fence"] = all_data["Fence"].fillna("None")

FireplaceQu : data description says NA means "no fireplace"

LotFrontage : Since the area of each street connected to the house property most likely have a similar area to other houses in its neighborhood , we can fill in missing values by the median LotFrontage of the neighborhood.

In [0]:
#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

GarageType, GarageFinish, GarageQual and GarageCond : Replacing missing data with None

In [0]:
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')

GarageYrBlt, GarageArea and GarageCars : Replacing missing data with 0 (Since No garage = no cars in such garage.)

In [0]:
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)

BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath : missing values are likely zero for having no basement

In [0]:
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)

BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 : For all these categorical basement-related features, NaN means that there is no basement.

In [0]:
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')

MasVnrArea and MasVnrType : NA most likely means no masonry veneer for these houses. We can fill 0 for the area and None for the type. 

In [0]:
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)

MSZoning (The general zoning classification) : 'RL' is by far the most common value. So we can fill in missing values with 'RL'

In [0]:
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])

Functional : data description says NA means typical

In [0]:
all_data["Functional"] = all_data["Functional"].fillna("Typ")

Electrical : It has one NA value. Since this feature has mostly 'SBrkr', we can set that for the missing value.

In [0]:
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])

KitchenQual: Only one NA value, and same as Electrical, we set 'TA' (which is the most frequent) for the missing value in KitchenQual.

In [0]:
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])

Exterior1st and Exterior2nd : Again Both Exterior 1 & 2 have only one missing value. We will just substitute in the most common string

In [0]:

all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])

SaleType : Fill in again with most frequent which is "WD"

In [0]:
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])

MSSubClass : Na most likely means No building class. We can replace missing values with None

In [0]:
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

FireplaceQu : data description says NA means "no fireplace"

In [0]:
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")

Is there any remaining missing value ?

In [0]:
#Check remaining missing values if any 
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()

#### 6-4-2 More features engeneering

Transforming some numerical variables that are really categorical

In [0]:
#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)


#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)


#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

Label Encoding some categorical variables that may contain information in their ordering set

In [0]:
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

# shape        
print('Shape all_data: {}'.format(all_data.shape))
print(all_data)

Adding one more important feature

Since area related features are very important to determine house prices, we add one more feature which is the total area of basement, first and second floor areas of each house

#### 6-5 Skewed features

In [0]:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)

Box Cox Transformation of (highly) skewed features

We use the scipy function boxcox1p which computes the Box-Cox transformation of 
1+x
1+x
.
Note that setting 
λ=0
λ=0
is equivalent to log1p used above for the target variable.
See this page for more details on Box Cox Transformation as well as the scipy function's page

In [0]:
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)
    
#all_data[skewed_features] = np.log1p(all_data[skewed_features])

Getting dummy categorical features

In [0]:
all_data = pd.get_dummies(all_data)
print(all_data.shape)

Getting the new train and test sets.

In [0]:
train = all_data[:ntrain]
test = all_data[ntrain:]

# Linear Regression : One Variable

## Picking the target variable


In [0]:
training_set = open('dataset/train.csv')
csv_training_set = csv.DictReader(training_set)

total_sf = []
sale_price = []

for row in csv_training_set:
  temp_total_sf = 0
  temp_total_sf += float(row['TotalBsmtSF'])
  temp_total_sf += float(row['1stFlrSF'])
  temp_total_sf += float(row['2ndFlrSF'])
  total_sf.append(temp_total_sf)
  sale_price.append(row['SalePrice'])

total_sf = np.array(total_sf, dtype='float64')
sale_price = np.array(sale_price, dtype='float64')
print('total square footage array:', total_sf)
print('sale price array:', sale_price)

In [0]:
fig, ax = plt.subplots()
ax.scatter(x = total_sf, y = sale_price)
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('Total SF', fontsize=13)
plt.show()

## 6-3-2 Noise filtering

In [0]:
train = pd.read_csv('dataset/train.csv')
test= pd.read_csv('dataset/test.csv')

In [0]:
# Adding total sqfootage feature 
train['TotalSF'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']

In [0]:

#Deleting outliers
train = train.drop(train[(train['TotalSF']>7000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['TotalSF'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('TotalSF', fontsize=13)
plt.show()

## The Cost Function

In [0]:
def computeCost(X,theta,y):
    predicted = np.dot(theta, X)
    cost = np.sum(np.square(predicted - y))
    cost /= (2 * len(y))
    return cost


## Gradient Descent



### The Algorithm

In [0]:
def gradient_descent(X, y, theta, alpha, num_iters):
    # Performs gradient descent to learn theta
    # updates theta by taking 'num_iters' gradient steps with learning rate 'alpha'

    # initialize some useful values
    m = len(y)  # number of training examples
    cost_history = np.zeros(num_iters)

    for iter in range(num_iters):
        temp = np.sum((np.dot(theta, X) - y)*X, axis=1)

        theta = theta - (alpha / m) * temp

        cost_history[iter] = computeCost(X, theta, y)
        
    print('cost_history:', '\n', cost_history)
    step = [cost_history[idx+1]-cost for idx, cost in enumerate(cost_history) if idx<num_iters-1]
    step = np.array(step)
    print(step)
    print('final theta:', theta)
    
    x = np.linspace(0, num_iters-2, num_iters-1)
    y = np.abs(step)
    print(x)
    print('steps:', y)
    plt.plot(x, y)
    plt.show()
    
    return theta

### Implementation

In [0]:
#Gradient Descent
#Some gradient descent settings
iterations = 10
alpha = 0.0000002

theta = np.zeros(2)
print('theta array:', theta)


# initial cost
# first row of X = ones
num_training_ex = len(sale_price)
print(num_training_ex)
X = np.vstack((np.ones(num_training_ex),total_sf))
print('first floor square footage array(ones inserted):''\n', X)

initial_cost = computeCost(X, theta, sale_price)
print('initial cost:', initial_cost)


# run gradient descent
theta = gradient_descent(X, sale_price, theta, alpha, iterations)


### Plotting Performance

In [0]:
from sklearn import linear_model
import matplotlib.pyplot as plt

model = linear_model.LinearRegression()

train = train[train['LotArea'] < 30000]
test = test[test['LotArea'] < 30000]

X_train = train['LotArea']
X_train = X_train.values.reshape(-1, 1)

Y_train = train['SalePrice']
Y_train = Y_train.values.reshape(-1, 1)

X_test = test['LotArea']
X_test = X_test.values.reshape(-1, 1)

model.fit(X_train, Y_train)
pred = model.predict(X_test)

plt.scatter(X_train, Y_train, color='red')
plt.plot(X_test, pred, color='blue')

## Normalizing the sale price
SalePrice is the variable we need to predict. So let's do some analysis on this variable first.

In [0]:
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

The target variable is right skewed. As (linear) models love normally distributed data , we need to transform this variable and make it more normally distributed.


Log-transformation of the target variable

In [0]:
#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

# Linear Regression : Multiple Variables

In [0]:
training_set = open('dataset/train.csv')
csv_training_set = csv.DictReader(training_set)

feature_list = [
  'OverallQual',
  'YearBuilt',
  'TotalBsmtSF',
  '1stFlrSF',
  '2ndFlrSF',
  'FullBath'
]

nb_features = len(feature_list)
feature_matrix = np.empty([nb_features,1])
sale_price = []
num_training_ex = 0

for idx, row in enumerate(csv_training_set):
    feature_column = []
    for feature in feature_list:
        #if row[feature] == 'NA':
        #    row[feature] = 0
        feature_column.append(row[feature])
    feature_vector = np.array([feature_column], dtype='float64').T
    if idx is 0:
        feature_matrix = feature_vector
    else:
        feature_matrix = np.concatenate([feature_matrix, feature_vector], axis=1)
    sale_price.append(row['SalePrice'])

sale_price = np.array(sale_price, dtype='float64')
print('Features Matrix:', '\n', feature_matrix)
print('Sale Price Array:', '\n', sale_price)

In [0]:
print('Feature Scaling!')

for i, column in enumerate(feature_matrix):
    max = np.max(column)
    min = np.min(column)
    spread = max - min
    mean = np.mean(column)
    # print('Column before Scaling:', column)

    for j, content in enumerate(column):
        column[j] = (content-mean)/spread
    # print('Column after Scaling:', column)
    feature_matrix[i] = column

#print('Feature Matrix After Scaling:', '\n', feature_matrix)
"""
max = np.max(sale_price)
min = np.min(sale_price)
spread = max - min
mean = np.mean(column)

for idx, entry in enumerate(sale_price):
  sale_price[idx] = (entry - mean)/spread
"""
print(sale_price)




In [0]:
#Gradient Descent
#Some gradient descent settings
iterations = 5000
alpha = 1.5

theta = np.zeros(nb_features+1)
print('theta array:', theta)


# initial cost
# first row of X = ones
num_training_ex = len(sale_price) 
X = np.vstack((np.ones(num_training_ex),feature_matrix))
print('first floor square footage array(ones inserted):''\n', X)

current_cost = computeCost(X, theta, sale_price)
print('initial cost:', current_cost)


# run gradient descent
theta = gradient_descent(X, sale_price, theta, alpha, iterations)

In [0]:
features = ['LotArea', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd']

X_train = train[features]
X_test = test[features]

Y_train = train['SalePrice']
Y_train = Y_train.values.reshape(-1, 1)

model.fit(X_train, Y_train)
pred = model.predict(X_test)

print(pred)

# Polynomial Regression



Overfitting

# Logistic Regression

# More Regression Algorithms (Extra Material)

In [0]:
!pip install lightgbm
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

### 7-1 Defining a cross validation strategy

We use the cross_val_score function of **Sklearn**. However this function has not a shuffle attribut, we add then one line of code, in order to shuffle the dataset prior to cross-validation

In [0]:

#Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

### 7-2 Models

#### LASSO Regression 
In statistics and machine learning, lasso (least absolute shrinkage and selection operator)  is a **regression analysis** method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces.  Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear.

This model may be very sensitive to outliers. So we need to made it more robust on them. For that we use the sklearn's Robustscaler() method on pipeline

In [0]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

In [0]:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)

#### Elastic Net Regression 
the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.
again made robust to outliers

In [0]:
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))

#### Kernel Ridge Regression 
Kernel ridge regression (KRR)  combines Ridge Regression (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space.

In [0]:
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

#### Gradient Boosting Regression
With huber loss that makes it robust to outliers

In [0]:
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

#### XGBoost

In [0]:
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)

#### LightGBM

In [0]:
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

### Base models scores

Let's see how these base models perform on the data by evaluating the cross-validation rmsle error

In [0]:
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [0]:
score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [0]:
score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [0]:
score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [0]:
score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [0]:
score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

Stacking models

Simplest Stacking approach : Averaging base models
We begin with this simple approach of averaging base models. We build a new class to extend scikit-learn with our model and also to laverage encapsulation and code reuse (inheritance)

Averaged base models class

In [0]:

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1) 


Averaged base models score

We just average four models here ENet, GBoost, KRR and lasso. Of course we could easily add more models in the mix.

In [0]:
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

In [0]:
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

In [0]:
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

In [0]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

In [0]:
#StackedRegressor
#Final Training and Prediction
stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))
print(rmsle(y_train, stacked_train_pred))

In [0]:
#XGBoost
model_xgb.fit(train, y_train)
xgb_train_pred = model_xgb.predict(train)
xgb_pred = np.expm1(model_xgb.predict(test))
print(rmsle(y_train, xgb_train_pred))

In [0]:
#lightGBM
model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)
lgb_pred = np.expm1(model_lgb.predict(test.values))
print(rmsle(y_train, lgb_train_pred))

In [0]:
'''RMSE on the entire Train data when averaging'''

print('RMSLE score on train data:')
print(rmsle(y_train,stacked_train_pred*0.70 +
               xgb_train_pred*0.15 + lgb_train_pred*0.15 ))

### Ensemble prediction

In [0]:
ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15

In [0]:
sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = ensemble
sub.to_csv('submission.csv',index=False)


Box Cox Transformation of (highly) skewed features

We use the scipy function boxcox1p which computes the Box-Cox transformation of 
1+x
1+x
.
Note that setting 
λ=0
λ=0
is equivalent to log1p used above for the target variable.
See this page for more details on Box Cox Transformation as well as the scipy function's page

In [0]:
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)
    
#all_data[skewed_features] = np.log1p(all_data[skewed_features])

-----------------
## Conclusion

## 9- References
* [1] https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-for-house-prices
* [2] [https://skymind.ai/wiki/machine-learning-workflow](https://skymind.ai/wiki/machine-learning-workflow)
* [3] [Problem-define](https://machinelearningmastery.com/machine-learning-in-python-step-by-step/)
* [4] [Sklearn](http://scikit-learn.org/)
* [5] [machine-learning-in-python-step-by-step](https://machinelearningmastery.com/machine-learning-in-python-step-by-step/)
* [6] [Data Cleaning](http://wp.sigmod.org/?p=2288)
* [7] [kaggle kernel](https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard)

