# Supervised Learning - Putting it all together


I want to the know the outcome

I have historical data with outcomes

Algorithms are trained with historical data (find the line(s) of best fit)

Algorithms predict using Test data and we then compare the Test data with the actual data


### Get The Data

Regression Test Set (Housing Data Set)

https://www.kaggle.com/datasets/vedavyasv/usa-housing

Classification Test Set (Titanic Data Set)

https://www.kaggle.com/datasets/brendan45774/test-file


### Algorithms

The following algorithms can be used for Regression and Classification

LinearRegression / LogisticRegression
- Data is assumed to have a Linear Relationship

K Nearest Neighbor
- Creates groups of data and compares the new attributes to the groups

Decision Tree and Random Forest
- Creates a tree structure from the data and predicts based on the paths created

Support Vector
- Creates hyperplane(s) to separate the data

### Regression

Find a Quantitative Numerical Outcome

###### Evaluation

Mean Absolute Error (MAE)
- Average distance from the line of best fit between the prediction and the actual test results

Mean Squared Error (MSE)
- Results from the MAE are squared to compensate for outliers

Root Mean Squared Error
- Square root of the MSE for a closer error value

Score
- Quality or accuracy of fit of the model between the Attributes and the Target

### Classification

Find a Categorical Outcome

###### Evaluation

Confusion Matrix
- Visual output of True Positive, False Positive, True Negative, False Negative Results from the prediction vs test

Classification Report
- Evaluation of Accuracy of the model
- Precison, Recall and F1 Score of each Positive and Negative results


### Preparing Data

Split the DataFrame into 2 major groups
- X -> Attributes that we want to predict on
- y -> Outcome

###### Train Test Split

Split the 2 groups (X and y) into 4 groups

X_train
- Attributes that we want to use to train our model on

y_train
- Outomes that we want to use to trian our model on

X_test
- Attributes that we want to use to predict/forecast future values

y_test
- Outcomes that we want to use to compare against the predcition


### Training

Instantiate the model
- `model = Algorithm(parameters if any)`

Fit the data
- `model.fit(X_train, y_train)`

Predict
- `predictions = model.predict(X_test)`


### Tuning

Not required but recommended

###### Standard Scalar

Method to scale all the attributes in the dataframe (not the target) to a single common value

###### Pipeline

Create a standard template to apply normalization parameters before running the alogorithm

###### GridSearchCV

Method to iterate through combinations of tuning parameters for algorithms

In [18]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import SVR, SVC

from sklearn.metrics import mean_absolute_error, mean_squared_error, confusion_matrix, classification_report

### Regression

In [19]:
df = pd.read_csv('USA_Housing.csv')

In [20]:
df.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


In [21]:
# Train Test split


X = df[df.columns[:-2]]
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)


In [25]:
# What we want to build is a iterative loop that includes

# Standard Scaler and a GridSearchCV for tuning parameters for each Regression algorithm

regression_list = [LinearRegression, KNeighborsRegressor, DecisionTreeRegressor, RandomForestRegressor, SVR]


regression_dict = {
    'Linear':{
        'algorithm': LinearRegression,
        'params': {}
    },
    'KNeighbors':{
        'algorithm': KNeighborsRegressor,
        'params': {'reg__n_neighbors':range(1,20)}
    },
    'Decision Tree':{
        'algorithm': DecisionTreeRegressor,
        'params': {'reg__max_depth':[4,5]}
    },
    'RandomForest':{
        'algorithm': RandomForestRegressor,
        'params': {'reg__max_depth':[4,5], 'reg__n_estimators':[100, 200]}
    },
    'Support Vector':{
        'algorithm': SVR,
        'params': {'reg__C':[1,1000, 100000],'reg__epsilon':[0.1, 0.001, 0.0001]}
    },
    
}


for name, algorithm in regression_dict.items():
    
    param_grid = {}
    
    print('\n', name.title())
    
    pipe = Pipeline([
        ('sc', StandardScaler()),
        ('reg', algorithm['algorithm']())
    ], verbose=True)
    

    for key, value in algorithm['params'].items():
        param_grid[key] = value

    
    grid = GridSearchCV(pipe, param_grid=param_grid, verbose=3)
    
    grid.fit(X_train, y_train)
    
    predictions = grid.predict(X_test)
    
    print('MAE', mean_absolute_error(y_test, predictions))
    print('MSE', mean_squared_error(y_test, predictions))
    print('RMSE', np.sqrt(mean_squared_error(y_test, predictions)))
    print('Score', grid.score(X_test, y_test))
    
    


 Linear
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 1/5] END ..................................., score=0.918 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 2/5] END ..................................., score=0.911 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 3/5] END ..................................., score=0.919 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 4/5] END ..................................., score=0.922 total time=   0.0s
[Pipeline] ................ (step 1 of 2

[CV 1/5] END ................reg__n_neighbors=7;, score=0.871 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 2/5] END ................reg__n_neighbors=7;, score=0.858 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 3/5] END ................reg__n_neighbors=7;, score=0.873 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 4/5] END ................reg__n_neighbors=7;, score=0.868 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 5/5] END ................reg__n_neighbors=7;, score=0.879 total time=   0.0s
[Pipeline] ................

[CV 1/5] END ...............reg__n_neighbors=15;, score=0.878 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 2/5] END ...............reg__n_neighbors=15;, score=0.861 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 3/5] END ...............reg__n_neighbors=15;, score=0.873 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 4/5] END ...............reg__n_neighbors=15;, score=0.871 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.0s
[CV 5/5] END ...............reg__n_neighbors=15;, score=0.887 total time=   0.0s
[Pipeline] ................

[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.4s
[CV 1/5] END reg__max_depth=4, reg__n_estimators=100;, score=0.711 total time=   0.4s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.4s
[CV 2/5] END reg__max_depth=4, reg__n_estimators=100;, score=0.659 total time=   0.4s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.3s
[CV 3/5] END reg__max_depth=4, reg__n_estimators=100;, score=0.693 total time=   0.4s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.4s
[CV 4/5] END reg__max_depth=4, reg__n_estimators=100;, score=0.699 total time=   0.4s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.3s
[CV 5/5] END reg_

[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.4s
[CV 1/5] END .....reg__C=1000, reg__epsilon=0.1;, score=0.499 total time=   0.6s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.4s
[CV 2/5] END .....reg__C=1000, reg__epsilon=0.1;, score=0.491 total time=   0.6s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.4s
[CV 3/5] END .....reg__C=1000, reg__epsilon=0.1;, score=0.463 total time=   0.6s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.5s
[CV 4/5] END .....reg__C=1000, reg__epsilon=0.1;, score=0.462 total time=   0.7s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing reg, total=   0.4s
[CV 5/5] END .....reg__C=1000, reg__e

### Classification

In [26]:
df = pd.read_csv('Titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [27]:
def fill_in_age(columns):
    age = columns[0]
    pclass = columns[1]
    
    if np.isnan(age):
        if pclass == 1:
            return 41
        elif pclass == 2:
            return 29
        else:
            return 24
    else:
        return age
df['Age'] = df[['Age', 'Pclass']].apply(fill_in_age, axis=1)

# Embarked and Pclass split

emb = pd.get_dummies(df['Embarked'], drop_first=True)
pclass = pd.get_dummies(df['Pclass'], drop_first=True)
pclass = pclass.rename({2:'P2', 3:'P3'}, axis=1)

df = pd.concat([df, emb, pclass], axis=1)

# Drop the unnecessary columns
df = df.drop(['PassengerId', 'Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1)

df = df.dropna()

df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 417 entries, 0 to 417
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  417 non-null    int64  
 1   Age       417 non-null    float64
 2   SibSp     417 non-null    int64  
 3   Parch     417 non-null    int64  
 4   Fare      417 non-null    float64
 5   Q         417 non-null    uint8  
 6   S         417 non-null    uint8  
 7   P2        417 non-null    uint8  
 8   P3        417 non-null    uint8  
dtypes: float64(2), int64(3), uint8(4)
memory usage: 21.2 KB


In [28]:
df.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Q,S,P2,P3
0,0,34.5,0,0,7.8292,1,0,0,1
1,1,47.0,1,0,7.0,0,1,0,1
2,0,62.0,0,0,9.6875,1,0,1,0
3,0,27.0,0,0,8.6625,0,1,0,1
4,1,22.0,1,1,12.2875,0,1,0,1


In [29]:
# Train test split

X = df[df.columns[1:]]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [32]:
# Lets setup our dictionary and params with the pipeline with Standard Scaler and GridSearchCV



# Dictionary
classification_dict = {
    'Linear':{
        'algorithm': LogisticRegression,
        'params': {}
    },
    'KNeighbors':{
        'algorithm': KNeighborsClassifier,
        'params': {'cls__n_neighbors':range(1,20)}
    },
    'Decision Tree':{
        'algorithm': DecisionTreeClassifier,
        'params': {'cls__max_depth':[4,5]}
    },
    'RandomForest':{
        'algorithm': RandomForestClassifier,
        'params': {'cls__max_depth':[4,5], 'cls__n_estimators':[100, 200]}
    },
    'Support Vector':{
        'algorithm': SVC,
        'params': {'cls__C':[1,1000, 100000],'cls__gamma':['auto', 'scale']}
    },
}


# For loop 
for name, algorithm in classification_dict.items():
    
    param_grid = {}
    
    print('\n', name.title())
    
    # Pipeline setup with Standard Scalar
    pipe = Pipeline([
        ('sc', StandardScaler()),
        ('cls', algorithm['algorithm']())
    ], verbose=True)
    
    
    # For loop for param_grid
    for key, value in algorithm['params'].items():
        param_grid[key] = value


    # GridSearchCV
    grid = GridSearchCV(pipe, param_grid=param_grid, verbose=3)
    grid.fit(X_train, y_train)
    predictions = grid.predict(X_test)

    # Evaluation
    print('Error', np.mean(y_test != predictions))
    print(confusion_matrix(y_test, predictions))
    print(classification_report(y_test, predictions))




 Linear
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 1/5] END ..................................., score=0.593 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 2/5] END ..................................., score=0.655 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 3/5] END ..................................., score=0.603 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 4/5] END ..................................., score=0.638 total time=   0.0s
[Pipeline] ................ (step 1 of 2

[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 2/5] END ...............cls__n_neighbors=10;, score=0.655 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 3/5] END ...............cls__n_neighbors=10;, score=0.655 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 4/5] END ...............cls__n_neighbors=10;, score=0.603 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 5/5] END ...............cls__n_neighbors=10;, score=0.603 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 1/5] END ...............cls__n_ne

[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
Error 0.3888888888888889
[[68 11]
 [38  9]]
              precision    recall  f1-score   support

           0       0.64      0.86      0.74        79
           1       0.45      0.19      0.27        47

    accuracy                           0.61       126
   macro avg       0.55      0.53      0.50       126
weighted avg       0.57      0.61      0.56       126


 Decision Tree
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 1/5] END ..................cls__max_depth=4;, score=0.695 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 2/5] END ..................cls__max_depth=

[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 3/5] END .........cls__C=1, cls__gamma=auto;, score=0.707 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 4/5] END .........cls__C=1, cls__gamma=auto;, score=0.586 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 5/5] END .........cls__C=1, cls__gamma=auto;, score=0.569 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 1/5] END ........cls__C=1, cls__gamma=scale;, score=0.695 total time=   0.0s
[Pipeline] ................ (step 1 of 2) Processing sc, total=   0.0s
[Pipeline] ............... (step 2 of 2) Processing cls, total=   0.0s
[CV 2/5] END ........cls__C=1, cls__g