<a href="https://colab.research.google.com/github/mdkamrulhasan/machine_learning_concepts/blob/master/notebooks/supervised/Classification_breast_cancer_LR_kNN_DTree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What will we cover today ?

We are going to learn how to train and test a Regression model on a real dataset. More specifically we are going to use the sklearn implementation of the following three regression models for a regression task (***diabetest severity prediction***).

1.   **Linear Regression**
2.   **kNN Regression**
3.   **Decision Tree Regressor**








[Diabetes Dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)

**Summary of the dataset:**

*   **Features:** *age, sex, bmi, bp, s1, s2, s3, s4, s5, s6*
*   **Target (y)**: A quantitative measure to track the progression of diabetes.



**Note:** Some features are directly identifiable, say, age/sex/bmi, while some others are derived features (for more details, read the data description provided through the link above).



---


---





## Loading necessary python packages

In [2]:
# database related package(s)
from sklearn import datasets

# breast cancer data comes as a part of the sklearn package
from sklearn.datasets import load_breast_cancer

# data processing packages
import pandas as pd

# Logistic Regression modeling package(s) (sklearn)
from sklearn.linear_model import LogisticRegression


# model evaluation related packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# visualization
import plotly.express as px
import plotly.graph_objects as go

## Loading data and some preprocessing

In [3]:
# Load the  dataset
df = datasets.load_breast_cancer(as_frame=True)
df.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [4]:
# Separating features and labels dataframes
features_df, labels_df = df.data, df.target

In [5]:
# Looking at a sample of features
features_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [6]:
features_df.shape

(569, 30)

In [7]:
labels_df.head()

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


## Modeling

Extracting features and labels as numpy matrices

In [8]:
X, y = features_df.values, labels_df.values

Splitting data into train, test splits

In [9]:
# test data amount (in terms of proportion)
TEST_PROP = 0.5
# Random number seed; important for experiment reproducibility
RANDOM_SEED = 0

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_PROP, random_state=RANDOM_SEED)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((284, 30), (284,), (285, 30), (285,))

Model Instantiation

In [10]:
# Create a linear regression object
regr = LogisticRegression()

Model Training

In [11]:
# Train the model using the training set
regr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
regr.coef_, regr.intercept_

(array([[ 1.40267157,  0.08121742,  0.15056015,  0.0018719 , -0.04721641,
         -0.24636873, -0.36458528, -0.16534054, -0.1299629 , -0.01014455,
          0.05442662,  0.33245345,  0.24640496, -0.07970239, -0.00406198,
         -0.04457261, -0.07568322, -0.02501147, -0.03022065, -0.00323834,
          1.49153868, -0.23087379, -0.29614219, -0.02608488, -0.08123089,
         -0.69159632, -0.9851376 , -0.33170737, -0.30937954, -0.06613801]]),
 array([0.32308448]))

Making Predictions and evaluation (on the traning data)

- just checking how good the model fit was on the training data.

In [21]:
# Making predictions (training dataset)
y_pred = regr.predict(X_train)
# Estimating the accuracy
acc_train = accuracy_score(y_train, y_pred)
print("accuracy score (training data): %.2f" % acc_train)

accuracy score (training data): 0.97


Making Predictions and evaluation (on the test data)

- This is more interesting metric as we are reporting on unseen data (by the model)

In [22]:
# Making predictions (test dataset)
y_pred = regr.predict(X_test)
# Estimating the accuracy
acc_test = accuracy_score(y_test, y_pred)
print("accuracy_score (test data): %.2f" % acc_test)

accuracy_score (test data): 0.95


In [31]:
lreg_results = pd.DataFrame({
  'model': ['lr'],
  'train_acc': [round(acc_train, 2)],
  'test_acc': [round(acc_test, 2)]
})

Note(s):


*   We see test-ccuracyis higher than the test accuracy
*   Can you explain, why test-error is higher than train-error?
*   How can we ensure our model perform similary at test time ?




## kNN Regressor

In [32]:
from sklearn.neighbors import KNeighborsClassifier

N_NEAREST_NEIGHOR_PARAM = 1

# Model instantiation and training
reg_knn = KNeighborsClassifier(n_neighbors=N_NEAREST_NEIGHOR_PARAM)
reg_knn.fit(X_train, y_train)

# Prediction and error estimation (traing data)
y_pred = reg_knn.predict(X_train)
acc_train = accuracy_score(y_train, y_pred)
print("accuracy score (training data): %.2f" % acc_train)

# Prediction and error estimation (test data)
y_pred = reg_knn.predict(X_test)
acc_test = accuracy_score(y_test, y_pred)
print("accuracy_score (test data): %.2f" % acc_test)

# Storing results in a dataframe
knn_results = pd.DataFrame({
  'model': ['knn'],
  'train_acc': [round(acc_train, 2)],
  'test_acc': [round(acc_test, 2)]
})

accuracy score (training data): 1.00
accuracy_score (test data): 0.93


## Decison Tree Regressor

In [33]:
from sklearn.tree import DecisionTreeRegressor

# Model instantiation and training
reg_dtree = DecisionTreeRegressor()
reg_dtree.fit(X_train, y_train)

# Prediction and error estimation (traing data)
y_pred = reg_dtree.predict(X_train)
acc_train = accuracy_score(y_train, y_pred)
print("accuracy score (training data): %.2f" % acc_train)

# Prediction and error estimation (test data)
y_pred = reg_dtree.predict(X_test)
acc_test = accuracy_score(y_test, y_pred)
print("accuracy_score (test data): %.2f" % acc_test)

# Storing results in a dataframe
dtree_results = pd.DataFrame({
  'model': ['dtree'],
  'train_acc': [round(acc_train, 2)],
  'test_acc': [round(acc_test, 2)]
})

accuracy score (training data): 1.00
accuracy_score (test data): 0.91


## Comparing model performances

In [34]:
results = pd.concat([lreg_results, knn_results, dtree_results], axis=0)

fig = go.Figure([
    go.Bar(x=results.model, y=results.train_acc, name='Training accuracy'),
    go.Bar(x=results.model, y=results.test_acc, name='Test accuracy')
]
               )
fig.update_layout(
    title="Model comparison", yaxis_title="accuracy")
fig.update_layout(
    legend=dict(
        x=0.05,
        y=0.95
    )
)
fig.show()

# Questions for you



*   Any differences have you noticed among these three model performances ?
*   Which is your preferred model and why?
*   What steps would you take to make a fair comparion among these models (model classes)?



