<a href="https://colab.research.google.com/github/job-moses/MachineLearning/blob/main/CapstoneProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PART 1. COMPARING 5 PREDICTIVE MODEL ON BOSTON HOUSE PRICING DATASET

## About The Dataset

Boston Housing dataset contains 13 features (independent variables) that describe various aspects of housing in different neighborhoods of Boston. Here is an outline of the full meaning of each feature:

1. **CRIM: Per capita crime rate by town**
   - This represents the crime rate in the town. Higher values indicate higher crime rates.

2. **ZN: Proportion of residential land zoned for lots over 25,000 sq. ft.**
   - This feature represents the proportion of large residential lots. It provides information about the zoning regulations related to the size of residential land.

3. **INDUS: Proportion of non-retail business acres per town**
   - INDUS represents the proportion of non-retail business areas. It provides insights into the industrial nature of the town.

4. **CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)**
   - This is a binary indicator that tells whether a tract is located along the Charles River (1) or not (0).

5. **NOX: Nitric oxides concentration (parts per 10 million)**
   - NOX represents the concentration of nitric oxides in the air. It is a measure of air pollution.

6. **RM: Average number of rooms per dwelling**
   - RM represents the average number of rooms in a dwelling. It gives an indication of the size of houses in the town.

7. **AGE: Proportion of owner-occupied units built prior to 1940**
   - AGE represents the proportion of owner-occupied units that were built before 1940. It provides information about the age of the housing stock.

8. **DIS: Weighted distances to five Boston employment centers**
   - DIS represents the weighted distances from the residential areas to five employment centers. It gives an idea of the accessibility to employment.

9. **RAD: Index of accessibility to radial highways**
   - RAD represents an index that measures the accessibility to radial highways. Higher values indicate better accessibility.

10. **TAX: Full-value property tax rate per $10,000**
    - TAX represents the property tax rate. It provides information about the tax burden on properties.

11. **PTRATIO: Pupil-teacher ratio by town**
    - PTRATIO represents the ratio of students to teachers in schools. It is an indicator of the quality of education.

12. **B: 1000(Bk - 0.63)^2 where Bk is the proportion of Black residents by town**
    - B represents a metric that adjusts for the proportion of Black residents in the town.

13. **LSTAT: Percentage of lower status of the population**
    - LSTAT represents the percentage of the population with lower socioeconomic status. It is an indicator of the social and economic status of the population.

Each of these features provides different perspectives on the characteristics of the neighborhoods in Boston and can be used to predict the median value of owner-occupied homes(the target variable, often referred to as "PRICE" in the dataset).

## Evaluation Metric
The problem involve predicting house prise which is a regression problem, therefore i have decide to use
__R2 Metric__
The R2
(or R Squared) metric provides an indication of the goodness of fit of a set of predictions
to the actual values. In statistical literature this measure is called the coefficient of determination.
This is a value between 0 and 1 for no-fit and perfect fit respectively. T

## Importing Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.metrics import r2_score
#linear models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor

In [None]:
# silent python warnings to have a neat workbook
import warnings
warnings.filterwarnings("ignore")

In [None]:
# read the boston dataset
df = pd.read_csv('boston_house_prices.csv',skiprows=1)

In [None]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [None]:
df.tail()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.12,76.7,2.2875,1,273,21.0,396.9,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273,21.0,396.9,7.88,11.9


In [None]:
df.sample(5)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
384,20.0849,0.0,18.1,0,0.7,4.368,91.2,1.4395,24,666,20.2,285.83,30.63,8.8
55,0.01311,90.0,1.22,0,0.403,7.249,21.9,8.6966,5,226,17.9,395.93,4.81,35.4
397,7.67202,0.0,18.1,0,0.693,5.747,98.9,1.6334,24,666,20.2,393.1,19.92,8.5
471,4.03841,0.0,18.1,0,0.532,6.229,90.7,3.0993,24,666,20.2,395.33,12.87,19.6
418,73.5341,0.0,18.1,0,0.679,5.957,100.0,1.8026,24,666,20.2,16.45,20.62,8.8


In [None]:
df.shape

(506, 14)

In [None]:
print(f"The dataset consist of {df.shape[0]} rows and {df.shape[1]} columns")

The dataset consist of 506 rows and 14 columns


In [None]:
df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


In [None]:
df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [None]:
df.duplicated().any()

False

In [None]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


__Split data into input and output__

In [None]:
X = df.drop('MEDV', axis =1)
y = df['MEDV']

### 1. Data Splitting – Split the data into training and testing datasets

In [None]:
train_X,test_X, train_y, test_y = train_test_split(X,y, test_size = 0.2, random_state = 0)

### 2. Model building and Evaluation

In [None]:
models = [
    ('LinearRegression',LinearRegression()),
    ('Ridge',Ridge()),
    ('Lasso',Lasso()),
    ('ElasticNet',ElasticNet()),
    ('RandomForestRegressor',RandomForestRegressor()),

]
models_cv_results = {}
model_scores = {}
scoring = 'r2'
kfold = KFold(n_splits = 10, shuffle = True, random_state = 0)
for model_name, model in models:
    model_cv_result = cross_val_score(model, train_X, train_y, cv = kfold, scoring = scoring)
    models_cv_results[model_name] = np.mean(model_cv_result)

    #fit the model on training set
    model.fit(train_X, train_y)

    # predict output using testset
    pred = model.predict(test_X)

    # score the model performance
    model_score = r2_score(test_y,pred)
    model_scores[model_name] = model_score

# print cross validation result on training set
print(f"Training CV result using {scoring}  Metric\n")
for name, model_r2_score in models_cv_results.items():
    print(f"{name} : {model_r2_score: .4f} ")
#print r2_score on test set
print(f"\n {scoring} result on testset \n")
for name, model_score in model_scores.items():
    print(f"{name} : {model_score: .4f}")

Best_model = max(model_scores, key = model_scores.get)
print(f"\n Best Model is {Best_model} with r2_score of {model_scores[Best_model]}")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Training CV result using r2  Metrics

LinearRegression :  0.7370 
Ridge :  0.7357 
Lasso :  0.6850 
ElasticNet :  0.6989 
RandomForestRegressor :  0.8703 

 r2 result on testset 

LinearRegression :  0.5892
Ridge :  0.5796
Lasso :  0.4879
ElasticNet :  0.5006
RandomForestRegressor :  0.7806

 Best Model is RandomForestRegressor with r2_score of 0.7806067008287291


comparing model on __Standardized input variables__

In [None]:
scalar = StandardScaler()
X_train = scalar.fit_transform(train_X)
X_test =  scalar.fit_transform(test_X)
models = [
    ('LinearRegression',LinearRegression()),
    ('Ridge',Ridge()),
    ('Lasso',Lasso()),
    ('ElasticNet',ElasticNet()),
    ('RandomForestRegressor',RandomForestRegressor()),

]
models_cv_results = {}
model_scores = {}
scoring = 'r2'
kfold = KFold(n_splits = 10, shuffle = True, random_state = 0)
for model_name, model in models:
    model_cv_result = cross_val_score(model, X_train, train_y, cv = kfold, scoring = scoring)
    models_cv_results[model_name] = np.mean(model_cv_result)

    #fit the model on training set
    model.fit(X_train, train_y)

    # predict output using testset
    pred = model.predict(X_test)

    # score the model performance
    model_score = r2_score(test_y,pred)
    model_scores[model_name] = model_score

# print cross validation result on training set
print(f"Training CV result using {scoring}  Metrics\n")
for name, model_r2_score in models_cv_results.items():
    print(f"{name} : {model_r2_score: .4f} ")
#print r2_score on test set
print(f"\n {scoring} result on testset \n")
for name, model_score in model_scores.items():
    print(f"{name} : {model_score: .4f}")

Best_model = max(model_scores, key = model_scores.get)
print(f"\n Best Model is {Best_model} with r2_score of {model_scores[Best_model]}")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Training CV result using r2  Metrics

LinearRegression :  0.7370 
Ridge :  0.7373 
Lasso :  0.6821 
ElasticNet :  0.6732 
RandomForestRegressor :  0.8741 

 r2 result on testset 

LinearRegression :  0.5687
Ridge :  0.5679
Lasso :  0.5012
ElasticNet :  0.4688
RandomForestRegressor :  0.7421

 Best Model is RandomForestRegressor with r2_score of 0.7421314097848315


## Conclusion

5 model has been sucessfully used to predict house price, out of the 5 model compare, RandomForestRegressor perform better with r2_score of 78%.
we also see that standardizing input variable has negative effect on the performance of our metrics

## Part 2. Hyperparameter Tuning with GridSearchCV for  a Classification problem

The pima indian dataset is a good dataset for developing a predictive model to classify if a patient will have diabetes or not, it is a binary classification problem. In this problem i have employed Accuracy as my scoring metrics, that is how accurately the model is able to predict if a patient will have diabetes

In [None]:
# load dataset from file directory
df2 = pd.read_csv('diabetes.csv')

#check first five rows(5)
df2.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
# importing libraries
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# split data into input and output
X = df2.drop('Outcome', axis = 1)
y = df2['Outcome']

# standardize input features
scalar = StandardScaler()
rescale_X = scalar.fit_transform(X)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(rescale_X, y, test_size=0.2, random_state=42)

# classification Models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree Classifier': DecisionTreeClassifier(),
    'Random Forest Classifier': RandomForestClassifier(),
    'Support Vector Classifier': SVC(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
}

# define hyperparameter for the models

param_grids = [
    # Logistic Regression
    {'C': [0.1, 1, 10], 'solver': ['liblinear', 'saga']},

    # Decision Tree Classifier
    {'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10]},

    # Random Forest Classifier
    {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30]},

    # Support Vector Classifier
    {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},

    # K-Nearest Neighbors
    {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']},

]

# Perform GridSearchCV for each model

best_models = {}  #save best model parameter for prediction
best_params = {}  #save best parameters
model_accuracy = {}  # save accuracy score for each model

for model_name, model in models.items():
    grid_search = GridSearchCV(model, param_grids.pop(0), cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    best_models[model_name] = grid_search.best_estimator_
    best_params[model_name] = grid_search.best_params_

# Evaluate the best models on the test set
for model_name, best_model in best_models.items():
    predictions = best_model.predict(X_test)
    #score the model
    accuracy = accuracy_score(y_test, predictions)
    model_accuracy[model_name]= accuracy

# save summary of model in a list
model_summary = []
for model_name, best_param in best_models.items():
    model_summary.append((model_name, model_accuracy[model_name], best_param,))

# save summary list in a dataframe
pd.DataFrame(model_summary, columns = ['Model Name', 'Best Accuracy Score', 'Best Parameter'])

Unnamed: 0,Model Nmae,Best Accuracy Score,Best Parameter
0,Logistic Regression,0.753247,"LogisticRegression(C=10, solver='liblinear')"
1,Decision Tree Classifier,0.75974,"DecisionTreeClassifier(max_depth=30, min_sampl..."
2,Random Forest Classifier,0.753247,"(DecisionTreeClassifier(max_depth=10, max_feat..."
3,Support Vector Classifier,0.727273,SVC(C=1)
4,K-Nearest Neighbors,0.681818,KNeighborsClassifier(n_neighbors=7)
