# Project # 5 - Multi-model
Data file: https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/Framingham_4000.csv


## Project #5 Requirements
* Load and examine data
* Prepare data for model training
  * Perform the necessary steps that we learned during the semester
* Train 3 separate models
  * From the various Classification algorithms that we learned during the semester, train 3 different Classification algorithms
* Print accuracy of each model

In [1]:
from datetime import datetime
print(f'Run time: {datetime.now().strftime("%D %T")}')

Run time: 12/08/23 11:41:19


### Import libraries

In [2]:
import pandas as pd
# Add all other necessary imports below
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

### Load data

The goal of the Framingham_1.csv dataset is to predict whether the patient has a 10-year risk of future (CHD) coronary heart disease.  
The dataset contains:
* over 4,000 records
* 15 features (independent variables)
* the target variable (dependent variable) is 'TenYearCHD'

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/vjavaly/Baruch-CIS-STA-3920/main/data/Framingham_4000.csv")

### Examine data

In [4]:
pd.set_option('display.max_rows', None)

In [5]:
print(df.head())
print(df.info())

   male  age  education  currentSmoker  cigsPerDay  BPMeds  prevalentStroke  \
0     0   37        4.0              0         0.0     0.0                0   
1     0   54        3.0              0         0.0     0.0                0   
2     0   50        2.0              0         0.0     1.0                0   
3     0   52        3.0              0         0.0     0.0                0   
4     1   45        3.0              1        30.0     0.0                0   

   prevalentHyp  diabetes  totChol  sysBP  diaBP    BMI  heartRate  glucose  \
0             0         0    169.0  104.0   66.0  20.84       70.0     72.0   
1             1         0    227.0  168.0   94.0  22.70       75.0     70.0   
2             1         0    241.0  132.0   85.0  23.81       55.0     84.0   
3             0         0    325.0  119.5   86.0  24.56       64.0      NaN   
4             1         0    233.0  147.0  101.0  24.32       75.0     99.0   

   TenYearCHD  
0           0  
1           0  
2 

In [6]:
print(df.isnull().sum())

male                 0
age                  0
education            0
currentSmoker        0
cigsPerDay           0
BPMeds               0
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             47
sysBP                0
diaBP                0
BMI                 18
heartRate            0
glucose            372
TenYearCHD           0
dtype: int64


### Prepare data for model training

In [7]:
df = df.dropna()

In [8]:
scaler = StandardScaler()
features = df.drop('TenYearCHD', axis=1)
target = df['TenYearCHD']
features_scaled = scaler.fit_transform(features)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42)

In [10]:
# def train_evaluate_model(model, X_train, y_train, X_test, y_test):
#     model.fit(X_train, y_train)
#     predictions = model.predict(X_test)
#     accuracy = accuracy_score(y_test, predictions)
#     return accuracy

### Display first 100 rows of final dataframe used for model training

In [11]:
final_df_for_training = pd.DataFrame(features_scaled, columns=features.columns)
final_df_for_training['TenYearCHD'] = target.values
print("First 100 Rows of Final Dataframe for Model Training:")
print(final_df_for_training.head(100))

First 100 Rows of Final Dataframe for Model Training:
        male       age  education  currentSmoker  cigsPerDay    BPMeds  \
0  -0.893368 -1.469930   1.979642      -0.973460   -0.754345 -0.178187   
1  -0.893368  0.517078   1.001616      -0.973460   -0.754345 -0.178187   
2  -0.893368  0.049546   0.023590      -0.973460   -0.754345  5.612085   
3   1.119360 -0.534868   1.001616       1.027264    1.762853 -0.178187   
4  -0.893368 -0.534868   1.001616       1.027264   -0.083092 -0.178187   
5  -0.893368 -0.885516   0.023590       1.027264    0.084721 -0.178187   
6  -0.893368  0.750843   0.023590      -0.973460   -0.754345 -0.178187   
7  -0.893368  0.633960  -0.954436      -0.973460   -0.754345 -0.178187   
8   1.119360 -0.067336   1.001616      -0.973460   -0.754345 -0.178187   
9  -0.893368 -0.651750   0.023590       1.027264    0.084721 -0.178187   
10 -0.893368  0.400195   0.023590      -0.973460   -0.754345 -0.178187   
11 -0.893368 -0.301102   0.023590      -0.973460   -0.7543

### Train Classification model 1

In [12]:
lr_params = {'C': [0.001, 0.01, 0.1, 1, 10]}
lr_grid = GridSearchCV(LogisticRegression(max_iter=1000), lr_params, cv=5, scoring='accuracy')
lr_grid.fit(X_train, y_train)

### Evaluate Classification model 1 performance

In [13]:
print("Best parameters for Logistic Regression:", lr_grid.best_params_)
print("Cross-validated accuracy for Logistic Regression:", lr_grid.best_score_)

Best parameters for Logistic Regression: {'C': 0.1}
Cross-validated accuracy for Logistic Regression: 0.8537261698440208


### Train Classification model 2

In [14]:
rf_params = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30]}
rf_grid = GridSearchCV(RandomForestClassifier(), rf_params, cv=5, scoring='accuracy')
rf_grid.fit(X_train, y_train)

### Evaluate Classification model 2 performance

In [15]:
print("Best parameters for Random Forest:", rf_grid.best_params_)
print("Cross-validated accuracy for Random Forest:", rf_grid.best_score_)

Best parameters for Random Forest: {'max_depth': None, 'n_estimators': 200}
Cross-validated accuracy for Random Forest: 0.8478336221837088


### Train Classification model 3

In [16]:
svc_params = {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1]}
svc_grid = GridSearchCV(SVC(), svc_params, cv=5, scoring='accuracy')
svc_grid.fit(X_train, y_train)

### Evaluate Classification model 3 performance

In [17]:
print("Best parameters for SVC:", svc_grid.best_params_)
print("Cross-validated accuracy for SVC:", svc_grid.best_score_)

Best parameters for SVC: {'C': 0.1, 'gamma': 0.001}
Cross-validated accuracy for SVC: 0.8492201039861353
