# Modeling

---



**Objective:**

Select and apply various modeling techniques, and calibrate the models to optimize their performance.

**Tasks:**

1. Select modeling techniques.
2. Generate test design.
3. Build the model.
4. Assess the model.

We'll start by selecting a few commonly used modeling techniques for binary classification tasks, such as:

1. Logistic Regression
2. Random Forest Classifier
3. Gradient Boosted Trees (e.g., XGBoost)

In [1]:
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
from google.colab import files

uploaded = files.upload()

Saving heart.csv to heart.csv


In [2]:
import pandas as pd
data = pd.read_csv('heart.csv')

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Splitting the data into features and target variable
X = data.drop(columns=['output'])
y = data['output']

# Feature Scaling (Normalization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Splitting the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape

((242, 13), (61, 13))

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train the Logistic Regression model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

# Predictions on the training set
y_train_pred = log_reg.predict(X_train)

# Evaluate the model's performance on the training data
accuracy = accuracy_score(y_train, y_train_pred)
class_report = classification_report(y_train, y_train_pred)

accuracy, class_report


(0.8636363636363636,
 '              precision    recall  f1-score   support\n\n           0       0.89      0.80      0.84       109\n           1       0.85      0.92      0.88       133\n\n    accuracy                           0.86       242\n   macro avg       0.87      0.86      0.86       242\nweighted avg       0.87      0.86      0.86       242\n')

The Logistic Regression model's performance on the training data is as follows:

* Accuracy: Approximately 86.36%
* Precision (Class 0): 0.89
* Recall (Class 0): 0.80
* F1-Score (Class 0): 0.84
* Precision (Class 1): 0.85
* Recall (Class 1): 0.92
* F1-Score (Class 1): 0.88

The model seems to be performing well on the training data, with good precision, recall, and F1-scores for both classes.

Next, we'll train a Random Forest Classifier and a Gradient Boosted Trees model (using XGBoost) and evaluate their performances on the training data.

In [6]:
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

# Initialize and train the Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
y_train_rf = rf_model.predict(X_train)
rf_accuracy = accuracy_score(y_train, y_train_rf)

# Initialize and train the XGBoost model
xgb_model = xgb.XGBClassifier(random_state=42, eval_metric='logloss')
xgb_model.fit(X_train, y_train)
y_train_xgb = xgb_model.predict(X_train)
xgb_accuracy = accuracy_score(y_train, y_train_xgb)

rf_accuracy, xgb_accuracy


(1.0, 1.0)

In [8]:
data.to_csv('/prep.csv', encoding='utf-8', index=None)