# 5 Modeling - "Of Genomes And Genetics"

#### Objective: Apply statistical and machine learning models to identify and validate the correlation between genetic markers and health outcomes.

## Import

In [1]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV
import time

## Load The Data

In [2]:
X_train = np.load('data/X_train.npy')
X_test = np.load('data/X_test.npy')
y_train = np.load('data/y_train.npy', allow_pickle=True)
y_test = np.load('data/y_test.npy', allow_pickle=True)

In [3]:
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
print("Data type of y_train:", type(y_train[0]))

Shape of X_train: (17666, 33)
Shape of X_test: (4417, 33)
Shape of y_train: (17666,)
Shape of y_test: (4417,)
Data type of y_train: <class 'str'>


## Prepare Data for Modeling

In [4]:
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

In [5]:
y_pred = model.predict(X_test)
print("Model Performance Metrics:")
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Model Performance Metrics:
                                              precision    recall  f1-score   support

 Mitochondrial genetic inheritance disorders       0.56      0.87      0.68      2485
Multifactorial genetic inheritance disorders       0.29      0.00      0.01       433
            Single-gene inheritance diseases       0.33      0.12      0.18      1499

                                    accuracy                           0.53      4417
                                   macro avg       0.39      0.33      0.29      4417
                                weighted avg       0.46      0.53      0.45      4417

Accuracy: 0.5331673081276884
Confusion Matrix:
 [[2167    3  315]
 [ 372    2   59]
 [1311    2  186]]


In [6]:
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"Cross-validated Accuracy: {scores.mean():.2f} ± {scores.std():.2f}")

Cross-validated Accuracy: 0.53 ± 0.00


## Improving Accuracy - Hyperparameter Tuning Using Grid Search

In [7]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [8]:
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='accuracy')
start_time = time.time()
grid_search.fit(X_train, y_train)
end_time = time.time()
print(f"Grid search took {end_time - start_time:.2f} seconds.")
print("Best parameters:", grid_search.best_params_)
print("Best cross-validated accuracy:", grid_search.best_score_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Grid search took 257.50 seconds.
Best parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 300}
Best cross-validated accuracy: 0.5579644490278498
