# TCP Classification - Model Training and Evaluation

This notebook assumes the CapstoneProject_eda notebook was run to generate the cleaned and labeled data set.

## 1. Data Loading and Prep

### Load and Split Data

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load dataset
data = pd.read_csv('data/cleaned_data.csv')

# Split into features (X) and target (y)
X = data.drop('label', axis=1)
y = data['label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

### Data Scaling

In [2]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 2. Modeling

### Logistic Regression

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

logistic_model = LogisticRegression(random_state=42, max_iter=250, solver='saga', n_jobs=-1, class_weight='balanced')
logistic_cv_scores = cross_val_score(logistic_model, X, y, cv=3, scoring='f1')

KeyboardInterrupt: 

### Random Forest

In [9]:
from sklearn.ensemble import RandomForestClassifier

# Initialize random forest
random_forest_model = RandomForestClassifier(random_state=42, n_estimators=50, n_jobs=-1)

rf_cv_scores = cross_val_score(random_forest_model, X, y, cv=5, scoring='f1')



### SVM

In [14]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

svm_model = LinearSVC(random_state=42, max_iter=1000)

svm_cv_scores = cross_val_score(svm_model, X, y, cv=3, scoring='f1', n_jobs=-1)

### Results

In [None]:
print(f'Logistic Regression Mean F1-Score: {logistic_cv_scores.mean()}')
print(f'Random Forest Mean F1-Score: {rf_cv_scores.mean()}')
print(f'SVM Mean F1-Score: {svm_cv_scores.mean()}')

## 3. Hyperparameter Tuning

### Logistic Regression Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid for Logistic Regression
param_grid_lr = {
    'solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'C': [0.1, 1, 10],
    'max_iter': [100, 200, 300]
}

# Initialize Grid Search
grid_search_lr = GridSearchCV(LogisticRegression(random_state=42), param_grid_lr, cv=5, scoring='f1', n_jobs=-1)
grid_search_lr.fit(X_train, y_train)

# Best parameters and score
print(f"Best parameters for Logistic Regression: {grid_search_lr.best_params_}")
print(f"Best F1-Score for Logistic Regression: {grid_search_lr.best_score_}")

### Random Forest Grid Search

In [None]:
# Define parameter grid for Random Forest
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize Grid Search
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5, scoring='f1', n_jobs=-1)
grid_search_rf.fit(X_train, y_train)

# Best parameters and score
print(f"Best parameters for Random Forest: {grid_search_rf.best_params_}")
print(f"Best F1-Score for Random Forest: {grid_search_rf.best_score_}")

### SVM Grid Search

In [None]:
# Define parameter grid for SVM
param_grid_svm = {
    'kernel': ['linear', 'rbf', 'poly'],
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto']
}

# Initialize Grid Search
grid_search_svm = GridSearchCV(SVC(random_state=42), param_grid_svm, cv=5, scoring='f1', n_jobs=-1)
grid_search_svm.fit(X_train, y_train)

# Best parameters and score
print(f"Best parameters for SVM: {grid_search_svm.best_params_}")
print(f"Best F1-Score for SVM: {grid_search_svm.best_score_}")

## 4. Interpretation of Model's Results

### Interpretation of Logistic Regression Results
The Logistic Regression model was evaluated using 5-fold cross-validation with the F1-score as the evaluation metric. The F1-score is a measure of a test's accuracy and considers both precision and recall. The best parameters found using Grid Search were [Best parameters here]. The mean F1-score achieved with these parameters was [Best F1-Score here]. This indicates that the model has a balanced performance in terms of precision and recall for the given dataset.

### Interpretation of Random Forest Results
The Random Forest model was evaluated using 5-fold cross-validation with the F1-score as the evaluation metric. The best parameters found using Grid Search were [Best parameters here]. The mean F1-score achieved with these parameters was [Best F1-Score here]. This suggests that the Random Forest model is effective in handling the imbalanced nature of the dataset, providing a good balance between precision and recall.

### Interpretation of SVM Results
The SVM model was evaluated using 5-fold cross-validation with the F1-score as the evaluation metric. The best parameters found using Grid Search were [Best parameters here]. The mean F1-score achieved with these parameters was [Best F1-Score here]. This indicates that the SVM model can effectively classify the data with a good balance between precision and recall, given the optimal hyperparameters.

## 5. Explanation and Rationale for the Evaluation Metric

### Evaluation Metric: F1-Score
The F1-score was chosen as the evaluation metric for this classification task. The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both false positives and false negatives. This is particularly important for imbalanced datasets where one class may be more frequent than the other. By using the F1-score, we ensure that the model performs well in identifying both the positive and negative classes, providing a more robust evaluation of the model's performance.