---

# Machine Learning Model Starter-Kit

In this notebook, we'll explore two distinct approaches to machine learning using scikit-learn:

1. A simple logistic regression
2. Multiple Model method
---



## **1. Logistic Regression**
### (You can also adjust the `threshold`)

Logistic regression is a statistical method for analysing datasets where the outcome is binary. It's used to predict a binary outcome based on one or more predictor variables.

### You need to adjust these to your data:
```
X = df[["your_features_go_here"]]
y = df["your_label_goes_here"]
```

---

In [None]:
# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load your data (assuming it's in a CSV file)
# df = pd.read_csv('your_data_file.csv')

# Define features and labels
X = df[["your_features_go_here"]]
y = df["your_label_goes_here"]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define your model
model = LogisticRegression()

# Fit the model with training data
model.fit(X_train, y_train)

# Get the probabilities of the positive class
probabilities = model.predict_proba(X_test)[:, 1]

# Apply threshold
threshold = 0.7

y_pred = [1 if prob > threshold else 0 for prob in probabilities]

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Accuracy of Logistic Regression with threshold {threshold}: {accuracy:.2f}")


---

## **2. Multiple Models**

### Why Use a Single Function for Multiple Models?

- **Consistency**: Ensures every model is trained and evaluated in the same manner.
- **Efficiency**: Reduces repetitive code and makes the codebase cleaner.
- **Flexibility**: Easily add or remove models or change evaluation metrics.
- **Error Reduction**: Minimizes chances of errors from repetitive code.

### **You need to adjust these to your data:**
```
X = df[["your_features_go_here"]]
y = df["your_label_goes_here"]
```

```
models = {

    "Decision Tree": DecisionTreeClassifier(),

    "Random Forest": RandomForestClassifier(),

    "Gradient Boosting": GradientBoostingClassifier(),

    }
```

---


In [None]:
# Import necessary libraries
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Define features and labels
X = df[["your_features_go_here"]]
y = df["your_label_goes_here"]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models and experiment with hyperparameters (HP)
models = {

    # Decision Trees:
    "Decision Tree": DecisionTreeClassifier(),
    "Decision Tree HP": DecisionTreeClassifier(max_depth=5, min_samples_split=10, random_state=42),

    # Random Forests:
    "Random Forest": RandomForestClassifier(),
    "Random Forest HP": RandomForestClassifier(random_state=42, min_samples_split=10, max_depth=5),

    # Gradient Boosting
    "Gradient Boosting": GradientBoostingClassifier(),
    "Gradient Boosting HP": GradientBoostingClassifier(random_state=42, min_samples_split=10, max_depth=5),

    }


def evaluate_model(model, X_train, y_train, X_test, y_test, X, y):

    # Train the model
    model.fit(X_train, y_train.values.ravel())

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    y_overall_pred = model.predict(X)

    # Calculate performance metrics
    train_acc = round(accuracy_score(y_train_pred, y_train), 4)
    test_acc = round(accuracy_score(y_test_pred, y_test), 4)
    overall_acc = round(accuracy_score(y_overall_pred, y), 4)
    test_recall = round(recall_score(y_test, y_test_pred), 4)
    test_precision = round(precision_score(y_test, y_test_pred), 4)
    test_f1 = round(f1_score(y_test, y_test_pred), 4)

    return test_acc, train_acc, overall_acc, test_recall, test_precision, test_f1

# Evaluate each model and store results
results = []
for name, model in models.items():
    test_acc, train_acc, overall_acc, test_recall, test_precision, test_f1 = evaluate_model(model, X_train, y_train, X_test, y_test, X, y)
    results.append([name, test_acc, train_acc, overall_acc, test_recall, test_precision, test_f1])

# Convert results to a DataFrame and display
results_df = pd.DataFrame(results, columns=['Model', 'Test Accuracy', 'Train Accuracy', 'Overall Accuracy', 'Test Recall', 'Test Precision', 'Test F1 Score'])
results_df.head()
