##### Scenario 1: Fraud Detection in a Bank #####

You are a Junior Data Scientist at a financial institution. The bank is experiencing an increase in credit card fraud, where fraudulent transactions need to be detected in real-time. Traditional machine learning models, like logistic regression or a single decision tree, struggle to detect fraud because:

- Fraud cases are rare (high class imbalance).
- Fraud patterns evolve over time, making it difficult for static models to generalize.
- A single model may not catch all fraudulent cases, leading to false negatives.

To address this, your manager assigns you to create a predictive model via supervised learning. Given the challenges with detecting fraud you have decided to use an Adaptive Boosting model.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# Load dataset (Assuming a CSV file with fraud data)
df = pd.read_csv("transactions.csv")

# Define features (X) and target variable (y)
X = df.drop(columns=['Fraud'])  # Excluding the target variable
y = df['Fraud']  # 1 = Fraudulent, 0 = Real

In [None]:
# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize base estimator (Weak Learner - Decision Tree with max depth 1)
base_model = DecisionTreeClassifier(max_depth=1)

In [None]:
# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize base estimator (Weak Learner - Decision Tree with max depth 1)
base_model = DecisionTreeClassifier(max_depth=1)

In [None]:
# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Display classification report
print(classification_report(y_test, y_pred))

In [None]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Low Risk', 'High Risk'], yticklabels=['Low Risk', 'High Risk'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - AdaBoost for Loan Risk')
plt.show()

**Why AdaBoost is Appropriate Here**

- Improves Weak Learners: since decision trees (stumps) are weak learners, AdaBoost corrects their mistakes iteratively
- Handles Class Imbalance: high-risk loans are less common, and AdaBoost increases their importance in the model
- Better Accuracy: boosting helps catch subtle patterns that a single decision tree would miss

##### Scenario 2: Predicting Customer Churn Using Gradient Boosting #####

You are a Junior Data Scientist at a telecom company. They want to predict customer churnâ€”whether a customer will cancel their subscription. A standard decision tree is not accurate enough, and the company needs a model that minimizes churn prediction errors to target at-risk customers.

To address the challenges in predicting churn, which usually has a large class imbalance, you decide to use Gradient Boosting.

In [None]:
# Import necessary libraries
from sklearn.ensemble import GradientBoostingClassifier

# Load dataset (Assuming a CSV file with customer data)
df = pd.read_csv("customer_churn.csv")

In [None]:
# Define features (X) and target variable (y)
X = df.drop(columns=['Churn'])  # Excluding the target variable
y = df['Churn']  # 1 = Churn, 0 = Stay

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Initialize and train Gradient Boosting model
gb_model = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=3, random_state=42)
gb_model.fit(X_train, y_train)

# Make predictions
y_pred = gb_model.predict(X_test)

In [None]:
# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Display classification report
print(classification_report(y_test, y_pred))

In [None]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Stayed', 'Churned'], yticklabels=['Stayed', 'Churned'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Gradient Boosting for Customer Churn')
plt.show()

**Why Gradient Boosting is Appropriate Here**

- Minimizes Errors Effectively: learns from past mistakes by adjusting weights in a way that optimally reduces errors
- Captures Complex Customer Behavior: recognizes patterns such as long-term engagement, billing issues, and usage drop-offs
- Highly Scalable: able to handle large datasets with high performance