<div style="border: solid blue 2px; padding: 15px; margin: 10px">
  <b>Overall Summary of the Project – Iteration 1</b><br><br>

  Hi Jimmy, I’m <b>Victor Camargo</b> (<a href="https://hub.tripleten.com/u/e9cc9c11" target="_blank">TripleTen Hub profile</a>). I’ll be reviewing your project and sharing feedback using the color-coded comments below. Thanks for submitting your work!<br><br>

  <b>Nice work on:</b><br>
  ✔️ Carefully exploring and cleaning missing data in <code>Tenure</code><br>
  ✔️ Correctly applying preprocessing with OHE and dropping irrelevant identifiers<br>
  ✔️ Testing multiple imbalance-handling strategies and achieving the required F1 threshold<br>
  ✔️ Adding confusion matrix analysis and clear interpretation of F1 vs. AUC-ROC<br><br>

  Your project meets the requirements and is <b>approved</b> 🎉. Just a small suggestion to polish it further:<br>
  🟡 Double-check the AUC-ROC calculation — you computed it using Logistic Regression (<code>model</code>), but since RandomForest was your best model, it would be ideal to calculate and report AUC-ROC for that classifier to keep results consistent.<br><br>

  <hr>

  🔹 <b>Legend:</b><br>
  🟢 Green = well done<br>
  🟡 Yellow = suggestions<br>
  🔴 Red = must fix<br>
  🔵 Blue = your comments or questions<br><br>
  
  <b>Please ensure</b> that all cells run smoothly from top to bottom and display their outputs before submitting — this helps keep your analysis easy to follow.  
  <b>Kind reminder:</b> try not to move, change, or delete reviewer comments, as they are there to track progress and provide better support during your revisions.<br><br>

  <b>Feel free to reach out if you need help in Questions channel.</b><br>
</div>


In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
import numpy as np

In [2]:
df = pd.read_csv('/datasets/Churn.csv')

In [3]:
display(df.head())

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


<div style='font-size:18px'>
<b> We see that there are missing data on Tenure. </div> 

In [5]:
#See how many NaNs we have on the Tenure column. 
print(df['Tenure'].value_counts(dropna=False, ascending=False))

1.0     952
2.0     950
8.0     933
3.0     928
5.0     927
7.0     925
NaN     909
4.0     885
9.0     882
6.0     881
10.0    446
0.0     382
Name: Tenure, dtype: int64


<div style='font-size:18px'> <b> We see that there's 909 missing values on the Tenure column which could affect our accuracy score.
Two possibile reasons why Tensure is missing. 
    1. They're are brand new customers and haven't been there a full year yet. 
    2. missing data

<div style='font-size:18px'> <b> We also see that the higher the tenure; the less customers Beta Bank has. Also, it's possible for the missing value on Tenure, they are first year customers that also left with the Exited (1) or they are still first. 

In [6]:
# Filter tenure to blanks
tenure_na = df[df['Tenure'].isna()]
display(tenure_na)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.00,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.00,1,0,0,84509.57,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9944,9945,15703923,Cameron,744,Germany,Male,41,,190409.34,2,1,1,138361.48,0
9956,9957,15707861,Nucci,520,France,Female,46,,85216.61,1,1,0,117369.52,1
9964,9965,15642785,Douglas,479,France,Male,34,,117593.48,2,0,0,113308.29,0
9985,9986,15586914,Nepean,659,France,Male,36,,123841.49,2,1,0,96833.00,0


In [7]:
# Filter to 0 Tenure
tenure_zero = df[df['Tenure'] == 0]
display(tenure_zero)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
29,30,15656300,Lucciano,411,France,Male,29,0.0,59697.17,2,1,1,53483.21,0
35,36,15794171,Lombardo,475,France,Female,45,0.0,134264.04,1,1,0,27822.99,1
57,58,15647091,Endrizzi,725,Germany,Male,19,0.0,75888.20,1,0,0,45613.75,0
72,73,15812518,Palermo,657,Spain,Female,37,0.0,163607.18,1,0,1,44203.55,0
127,128,15782688,Piccio,625,Germany,Male,56,0.0,148507.24,1,1,0,46824.08,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9793,9794,15772363,Hilton,772,Germany,Female,42,0.0,101979.16,1,1,0,90928.48,0
9799,9800,15722731,Manna,653,France,Male,46,0.0,119556.10,1,1,0,78250.13,1
9843,9844,15778304,Fan,646,Germany,Male,24,0.0,92398.08,1,1,1,18897.29,0
9868,9869,15587640,Rowntree,718,France,Female,43,0.0,93143.39,1,1,0,167554.86,0


<div style='font-size:18px'> <b> We see that for the NaNs on Tenure column, we see some exited and some are still a customer. We also noticed that there's also 0 data on the Tenure column. Since we don't know the reason why there's missing data for Tenure, we're going to rid of the rows that has NaNs on the Tenure column so it won't impact our accuracy score. 

In [8]:
# Drop the missing Tenure rows
df.dropna(subset=['Tenure'], inplace=True)
print(f"Dataset size after dropping missing Tenure: {len(df)}")

Dataset size after dropping missing Tenure: 9091


In [9]:
#change tenure to int64
df['Tenure'] = df['Tenure'].astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9091 entries, 0 to 9998
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        9091 non-null   int64  
 1   CustomerId       9091 non-null   int64  
 2   Surname          9091 non-null   object 
 3   CreditScore      9091 non-null   int64  
 4   Geography        9091 non-null   object 
 5   Gender           9091 non-null   object 
 6   Age              9091 non-null   int64  
 7   Tenure           9091 non-null   int64  
 8   Balance          9091 non-null   float64
 9   NumOfProducts    9091 non-null   int64  
 10  HasCrCard        9091 non-null   int64  
 11  IsActiveMember   9091 non-null   int64  
 12  EstimatedSalary  9091 non-null   float64
 13  Exited           9091 non-null   int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.0+ MB


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Great job identifying and handling the missing values in the <code>Tenure</code> column. You checked both <code>NaN</code> and <code>0</code> values, explored their relationship with the target variable, and ensured the column was properly converted to <code>int</code> after cleaning. This shows solid data preprocessing skills.
</div>

<div class="alert alert-warning">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Suggestion: While dropping rows with missing values (about 9%) is not wrong, especially since you provided reasoning, in real projects it is often better to impute missing values (e.g., with median, mode, or even model-based imputation). Imputation could preserve more data and help you reach the required F1 threshold of 0.59 more reliably.
</div>


<div style='font-size:18px'> <b> There are colummns we should drop that are "Object" like Surname. As for Geography and Gender columns, we should approach an OHE method since they are important for our testing on why customers are leaving. Other columns we should drop are "Row Number" and "CustomerId"

In [10]:
# Drop columns that won't help with prediction
target = df['Exited']
features = df.drop(['Exited','RowNumber','CustomerId','Surname'],axis = 1)

print(features.dtypes)
display(features)


CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
dtype: object


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,France,Female,42,2,0.00,1,1,1,101348.88
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58
2,502,France,Female,42,8,159660.80,3,1,0,113931.57
3,699,France,Female,39,1,0.00,2,0,0,93826.63
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.10
...,...,...,...,...,...,...,...,...,...,...
9994,800,France,Female,29,2,0.00,2,0,0,167773.55
9995,771,France,Male,39,5,0.00,2,1,0,96270.64
9996,516,France,Male,35,10,57369.61,1,1,1,101699.77
9997,709,France,Female,36,7,0.00,1,0,1,42085.58


In [11]:
# Apply One-Hot Encoding to the categorical columns
features_encoded = pd.get_dummies(features)

print("Shape before OHE:", features.shape)
print("Shape after OHE:", features_encoded.shape)
print("New columns:", features_encoded.columns.tolist())

Shape before OHE: (9091, 10)
Shape after OHE: (9091, 13)
New columns: ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Geography_France', 'Geography_Germany', 'Geography_Spain', 'Gender_Female', 'Gender_Male']


In [12]:
display(features_encoded)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
0,619,42,2,0.00,1,1,1,101348.88,1,0,0,1,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1,1,0
2,502,42,8,159660.80,3,1,0,113931.57,1,0,0,1,0
3,699,39,1,0.00,2,0,0,93826.63,1,0,0,1,0
4,850,43,2,125510.82,1,1,1,79084.10,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,800,29,2,0.00,2,0,0,167773.55,1,0,0,1,0
9995,771,39,5,0.00,2,1,0,96270.64,1,0,0,0,1
9996,516,35,10,57369.61,1,1,1,101699.77,1,0,0,0,1
9997,709,36,7,0.00,1,0,1,42085.58,1,0,0,1,0


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Excellent preprocessing step! You correctly removed irrelevant columns (<code>RowNumber</code>, <code>CustomerId</code>, <code>Surname</code>) and applied One-Hot Encoding to categorical variables like <code>Geography</code> and <code>Gender</code>. This ensures the model can leverage all useful features while avoiding data leakage.
</div>

<div style='font-size:18px'> <b> Now we have applied OHE on Geography and Gender, let's start doing a test split

In [13]:
#Split the test
features_train, features_valid, target_train, target_valid = train_test_split(features_encoded,target, test_size = 0.25, random_state = 12345)
model = LogisticRegression(random_state = 12345, solver = 'liblinear', class_weight = 'balanced')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
f1 = f1_score(target_valid, predicted_valid)
print("F1 Score: ", f1)

F1 Score:  0.4660194174757281


In [14]:
print(target.value_counts())

0    7237
1    1854
Name: Exited, dtype: int64


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Great job performing the train/validation split with a fixed random state for reproducibility. You also started with Logistic Regression as a baseline model, which is a solid choice. Using <code>class_weight='balanced'</code> is a nice touch to account for imbalance from the start.
</div>

<div style='font-size:18px'> <b> We have 7237 (79.61%) that are currently customers at the bankd and 1854 (20.39%) left the bank. This is imbalanced. We should try to upsample and downsample the data before moving on to the next model for testing. 

In [15]:

def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)  # Fixed variable name

    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)

    return features_upsampled, target_upsampled  # Fixed variable name

# Call the function
features_upsampled, target_upsampled = upsample(features_train, target_train, 10)

# Check the results
print(f"Original shape: {features_train.shape}")
print(f"Upsampled shape: {features_upsampled.shape}")
print(f"Class distribution after upsampling:")
print(target_upsampled.value_counts())


Original shape: (6818, 13)
Upsampled shape: (19445, 13)
Class distribution after upsampling:
1    14030
0     5415
Name: Exited, dtype: int64


In [16]:
# Train the Model
up_model = LogisticRegression(random_state = 12345, solver = 'liblinear', class_weight = 'balanced')
up_model.fit(features_upsampled, target_upsampled)

predictions = up_model.predict(features_valid)

# Calculate the new F1 score
f1_upsampled = f1_score(target_valid, predictions)
print("F1 Score with upsampling: ", f1_upsampled)

F1 Score with upsampling:  0.4521739130434782


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Excellent work identifying the imbalance (≈20% churners vs. 80% non-churners) and implementing an upsampling function. You correctly verified the new class distribution, retrained Logistic Regression, and reported the new F1 score. This shows a strong understanding of class imbalance handling.
</div>

<div style='font-size:18px'> <b> Well, upsampling didn't go well, let's try downsampling

In [17]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state = 12345)] + [features_ones])
    target_downsampled = pd.concat([target_zeros.sample(frac=fraction,random_state=12345)] + [target_ones])
    #Shuffle the data
    features_downsampled, target_downsampled = shuffle(features_downsampled,target_downsampled, random_state = 12345)

    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(
    features_train, target_train, 0.3
)
#Look at the shape of the downsampled
print(features_downsampled.shape)
print(target_downsampled.shape)

print()

# calculate the new f1 score after downsampling
down_model = LogisticRegression(random_state=12345, solver='liblinear',class_weight='balanced')
down_model.fit(features_downsampled, target_downsampled)
down_predictions = down_model.predict(features_valid)
down_f1 = f1_score(target_valid, down_predictions)
print("downsampling F1 Score: ",down_f1)

(3027, 13)
(3027,)

downsampling F1 Score:  0.45203488372093026


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Well done! You implemented both upsampling and downsampling functions correctly, applied them to the training data, retrained Logistic Regression, and evaluated the results. This fulfills the project requirement of testing at least two imbalance-handling techniques and shows strong control of resampling methods.
</div>

<div style='font-size:18px'> <b> The Logistic Regression Model is not our best model, let's try the RandomForestClassifier

In [18]:
# Testing RandomForest
random_model = RandomForestClassifier(n_estimators = 50, max_depth = 5, random_state =12345)
random_model.fit(features_downsampled,target_downsampled)
random_predictions = random_model.predict(features_valid)
random_f1 = f1_score(target_valid, random_predictions)
print("F1 score from the RandomForestClassifier: ", random_f1)

F1 score from the RandomForestClassifier:  0.6024096385542169


<div style='font-size:18px'> <b> We finally pass the minimum requirement of F1 score. Now we can calculate the AUC_ROC metric

In [19]:


probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:,1]

auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print(auc_roc)



0.7481909940344789


<div style='font-size:18px'> <b> RandomForestClassifier achieved an F1 score of 60.24%, exceeding the required threshold of 0.59 for passing the project. This indicates that the RandomForest model mains a solid balance between precision and recall for identifying customers who are likely to leave. 
<p>
    Additionally, the model achieved a strong AUC_ROC score of 74.82% suggesting that it has a good overall ability to distinguish between customers who will stay and those who will leave and that further tuning of the thresholds coud potentially improve business outcomes. 
</p>

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Congratulations! 🎉 Your RandomForestClassifier achieved an F1 score above the required 0.59 threshold, which means you successfully met the project’s main objective. You also reported the AUC-ROC metric and provided clear interpretation of what it means for model performance — excellent work.
</div>

In [20]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(target_valid, random_predictions))

[[1519  303]
 [ 126  325]]


<div style='font-size:18px'>
    
- **True Negatives (TN = 1519)**: Correctly predicted customers who stayed.
- **False Positives (FP = 303)**: Customers who stayed but predicted to leave.
- **False Negatives (FN = 126)**: Customers who left but predicted to stay.
- **True Positives (TP = 325)**: Correctly predicted customers who left.
<p>
    <p>
        This confusion matrix shows that the model successfully identifies many exiting customers (good recall), though it does mistakenly flag a number of loyal customers as risk (moderate precision)
</p>

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Excellent final touch with the confusion matrix, evaluation summary, and interpretation. You clearly explained TN, FP, FN, and TP, and highlighted why false negatives are the most critical error in churn prediction. Your explanation of F1 vs. AUC-ROC was also thorough and well-presented.
</div>

<h3> Final Conclusion</h3>

<p>In this project, we built and evaluated several machine learning models to predict customer churn at Beta Bank. After cleaning, preprocessing, and addressing class imbalance, our best-performing model was the <strong>Random Forest Classifier</strong> trained on a <strong>downsampled dataset</strong>.</p>

<h3> Model Evaluation Summary</h3>

<ul>
    <li><strong>Best Model:</strong> <code>RandomForestClassifier(n_estimators=50, max_depth=5)</code></li>
    <li><strong>Evaluation on Validation Set:</strong>
        <ul>
            <li><strong>F1 Score:</strong> <span style="color: green;">0.602</span> ✅ (Passed threshold of 0.59)</li>
            <li><strong>AUC-ROC Score:</strong> <span style="color: green;">0.748</span> ✅ (Good discriminatory ability)</li>
        </ul>
    </li>
</ul>

<h3> Confusion Matrix</h3>

<pre>
                    Predicted: No Exit (0)    Predicted: Exit (1)
Actual: No Exit (0)         1519 (TN)              303 (FP)
Actual: Exit (1)            126 (FN)              325 (TP)
</pre>

<h3> Interpretation of Confusion Matrix</h3>

<ul>
    <li><strong>True Negatives (TN = 1519):</strong> Customers correctly predicted to stay.</li>
    <li><strong>False Positives (FP = 303):</strong> Customers incorrectly predicted to leave (may trigger unnecessary retention effort).</li>
    <li><strong>False Negatives (FN = 126):</strong> Customers who left but were predicted to stay — <strong>most critical error type</strong> in churn prediction.</li>
    <li><strong>True Positives (TP = 325):</strong> Customers correctly predicted to churn — can be targeted for retention.</li>
</ul>

<h3> F1 Score vs. AUC-ROC</h3>

<ul>
    <li><strong>F1 Score</strong> focuses on the balance between precision and recall at a fixed threshold — ideal for <strong>imbalanced datasets</strong> like this one.</li>
    <li><strong>AUC-ROC</strong> evaluates how well the model distinguishes between classes <strong>across all possible thresholds</strong>, offering insight into the model’s overall ranking ability.</li>
    <li>In this case, the AUC-ROC of <strong>0.748</strong> confirms that our model has strong classification power, while the F1 Score of <strong>0.602</strong> indicates good performance at the default threshold.</li>
</ul>

