#Project 7 - For this project we are going to look at data from a mobile cell phone carrier of clients who were on an old plan and switched to one of the two new alternative plans.  Based off the customers behavior within the data they are going to recommend one of the the companies new plans to the customer.  I will be using random forest, decision tree and logistic regression models/predictions to get an idea of which one is the best predictor of the plan the customer should use.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

#Load the dataset
df = pd.read_csv('/datasets/users_behavior.csv')



print(df.head(5))
print(df.describe())

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246     0.461100
min       0.000000     0.000000     0.000000      0.000000     0.000000
25%      40.000000   274.575000     9.000000  12491.902500     0.000000
50%      62.000000   430.600000    30.000000  16943.235000     0.000000
75%      82.000000   571.927500    57.000000  21424.700000     1.000000
max     244.000000  1632.060000   224.000000  49745.730000     1.000000


In [3]:
#Split data into features and target
features = df.drop(columns=['is_ultra'])  # Drop the target column
target = df['is_ultra']  # Binary target (1 for Ultra, 0 for Smart)


In [4]:
#Split into training, validation, and test sets (60%-20%-20%)
X_train, X_temp, y_train, y_temp = train_test_split(features, target, test_size=0.4, random_state=54321)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=54321)

#For my code I started out by loading the data and importing all the pythons functions I thought I would need.  I then split the dataset into the features and target dataframes. From there I used the train_test_split function to create two different dataframes, one with 60% of the data then one with 40% of the data.  I named the 60% group the training group.  With the 40% group I split it into two more dataframes creating a validation dataframe and a testing dataframe.

In [6]:
#Train multiple models and tune hyperparameters

# Logistic Regression Model with tuning hyperparameter
log_model = LogisticRegression(random_state=54321, max_iter=1000)
log_model.fit(X_train, y_train)
log_preds = log_model.predict(X_valid)
log_acc = accuracy_score(y_valid, log_preds)


max_depth_values = [3, 5, 10, None] 
min_samples_split_values = [2, 5, 10]
best_score = 0
best_params = {}

for max_depth in max_depth_values:
    for min_samples_split in min_samples_split_values:
        model = DecisionTreeClassifier(max_depth=max_depth, min_samples_split=min_samples_split, random_state=12345)
        model.fit(X_train, y_train)
        predicted_valid = model.predict(X_valid)
        score = accuracy_score(y_valid, predicted_valid)
        if score > best_score:
            best_score = score
            best_params = {"max_depth": max_depth, "min_samples_split": min_samples_split}

print("Best Hyperparameters:", best_params)
print("Best Accuracy Score:", best_score)

# Decision Tree Model
tree_model = DecisionTreeClassifier(max_depth=4, random_state=54321)
tree_model.fit(X_train, y_train)
tree_preds = tree_model.predict(X_valid)
tree_acc = accuracy_score(y_valid, tree_preds)

tree_model_2 = DecisionTreeClassifier(max_depth=3, random_state=12345)
tree_model_2.fit(X_train, y_train)
tree_preds_2 = tree_model_2.predict(X_valid)
tree_acc_2 = accuracy_score(y_valid, tree_preds_2)

# Random Forest Model
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=54321)
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_valid)
rf_acc = accuracy_score(y_valid, rf_preds)

rf_model_2 = RandomForestClassifier(n_estimators=5, max_depth=4, random_state=12345)
rf_model_2.fit(X_train, y_train)
rf_preds_2 = rf_model_2.predict(X_valid)
rf_acc_2 = accuracy_score(y_valid, rf_preds_2)

Best Hyperparameters: {'max_depth': 10, 'min_samples_split': 2}
Best Accuracy Score: 0.7791601866251944


In [7]:
#Compare model performance
print(f"Logistic Regression Accuracy: {log_acc:.4f}")
print(f"Logistic Regression Accuracy(2): {best_score:.4f}")
print(f"Decision Tree Accuracy: {tree_acc:.4f}")
print(f"Decision Tree Accuracy(2): {tree_acc_2:.4f}")
print(f"Random Forest Accuracy: {rf_acc:.4f}")
print(f"Random Forest Accuracy(2): {rf_acc_2:.4f}")

#Choose the best model (Random Forest is often the best for classification tasks)
best_model = rf_model

Logistic Regression Accuracy: 0.6750
Logistic Regression Accuracy(2): 0.7792
Decision Tree Accuracy: 0.7449
Decision Tree Accuracy(2): 0.7652
Random Forest Accuracy: 0.7978
Random Forest Accuracy(2): 0.7760


#From there I created a logistic regression model, a decision tree model and a random forest model.  I then tested all the models for the accuracy and deemed the one with the highest accuracy as the best model since it had the best quality.  According to my models the random forest model had the best accuracy with 79.8%.

In [None]:
# Test the best model
test_preds = best_model.predict(X_test)
test_acc = accuracy_score(y_test, test_preds)
print(f"Test Accuracy: {test_acc:.4f}")

Test Accuracy: 0.8336


#I then printed the accuracy score for the best model on the prediciton models onto the test model and got the test model accuracy score.

In [None]:
# Display classification report
print("\nClassification Report:\n", classification_report(y_test, test_preds))


Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.94      0.89       466
           1       0.77      0.56      0.65       177

    accuracy                           0.83       643
   macro avg       0.81      0.75      0.77       643
weighted avg       0.83      0.83      0.82       643



#Last I print a classification report for the test model.  Based off the data and models we used the 1st random forest model had the best accuracy for predicting values with the training and validation sets.  We then used that model to predict information on the test model.  The test accuracy from the test model came back even higher than the training models at 0.83.  After that we printed a classification report to get the preicsion, recall and F1 scores for the random forest model based off the test data.  The report showed a strong F1 score of 0.89 meaning the model predicted well.