# Finding the Best Model for Plan Recommendation #

## Introduction: ##
Our goal of finding a model that recommends the best plan to subscribers means investigating different types of Machine Learning algorithms. Need to build a binary-classification model that can predict the best mobile plan a user is likely to choose. Input features include 'calls', 'minutes', 'messages' and 'mb_used'. The aim of the model is to accurately categorise the sets into one of the two plans with an accuracy of 75% or higher on  unseen data. We'll be testing it against a decision tree model, a random forest model and a logistic regression model.

The work plan will be implemented as follows: 
- Data Overview 
- EDA (determining if missing/duplicate values are present, checking datatypes and class imbalance)
- Preprocessing (splitting data, rectifying possible imbalance)
- Model training and testing 
- Final conclusions and best steps for Megaline going forward.

In [4]:
#importing all the necessary libraries 
import pandas as pd 
from sklearn import set_config
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.dummy import DummyClassifier

In [5]:
#creating a main dataframe to store info
df_megaline = pd.read_csv('/Users/micha/Downloads/users_behavior.csv')

In [6]:
#printing the dataframe 
df_megaline.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [7]:
#getting the general info
df_megaline.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [8]:
#general description of data
df_megaline.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [9]:
#finding the amount of 'ultra' plan users
df_megaline[df_megaline['is_ultra'] == 1].info()

<class 'pandas.core.frame.DataFrame'>
Index: 985 entries, 3 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     985 non-null    float64
 1   minutes   985 non-null    float64
 2   messages  985 non-null    float64
 3   mb_used   985 non-null    float64
 4   is_ultra  985 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 46.2 KB


In [10]:
#ultra plan users % 
(985 / 3214) * 100

30.647168637212197

In [11]:
df_megaline[df_megaline['is_ultra'] == 0].info()

<class 'pandas.core.frame.DataFrame'>
Index: 2229 entries, 0 to 3212
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     2229 non-null   float64
 1   minutes   2229 non-null   float64
 2   messages  2229 non-null   float64
 3   mb_used   2229 non-null   float64
 4   is_ultra  2229 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 104.5 KB


In [12]:
#smart plan users %
(2229 / 3214) * 100

69.35283136278781

In [13]:
#creating features and target variables for model development
features = df_megaline.drop(['is_ultra'], axis=1)
target = df_megaline['is_ultra']

In [14]:
#splitting the data into training, validation and testing sets
features_temp, features_test, target_temp, target_test = train_test_split(features, target, test_size=0.2, random_state=246, stratify=target)
features_train, features_valid, target_train, target_valid = train_test_split(features_temp, target_temp, test_size=0.25, random_state=246, stratify=target_temp)

### NOTES ON DATA SPLIT: ###
The dataset was very imbalnced with only around 30% of all customers being enrolled in the 'Ultra' plan offered by Megline. To preserve class distribution across splits, stratified sampling was used. Especially important for imbalanced classification problems to ensure reliable model evaluation. 

**This code splits the data into:** 
- 'features_train', 'target_train' = 60% of the data
- 'features_valid', 'target_valid' = 20% of the data
- 'features_test', 'target_test' = 20% of the data

In [15]:
#printing the shape of each dataframe and getting the size
print(f"Training set features shape: {features_train.shape}")
print(f"Training set target shape: {target_train.shape}")
print(f"Validation set features shape: {features_valid.shape}")
print(f"Validation set target shape: {target_valid.shape}")
print(f"Test set features shape: {features_test.shape}")
print(f"Test set features shape: {target_test.shape}")

Training set features shape: (1928, 4)
Training set target shape: (1928,)
Validation set features shape: (643, 4)
Validation set target shape: (643,)
Test set features shape: (643, 4)
Test set features shape: (643,)


## Model Training: ##

In [None]:
#created a custom function for model evaluation
def evaluate_model(model, features_train, target_train, features_valid, target_valid, features_test, target_test):
    
    model.fit(features_train, target_train)

    preds = {
        'train': model.predict(features_train),
        'valid': model.predict(features_valid),
        'test': model.predict(features_test)
    }
    probs = {
        'train': model.predict_proba(features_train)[:, 1],
        'valid': model.predict_proba(features_valid)[:, 1],
        'test': model.predict_proba(features_test)[:, 1]
    }
    
    metrics = {}
    for split, y_true in zip(['train', 'valid', 'test'], [target_train, target_valid, target_test]):
        y_preds = preds[split]
        y_probs = probs[split]

        metrics[split] = {
            'accuracy': accuracy_score(y_true, y_preds),
            'f1': f1_score(y_true, y_preds),
            'roc_auc': roc_auc_score(y_true, y_probs)
        }
    results = {
        'predictions': preds,
        'probabilities': probs,
        'metrics': metrics
    }
    
    return results

## Decision Tree Model: ##

In [35]:
#training a decision tree model and testing depth for best accuracy using the training and validation set
dt_model = DecisionTreeClassifier(
    max_depth=8,
    random_state=246,
    class_weight='balanced'
)

results = evaluate_model(
    dt_model,
    features_train, target_train,
    features_valid, target_valid,
    features_test, target_test
)
print(f'Train Metrics: {results['metrics']['train']} \nValid Metrics: {results['metrics']['valid']} \nTest Metrics: {results['metrics']['test']}')

Train Metrics: {'accuracy': 0.8449170124481328, 'f1': 0.7397737162750218, 'roc_auc': 0.878135887730062} 
Valid Metrics: {'accuracy': 0.7573872472783826, 'f1': 0.5894736842105263, 'roc_auc': 0.7257972729962896} 
Test Metrics: {'accuracy': 0.7589424572317263, 'f1': 0.5888594164456233, 'roc_auc': 0.7383908857071317}


## Random Forest Model: ##

In [36]:
rf_model = RandomForestClassifier(
    n_estimators=300,
    max_depth=6,
    class_weight='balanced',
    random_state=246
)
results = evaluate_model(
    rf_model,
    features_train, target_train,
    features_valid, target_valid,
    features_test, target_test
)
print(f'Train Metrics: {results['metrics']['train']} \nValid Metrics: {results['metrics']['valid']} \nTest Metrics: {results['metrics']['test']}')

Train Metrics: {'accuracy': 0.8475103734439834, 'f1': 0.7267657992565055, 'roc_auc': 0.8651639463556438} 
Valid Metrics: {'accuracy': 0.7916018662519441, 'f1': 0.6235955056179775, 'roc_auc': 0.819705902437914} 
Test Metrics: {'accuracy': 0.8040435458786936, 'f1': 0.6460674157303371, 'roc_auc': 0.809104049532221}


## Logistic Regression Model ##

In [None]:
log_reg_model = LogisticRegression(random_state=246, solver='liblinear', class_weight='balanced')
results = evaluate_model(
    log_reg_model, 
    features_train, target_train, 
    features_valid, target_valid,
    features_test, target_test
)
print(f'Train Metrics: {results['metrics']['train']} \nValid Metrics: {results['metrics']['valid']} \nTest Metrics: {results['metrics']['test']}')


Train Metrics: {'accuracy': 0.6488589211618258, 'f1': 0.5208775654635527, 'roc_auc': 0.6757356356314551} 
Valid Metrics: {'accuracy': 0.6283048211508554, 'f1': 0.5031185031185031, 'roc_auc': 0.6666135530718627} 
Test Metrics: {'accuracy': 0.614307931570762, 'f1': 0.4723404255319149, 'roc_auc': 0.6414946165577837}


In [None]:
#testing against a dummy classifier 
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(features_train, target_train)
dummy_predictions = dummy.predict(features_valid)
dummy_accuracy = accuracy_score(target_valid, dummy_predictions)
print(f"Dummy Accuracy (validation): {dummy_accuracy}")

Dummy Accuracy (validation): 0.6936236391912908


### Initial Model Training Conclusions: ###

After training and testing on various model we can draw a few conclusions: 
- Evidence of overfitting on all models (except LogisticRegression) ***f1 score drops by around 10 between training, validation, test sets, more marginal drop in accuracy***
- The DecisionTree/RandomForest performed better than company's accuracy threshold of 75%.
- RandomForest was the most well rounded classifier of the 3 models tested with a final test accuarcy around 80% (very accurate),a final ROC-AUC of 0.8 (this means the model does a good job at valueing positive instances i.e. when a customer is a member of the 'Ultra' plan over negative instances. Better for prediction) and a final f1 of around 0.64 (meaning it does relatively well at catching positive results and avoiding false alarms, important in imbalanced dataset)

### Dummy Classifier Conclusions: ###
The results of the dummy classifer are what may be expected given the use of "strategy='most_frequent'" which returns only the most frequent class of whatever it we are testing for (here it's the smart user plan). Making it test at an accuracy around 70% our Decision Tree and Random Forest models both outperform the Dummy classifier by nearly 10% and even the Logical Regression model outperforms it when ran against the test set. 

# Overall Conclusion: #
In the final analysis and for what Megaline hopes to achieve I would most recommend the RandomForestModel, it performed the best overall between all 3 metrics particulary on accuracy (achieved a score >= 0.75 as per company requests) and ROC-AUC which works hand in hand with accuracy and evaluating how the model values positive instances over negative ones (i.e. whether a user will be more compatible with the 'Ultra' or 'Smart' plan). Overall of the models that were evaluated it is the best performing, easy to use and will help the company in making the best recommendations to potential customers, meaning it can make the company much more profitable overall. 