# Mobile Carrier Megaline Model

## Introduction 

It has been observed that many subscribers are still utilizing older legacy plans, prompting Megaline's interest in transitioning them to newer, more advanced plans: Smart or Ultra.

The objective is to utilize subscriber behavior data, particularly from those who have already migrated to the new plans, and construct a sophisticated model capable of analyzing this data to recommend the most suitable plan—either Smart or Ultra—for each subscriber.

The initial groundwork has been established through extensive data preprocessing efforts undertaken in a prior project on Statistical Data Analysis. With this preprocessing completed, the project is poised to advance into the development phase of creating the model.

The overarching goal is to develop a model that achieves the highest achievable accuracy. Megaline has set a challenging threshold of 75% accuracy for this project, with a determination to meet and exceed this target.

### Imports

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [6]:
try:
    df = pd.read_csv("../data/users_behavior.csv")
except FileNotFoundError as e:
    print(f"The following error: ({e}) occured while trying to load the dataset")
else:
    print(f"The dataset was loaded succesfulyy!")

The dataset was loaded succesfulyy!


## Data Observations

In [7]:
df.head(20) # Look at data

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


In [8]:
df.info() # Look at data information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


**Observations**

Upon examining the data, it's evident that every column in the dataset adheres to the appropriate data type. Additionally, initial scrutiny reveals no instances of missing values within the dataset. The data suggests that the column labeled `is_ultra` serves as a classification attribute, with a value of 1 denoting the 'ultra' plan and 0 indicating a 'smart' plan.

## Model Preparations

Given that there's just one dataset available, the source data will undergo a split of 3:1:1. This allocation reserves 60% of the data for training purposes, while 20% is set aside for both validation and testing.

In [9]:
def split_data_3_1_1(dataset, test_s=0.40, rnd_state=42):
    '''Prints a statement specifying the data-split used and returns 3 variables for the train, validation and test datasets respectively'''
    
    # Splitting the source data into 40% for Validation (to be split again) and 60% for Training
    df_train, df_valid = train_test_split(df, test_size=test_s, random_state=rnd_state)

    # Further splitting the df_valid data 50/50 (40% from previous task) to obtain 3:1:1 ratio
    df_test, df_valid = train_test_split(df_valid, test_size=(test_s + .10), random_state=rnd_state)
    
    # Printing confirmation of data split
    sum_of_datasets = len(df_train) + len(df_valid) + len(df_test)
    if len(df) == sum_of_datasets:
        print(f"Data split ratio is 3:1:1, where data is allocated as:\nTraining = 60% (n={len(df_train)})\nValidation = 20% (n={len(df_valid)})\nTesting = 20% (n={len(df_test)})")
    
    return df_train, df_valid, df_test

In [10]:
# Unpacking the values by calling function

df_train, df_valid, df_test = split_data_3_1_1(df, 0.40, 24681)

Data split ratio is 3:1:1, where data is allocated as:
Training = 60% (n=1928)
Validation = 20% (n=643)
Testing = 20% (n=643)


In [11]:
def prepare_data(dataset, drop_cols, target_col):
    features = dataset.drop(drop_cols, axis=1)
    target = dataset[target_col]
    
    return features, target

In [12]:
# Determining the train, validation, and test features & target

train_features, train_target = prepare_data(df_train, ['is_ultra'], 'is_ultra')
valid_features, valid_target = prepare_data(df_valid, ['is_ultra'], 'is_ultra')
test_features, test_target = prepare_data(df_test, ['is_ultra'], 'is_ultra')

## Creating and Training Models

### Training a Tree Model at Various Depths

In [13]:
best_model_tree = None
best_result_tree = 0
best_depth_tree = 0

for depth in range(1,6):
    model = DecisionTreeClassifier(random_state=24681, max_depth=depth)
    model.fit(train_features, train_target)
    valid_predictions = model.predict(valid_features)
    result = accuracy_score(valid_target, valid_predictions) 
    
    if result > best_result_tree:
        best_model_tree = model
        best_result_tree = result
        best_depth_tree = depth

print(f"Accuracy of the best model: {best_result_tree}\nDepth of the best model: {best_depth_tree}")

Accuracy of the best model: 0.8009331259720062
Depth of the best model: 3


### Training a Random Forest at Various Estimators

In [14]:
best_model_forest = None
best_result_forest = 0
best_est_forest = 0

for est in range(1, 20):
    model = RandomForestClassifier(random_state=24681, n_estimators=est)
    model.fit(train_features, train_target)
    valid_predictions = model.predict(valid_features)
    result = accuracy_score(valid_target, valid_predictions) 
    if result > best_result_forest:
        best_model_forest = model
        best_result_forest = result
        best_est_forest = est
        
print(f"Accuracy of the best model: {best_result_forest}\nNumber of estimators of the best model: {est}")

Accuracy of the best model: 0.7993779160186625
Number of estimators of the best model: 19


### Training a Logistic Regresssion Model using 'liblinear'

In [15]:
def run_log_regression(rnd_state=42, solv='liblinear'):
    model = LogisticRegression(random_state=rnd_state, solver=solv)
    model.fit(train_features, train_target) 
    valid_predictions = model.predict(valid_features)
    result = accuracy_score(valid_target, valid_predictions)
    print(f"Accuracy of the model: {result}")
    
run_log_regression(24681)

Accuracy of the model: 0.7325038880248833


**Observations**

The training data was utilized to train three distinct models:
1. `DecisionTreeClassifier`
2. `RandomForestClassifier`
3. `LogisticRegression`

Following this, the validation data was employed to generate predictions for each model, and the accuracy score was computed for each. It's evident that the most accurate model was the Decision Tree model trained with a depth of 3. The Random Forest model, utilizing 19 estimators, closely followed as the second most accurate model. Conversely, the Logistic Regression model performed very poorly with this dataset.

## Final Model

Based on the results of training the different models. It is clear to see that the DecisionTreeClassifier is the best model to use for this data.

In [16]:
# Creating and training the final model

final_model = DecisionTreeClassifier(random_state=24681, max_depth=3)
final_model.fit(train_features, train_target)

# Making predictions using testing data
test_predictions = final_model.predict(test_features)

# Checking accuracy using test target and test predictions
result = accuracy_score(test_target, test_predictions)

print(f"The accuracy score is: {result}")

The accuracy score is: 0.8133748055987559


In [17]:
def error_count(answers, predictions):
    count = 0
    for i in range(len(answers)):
        if answers[i] != predictions[i]:
            count += 1
    return count

answers = test_target.values

# Count number of errors
print('Errors:', error_count(answers, test_predictions))

Errors: 120


**Observations**

The final model was developed using a `DecisionTreeClassifier` algorithm. This model underwent training using the training dataset. Instead of utilizing the validation dataset for testing, the model's accuracy was assessed using a separate test dataset that had not been previously used during training or validation. The resulting accuracy score of the final model was 81.33%, with a total of 120 errors identified during testing.

## Conclusion

In conclusion, the development of our model utilizing the DecisionTreeClassifier algorithm has proven successful. Through training on the designated dataset and subsequent evaluation on a separate, previously untouched test dataset, we achieved an impressive accuracy score of 81.33%. This outcome surpassed the targeted accuracy threshold of 75% set by Megaline for the project. During testing, a total of 120 errors were identified, providing valuable insights for further refinement and optimization of our model. This accomplishment signifies a significant step forward in leveraging subscriber behavior data to recommend appropriate plan transitions, thereby advancing Megaline's goals of enhancing subscriber satisfaction and plan adoption.