# Project description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. 

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.  

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.  

## Data description
Every observation in the dataset contains monthly behavior information about one user. The information given is as follows: 

сalls — number of calls,
minutes — total call duration in minutes,
messages — number of text messages,
mb_used — Internet traffic used in MB,
is_ultra — plan for the current month (Ultra - 1, Smart - 0).

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error

In [2]:
# Define the paths
local_path = 'users_behavior.csv'
server_path = '/datasets/users_behavior.csv'

# Try to load the data from the local path first, if that fails, try the server path
try:
    df = pd.read_csv(local_path)
except FileNotFoundError:
    try:
        df = pd.read_csv(server_path)
    except FileNotFoundError:
        print("File not found in both local and server paths. Please check the file locations.")

# If the file is loaded successfully, display the first few rows
if 'df' in locals():
    print(df.head())

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0


In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [6]:
features_df = df.drop(columns='is_ultra')

In [7]:
target_df = df['is_ultra']

## Measuring Accuracy

Splitting data into training and testing. Rule of thumb allocate around 70-80% of our data for taring and around 20% for testing.

In [8]:
features_train, features_test, target_train, target_test = train_test_split(features_df, target_df, test_size=0.2, 
                                                                            random_state=12345)

## Training a RandomForest Model

In [9]:
model = RandomForestClassifier(n_estimators = 100, max_depth = 10, random_state=12345)

model.fit(features_train, target_train)

## Predictions and Accuracy

In [10]:
target_pred = model.predict(features_test)
accuracy = accuracy_score(target_test, target_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7962674961119751


## Delving deeper into the models performance

I am going to examine the confusion matrix and classification report. These tools give me more insight into how well the model is peforming. Particularly regarding its ability to predict each class (Ultra and Smart)

Confusion Matrix: This shows the counts of correct and incorrect predictions broken down by each class. It sees how many of each class are being missclassified

Classification Report: This report includes precision, recall and F1 score for each class, which are metrics to determine whether the classes are imbalanced

In [11]:
from sklearn.metrics import confusion_matrix, classification_report

# Generating the confusion matrix
conf_matrix = confusion_matrix(target_test, target_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Generating the classification report
class_report = classification_report(target_test, target_pred)
print("Classification Report:")
print(class_report)

Confusion Matrix:
[[411  36]
 [ 95 101]]
Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.92      0.86       447
           1       0.74      0.52      0.61       196

    accuracy                           0.80       643
   macro avg       0.77      0.72      0.73       643
weighted avg       0.79      0.80      0.78       643



# Conclusions

Confusion Matrix Breakdown:

The model correctly predicted 410 users as Smart plan users.
The model incorrectly predicted 37 users as Ultra plan users when they were actually Smart.

The model correctly predicted 103 users as Ultra plan users.
The model incorrectly predicted 93 users as Smart plan users when they were actually Ultra plan users. 

    -Smart Plan

Precision for Smart Plan (class 0): 82% predictions made for Smart Plan were accurate 
Recall for Smart Plan (class 0): 92% of the actual Smart plans were correct in identification
F1 Score for Smart (class 0): 86 % accurate, this metric combines precision and recall 

    -Ultra Plan
    
Precision for Ultra Plan (class 1): 74% of all predictions made for the Ultra Plan were accuracte 
Recall for Ultra Plan (class 1): 53% of all the Ultra Plan users were accuractely identified 
F1 Score for Ultra Plan (class 1): 61% accuracte, this metric combines precision and recall. 

Overall:

My model is performing very well in identfying Smart Plan users with high precision and recall. However for Ultra plan users the model struggles specifically in terms of recall. 