# Mobile Carrier 

# Project description
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Phone Plan Analysis). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Goal:
Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.


# Data description
Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:

* сalls — number of calls,
* minutes — total call duration in minutes,
* messages — number of text messages,
* mb_used — Internet traffic used in MB,
* is_ultra — plan for the current month (Ultra - 1, Smart - 0).

In [1]:
import pandas as pd
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import balanced_accuracy_score, roc_auc_score

In [2]:
try:
    df = pd.read_csv('users_behavior.csv')   
except:
    df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
#looking into the first 5 rows of our dataset
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


There are no NA's present and just some data types that need to be converted.

In [5]:
# change calls and minutes to ints from floats
df['calls'] = df['calls'].astype(int)
df['messages'] = df['messages'].astype(int)

Since the data is already cleaned from our EDA processs we can move on to model building. 

### Spliting the data


We will be spliting the data into training, testing, and validation set. 
The data will be split into 80% train and 20% test. When spliting the data we will do a 80% train data which is obtained from the first split into validation and train.

In [6]:
features = df.drop(columns=['is_ultra'])
target =  df['is_ultra']

# spliting data from train and test 
features_train, features_test, target_train, target_test = train_test_split(
    features,target, test_size = 0.2, random_state = 12345)

#spliting data for train anf validate
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train,target_train, test_size = 0.2, random_state = 12345)


#check the size of training testing and validate
print(features_train.shape),
print(target_train.shape),
print(features_valid.shape), 
print(target_valid.shape),
print(features_test.shape), 
print(target_test.shape)

(2056, 4)
(2056,)
(515, 4)
(515,)
(643, 4)
(643,)


### Models and Hyperparameters 

#### Decision tree model

In [7]:
# with no hyperparamters for decision tree
dt = DecisionTreeClassifier()
dt.fit(features_train, target_train)
dt_predictions_valid = dt.predict(features_valid)# < find the predictions using validation set >
print(accuracy_score(target_valid, dt_predictions_valid)) 

0.6990291262135923


In [8]:
# descion tree with hyperparamters 
highest_score=0
for depth in range(1, 11):
    dt = DecisionTreeClassifier(max_depth = depth, random_state=12345)# <create a model,specify max_depth=depth >
    # < train the model >
    dt.fit(features_train,target_train)
    dt_predictions_valid1 = dt.predict(features_valid)# < find the predictions using validation set >
    dt_acc = accuracy_score(target_valid, dt_predictions_valid1)
    print("max_depth =", depth, ": ", dt_acc)
        
    if (dt_acc > highest_score):
        highest_score = dt_acc
        highest_depth = depth
print('The max depth: ', highest_depth,' The accuracy Score', highest_score)
     

max_depth = 1 :  0.7223300970873786
max_depth = 2 :  0.7475728155339806
max_depth = 3 :  0.7553398058252427
max_depth = 4 :  0.7533980582524272
max_depth = 5 :  0.7572815533980582
max_depth = 6 :  0.7611650485436893
max_depth = 7 :  0.7650485436893204
max_depth = 8 :  0.7631067961165049
max_depth = 9 :  0.7533980582524272
max_depth = 10 :  0.7592233009708738
The max depth:  7  The accuracy Score 0.7650485436893204


Dession tree gives a accuracy score of 77%

####  Logistic Regression
logistic regression is for classification problems, which predicts a probability range between 0 to 1.

In [9]:
#WITHOUT hyper parameters and tuning 
lr = LogisticRegression()
lr.fit(features_train, target_train)     
lr_predictions_valid = dt.predict(features_valid)
print("accuracy score",accuracy_score(target_valid, lr_predictions_valid)) 

accuracy score 0.7592233009708738


In [10]:
#with tunning 
log_reg1 = LogisticRegression(random_state=12345, solver='liblinear')
log_reg1.fit(features_train, target_train)
lr_predictions_valid1 = dt.predict(features_valid)# find the predictions using validation set 
accuracy_score(target_valid, lr_predictions_valid1)

0.7592233009708738

The accuracy score is 76%

#### Random Forest

In [11]:
#without hyper prameters
rf_model2 = RandomForestClassifier(random_state=12345)
rf_model2.fit(features_train, target_train)
rf_predictions_valid2 = rf_model2.predict(features_valid)
accuracy_score(target_valid, rf_predictions_valid2)

0.7864077669902912

In [12]:
#find the best estimator
highest_score = 0.0
for estimators in range(1, 25):
    rf_model = RandomForestClassifier(random_state=12345, n_estimators=estimators)
    rf_model.fit(features_train, target_train)
    rf_predictions_valid = rf_model.predict(features_valid)
    rf_acc = accuracy_score(target_valid, rf_predictions_valid)
    print("n_estimators =", estimators, ": ", rf_acc)
    if (rf_acc > highest_score):
        highest_score = rf_acc
        highest_depth = estimators
print('The n_estimators: ', highest_depth,' The accuracy Score', highest_score)


n_estimators = 1 :  0.7223300970873786
n_estimators = 2 :  0.7398058252427184
n_estimators = 3 :  0.7436893203883496
n_estimators = 4 :  0.7533980582524272
n_estimators = 5 :  0.7398058252427184
n_estimators = 6 :  0.7553398058252427
n_estimators = 7 :  0.7553398058252427
n_estimators = 8 :  0.7669902912621359
n_estimators = 9 :  0.7708737864077669
n_estimators = 10 :  0.7650485436893204
n_estimators = 11 :  0.7708737864077669
n_estimators = 12 :  0.7669902912621359
n_estimators = 13 :  0.7728155339805826
n_estimators = 14 :  0.7669902912621359
n_estimators = 15 :  0.7689320388349514
n_estimators = 16 :  0.7689320388349514
n_estimators = 17 :  0.7728155339805826
n_estimators = 18 :  0.7631067961165049
n_estimators = 19 :  0.7728155339805826
n_estimators = 20 :  0.7728155339805826
n_estimators = 21 :  0.7708737864077669
n_estimators = 22 :  0.7708737864077669
n_estimators = 23 :  0.7689320388349514
n_estimators = 24 :  0.7766990291262136
The n_estimators:  24  The accuracy Score 0.77669

Random forest gives us a accuracy score of 78% which is higher than Logistic Regression and Decision tree model.

### Model Testing


In [13]:
#testing choosen model
# test random forest as choosen model
rf1 = RandomForestClassifier(random_state=12345, n_estimators=24)
rf1.fit(features_train, target_train) 
rd_test_predictions = rf1.predict(features_test)   
rf1.score(features_train, target_train) 
print('Test accuracy score:', accuracy_score(target_test, rd_test_predictions))

Test accuracy score: 0.7822706065318819


From testing our model with Random Forest Classifier we have been able to retrieve an accuracy score of 78% with our test set. We can accurately say that our model will be able to predict accurately 78 % of the time to choose the right subscribers phone plan, and has met the baselines of 75%.

### Sanity Check

#### Dummy Classifier 

In [22]:
dummy_classifier_model = DummyClassifier(strategy='stratified')
dummy_classifier_model.fit(features_train, target_train)
dummy_predicition = dummy_classifier_model.predict(features_valid)
print("mean accuracy on train set: ", dummy_classifier_model.score(target_valid, dummy_predicition))


mean accuracy on train set:  0.5825242718446602


The dummy classifier gives you a measure of "baseline" performance and we can see by itself the model does have a high accuracy performance.

Will compute a function below which will alow us to see how well the algrothm works. This functions will compute the confusion matrix, precision, recall,F1-score, accuracy score, balanced accuracy score, ROC score.

In [24]:
# compute the precision, recall, and F1 score
def check(features_train, target_train, features_valid, target_valid):
    model = RandomForestClassifier(random_state=12345)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid) 
    print('Confusion Matrix')
    print(confusion_matrix(target_valid, predictions_valid))
    print('Recall: ', recall_score(target_valid, predictions_valid))
    print('Precision: ', precision_score(target_valid, predictions_valid))
    print('F1-score: ', f1_score(target_valid, predictions_valid))
    print('Accuracy Score: ', accuracy_score(target_valid, predictions_valid))
    print('Balanced Accuracy Score: ', balanced_accuracy_score(target_valid, predictions_valid))
    print('ROC Score: ', roc_auc_score(target_valid, predictions_valid))
    print('Classification report')
    print(classification_report(target_valid, predictions_valid))

In [25]:
# display sanity check
check(features_train, target_train, features_valid, target_valid)

Confusion Matrix
[[322  26]
 [ 84  83]]
Recall:  0.49700598802395207
Precision:  0.7614678899082569
F1-score:  0.6014492753623188
Accuracy Score:  0.7864077669902912
Balanced Accuracy Score:  0.7111466721728956
ROC Score:  0.7111466721728956
Classification report
              precision    recall  f1-score   support

           0       0.79      0.93      0.85       348
           1       0.76      0.50      0.60       167

    accuracy                           0.79       515
   macro avg       0.78      0.71      0.73       515
weighted avg       0.78      0.79      0.77       515



Our model is able to predict survivors behaviors 76% of the time and this is score is from the precision score. In addition, our F1-score does not look to good as well as recall, but the model is able to predict an accuracy score of 79%.

### Conclusion

From this classification project, we have been able to develop a model that will analyze subscribers' behavior which would recommend one of Megaline's newer plans. From the different models we have implemented we have found that Random Forest gives the best model accuracy out of all the models. Random Forest is able to 79% compared to the baseline of 58%. In addition, from learning this which we can help improve Megalines performance for recommending their newer plans to customers and having a better turnover in the company.