# The Right Phone Plan for Megaline Users

As an analyst at Megaline, the job is to analyze behavior data about subscribers who have already switched to the new plans.  

##  Introduction
Presented is an analysis of user behavior data that have switched from legacy plans to newer one. 

###  Goal:
This report will focus on developing a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
The final model will have the highest possible accuracy. The threshold for accuracy is set at 0.75.


### Stages:
This project will consist of the following stages:

1. Introduction
2. General Information
3. Split Dataset
    1. Training
    2. Validation
    3. Test 
4. Model Testing
    1. Decision Tree
    2. Random Forest
    3. Logistic Regression
5. Quality Check
6. Sanity Check
7. Conclusion
         

## General Information

Import all libraries and modules.

In [35]:
# import

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier

Let's open the file and study the dataset

In [23]:
#import dataset

users = pd.read_csv('/datasets/users_behavior.csv')

#first five rows of the dataset
print(users.head())

#general info about the dataset
users.info()

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Our dataset contains the following columns:
- `сalls` — number of calls,
- `minutes` — total call duration in minutes,
- `messages` — number of text messages,
- `mb_used` — Internet traffic used in MB,
- `is_ultra` — plan for the current month (Ultra - 1, Smart - 0).

Our goal is to develope a model that predict/picks which of the newer plans (Smart or Ultra) is best for the customer based on their behavior. Since a test dataset is not present, we will split our dataset following a 3:1:1 ratio. Resulting a **training dataset**, **validation dataset**, and a **test dataset**. In addition, our features will be 'calls', 'minitues', 'messages', and 'mb_used'. The target will be 'is_ultra'. 

## Split Dataset

In [24]:
#target and feature creation

target = users['is_ultra']
features = users.drop(['is_ultra'],axis = 1)

First we will create our target and features, it will make the splitting easier. 

In [25]:
#split the dataset

features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size = 0.4, random_state = 12345)
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size = 0.5, random_state = 12345)

print(features_train.shape,
     features_valid.shape,
     features_test.shape)

(1928, 4) (643, 4) (643, 4)


Here we can see that we have successfully split the dataset 3:1:1. features contain 4 columns because 'is_ultra' became the target column.

## Model Testing

We will test 3 different models: Decision Tree, Random Forest, and the Logistic Regression. Since our target is categorical (0, 1). 

### Decision Tree Classifier

We will need to adjust the hyperparameter: max_depth to decide which will provide the best accuracy. Since we used random_state = 12345 on our dataset split, we will keep this hyperparameter the same. 

In [31]:
#create a loop to test max_depth

best_depth = 0
accuracy = 0.75
for depth in range (1,11):
    tree_model = DecisionTreeClassifier(random_state = 12345, max_depth = depth)
    tree_model.fit(features_train, target_train)
    tree_prediction = tree_model.predict(features_valid)
    tree_accuracy = accuracy_score(target_valid, tree_prediction)
    if tree_accuracy > accuracy:
            accuracy = tree_accuracy
            best_depth = depth
    
print('Max depth of', best_depth, ',', 'Accuracy = ', accuracy)

Max depth of 3 , Accuracy =  0.7853810264385692


Based on the model,the Decision Tree accuracy is between 75-78%. Within all of the max depth that we have tested, it appeared that **max depth = 3** has the best accuracy at 78.5%. We will use this setting.

### Random Forest Classifier

We will need to adjust the hyperparameter: max_depth and n_estimators to decide which will provide the best accuracy. Since we used random_state = 12345 on our dataset split, we will keep this hyperparameter the same. 

In [27]:
#create a loop to test max_depth and n_estimators
best_est = 0
best_depth = 0
accuracy = 0.75
for depth in range (1,11):
    for est in range (10, 101, 10):
        forest_model = RandomForestClassifier(random_state = 12345, max_depth = depth, n_estimators = est)
        forest_model.fit(features_train, target_train)
        forest_prediction = forest_model.predict(features_valid)
        forest_accuracy = accuracy_score(target_valid, forest_prediction)
        if forest_accuracy > accuracy:
            accuracy = forest_accuracy
            best_est = est
            best_depth = depth
print('Max depth of', best_depth, 'and n_estimators of', best_est, 'Accuracy = ', accuracy)

Max depth of 8 and n_estimators of 40 Accuracy =  0.8087091757387247


Our result suggested that using a **max_depth of 8 and n_estimator of 40** will grant a 80.87% accuracy, which is higher than the highest accuracy Decision Tree achieved. 

### Logistic Regression

All we need to do for Logistic Regression is set a solver, we will use the default. and keep the random_state the same.

In [33]:
#Logistic Regression

logistic_model = LogisticRegression(random_state = 12345)
logistic_model.fit(features_train, target_train)
logistic_prediction = logistic_model.predict(features_valid)
print('Accuracy for Logistic Regression is:', accuracy_score(target_valid, logistic_prediction))

Accuracy for Logistic Regression is: 0.7107309486780715


The accuracy from logistic regression is 71.1%. Which appeared to be lower that both randomforest and decision tree.

**Best Model**

By testing three models, we found that logistic regression model is the fastest, but with the lowest accuracy. Decision tree ranks 2nd in both accuracy and speed. While Randomforest is the most accurate but also the most time-consuming. Since our goal is to have a model with the highest accuracy, we will use **RandomForest** with hyperparameter setting.

## Quality Check

In [34]:
#Quality check with test_dataset

model = RandomForestClassifier(random_state = 12345, n_estimators = 40, max_depth = 8)
model.fit(features_train, target_train)
prediction = model.predict(features_test)

print('The accuracy for our model using test dataset is:', accuracy_score(target_test, prediction))

The accuracy for our model using test dataset is: 0.7962674961119751


Our RandomForest model gave a 79.6% accuracy, which is higher than our accuracy threshold set at 75%. Therefore our model has passed the quality check.

## Sanity Check 

We will preform a sanity check, making sure our model is performing better than chance. We will use DummyClassifier.

In [37]:
#Quality check with test_dataset

dummy_model = DummyClassifier(random_state = 12345)
dummy_model.fit(features_train, target_train)
dummy_prediction = dummy_model.predict(features_test)

print('The accuracy for dummy model using test dataset is:', accuracy_score(target_test, dummy_prediction))

The accuracy for dummy model using test dataset is: 0.6842923794712286


The dummymodel would have an accuracy of 68.4%, which is less than our model. It indicates that our model has passed the sanity check.

## Conclusion

We have successfully developed a model that will predict which of the new plans is best for a customer based on their phone usage behavior. We first split our data following a 3:1:1 ratio into training, validation and test sets. And then test three models and determined that RandomForest with n_estimators = 40 and max_depth = 8 returned the highest accuracy. This model was used to test test sets and resulted in a ~80% accuracy and it is higher than the dummy model, therefore passing the quality and sanity check.