# Project: Creating a Model to Pick the Right Mobile Plan

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
In this project, we analyzed the behavior data about subscribers who have already switched to the new plans. For this classification task, we developed a model that will pick the right plan and picked the model with the highest possible accuracy score. The threshold for accuracy was 0.75. 

# Libraries used for Analysis 

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Description of users' behavior data

The users' behavior data was saved in teh variable __data__, which contains 5 columns, and 3214 rows, each representing an observation for an user. The description of the columns in listed below: 

- `сalls` — number of calls,
- `minutes` — total call duration in minutes,
- `messages` — number of text messages,
- `mb_used` — Internet traffic used in MB,
- `is_ultra` — plan for the current month (Ultra - 1, Smart - 0).

In [2]:
data= pd.read_csv('/datasets/users_behavior.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


# Model 1: Decision tree

For the decision tree model, we split the data into three data sets: training, validation, and testing, in a 3:1:1 ratio, respectively. 

In [3]:
#splitting the data into training, validation, and test in a 3:1:1 ratio, respectively

train_size=0.6

x = data.drop(columns = ['is_ultra'], axis=1)
y = data['is_ultra']

# In the first step we will split the data in training and remaining dataset
x_train, x_rem, y_train, y_rem = train_test_split(x,y, train_size=0.6, random_state=12345)

# Now since we want the valid and test size to be equal (20% each of overall data). 
# we have to define valid_size=0.5 (that is 50% of remaining data)

test_size = 0.5
x_valid, x_test, y_valid, y_test = train_test_split(x_rem,y_rem, test_size=0.5, random_state=12345)


print("training data set features shape:", x_train.shape), print("training data set target shape:",y_train.shape)
print("validation data set features shape:",x_valid.shape), print("validation data set target shape:",y_valid.shape)
print("testing data set features shape:",x_test.shape), print("testing data set target shape:",y_test.shape)

training data set features shape: (1928, 4)
training data set target shape: (1928,)
validation data set features shape: (643, 4)
validation data set target shape: (643,)
testing data set features shape: (643, 4)
testing data set target shape: (643,)


(None, None)

For the decision tree model, we created a loop to determine the optimal __max_depth__ hyperparameter value for the model. We found that the best __max_depth__ value was 3. After applying the trained model to the test data set, we obtained an accuracy score of 77.91% for the test data set. 

In [4]:
#creating a loop to see which depth has highest accuracy 
best_score = 0 
best_depth = 0

for depth in range(1,120):
    model= DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(x_train, y_train)
    score= model.score(x_valid, y_valid)
    if score > best_score:
        best_score= score
        best_depth = depth

print(f' depth= {best_depth} with accuracy {best_score}')

 depth= 3 with accuracy 0.7853810264385692


In [5]:
model= DecisionTreeClassifier(random_state=12345, max_depth=3)
model.fit(x_train, y_train)

train_predictions = model.predict(x_train)
test_predictions = model.predict(x_test)

print("Accuracy Score for Train Data: ", accuracy_score(y_train, train_predictions))
print("Accuracy Score for Test Data:", accuracy_score(y_test, test_predictions))

Accuracy Score for Train Data:  0.8075726141078838
Accuracy Score for Test Data: 0.7791601866251944


## Model 1: sanity check

For the sanity check, we tested against predicting a model randomly. Since each plan has an equal chance, 50% probability, of being selected randomly; then our accuracy from selecting a plan randomly would be 0.50. Our decision tree model passed the sanity check for predicting a model from the testing data set since its accuracy score is higher than the accuracy score of selecting a plan randomly. 

In [6]:
sanity_accuracy_random= y_test.value_counts(normalize=True)*0.5
if sanity_accuracy_random.sum() >=  accuracy_score(y_test, test_predictions): 
    print('sanity check failed!')
if sanity_accuracy_random.sum() <  accuracy_score(y_test, test_predictions): 
    print('sanity check passed!')

sanity check passed!


# Model 2: Random Forest 

For the random forest model, we split the data into three data sets: training, validation, and testing, in a 3:1:1 ratio, respectively. The loop below was used to find the optimal value for the hyperparameter __n_stimators__. We found that this optimal value was 23 and it produced an accuracy score of 78.07% for predicing our testing data set. 

In [7]:
best_score = 0
best_est = 0
for est in range (1,100):
    model= RandomForestClassifier(random_state=12345, n_estimators=est)
    model.fit(x_train, y_train)
    score= model.score(x_valid,y_valid)
    if score > best_score:
        best_score = score 
        best_est = est
print (f'Best estimmator for model is {best_est}  with an accuracy score of {best_score}')

Best estimmator for model is 23  with an accuracy score of 0.7947122861586314


In [8]:
model= RandomForestClassifier(random_state=12345, n_estimators=23)
model.fit(x_train, y_train)

train_predictions = model.predict(x_train)
test_predictions = model.predict(x_test)

print("Accuracy Score for Train Data: ", accuracy_score(y_train, train_predictions))
print("Accuracy Score for Test Data:", accuracy_score(y_test, test_predictions))

Accuracy Score for Train Data:  0.9937759336099585
Accuracy Score for Test Data: 0.7807153965785381


## Model 2: sanity check

For the sanity check, we tested against predicting a model randomly. Since each plan has an equal chance, 50% probability, of being selected randomly; then our accuracy from selecting a plan randomly would be 0.50. Our random forest model passed the sanity check for predicting a model from the testing data set since its accuracy score is higher than the accuracy score of selecting a plan randomly. 

In [9]:
sanity_accuracy_random= y_test.value_counts(normalize=True)*0.5
if sanity_accuracy_random.sum() >=  accuracy_score(y_test, test_predictions): 
    print('sanity check failed!')
if sanity_accuracy_random.sum() <  accuracy_score(y_test, test_predictions): 
    print('sanity check passed!')

sanity check passed!


# General Conclusion

The model that we selected was the random forest with a __n_stimators__ hyperparamater value of 23. The model produced a 78.07% accuracy score, compared to our decision tree model that produced an accuracy score of 77.91%, for our test data set. 