**Project description:**

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. We want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

We will start by loading the data and installing the libraries.

In [1]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from joblib import dump

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


We will split the source data into a training set (60%), a validation set (20%), and a test set (20%).

In [4]:
# set aside 20% of train and test data for evaluation
df_train, df_test = train_test_split(df, test_size=0.2, random_state=12345)

# Use the same function above for the validation set
df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345) # 0.25 x 0.8 = 0.2

We will define features and targets.

In [5]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

Let's investigate the quality of different models by changing hyperparameters.

Decision tree:

In [6]:
for depth in range(1, 6):
        model = DecisionTreeClassifier(random_state=12345, max_depth=depth)

        model.fit(features_train, target_train)

        predictions_valid = model.predict(features_valid)

        print("max_depth =", depth, ": ", end='')
        print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.75
max_depth = 2 : 0.7835820895522388
max_depth = 3 : 0.7885572139303483
max_depth = 4 : 0.7810945273631841
max_depth = 5 : 0.7810945273631841


Regarding the decision tree, the highest accuracy appears to be obtained at depth 3 (0.788).

Random forest classifier:

In [7]:
best_score = 0
best_est = 0
best_depth = 0
for est in range(1, 10):
    for depth in range (1, 11):
        
        model = RandomForestClassifier(random_state=54321, n_estimators=est) 
        model.fit(features_train, target_train) 
        score = model.score(features_valid, target_valid) 
        if score > best_score:
            
        
            best_score = score
            best_est = est


print("Accuracy of the best model on the validation set:", best_score, "n_estimators:", best_est, "best_depth:", depth)
final_model = RandomForestClassifier(random_state=54321, n_estimators=9) # change n_estimators to get best model
final_model.fit(features_train, target_train)

Accuracy of the best model on the validation set: 0.7885572139303483 n_estimators: 4 best_depth: 10


RandomForestClassifier(n_estimators=9, random_state=54321)

Logistic regression:

In [8]:
model = LogisticRegression(random_state=12345,
                           solver='lbfgs', 
                           max_iter=1000)
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
model.score(features_valid, target_valid)
accuracy_score(target_valid, predictions_valid)

0.7039800995024875

**Here are the results of the experiments to find the best model and hyperparameters:**

Decision Tree - found to have the best level of accuracy at depth 3 with 78.85% accuracy.

Random Forest - found to have the best level of accuracy with 4 trees at a depth of 10 is 78.85%.

Logistic regression - its best level of accuracy is 70%.

I recommend continuing with the decision tree model as it has the same percentage of accuracy as the random forest and is much faster than it.

Let's check the quality of the model we chose with the test set.

In [16]:
#We will write a function that will tell us errors in the test set.
def error_count(answers, predictions):
    errors=0
    for answer, prediction in zip(answers, predictions):
          if answer != prediction:
            errors += 1
    return errors

In [18]:
model1 = DecisionTreeClassifier(random_state=12345, max_depth=3)
model1.fit(features_train, target_train)
test_predictions = model1.predict(features_test)
print('Errors:', error_count(target_test, test_predictions))
print('Accuracy:',accuracy_score(target_test, test_predictions))

Errors: 139
Accuracy: 0.7838258164852255


Our model is also of high quality in testing with the test.

We will now perform a sanity test on the model.

In [19]:
#let's create a dummy model for our sanity test.
fake_predictions = pd.Series(0, index=df_test.index)
accuracy_sanity = accuracy_score(target_test, fake_predictions)
print('Errors:', error_count(target_test, fake_predictions))
print('Accuracy:',accuracy_score(target_test, fake_predictions))

Errors: 196
Accuracy: 0.6951788491446346


Our model provides a more accurate execution than a dummy model - a sign that the model has passed the sanity test.

Steps and final conclusion:

Our task was to find the most suitable model for our customer classification in the cellular company, with our accuracy threshold of 0.75.

Steps and findings:
In the first stage, we loaded the data set, and we divided it into three sets: the training set, the evaluation set and the test set, and each of them is divided into targets and features.

2.We searched among the three models mentioned above for the models with a percentage of accuracy higher than 0.75, and within them we also tested a variety of hypermeters in order to achieve the most accuracy. Random Forest found to have the best level of accuracy with 4 trees at a depth of 10 is 78.85% just like a decision tree at depth 3- we decided to go for a decision tree because it is faster.

