# Project_07 - Classification (Mobile)

**Project description**
<br>Mobile carrier Megaline has found out that many of their subscribers use legacy plans.
<br>They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
<br>You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. <br>Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
<br>Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

**Project goal**
<br>Analyze the dataset, preprosess and devide the dataset for classification modeling.
<br>Apply different models to see which model has the hight accuracy in predicting users's plan category. 

**Project Workflow**
<br>Import dataset and perform EDA
<br>Split the source data into a training set, a validation set, and a test set.
<br>Investigate the quality of different models by changing hyperparameters.
<br>Check the quality of the model using the test set.

**Data description**
<br>Every observation in the dataset contains monthly behavior information about one user.
<br>The information given is as follows:
<br>сalls — number of calls,
<br>minutes — total call duration in minutes,
<br>messages — number of text messages,
<br>mb_used — Internet traffic used in MB,
<br>is_ultra — plan for the current month (Ultra - 1, Smart - 0).

In [5]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [6]:
df = pd.read_csv('datasets/project_02_dataset.csv')
display(df.head())
df.info()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [7]:
#df.to_csv('~/work/project_datasets/project_02_dataset.csv', index=False, header=list(df.columns))

In [8]:
display(df.duplicated().sum())

0

The datatypes look correct and there are no duplicated rows.

In [9]:
train, test = train_test_split(df, test_size=0.4, random_state=12345)
valid, test = train_test_split(test, test_size=0.5, random_state=12345)
print(train.shape, valid.shape, test.shape)

(1928, 5) (643, 5) (643, 5)


Split the data into 3 parts with the following proportions, with the training set having the most data for training the model.
<br>Train: 60%, Valid: 20%, Test: 20%

In [10]:
features_train = train.drop(['is_ultra'], axis=1)
target_train = train['is_ultra']

features_valid = valid.drop(['is_ultra'], axis=1)
target_valid = valid['is_ultra']

features_test = test.drop(['is_ultra'], axis=1)
target_test = test['is_ultra']

Sliced each dataset into their corresponding feature and target dataset for modeling in the next sections.

In [11]:
final_model = None
best_result = 0
best_depth = None
for depth in range(1, 11):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_valid)
    if result > best_result:
        best_result = result
        best_depth = depth
        final_model = model
print('Best Train Accuracy:', best_result, '|', 'Best Train Depth:', best_depth)

final_model.fit(features_train, target_train)
predictions_test = final_model.predict(features_test)
final_accuracy = accuracy_score(target_test, predictions_test)
print('Final Accuracy:', final_accuracy)

Best Train Accuracy: 0.7853810264385692 | Best Train Depth: 3
Final Accuracy: 0.7791601866251944


The first model used is a decision tree model, with the depth as the parameter to optimize.
<br>Used the train dataset to train the model, valid dataset to validate the model by comparing the target and the predictions.
<br>Finally, using the depth variable that yielded the hight accuracy, the test dataset was used to check final model's accuracy.
<br>The final model accuracy was ~0.779, which is higher than the required threshold of 0.75.

In [12]:
final_model = None
best_result = 0
best_estimator = None
best_depth = None
for estimator in range(10, 51, 10):
    for depth in range(1, 11):
        model = RandomForestClassifier(random_state=12345, n_estimators=estimator, max_depth=depth)
        model.fit(features_train, target_train)
        predictions_valid = model.predict(features_valid)
        result = accuracy_score(target_valid, predictions_valid)
        if result > best_result:
            best_result = result
            best_estimator = estimator
            best_depth = depth
            final_model = model
print('Best Train Accuracy:', best_result, '|', 'Best Train Estimators:', best_estimator, '|', 'Best Train Depth:', best_depth)

final_model.fit(features_train, target_train)
predictions_test = final_model.predict(features_test)
final_accuracy = accuracy_score(target_test, predictions_test)
print('Final Accuracy:', final_accuracy)   

Best Train Accuracy: 0.8087091757387247 | Best Train Estimators: 40 | Best Train Depth: 8
Final Accuracy: 0.7962674961119751


The second model used is a random forest model, with the number of estimators (trees) as the parameter to optimize.
<br>Used the train dataset to train the model, valid dataset to validate the model by comparing the target and the predictions.
<br>Finally, applying the n_estimator that yielded the hight accuracy, the test dataset was used to check final model's accuracy.
<br>The final model accuracy was ~0.796, which is higher than the required threshold of 0.75.

In [13]:
final_model = None
best_result = 0
best_solver = None
solver_list = ['liblinear', 'lbfgs']
for solver in solver_list:
    model = LogisticRegression(random_state=12345, solver=solver)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_valid)
    if result > best_result:
        best_result = result
        best_solver = solver
        final_model = model
print('Best Train Accuracy:', best_result, '|', 'Best Train Solver:', best_solver)

final_model.fit(features_train, target_train)
predictions_test = final_model.predict(features_test)
final_accuracy = accuracy_score(target_test, predictions_test)
print('Final Accuracy:', final_accuracy)

Best Train Accuracy: 0.7542768273716952 | Best Train Solver: liblinear
Final Accuracy: 0.7262830482115086


The third model used is a logistic regression, with the solver as the parameter to optimize.
<br>Used the train dataset to train the model, valid dataset to validate the model by comparing the target and the predictions.
<br>Finally, applying the solver that yielded the hight accuracy, the test dataset was used to check final model's accuracy.
<br>The final model's accuracy was ~0.740, which is lower than the required threshold of 0.75.

## Conclusions

After conducting test on three classification models, namely decision tree, random forest, and logistic regression,
<br>I concluded that the random forest model achieved the highest level of accuracy (~0.796) in predicting user's plan category.
<br>This outcome wasn't surprising, as random forest is as an ensemble tree model known to achieve high accuracy while being computationally intensive.
<br>The results support the use of the random forest model as an effective tool for predicting user plan category.