# Using Decision Tree, Random Forest, and Logistic Regression for classification tasks with telecom data

**Introduction:**

In this project I will use sklearn tools to help Megaline make quality recommendations of which plan to use for their customers. The two plans are: Smart and Ultra. This dataset has the customers that have already chosen one of the two plans, and my task is to find a model that helps fit the data so that other customers can see which plan is best for them.

In [1]:
#Import Packages needed for analysis of this data
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier


In [2]:
#read the dataset into a variable and show the format for the dataset
df = pd.read_csv('users_behavior.csv')
display(df.head())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
#check basic information about the data to make sure that it's ready for the Machine Learning task
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Since this data needs to be split into 3 different groups, I will be splitting it as a 3:1:1 set. 

In [4]:
#Split the data into training, validation, and test  sets
#first make training and validation sets
df_train, df_valid = train_test_split(df, test_size=0.2, random_state=8420, stratify=df['is_ultra'])
#now make a test set off of the remaining training data
df_train, df_test = train_test_split(df_train, test_size=0.25, random_state=8420, stratify=df_train['is_ultra'])

print(f'Training Data: {df_train.shape}')
print(f'Validation Data: {df_valid.shape}')
print(f'Test Data: {df_test.shape}')

print(f'The training set shape is {len(df_train)} for features')
print(f'The Validation set shape is {len(df_valid)} for features')
print(f'The Test set shape is {len(df_test)} for features')

Training Data: (1928, 5)
Validation Data: (643, 5)
Test Data: (643, 5)
The training set shape is 1928 for features
The Validation set shape is 643 for features
The Test set shape is 643 for features


Now that the data is split into 3 different groups I'm now going to create the variables for the features and target. Since the aim is to try to figure out a model to recomend which plan customers will want to use, I will set the target as the 'is_ultra' column. The other columns will make up the features of this study. Also, since this study is predicting a binary result, it seems to me that this is a classification task, not a regression task.

In [5]:
#define the target and features for this task in each of the 3 datasets
target = df['is_ultra']
features = df.drop(['is_ultra'], axis=1)

test_target = df_test['is_ultra']
test_features = df_test.drop(['is_ultra'], axis=1)

valid_target = df_valid['is_ultra']
valid_features = df_valid.drop(['is_ultra'], axis=1)

train_target = df_train['is_ultra']
train_features = df_train.drop(['is_ultra'], axis=1)



In [6]:
#use the decision tree classifier to creat a first model
DTCmodel = DecisionTreeClassifier(random_state=8420, max_depth=4)
DTCmodel.fit(train_features, train_target)

#test the model against predictions to see if it is working.
train_predictions = DTCmodel.predict(train_features)

#make a function that counts the errors
def error_count(answers, predictions):
    return (answers != predictions).sum()


print('Training Errors:', error_count(train_target, train_predictions))

print(f'Training Accuracy: {accuracy_score(train_target, train_predictions)*100:.2f}%')

Training Errors: 381
Training Accuracy: 80.24%


In [7]:
print('Depth Comparison in the Validation Set:')
for depth in range(1,11):
    DTC_V_model = DecisionTreeClassifier(random_state=8420, max_depth=depth)
    DTC_V_model.fit(train_features, train_target)
    predictions_valid = DTC_V_model.predict(valid_features)
    
    print("max_depth =", depth, ": ", end='')
    print(f'{accuracy_score(valid_target, predictions_valid)*100:.2f}%')

Depth Comparison in the Validation Set:
max_depth = 1 : 74.65%
max_depth = 2 : 77.14%
max_depth = 3 : 78.85%
max_depth = 4 : 78.54%
max_depth = 5 : 77.92%
max_depth = 6 : 77.60%
max_depth = 7 : 77.45%
max_depth = 8 : 79.32%
max_depth = 9 : 79.00%
max_depth = 10 : 78.54%


In [8]:
#Sanity Check using a dummy classifier in order to make sure that the model is performing better than random chance

# Train a DummyClassifier to predict the majority class
dummy_model = DummyClassifier(strategy="most_frequent")
dummy_model.fit(features, target)
baseline_accuracy = dummy_model.score(features, target)
model_accuracy = DTC_V_model.score(valid_features, valid_target)

print(f"Dummy Baseline Accuracy: {baseline_accuracy*100:.2f}%")
print(f"Model Accuracy: {model_accuracy*100:.2f}%")

Dummy Baseline Accuracy: 69.35%
Model Accuracy: 78.54%


**Decision Tree Findings**
This model performed quite well with this data. Random chance would only allow for a 69% prediction rating, whereas this moel was able to get to 78% with the validation dataset and 81% with the training set. I found that the depth of 4 produced the best results with this data. 

Now I'm going to run through this data set with the RandomForest model

In [9]:
# Find the best n_estimators using the training and validation sets
best_score = 0
best_est = 0
for est in range(1, 21):
    RFmodel = RandomForestClassifier(random_state=8420, n_estimators=est, min_samples_leaf=10, min_samples_split=10)
    RFmodel.fit(train_features, train_target)
    score = RFmodel.score(valid_features, valid_target)
    if score > best_score:
        best_score = score
        best_est = est

print(f"Best n_estimators based on validation set: {best_est}")
print(f"Validation Accuracy of the best model: {best_score*100:.2f}%")

# Create a final model using the best n_estimators, training on both training and validation sets
# Concatenate train and validation sets for final training
combined_features = np.concatenate([train_features, valid_features], axis=0)
combined_target = np.concatenate([train_target, valid_target], axis=0)

final_modelRF = RandomForestClassifier(random_state=8420, n_estimators=best_est)
final_modelRF.fit(combined_features, combined_target)

# Evaluate the final model on the test set
test_accuracy = final_modelRF.score(test_features, test_target)
print(f"Accuracy of the final model on the test set: {test_accuracy * 100:.2f}%")


Best n_estimators based on validation set: 9
Validation Accuracy of the best model: 81.49%
Accuracy of the final model on the test set: 79.63%




In [10]:
#Sanity Check
RFmodel.fit(train_features, train_target)

#Evaluate accuracy on both training and validation sets
train_accuracy = RFmodel.score(train_features, train_target)
val_accuracy = RFmodel.score(valid_features, valid_target)

print(f"Training Accuracy: {train_accuracy*100:.2f}%")
print(f"Validation Accuracy: {val_accuracy*100:.2f}%")

Training Accuracy: 83.82%
Validation Accuracy: 81.18%


**Random Forest Findings**
This model found that the best number of trees to have in the forest is 9. I adjusted a couple of the hyperparameters (min_samples_leaf and min_samples_split) to prevent overfitting the training data. Prior to adjusting these hyperparameters I was getting a training accuracy of nearly 98%. By doing this it was able to find an accuracy of 81% in the Validation Set, and 80% in the test set. I then wrote a sanity check to make sure that the data makes sense and got a training accuracy of 84% and a validation accuracy of 81%. These numbers are close enough to one another that it appears that the model is no loger overfit. 

Now I'm going to run the data through the LogisticRegression model to see how it performs in this task

In [11]:
#Use the LogisticRegression tool to create a model to predict the best plan for each user
LRmodel =  LogisticRegression(random_state=8420, solver='liblinear') 
LRmodel.fit(train_features, train_target)  
score_train = LRmodel.score(train_features, train_target)
score_valid = LRmodel.score(valid_features, valid_target) 

print(f"Accuracy of the logistic regression model on the training set: {score_train*100:.2f}%")
print(f"Accuracy of the logistic regression model on the validation set: {score_valid*100:.2f}%")

final_model_LR = LogisticRegression(random_state=8420, solver='liblinear')
final_model_LR.fit(test_features, test_target)
print(f"Accuracy of the model on the test set: {final_model_LR.score(test_features, test_target)*100:.2f}%")

Accuracy of the logistic regression model on the training set: 70.33%
Accuracy of the logistic regression model on the validation set: 70.61%
Accuracy of the model on the test set: 74.96%


In [12]:
dummy_model = DummyClassifier(strategy="most_frequent")
dummy_model.fit(features, target)
baseline_accuracy = dummy_model.score(features, target)
LRmodel_accuracy = final_model_LR.score(valid_features, valid_target)

print(f"Dummy Baseline Accuracy: {baseline_accuracy*100:.2f}%")
print(f"Model Accuracy: {LRmodel_accuracy*100:.2f}%")

Dummy Baseline Accuracy: 69.35%
Model Accuracy: 75.43%


**Logistic Regression Findings**
This model for predicting which plan each user should have produced the lowest accuracy scores of the three models coming in at 70% for the training set, 71% for the Validation set, and only 75% for the test set. Using the same sanity check from the DecisionTree with the dummy classifier, it doesn't appear that this model is producing a much better result than mere chance. 

**Conclusion:**

I applied the DecisionTreeClassifier, RandomForestClassifier, and LogisticRegression models to this data set. In the process of using these models with the training set, validation set, and test set, it became clear that the RandomForestClassifier model was able to produce the highest accuracy scores, but they seemed to be as a result of overfitting. The DecisionTreeClassifier appears to have the best results and not be overfit like the RandomForest model. So, I would recommend that Megaline use the DecisionTree model for their plan to make recommendations to their customers based on their usage of the the service.