In this project, I develop a model for a mobile carrier that analyzes subscribers' behavior and reccommends one of the company's newer plans as opposed to the legacy plan many customers are still on. I look for any errors in the data, compare models, and compare their accuracy.

In [4]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [5]:
df = pd.read_csv("/datasets/users_behavior.csv")

In [6]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [7]:
df.duplicated().sum()

0

In [8]:
df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [9]:
df.shape

(3214, 5)

Declare feature and target variables.

In [10]:
features = df.drop(columns=['is_ultra'])
target = df['is_ultra']

Create a training and validation set from the data.

In [11]:
# First split: training set vs temp (which will later be split into validation and test)
df_train, df_temp = train_test_split(df, test_size=0.4, random_state=54321)
# (60% training, 40% temporary)

# Second split: validation vs test from the temp set
df_valid, df_test = train_test_split(df_temp, test_size=0.5, random_state=54321)
# (each 20% of total data)

Create feature and target variables for the training and validation sets.

In [12]:

features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']



Since the target variable value is either 0 or 1, this requires a classification model. Since this is model will be predicting for business related outcomes, I need to prioritize precision. Therefore I am considering the RandomForestClassifier model. But we will also compare with the DecisionTreeClassifier and LogisticRegression models.

In [13]:
best_score = 0
best_est = 0
for est in range(1, 11): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, n_estimators=est) # set number of trees
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_est = est # save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

final_model = RandomForestClassifier(random_state=54321, n_estimators=4) # change n_estimators to get best model
final_model.fit(features_train, target_train)


Accuracy of the best model on the validation set (n_estimators = 10): 0.7698289269051322


RandomForestClassifier(n_estimators=4, random_state=54321)

Now let's compare accuracy scores with the test portion of the data. First let's acquire the error count.

In [14]:
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']
target_test = target_test.reset_index(drop=True)

test_predictions = model.predict(features_test)

def error_count(answers, predictions):
    errors = 0
    for i in range(len(answers)):
        if answers[i] != predictions[i]:
            errors += 1
    return errors
            
print('Errors:', error_count(target_test, test_predictions))


Errors: 129


Now we can get the accuracy by dividing this error count by the number of total answers.

In [15]:
def accuracy(answers, predictions):
    count = 0
    for i in range(len(answers)):
        if answers[i] == predictions[i]:
            count += 1
    return count/len(predictions)

print('Accuracy:', accuracy(target_test, test_predictions))


Accuracy: 0.7993779160186625


Now let's perform a sanity check.

In [16]:
from sklearn.metrics import accuracy_score

train_predictions = model.predict(features)
test_predictions = model.predict(features_test)

accuracy = accuracy_score(target, train_predictions)
test_accuracy = accuracy_score(target_test, test_predictions)

print('Accuracy')
print('Training set:', accuracy)
print('Test set:', test_accuracy)


Accuracy
Training set: 0.9016801493466086
Test set: 0.7993779160186625


Below I will use a DecisionTreeClassifier model.

In [17]:
from sklearn.tree import DecisionTreeClassifier
best_model = None
best_result = 0
for depth in range(1,6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_valid)
    if result > best_result:
        best_model = model
        best_result = result

print("Accuracy of the best model:", best_result)


Accuracy of the best model: 0.7651632970451011


Below I will use a LogisticRegression model.

In [18]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
score_train = model.score(features_train, target_train)
score_valid = model.score(features_valid, target_valid)
print("Accuracy of the logistic regression model on the training set:", score_train)
print("Accuracy of the logistic regression model on the validation set:", score_valid)



Accuracy of the logistic regression model on the training set: 0.7131742738589212
Accuracy of the logistic regression model on the validation set: 0.6780715396578538


The test accuracy is above the threshold of 0.75 for all models except the LogisticRegression one. The RandomForestClassifier model is ready to be deployed.
