# Sprint 7 Project

In this project we attempted to examine two groups of data, being the testing data and the training data. The training data determines accuracy by not subjecting this group to the test variables under investigation, which allows the comparison we are trying to make valid. We examined the features of the data, and created a confusion matrix to solidify this task further.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import joblib
from joblib import dump
from sklearn.metrics import recall_score, precision_score, f1_score, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [4]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [5]:
print(df.dtypes)
print(df.head())

calls       float64
minutes     float64
messages    float64
mb_used     float64
is_ultra      int64
dtype: object
   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0


In [6]:
print(df.columns)

if 'is_ultra' in df.columns:
    features = df[['calls', 'minutes', 'messages', 'mb_used']]
    target = df['is_ultra']
else:
    print("Column 'is_ultimate' not found in the DataFrame.")
    # You may need to find the correct target column or fix the dataset

df = df[['calls', 'minutes', 'messages', 'mb_used', 'is_ultra']]

train, temp = train_test_split(df, test_size=0.3, random_state=54321)

valid, test = train_test_split(temp, test_size=0.5, random_state=54321)

features_train = train[['calls', 'minutes', 'messages', 'mb_used']]
target_train = train['is_ultra']

features_valid = valid[['calls', 'minutes', 'messages', 'mb_used']]
target_valid = valid['is_ultra']

features_test = test[['calls', 'minutes', 'messages', 'mb_used']]
target_test = test['is_ultra']


Index(['calls', 'minutes', 'messages', 'mb_used', 'is_ultra'], dtype='object')


In [32]:
model = DecisionTreeClassifier(random_state=12345)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)

accuracy_valid = accuracy_score(target_valid, predicted_valid)

print(accuracy_valid)

0.7344398340248963


In [33]:
target = df['is_ultra']
features = df.drop('calls', axis=1)

target_pred_constant = pd.Series(0, index=target.index)

print(accuracy_score(target, target_pred_constant))

0.693528313627878


In [34]:
joblib.dump(model, 'model.joblib')

['model.joblib']

In [35]:
model = joblib.load('model.joblib')

In [36]:
print(confusion_matrix(target_valid, predicted_valid))

[[266  53]
 [ 75  88]]


In [37]:
model_performance = {}

In [38]:
dt_model = DecisionTreeClassifier(random_state=54321)
dt_model.fit(features, target)
val_preds = dt_model.predict(features_test)
dt_accuracy = accuracy_score(target_test, val_preds)
model_performance['DecisionTreeClassifier'] = dt_accuracy

In [39]:
rf_model = RandomForestClassifier(random_state=54321, n_estimators=10)
rf_model.fit(features_train, target_train)
val_preds = rf_model.predict(features_valid)
rf_accuracy = accuracy_score(target_valid, val_preds)
model_performance['RandomForestClassifier'] = rf_accuracy

In [40]:
lr_model = LogisticRegression(random_state=54321)
lr_model.fit(features_train, target_train)
val_preds = lr_model.predict(features_valid)
lr_accuracy = accuracy_score(target_valid, val_preds)
model_performance['LogisticRegression'] = lr_accuracy

In [41]:
for model_name, accuracy in model_performance.items():
    print(f"{model_name} Validation Accuracy: {accuracy:.2f}")
# Choose the best model based on validation accuracy and evaluate on test set
best_model_name = max(model_performance, key=model_performance.get)
best_model = None

if best_model_name == 'DecisionTreeClassifier':
    best_model = dt_model
elif best_model_name == 'RandomForestClassifier':
    best_model = rf_model
elif best_model_name == 'LogisticRegression':
    best_model = lr_model

test_preds = best_model.predict(features_test)
test_accuracy = accuracy_score(target_test, test_preds)

print(f"Best Model: {best_model_name} Test Accuracy: {test_accuracy:.2f}")

DecisionTreeClassifier Validation Accuracy: 0.30
RandomForestClassifier Validation Accuracy: 0.80
LogisticRegression Validation Accuracy: 0.72
Best Model: RandomForestClassifier Test Accuracy: 0.80


After running a sanity check, the process of comparing our model with a random one to assess whether the model makes sense we determined it does. A confusion matrix is a useful tool for evaluating the performance of a classification model. It provides a summary of how well the model’s predictions match the actual class labels.

In binary classification (where there are two classes), the confusion matrix has four entries:
True Positives (TP): The number of instances correctly predicted as positive.
True Negatives (TN): The number of instances correctly predicted as negative.
False Positives (FP): The number of instances incorrectly predicted as positive.
False Negatives (FN): The number of instances incorrectly predicted as negative.

We can see all 4 of the targets and we can, upon initial inspection, determine that the numbers are high for true positives and true negatives, indicating more true negatives than false negatives and more true positives than false positives, ensuring accuracy.

In [42]:
param_grid_rf = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10]
}


In [43]:

# Parallelized GridSearchCV
grid_search_rf = GridSearchCV(RandomForestClassifier(), param_grid_rf, cv=3, scoring='accuracy', n_jobs=-1)
grid_search_rf.fit(features_train, target_train)
best_rf_model = grid_search_rf.best_estimator_

# Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(features_train, target_train)



DecisionTreeClassifier(random_state=42)

In [44]:
# Logistic Regression
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(features_train, target_train)

# Validation accuracies
rf_validation_accuracy = accuracy_score(target_valid, best_rf_model.predict(features_valid))
dt_validation_accuracy = accuracy_score(target_valid, dt_model.predict(features_valid))
lr_validation_accuracy = accuracy_score(target_valid, lr_model.predict(features_valid))

model_performance = {
    'DecisionTreeClassifier': dt_validation_accuracy,
    'RandomForestClassifier': rf_validation_accuracy,
    'LogisticRegression': lr_validation_accuracy
}

best_model_name = max(model_performance, key=model_performance.get)

if best_model_name == 'DecisionTreeClassifier':
    best_model = dt_model
elif best_model_name == 'RandomForestClassifier':
    best_model = best_rf_model
elif best_model_name == 'LogisticRegression':
    best_model = lr_model

In [45]:
# Evaluate best model on test set
test_preds = best_model.predict(features_test)
test_accuracy = accuracy_score(target_test, test_preds)

print(f'Best Model: {best_model_name} Test Accuracy: {test_accuracy:.2f}')


Best Model: RandomForestClassifier Test Accuracy: 0.80


In [46]:
features_train.index.intersection(features_test.index)

Int64Index([], dtype='int64')

The best model predicted is the Decision Tree Classifier test, at 80% accuracy.

In [47]:
model = DecisionTreeClassifier(random_state=12345)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)

In [48]:
predicted_valid = model.predict(features_valid)

In [49]:
f1 = f1_score(target_valid, predicted_valid, average='binary')
print(f1)

0.5789473684210527


The F1 score, which is calculated as the harmonic mean of precision and recall, is at 57.8%, indicating accuracy.

In [50]:
recall = recall_score(target_valid, predicted_valid, average='binary')
print(recall)

0.5398773006134969


Recall, which takes all data that is accurate and calculates what fraction of them was recognized by the model, is at 53.9%, indicating accuracy. Data that was recognized by the model by mistake are ignored.

In [51]:
precision = precision_score(target_valid, predicted_valid, average='binary')
print(precision)

0.624113475177305


Precision, which is an evaluation metric that shows the ratio of the number of actual observations with answer "1" to the number of observations marked as "1" by the model is at 62.4%, indicating accuracy.

In [52]:
best_accuracy_score = 0
best_estimator = 0

for k in range(1, 50):
    model = RandomForestClassifier(max_depth=5, random_state=13, n_estimators=k)
    model.fit(features_train, target_train)
    
    validation_prediction = model.predict(features_valid)
    accuracy_validation = accuracy_score(target_valid, validation_prediction)
    
    if accuracy_validation > best_accuracy_score:
        best_accuracy_score = accuracy_validation
        best_estimator = k

final_model = RandomForestClassifier(max_depth=5, random_state=13, n_estimators=best_estimator)
final_model.fit(features_train, target_train)
test_prediction = final_model.predict(features_test)

accuracy_test = accuracy_score(target_test, test_prediction)

print(f"Final Model Test Set Accuracy: {accuracy_test:.2f} with n_estimators: {best_estimator}.")


Final Model Test Set Accuracy: 0.80 with n_estimators: 26.


The accurracy is relatively high, at 80%

In [53]:
print(features.shape)
print(target.shape)

(3214, 4)
(3214,)


In [54]:
best_score = 0
best_est = 0
for est in range(1, 11):
    model = RandomForestClassifier(random_state=54321, n_estimators=est)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

final_model = RandomForestClassifier(random_state=54321, n_estimators=10)
final_model.fit(features_train, target_train)

Accuracy of the best model on the validation set (n_estimators = 8): 0.8049792531120332


RandomForestClassifier(n_estimators=10, random_state=54321)

The accuracy is relatively high at 80.49%.

In [55]:

model = LogisticRegression(
    random_state=54321, solver="liblinear"
)
model.fit(features_train, target_train)
score_train = model.score(features_train, target_train)
score_valid = model.score(features_valid, target_valid)

print(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)

Accuracy of the logistic regression model on the training set: 0.7216540684748777
Accuracy of the logistic regression model on the validation set: 0.6721991701244814


Accuracy of the logistic model is also fairly high for both sets, at 72.16% and  67.21% respectively.

In this project we attempted to examine two groups of data, being the testing data and the training data. We examined the features of the data, and created a confusion matrix to solidify this task further. We also examined the predicted value, recall, precision, and F1 score. Finally we altered the hyperparameters and and examined the accuracy of the linear regression as well as the accuracy of the validation set.