# Project 8 : Megaline Machine Learning Algoritms

The purpose of this project is to demonstrate the skills to perform an analysis developing a model which can analize the behaviour of the clients for the company Megaline (telecommunications company) and recomend a data plan (Smart or Ultra) for each user.

I'll work with different models to find the best solution with the following structure:

- Importing libraries
- Load the information.
- Verify the integrity of the data.
- Clean the data.
- Analyze the data.
- Create a model(s).
- Train the model(s).
- Find the best result.

## Importing libraries
Importing necessary libraries

In [55]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

Importing datasets

In [56]:
df= pd.read_csv('D:/Tripleten/datasets/users_behavior.csv')

In [57]:
df.info()
df['calls'] = df['calls'].astype('Int64')
df['messages'] = df['messages'].astype('Int64')
df.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1404,32,187.52,47,19373.91,0
775,119,808.13,14,20728.45,0
96,29,267.55,29,16996.83,0
697,64,466.4,33,14327.24,0
1954,4,46.98,1,0.0,0
2519,57,454.05,13,17220.68,0
462,85,540.2,57,17836.84,1
686,79,562.99,19,25508.19,1
2932,95,636.52,6,15190.74,0
1191,77,542.2,8,18583.04,0


In [58]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


df.info shows:
- Columns names are OK
- Data types for calls and message where change to data type int64
- Data has not null values

The function df.sample(5) suggests consistency in the data.

df.describes shows :
calls 



## Preparing the training dataset 

The data will be segmented in validation data, training data, and testing data as a solution for not possesing another independent dataset. 

In [59]:
features = df.drop('is_ultra', axis=1)
target =  df['is_ultra']

# Splitting the data into training (60%) and temporary data (40%)
X_train, X_temp, y_train, y_temp = train_test_split(features,target, train_size=0.6, random_state=54321)

# Further splitting the temporary data into validation (20%) and test (20%)
X_val, X_test, y_val, y_test = train_test_split( X_temp, y_temp, train_size=0.5, random_state=54321)


## Implementing machinge learning for classification algoritm

In this project we will find the best MSE (mean squared error) in three different clasiffication algorithms.

- Decision Tree
- Random Forest
- Logistic Regression

## Decision Tree Classifier

First we will iterate the Decission Tree Classifier to obtain the best score and depth to use it with our test dataset. I will be necessary to use the train dataset and the validation dataset

In [60]:
best_score = 0
best_depth = 0 

for depth in range(1,200): 
    model = DecisionTreeClassifier(max_depth=depth, random_state=54321)
    model.fit(X_train,y_train)
    predictions = model.predict(X_val)
    val_score = accuracy_score(y_val, predictions)
    # print(val_score)
    if val_score > best_score:
        best_score = val_score
        best_depth = depth

print(f'Best score {best_score} and best depth {best_depth}')

Best score 0.7822706065318819 and best depth 10


Now it's time to compare the model through our test dataset

In [61]:
model = DecisionTreeClassifier(max_depth=10, random_state=54321)
model.fit( X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

decision_tree_df = pd.DataFrame(data=[['Decision Tree', accuracy, precision,recall,f1]] , columns=['Algoritm','Accuracy','Precision','Recall','F1 Score'])
print('Confusion Matrix')
print(conf_matrix)
decision_tree_df

Confusion Matrix
[[426  40]
 [ 85  92]]


Unnamed: 0,Algoritm,Accuracy,Precision,Recall,F1 Score
0,Decision Tree,0.805599,0.69697,0.519774,0.595469


## Random Forest Classifier

In [62]:

best_depth= 0
best_score = 0
best_est = 0

for est in range(1, 100, 10):
    for depth in range (1, 20):
        rfc = RandomForestClassifier(n_estimators = est, max_depth= depth, random_state=54321)
        rfc.fit(X_train, y_train)
        predictions = rfc.predict(X_val)
        val_score = accuracy_score(y_val, predictions)
        # print(val_score)

        if val_score > best_score:
            best_depth= depth
            best_score = val_score
            best_est = est

print(f'The best score is {val_score}, with a n_estimators of {best_est} and best_deep of {best_depth}')


The best score is 0.7822706065318819, with a n_estimators of 11 and best_deep of 8


In [63]:
rfc = RandomForestClassifier(n_estimators = best_est, max_depth=best_depth, random_state=54321)
rfc.fit(X_train,y_train)
predictions = rfc.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

random_forest_df = pd.DataFrame(data=[['Random Forest', accuracy,precision,recall,f1]] , columns=['Algoritm','Accuracy','Precision','Recall','F1 Score'])
print('Confusion Matrix')
print(conf_matrix)
random_forest_df





Confusion Matrix
[[438  28]
 [ 77 100]]


Unnamed: 0,Algoritm,Accuracy,Precision,Recall,F1 Score
0,Random Forest,0.836703,0.78125,0.564972,0.655738


## Logistic Regression

In [67]:
lr = LogisticRegression(random_state=54321, solver='liblinear')
lr.fit(X_train,y_train)

# Evaluating Validation Dataset
validation_predictions = lr.predict(X_val)
validation_score = accuracy_score(y_val, validation_predictions)
print(f'El score para el data de validación es: {validation_score}')

# Evaluating Test Dataset
test_predictions = lr.predict(X_test)
# print(f'El score para el data de validación es: {test_score}')

test_accuracy = accuracy_score(y_test, test_predictions)
precision = precision_score(y_test, test_predictions)
recall = recall_score(y_test, test_predictions)
f1 = f1_score(y_test, test_predictions)
conf_matrix = confusion_matrix(y_test, test_predictions)

logistic_reg_df = pd.DataFrame(data=[['Logistic Regression', test_accuracy,precision,recall,f1]] , columns=['Algoritm','Accuracy','Precision','Recall','F1 Score'])
print('Confusion Matrix')
print(conf_matrix)
logistic_reg_df


El score para el data de validación es: 0.6780715396578538
Confusion Matrix
[[462   4]
 [163  14]]


Unnamed: 0,Algoritm,Accuracy,Precision,Recall,F1 Score
0,Logistic Regression,0.74028,0.777778,0.079096,0.14359


In [68]:
result_df = pd.concat([decision_tree_df, random_forest_df, logistic_reg_df], ignore_index=True)
result_df

Unnamed: 0,Algoritm,Accuracy,Precision,Recall,F1 Score
0,Decision Tree,0.805599,0.69697,0.519774,0.595469
1,Random Forest,0.836703,0.78125,0.564972,0.655738
2,Logistic Regression,0.74028,0.777778,0.079096,0.14359


Conclusion