# Mobile carrier Megaline

In this endeavor, we embark on a mission to assist Mobile carrier Megaline in optimizing their customer plans. Faced with the challenge of numerous customers still using legacy plans, our goal is to develop an advanced model that leverages behavior data. This model will not only analyze customer behavior but also recommend the ideal plan from Megaline's newer offerings: Smart or Ultra.

# Objectives

Data Exploration: Begin by opening and thoroughly examining the dataset located at the path 'datasets/users_behavior.csv.' Download the dataset for further analysis.

Data Splitting: Divide the source data into three distinct sets: a training set, a validation set, and a test set. These sets will serve as the foundation for model development and evaluation.

Model Investigation: Explore the quality and performance of various machine learning models by systematically altering hyperparameters. Document the results of this experimentation, providing concise descriptions of the key findings.

Model Assessment: Evaluate the performance of the selected model using the test set to ensure its predictive accuracy and effectiveness.

Additional Task (Sanity Check): Recognize that this dataset presents a higher level of complexity than previous data. Conduct a comprehensive analysis to assess the model's sanity and robustness, addressing any challenges and intricacies encountered.

# Data Description

* сalls — number of calls,
* minutes — total call duration in minutes,
* messages — number of text messages,
* mb_used — Internet traffic used in MB,
* is_ultra — plan for the current month (Ultra - 1, Smart - 0).

# Dowload the data and Prepare it for Machine Learning

In [2]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [3]:
#Load the first 500 rows of data from visits_log_us.csv 
try:
    users=pd.read_csv('/datasets/users_behavior.csv')
except:
    users=pd.read_csv('users_behavior.csv')

In [4]:
#describe the data
users.describe()
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Based on the information about the dataset.There are no missing value 

In [5]:
#check for duplicated
print(users.duplicated().sum())

0


There are no duplicates in the dataset

# Data Splitting

In [6]:
#split the data sets
user_train, user_valid_and_test = train_test_split(users, test_size=0.4, random_state=12345)
user_valid, user_test = train_test_split(user_valid_and_test, test_size=0.5, random_state=12345)

features_train = user_train.drop(['is_ultra'], axis=1)
target_train = user_train['is_ultra']

features_valid = user_valid.drop(['is_ultra'], axis=1)
target_valid = user_valid['is_ultra']

features_test = user_test.drop(['is_ultra'], axis=1)
target_test = user_test['is_ultra']

# Tuning Model

## Decision Tree

In [7]:
# Define the depth range
depth_range=range(1,11)
#Initialize a list
decision_tree_results=[]
#loop through the depth and train the model
for depth in depth_range:
    model_dt=DecisionTreeClassifier(max_depth=depth,random_state=12345)
    model_dt.fit(features_train,target_train)
    # calculate accuracy
    valid_accuracy = model_dt.score(features_valid, target_valid)
    
    # Append results to the list
    decision_tree_results.append({'Depth': depth, 'Validation Accuracy': valid_accuracy})

# Convert results to a DataFrame for analysis
decision_tree_df = pd.DataFrame(decision_tree_results)

# Display the results
print(decision_tree_df)

   Depth  Validation Accuracy
0      1             0.754277
1      2             0.782271
2      3             0.785381
3      4             0.779160
4      5             0.779160
5      6             0.783826
6      7             0.782271
7      8             0.779160
8      9             0.782271
9     10             0.774495


As the depth of the decision tree increases from 1 to 3, the validation accuracy also increases. This suggests that the model becomes more capable of capturing complex patterns in the data, resulting in improved accuracy.

Around depths 3 to 6, the validation accuracy hovers around 0.78, but it doesn't show significant improvement. This could indicate that the model reaches a plateau in terms of accuracy gains, and increasing complexity beyond a certain point doesn't provide substantial benefits.

After depth 6, the accuracy begins to decrease slightly. This could be a sign of overfitting. Overfitting occurs when a model becomes too complex, fitting the training data noise rather than the underlying patterns.


In [11]:
# Define a range of estimator values (number of trees)
estimator_range = range(10, 101, 10)

# Initialize a list to store accuracy scores for each estimator value
accuracy_scores = []

# Iterate through the estimator values
for n_estimators in estimator_range:
    # Create a Random Forest model with the current number of estimators
    random_forest_model = RandomForestClassifier(n_estimators=n_estimators, random_state=12345)
    
    # Train the model on the training data
    random_forest_model.fit(features_train, target_train)
    
    # Make predictions on the validation set
    predictions_valid = random_forest_model.predict(features_valid)
    
    # Calculate the accuracy of the model on the validation set
    accuracy_valid = accuracy_score(target_valid, predictions_valid)
    
    # Append the accuracy score to the list
    accuracy_scores.append(accuracy_valid)
    
    # Print the validation accuracy for the current number of estimators
    print(f"Number of Estimators: {n_estimators}, Validation Accuracy: {accuracy_valid:.2}")


Number of Estimators: 10, Validation Accuracy: 0.79
Number of Estimators: 20, Validation Accuracy: 0.79
Number of Estimators: 30, Validation Accuracy: 0.78
Number of Estimators: 40, Validation Accuracy: 0.78
Number of Estimators: 50, Validation Accuracy: 0.79
Number of Estimators: 60, Validation Accuracy: 0.79
Number of Estimators: 70, Validation Accuracy: 0.78
Number of Estimators: 80, Validation Accuracy: 0.78
Number of Estimators: 90, Validation Accuracy: 0.78
Number of Estimators: 100, Validation Accuracy: 0.79


As the n_estimators increases, so does the accuracy on the training stay within 0.79 to 0.78.Based on this observation, 
The accuracy on the validation set stops improving after n_estimators of 10.

## Testing Model

Based on the best train model we obtain in the previous section, we will use a model of Random Forest with a n_estimators of 10.

In [12]:
model = RandomForestClassifier(n_estimators=10, random_state=12345)
model.fit(features_train, target_train)

print('Accuracy of the best model on the train set:', model.score(features_train, target_train))
print('Accuracy of the best model on the validation set:', model.score(features_valid, target_valid))
print('Accuracy of the best model on the test set:', model.score(features_test, target_test))

Accuracy of the best model on the train set: 0.9797717842323651
Accuracy of the best model on the validation set: 0.7651632970451011
Accuracy of the best model on the test set: 0.776049766718507


The accuracy of the test 77%  This project has met the threshold.