# Project : Introduction to Machine Learning
---

# Project Description

The mobile company **Megaline** is not satisfied to see that many of its customers use legacy plans. They want to develop a model that can analyze customer behavior and recommend one of Megaline’s new plans: **Smart** or **Ultra**.

For this project, we have access to the behavior data of subscribers who have already switched to new plans. For this sorting **task**, we must create a **model** that chooses the right plan. Since you’ve already taken the step of processing the data, you can jump right into creating the model.

We will develop a model as accurately as possible. In this project, the **accuracy** threshold is **0.75**. We’ll use the dataset to check accuracy.

# Project Instructions
---

For this **project** we will be working with the following **points**:

1. Open and browse the data file.
2. Segment the source data into a training, validation and test set.
3. Investigate the quality of different models by changing the hyperparameters and briefly describe the study findings.
4. Check the quality of the model using the test set.
5. Additional task: perform a sanity test on the model.

## Data Loading and General Survey of Information.
---

Load Libraries

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

Uploading Files

In [2]:
clientes = pd.read_csv('/content/users_behavior.csv')

Cantidad de características y objetivo.

In [3]:
clientes.shape

(3214, 5)

Data type in columns

In [4]:
clientes.dtypes

calls       float64
minutes     float64
messages    float64
mb_used     float64
is_ultra      int64
dtype: object

The first 5 columns

In [5]:
clientes.head(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


General information

In [6]:
clientes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


**OBSERVATIONS**

**Description of Data**

Each observation in the dataset contains monthly behavior information about the user. The information given is as follows:

- `сalls` - number of calls,
- `minutes` - total call time in minutes,
- `messages` - number of text messages,
- `mb_used` - Internet traffic used in MB,
- `is_ultra` - plan for the current month (Ultra - 1, Smart - 0)

Our data are in order and have no problems to solve, now in the next section we will start with the process of development of our trained model.

## Data Segmentation
---

In this section, we will **split the matrix** into three parts to order it as follows:

- **Training** set (60%)
- **Validation** set (20%)
- **Test** set (20%)

It will be done this way since there is no test dataset and we must create it prior to the training of the source dataset. To make our segmentation, first let’s define our features and our goal as follows:

In [7]:
features = clientes.drop(['is_ultra'], axis=1)
target   = clientes['is_ultra']                

Get training set (80%) and test (20%)

In [8]:
X_rest, X_test, y_rest, y_test = train_test_split(
    features, 
    target, 
    test_size = 0.20, 
    random_state=12345)

Checking data split

In [9]:
X_rest.shape, y_rest.shape, X_test.shape, y_test.shape

((2571, 4), (2571,), (643, 4), (643,))

Now we get the validation set (20%) using 80% of the previous resulting set to get 60%.

In [10]:
X_train, X_val, y_train, y_val = train_test_split(
    X_rest,
    y_rest, 
    test_size=0.25, 
    random_state=12345)

Checking data split.

In [11]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

((1928, 4), (1928,), (643, 4), (643,))

In [12]:
print(
    'Training Dataset (60%). :', X_train.shape, y_train.shape,'\n',
    'Test Dataset (20%)      :', X_test.shape , y_test.shape,'\n',
    'Validation Dataset (20%):', X_val.shape  , y_val.shape
)

Training Dataset (60%). : (1928, 4) (1928,) 
 Test Dataset (20%)      : (643, 4) (643,) 
 Validation Dataset (20%): (643, 4) (643,)


Comprobación del Split.

In [13]:
print('Original Dataset:', clientes.shape[0])
print('Total Splits.   :', X_train.shape[0] + X_test.shape[0] + X_val.shape[0])

Original Dataset: 3214
Total Splits.   : 3214


## Investigate the Quality of Different Models.
---

Since we have our datasets ready, we will now choose the best model to train for this task. First, we must consider that our task is *Classification,* so we must focus on it.

For sorting tasks, we have 3 models that are:

- decision tree.
- logistic regression.
- random forest.

For our case, we will choose the logistic regression model because its execution speed is faster and because it has an intermediate precision that usually does not have over adjustment and can work for us. In addition, this model contributes to the application of binary events, that is, events where one or the other option occurs, so we go to our model.

### Logistic Regression Model

In [14]:
# Instantiate model
rl_model = LogisticRegression(random_state=12345)
# Training model
rl_model.fit(X_train, y_train)

Accuracy Model

In [15]:
rl_score_test = rl_model.score(X_test, y_test)
print('LR Accuracy:', rl_score_test)

LR Accuracy: 0.7589424572317263


**Comments**

> The data show us that our model has a high accuracy just as we had predicted although the training model is little more accurate than the validation model. It is probably the best model we can choose but we will try another model to be sure.

### Random Forest

The model we select now will be a random forest because it consists of a high level of accuracy, the problem is that the speed of execution is slow, the more trees there will be slower this model.

Let’s find the Best Model.

In [16]:
best_score = 0
best_est   = 0

for est in range(1,31):
    ba_model = RandomForestClassifier(random_state=12345, n_estimators=est) # No of trees.
    ba_model.fit(X_train, y_train)                                          # Training model
    ba_model_score = ba_model.score(X_test, y_test)
    if ba_model_score > best_score:
        best_score += 1   # Best accuracy
        best_est   += 1   # Mejores estimaciones correspondientes a la mejor puntuación de exactitud.

print('Accuracy model on test dataset (n_estimators = {}):{}'.format(best_est, best_score))

Accuracy model on test dataset (n_estimators = 1):1


In [17]:
print('Best model with n_estimartors is:', ba_model_score)

Best model with n_estimartors is: 0.7853810264385692


**Comments**

> The accuracy for this model improves with this technique, but we still have another model to test, so we will investigate it.

### Decision Tree.

Among its features are having a low precision, although it has the highest speed of execution. Let’s look into it and see what results it can provide.

In [18]:
ad_model = DecisionTreeClassifier(random_state=12345) # Select model
ad_model.fit(X_train, y_train)                        # Train  model

Checking score

In [19]:
ad_model.score(X_test, y_test)

0.7309486780715396

Let’s find the best model by modifying the depth.

In [20]:
best_model = None
best_result = 0

for depth in range(1, 5):
    ad_model = DecisionTreeClassifier(random_state=12345, max_depth = depth) # creamos modelo con la profundidad proporcionada
    ad_model.fit(X_train, y_train)               # entrena el modelo 
    predictions = ad_model.predict(X_test)       # obtén las predicciones del modelo
    result = accuracy_score(y_test, predictions) # calcula la exactitud
    if result > best_result:
        best_model = ad_model
        best_result = result
        
print("Accuracy for the model:", best_result)

Accuracy for the model: 0.7869362363919129


**Comments**

> In a list we make adjusting the depth our model tells us that the best validation model has 78% for accuracy.

## Check the Quality

---

In the previous point we found out which of all models met to have the threshold of 0.75, now in this section we will test our logistic regression model to know its operation with a dataset that our model does not know, in this case the validation.

Define predictions

In [21]:
## Just remember the name of model is r1_model
X_predict = rl_model.predict(X_val)

Now, let's check out the model score

In [22]:
LR_result = accuracy_score(y_val, X_predict)
print('Accuracy of the model:', LR_result)

Accuracy of the model: 0.7262830482115086


## Test of Sanity.

---

For the test of sanity we will use the model **Dummy**, this is a base model that will serve as a reference to evaluate the accuracy of the other models. If any of our models have a score below the model **Dummy** it means that there is some error in our models.

In [23]:
## Dummy Model
d_model = DummyClassifier()
d_model.fit(X_train, y_train)
print(d_model)

## Get Predictions
d_model_preds = d_model.predict(X_test)

# Score model
dum_acc = accuracy_score(y_test, d_model_preds)
print("Final Accuracy of Dummy Model:", dum_acc)

DummyClassifier()
Final Accuracy of Dummy Model: 0.6951788491446346


Finally, we will create a table to compare the metrics of the models with which we work in this project.

In [24]:
results = pd.DataFrame({'Logistic Regression' : [0.7589],
                        'Random Forest': [0.7853],
                        'Decision Tree': [0.7309],
                        'Dummy Model'  : [0.69]
                        }) 
results.T.sort_values(by=0, ascending=False)

Unnamed: 0,0
Random Forest,0.7853
Logistic Regression,0.7589
Decision Tree,0.7309
Dummy Model,0.69


# Conclusions

> After working on each of our models, we can see that the best precision score is for the model **Random Forest,** with a final score of **0.78**. The worst model in our ranking is the **Decision Tree** which is only 3 points above the base model **Dummy**. None of our models is worse than the Dummy model, that means they all work properly.