# Project Sprint 7: Introduction to Machine Learning

Mobile operator Megaline is unhappy that many of its customers are using legacy plans. They want to develop a model that can analyze customer behavior and recommend one of Megaline's latest plans: Smart or Ultra.

In this project I developed a model with the highest possible accuracy.

## Importing required libraries

In [2]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

## Reading CSV file

In [3]:
df = pd.read_csv('moved_users_behavior.csv')

In [4]:
# print the first 10 lines
df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


**Data description:**

Each observation in the dataset contains monthly behavioral information about a user. The information given is as follows:

* **сalls —** number of calls
* **minutes —** total duration of the call in minutes
* **messages —** number of text messages
* **mb_used —** Internet traffic used in MB
* **is_ultra —** plan for the current month (Ultra - 1, Smart - 0)

## Splitting the source data

**Determining the features and target of the model**

In [6]:
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

**Separating the dataframe into training, validation and testing sets**

In [7]:
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.20, random_state=12345)


In [8]:
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid, target_valid, test_size=0.50, random_state=12345)

## Investigating the quality of different models

### Decision tree

In [10]:
best_model = None
best_result = 0
for depth in range(1, 6):
    model_tree = DecisionTreeClassifier(random_state=12345, max_depth=depth) 
    model_tree.fit(features_train, target_train) 
    predictions_valid = model_tree.predict(features_valid) 
    result = accuracy_score(target_valid, predictions_valid) 
    if result > best_result:
        best_model = model_tree
        best_result = result
        
print("The best accuracy of the model:", best_result, ", max_depth =", depth)

The best accuracy of the model: 0.794392523364486 , max_depth = 5


### Random forest

In [11]:
best_score = 0
best_est = 0
best_depth = 0
for est in range(10, 51, 10):
    for depth in range (1, 11):
        model_forest = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
        model_forest.fit(features_train, target_train)
        predictions_valid = model_forest.predict(features_valid)
        score = accuracy_score(target_valid, predictions_valid)
        if score > best_score:
            best_score = score
            best_est = est
            best_depth = depth

print("The best accuracy of the model in the validation set(n_estimators = {}): {}, max_depth = {}".format(best_est, best_score, best_depth))

The best accuracy of the model in the validation set(n_estimators = 10): 0.8037383177570093, max_depth = 9


### Logistic regression

In [12]:
model_logistic = LogisticRegression(random_state=54321, solver='liblinear') 
model_logistic.fit(features_train, target_train) 
predictions_valid = model_logistic.predict(features_valid)
score = accuracy_score(target_valid, predictions_valid) 


print("Accuracy of the logistic regression model in the validation set:", score)

Accuracy of the logistic regression model in the validation set: 0.67601246105919


### Results

The model that obtained the best accuracy was the Random Forest, which obtained an accuracy of 0.8037 with 10 trees and a maximum depth of 9.

## Checking the quality of the trained model using the test set

In [13]:
model = RandomForestClassifier(random_state=12345, n_estimators=20, max_depth=9)
model.fit(features_train, target_train)
train_predictions = model.predict(features_train)
test_predictions = model.predict(features_test)

print('Accuracy:')
print('Training set:', accuracy_score(target_train, train_predictions) )
print('Test set:', accuracy_score(target_test, test_predictions) )

Accuracy:
Training set: 0.8751458576429405
Test set: 0.8074534161490683


We observed a small difference between the trained set and the test set, this is due to overfitting.

## Real test

In [14]:
model = RandomForestClassifier(random_state=12345, n_estimators=20, max_depth=9)
model.fit(features, target)
predictions = model.predict(features)
test_predictions = model.predict(features_test)

print('Accuracy:')
print('Training set:', accuracy_score(target_train, train_predictions) )
print('Test set:', accuracy_score(target_test, test_predictions) )

Accuracy:
Training set: 0.8751458576429405
Test set: 0.8633540372670807
