## Project Description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

### Step 1. Look at the data

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [3]:
data = pd.read_csv('/datasets/users_behavior.csv')

In [4]:
data.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


### Step 2. Split data

In [7]:
def split_data(data):
    features, target = data.drop(columns=['is_ultra']), data['is_ultra']
    
    features_train, features_test, target_train, target_test = \
        train_test_split(features, target, test_size=0.2)
    
    features_train, features_valid, target_train, target_valid = \
        train_test_split(features_train, target_train, test_size=0.2)
    
    return features_train, features_valid, features_test, \
           target_train, target_valid, target_test

In [8]:
features_train, features_valid, features_test, target_train, target_valid, target_test = split_data(data)

In [12]:
# Let's check sizes of samples
assert features_train.shape[0] == target_train.shape[0]
assert features_valid.shape[0] == target_valid.shape[0]
assert features_test.shape[0] == target_test.shape[0]

In [14]:
# How much samples do we have for training?
print('There are {} objects for training'.format(features_train.shape[0]))

There are 2056 objects for training


### Step 3. Model training

Now let's train several models on the train set and check their perfomance on valid data set.

Then we choose the best model among them and check it on the test set, to see real perfomance

In [32]:
"""Now we have to choose models that we will train
   It will be:
       1. LogisticRegression
       2. DecisionTreeClassfifier (max_depth = (1, 10))
       3. RandomForestClassifiier (n_estimators = (10, 100, 10) and max_depth = (1, 10))
"""
RANDOM_STATE = 42

def choose_best_model(features_train, target_train, features_valid, target_valid):
    # At first let's create models
    models = list()
    models.append(get_logistic_regression_model())
    models.append(get_decision_tree_model())
    models.append(get_random_forest_model())
    # Now let's train it
    for model in models:
        train_model(model, features_train, target_train)
    # Now let's choose the best among every type
    best_models = list()
    for model in models:
        best_models.append(get_best_model(model))
    
    # Now let's choose the best from these three models
    return max(best_models, key=lambda model: accuracy_score(model.predict(features_valid), target_valid))
    
    
def get_logistic_regression_model():
    return LogisticRegression(random_state=RANDOM_STATE)


def get_decision_tree_model():
    decision_tree = DecisionTreeClassifier(random_state=RANDOM_STATE)
    decision_tree_params = {'max_depth': range(1, 11)}
    return GridSearchCV(decision_tree, decision_tree_params)


def get_random_forest_model():
    random_forest = RandomForestClassifier(random_state=RANDOM_STATE)
    random_forest_params = {'n_estimators': range(10, 101, 10), 'max_depth': range(1, 11)}
    return GridSearchCV(random_forest, random_forest_params)


def train_model(model, features_train, target_train):
    model.fit(features_train, target_train)
    

def get_best_model(model):
    if not hasattr(model, 'best_estimator_'):
        return model
    return model.best_estimator_

In [33]:
best_model = choose_best_model(features_train, target_train, features_valid, target_valid)



In [34]:
best_model

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=70,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

Our best model has 70 estimators (70 trees) and max depth equals to 10. Let's check it's accuracy score on train, valid and test data sets

### Step 4. Check the model quality using test set

In [36]:
def get_accuracy_on_sets(best_model, features, target):
    features_train = features['train']
    predictions_train = best_model.predict(features_train)
    target_train = target['train']
    train_accuracy = accuracy_score(target_train, predictions_train)
    
    features_valid = features['valid']
    predictions_valid = best_model.predict(features_valid)
    target_valid = target['valid']
    valid_accuracy = accuracy_score(target_valid, predictions_valid)
    
    features_test = features['test']
    predictions_test = best_model.predict(features_test)
    target_test = target['test']
    test_accuracy = accuracy_score(target_test, predictions_test)
    
    return train_accuracy, valid_accuracy, test_accuracy

In [37]:
features_dict = {
    'train': features_train,
    'valid': features_valid,
    'test': features_test
}

target_dict = {
    'train': target_train,
    'valid': target_valid,
    'test': target_test
}

train_accuracy, valid_accuracy, test_accuracy = get_accuracy_on_sets(best_model, features_dict, target_dict)

In [39]:
print('Train accuracy: {}\nValid accuracy: {}\nTest Accuracy:{}'
      .format(train_accuracy, valid_accuracy, test_accuracy))

Train accuracy: 0.8959143968871596
Valid accuracy: 0.8058252427184466
Test Accuracy:0.776049766718507


We got approximately 0.78 accuracy on the test set! I suppose that we may have a better perfomance if we have more features.

### Step 5. Model sanity check

I suppose that sanity check is a check that model has better perfomance than random classification, or classification just one class (such class that have a majority).

Our test accuracy is 0.77 so it's better than random classifier. But let's look at the class balance.

In [43]:
smart_target = (target_test == 0)
ultra_target = (target_test == 1)

print('The number of smart clients: {}'.format(smart_target.sum()))
print('The number of ultra clients: {}'.format(ultra_target.sum()))

The number of smart clients: 430
The number of ultra clients: 213


So, we see that the number of smart clients is a twice more than ultra clients. Now let's suppose that our classifier choose 0 for all clients. Let's see it's accuracy

In [45]:
print('The accuracy of "smart" classifier: {}'.format(smart_target.sum() / target_test.shape[0]))

The accuracy of "smart" classifier: 0.6687402799377916


We see that accuracy of the classifier is less than our RandomForestClassifier. It means that our classifier are better than random and better than classifier which choose only major class for all examples

### Conclusion

We got a classifier with approximately 0.78 accuracy score. It means that we will suggest the right tariff for legacy clients in 78% cases.

Also we did sanity check for our model and it's better than random and major class classifier.