# Part 1 - Import and Inspect

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_csv('users_behavior.csv')

In [3]:
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Simply by importing these modules and inspecting the data, we see it will be simple enough to split the data so we have training, validation, and testing data. To do this, we first put 60% of the data into training and 40% to testing. From there, we split our testing set in half. This puts 60% of total data to training, 20% to testing, and 20% to validation. 

# Part 2 - Split Data

In [5]:
df_train, df_split = train_test_split(df, test_size=0.40, random_state=47)

In [6]:
df_test, df_valid = train_test_split(df_split, test_size=0.50, random_state=47)

In [7]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

In [8]:
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

In [9]:
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

This gives us the proper training, validation, and testing sets, as outlined above. Next, we'll look at three different potential models: Decision Tree, Random Forest, and LinearRegression.

# Part 3 - Hyperparameter Tuning

In [10]:
for depth in range(1, 6):
    model = DecisionTreeClassifier(max_depth=depth, random_state=47)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    print("max_depth=", depth, ":", score)

max_depth= 1 : 0.7325038880248833
max_depth= 2 : 0.7698289269051322
max_depth= 3 : 0.7947122861586314
max_depth= 4 : 0.7962674961119751
max_depth= 5 : 0.7931570762052877


Looking at Decision Tree first, it appears a max depth set at 4 gives us our best results. 

In [11]:
for estims in range(10, 51, 10):
    model = RandomForestClassifier(random_state=47, n_estimators=estims)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    print("n_estimators=", estims, ":", score)

n_estimators= 10 : 0.7776049766718507
n_estimators= 20 : 0.7822706065318819
n_estimators= 30 : 0.7776049766718507
n_estimators= 40 : 0.7916018662519441
n_estimators= 50 : 0.7853810264385692


Considering Random Forests next, we find 40 estimators works best. However, it seems this still underperforms when compared to Decision Trees.

In [12]:
model = LogisticRegression(random_state=47)
model.fit(features_train, target_train)
score = model.score(features_valid, target_valid)
print(score)

0.7107309486780715


In this case, Logistic Regression underperforms both Decision Trees and Random Forests. 

# Part 4 - Test Model

In [13]:
model = DecisionTreeClassifier(max_depth=4, random_state=47)

In [14]:
model.fit(features_train, target_train)

DecisionTreeClassifier(max_depth=4, random_state=47)

In [15]:
train_predictions = model.predict(features_train)
train_accuracy = accuracy_score(target_train, train_predictions)
print(train_accuracy)

0.808091286307054


In [16]:
valid_predictions = model.predict(features_valid)
valid_accuracy = accuracy_score(target_valid, valid_predictions)
print(valid_accuracy)

0.7962674961119751


In [17]:
print("Training set:", train_accuracy)
print("Validation set:", valid_accuracy)

Training set: 0.808091286307054
Validation set: 0.7962674961119751


In [18]:
model.score(features_test, target_test)

0.7729393468118196

Here we checked out training, testing, and validation sets. We find that our training set performs the best. However, our final score for testing still results in about 77% accuracy. This means that after training with our initial data, we are able to still reasonably perform on "new data" (i.e. the testing set in this case). 