## Classifying data plans

Different models are investigated in order to recommend the correct phone plan based on subscriber behavior with the highest possible accuracy.

### Step 1. Open and look through the data file. 

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [None]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [None]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


### Conclusion

`/datasets/users_behavior.csv` was opened and examined for general information.

There are 5 columns in the file, each with 3,214 non-null entries. The features for each observation are described as follows:

`сalls`: Number of calls

`minutes`: Total call duration in minutes

`messages`: Number of text messages

`mb_used`: Internet traffic used in MB

`is_ultra`: Plan for the current month (Ultra - 1, Smart - 0)

### Step 2. Split the source data into a training set, a validation set, and a test set.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
target = df['is_ultra']
features = df.drop('is_ultra', axis=1)

In [None]:
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.4, random_state = 12345)
features_valid, features_test, target_valid, target_test = train_test_split(features_valid, target_valid, test_size=0.5, shuffle = False)

### Conclusion

`train_test_split()` is imported from `sklearn.model_selection`, which splits any data set into two sets.

The target feature here is `is_ultra`, since we are trying to determine which plan a subscriber should subscribe to.

The source data is split twice using `train_test_split()` into a 3:1:1 ratio: a training dataset (60%), validating dataset (20%), and test dataset (20%).

### Step 3. Investigate the quality of different models by changing hyperparameters

#### Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

for depth in range(1, 20):
    dt_model = DecisionTreeClassifier(random_state=99, max_depth=depth)
    
    # < train the model >
    dt_model.fit(features_train, target_train)
    
    # < find the predictions using validation set >
    predictions_valid = dt_model.predict(features_valid)

    print("max_depth =", depth, ":", accuracy_score(target_valid, predictions_valid)*100.0)

max_depth = 1 : 74.80559875583204
max_depth = 2 : 78.38258164852256
max_depth = 3 : 78.69362363919129
max_depth = 4 : 78.38258164852256
max_depth = 5 : 77.76049766718506
max_depth = 6 : 78.38258164852256
max_depth = 7 : 80.09331259720062
max_depth = 8 : 78.22706065318819
max_depth = 9 : 78.53810264385692
max_depth = 10 : 78.22706065318819
max_depth = 11 : 77.13841368584758
max_depth = 12 : 77.44945567651634
max_depth = 13 : 76.82737169517885
max_depth = 14 : 75.73872472783826
max_depth = 15 : 75.11664074650078
max_depth = 16 : 73.56143079315707
max_depth = 17 : 74.18351477449455
max_depth = 18 : 73.40590979782272
max_depth = 19 : 72.78382581648522


`7` is the best value found for the `max_depth` hyperparameter in the Decision Tree Model, with an `accuracy_score` of `80.09%`.

#### Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

for num in range(1, 20):
    rf_model = RandomForestClassifier(random_state=99, n_estimators=num)

    rf_model.fit(features_train, target_train)

    predictions_valid = rf_model.predict(features_valid)

    print("n_estimators =", num, ":", accuracy_score(target_valid, predictions_valid)*100.0)

n_estimators = 1 : 72.00622083981337
n_estimators = 2 : 76.98289269051321
n_estimators = 3 : 75.89424572317263
n_estimators = 4 : 78.0715396578538
n_estimators = 5 : 76.36080870917574
n_estimators = 6 : 78.22706065318819
n_estimators = 7 : 76.51632970451011
n_estimators = 8 : 77.76049766718506
n_estimators = 9 : 77.44945567651634
n_estimators = 10 : 77.76049766718506
n_estimators = 11 : 77.60497667185071
n_estimators = 12 : 77.76049766718506
n_estimators = 13 : 77.29393468118197
n_estimators = 14 : 77.44945567651634
n_estimators = 15 : 77.13841368584758
n_estimators = 16 : 77.44945567651634
n_estimators = 17 : 77.29393468118197
n_estimators = 18 : 77.44945567651634
n_estimators = 19 : 77.76049766718506


`6` is the best value found for the `n_estimators` hyperparameter in the Random Forest Model, with an `accuracy_score` of `78.23%`.

#### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

for solver in ['liblinear', 'lbfgs', 'newton-cg']:
    lr_model = LogisticRegression(random_state=99, solver=solver)
    lr_model.fit(features_train, target_train)
    
    print("solver =", solver, ":", lr_model.score(features_valid, target_valid)*100.0)

solver = liblinear : 76.049766718507
solver = lbfgs : 70.13996889580093
solver = newton-cg : 75.73872472783826


`liblinear` is the best value found for the `solver` hyperparameter in the Logistic Regression Model, with an `accuracy_score` of `76.05%`.

### Conclusion

3 models were used to compare quality: The Decision Tree Model, the Random Forest Model, and the Logistic Regression Model. The Decision Tree model resulted in the highest `accuracy_score`. Random Forest came in second, and Logistic Regression came in third place.

`7` is the best value found for the `max_depth` hyperparameter in the **Decision Tree** model for the validation set, with an `accuracy_score` of `80.09%`.

`6` is the best value found for the `n_estimators` hyperparameter in the **Random Forest** model for the validation set, with an `accuracy_score` of `78.23%`.

`liblinear` is the best value found for the `solver` hyperparameter in the **Logistic Regression** model for the validation set, with an `accuracy_score` of `76.05%`.

### Step 4. Check the quality of the model using the test set.

#### Decision Tree Model

In [None]:
dt_model = DecisionTreeClassifier(random_state=99, max_depth=7)
dt_model.fit(pd.concat([features_train, features_valid]), pd.concat([target_train, target_valid]))
predictions_test = dt_model.predict(features_test)
print("max_depth = 7", ":", accuracy_score(target_test, predictions_test)*100.0)

max_depth = 7 : 77.76049766718506


#### Random Forest Model

In [None]:
rf_model = RandomForestClassifier(random_state=99, n_estimators=6)
rf_model.fit(pd.concat([features_train, features_valid]), pd.concat([target_train, target_valid]))
predictions_test = rf_model.predict(features_test)
print("n_estimators = 6", ":", accuracy_score(target_test, predictions_test)*100.0)

n_estimators = 6 : 79.00466562986003


#### Logistic Regression

In [None]:
lr_model = LogisticRegression(random_state=99, solver='liblinear')
lr_model.fit(pd.concat([features_train, features_valid]), pd.concat([target_train, target_valid]))
    
print("solver = liblinear", ":", lr_model.score(features_test, target_test)*100.0)

solver = liblinear : 73.25038880248833


### Conclusion

For this step, the model was retrained with both the training and valid set in order to achieve better quality. The best hyperparameter value found from Step 3 for each model was used.

Of the three models, the **Random Forest** model displayed the highest `accuracy_score` of `79.00%` on the test set with the `n_estimators` hyperparameter set to `6`.

The **Decision Tree** model came in second with an `accuracy_score` of `77.76%` on the test set with the `max_depth` hyperparameter set to `7`.

Lastly, the **Logistic Regression** model had an `accuracy_score` of `73.25%` on the test set with the `solver` hyperparameter set as `liblinear`.

### Step 5. Sanity check the model.

In [None]:
target_test.value_counts()

0    447
1    196
Name: is_ultra, dtype: int64

In [None]:
target_test.mean() # percentage of 1's in target_test

0.3048211508553655

In [None]:
1 - target_test.mean() # percentage of 0's in target_test

0.6951788491446345

### Conclusion

The class distribution of `target_test` was examined and has revealed that 30.48% of the set has the value `1` and 69.52% of the set has the value `0`.

Therefore, a model that always predicts 0 would be correct 69.52% of the time. All of the models that were investigated earlier have resulted in a higher accuracy percentage with the highest being 79% from the Random Forest model, indicating an improvement over a "dumb" model.