# Megaline Model Analysis:
- Megaline is looking for a model that will analyze subscirbers' behavior and recommend one of new plans being offered: Smart or Ultra.
- To achieve this goal we will first make sure are data is clean and free of any issues.
- Secondly, we will split the dataset into 3 sets: training, validation, and test.
- From our newly formed data sets we will be able to train and test try differnt models.
- Accuracy for our models needs to be a minimum of 0.75.
- The following models will be tested: Decision Tree Classifier, Logistic Regression, and Random Forest.

## Data Discovery/Cleaning:

In [42]:
import pandas as pd
users = pd.read_csv('users_behavior.csv')
users.info()
users.head(5)
#users.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [43]:
users['calls'] = users['calls'].astype('int64')
users['messages'] = users['messages'].astype('int64')
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int64  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int64  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 125.7 KB



- Our data looks to be intact with no missiing values to fill in. 
- Since calls and messages can't be measured in partials we decided to change the dtype to 'int64'.

## Data Processing:

In [44]:
from sklearn.model_selection import train_test_split


users_train, users_temp = train_test_split(users, test_size=0.4, random_state=54321)
users_valid, users_test = train_test_split(users_temp, test_size=0.5, random_state=54321)

features_train = users_train.drop('is_ultra', axis=1)
target_train = users_train['is_ultra']
features_valid = users_valid.drop('is_ultra', axis=1)
target_valid = users_valid['is_ultra']
features_test = users_test.drop('is_ultra', axis=1)
target_test = users_test['is_ultra']
print(features_train.shape, target_train.shape)
print(features_valid.shape, target_valid.shape)
print(features_test.shape, target_test.shape)
print()
print(features_train.head(5))

(1928, 4) (1928,)
(643, 4) (643,)
(643, 4) (643,)

      calls  minutes  messages   mb_used
389     118   908.61        54  29678.60
1066     90   653.62         0  15697.77
36       76   543.18        43  31845.11
1903    134   940.77        56   2921.57
1373    121   769.36         0  42437.52



- Our data has now been properly split into 3 sets: train(60%), validation(20%) and test(20%).
- Variables have also been created for each set to make model testing more efficient.

In [45]:
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier


## Decision Tree Classifier:

In [46]:
# Evaluate Decision Tree with depths from 1 to 5

for depth in range(1,6):
    model = DecisionTreeClassifier(max_depth=depth, random_state=54321)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions)
    print(f'Max Depth: {depth}, Accuracy: {accuracy:.4f}')

# Choose the best model and evaluate on test set

model = DecisionTreeClassifier(max_depth=5, random_state=54321)
model.fit(features_train, target_train)
score_test = accuracy_score(target_test, model.predict(features_test))
print(f'Test Set Accuracy: {score_test:.4f}')

# Evaluate model on entire dataset

model_full = DecisionTreeClassifier(max_depth=5, random_state=54321)
model_full.fit(users.drop('is_ultra', axis=1), users['is_ultra'])
score_full = accuracy_score(users['is_ultra'], model_full.predict(users.drop('is_ultra', axis=1)))
print(f'Full Dataset Accuracy: {score_full:.4f}')

Max Depth: 1, Accuracy: 0.7216
Max Depth: 2, Accuracy: 0.7418
Max Depth: 3, Accuracy: 0.7652
Max Depth: 4, Accuracy: 0.7449
Max Depth: 5, Accuracy: 0.7652
Test Set Accuracy: 0.8072
Full Dataset Accuracy: 0.8171



- For our Decision Tree model we decided to keep the max depth hyperparameter between 1-5. 
- This decision was made in an effort to find a model that wouldn't overfit the data.
- A max depth of 5 was determined to be the best, resulting in an Accuracy of 0.7652.
- Test set accuracy was roughly 4% higher at 0.8072.
- For a sanity check we ran the best model against the entire dataset, the result was an accuracy of 0.8171.
- The Decision Tree, with a max depth of 5, fits the data with impressive accracy while staying fairly consistent across data sets.
- Due to the consistency of this model we have confidence that over/under fitting are not an issue.

## Logistic Regression:

In [47]:
# Evaluate logistic regression

model = LogisticRegression(solver='liblinear', random_state=54321)
model.fit(features_train, target_train)
score_train = model.score(features_train, target_train)
score_valid = model.score(features_valid, target_valid)
print(f'Logistic Regression - Training Accuracy: {score_train:.4f}, Validation Accuracy: {score_valid:.4f}')


# Evaluate on test set

score_test = model.score(features_test, target_test)
print(f'Logistic Regression - Test Accuracy: {score_test:.4f}')

#Evaluate model on entire dataset

model_full = LogisticRegression(solver='liblinear', random_state=54321)
model_full.fit(users.drop('is_ultra', axis=1), users['is_ultra'])
score_full = model_full.score(users.drop('is_ultra', axis=1), users['is_ultra'])
print(f'Logistic Regression - Full Dataset Accuracy: {score_full:.4f}')


Logistic Regression - Training Accuracy: 0.7132, Validation Accuracy: 0.6781
Logistic Regression - Test Accuracy: 0.7403
Logistic Regression - Full Dataset Accuracy: 0.7427



- For the Logistic Regression model we have decided to use the Liblinear solver.
- Accuracy for each dataset, including our sanity check on the overall data, came in below our accuracy threshold of 0.75.
- The moels accuracy does show to be fairly consistent across the various data sets. 
- Due to the accuracy threshold not being met we cannot recommend this model.

## Random Forest:

In [48]:
# Evaluate Random Forest with n_estimators from 1 to 10

best_score = 0
best_est = 0

for est in range(1,11):
    model = RandomForestClassifier(n_estimators=est, random_state=54321)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est
print(f'Best n_estimators: {best_est}, Validation Accuracy: {best_score:.4f}')

score_test = model.score(features_test, target_test)
print(f'Random Forest - Test Accuracy: {score_test:.4f}')

# Evaluate on entire dataset
model_full = RandomForestClassifier(n_estimators=best_est, random_state=54321)
model_full.fit(users.drop('is_ultra', axis=1), users['is_ultra'])
score_full = model_full.score(users.drop('is_ultra', axis=1), users['is_ultra'])
print(f'Random Forest - Full Dataset Accuracy: {score_full:.4f}')

Best n_estimators: 10, Validation Accuracy: 0.7698
Random Forest - Test Accuracy: 0.7994
Random Forest - Full Dataset Accuracy: 0.9779



- For our Random Forest model we chose to run a loop for n_estimators with a range between 1-10.
- The best n_estimators result was 10.
- Accuracy between the validation and test data sets seemed to be fairly consistent, both achieving scores above our threshold.
- However, our sanity check returned a accuracy score of 0.9779.
- Due to such a high accuracy score on the overall data set we are concerned about the model being overfitted.
- This model will not be recommened for further use.

## Conclusion:
- After our analysis and testing of 3 seperate models we can now recommend the use of our Decision Tree with the hyperparameter: max depth = 5.
- The models accuracy on the validation, test and full dataset scored above our accuracy threshold.
- Furthermore the models consistency gives confidence that the model is neither over or under fitted to the data. 