# Megaline Mobile Carrier Plan Recommendaion

In this analysis, were going to take a dataset belonging to the Megaline Mobile Carrier and were going to analyze the current subscribers behaviors with their current plans and see if we can recommend one of the phone companies newer plans: Smart or Ultra plans. In the dataframe, the subscribers that have Ultra are 1 in the dataframe, and Smart are 0. The threshold that we will be using to calculate the accuracy of the models is 0.75. We will check the accuracy using the test data set.

### Importing All necessary Libraries for this Project

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


### Splitting the Dataframe into Training set, Validation set, and Test Set

In the first code below, I  split the dataset into training and another set, in this case i named it "df_temp", which is a temporary hold for the validation and the test set, and in the second code, I used "df_temp" to split up my data into the validation set and the test set. In the end i used the len() function to see the lenghs of each data sets to ensure that the validation and test sets were split evenly.

In [5]:
df_train, df_temp = train_test_split(df, test_size=0.4, random_state=13)

df_valid, df_test = train_test_split(df_temp, test_size=0.5, random_state=13)

In [6]:
print(len(df_train))
print(len(df_valid))
print(len(df_test))


1928
643
643


### Features and Target for Training set, Validation set, and Test Set

In [7]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test  = df_test['is_ultra']

### Quality of Different Models: Decision Tree Classifier and Random Forest Classifier

#### Decision Tree Classifier

In [8]:
model = DecisionTreeClassifier(random_state=13, max_depth=5)
model.fit(features_train, target_train)  
valid_predictions_Decision = model.predict(features_valid)  
accuracy = accuracy_score(target_valid, valid_predictions_Decision)

print(f"Validation accuracy: {accuracy:.2f}")


Validation accuracy: 0.79


The validation accuracy using the Decision Tree Classifier is 0.79. Considering our threshold is 0.75, this is a good model since we are over the threshold of .75, after hypertuning the Decision Tree model max_depth to 5 levels. If we increase the max depth by more than five, the Decision Tree Starts overfitting and the accuracy score starts to go lower.

#### Random Forest Classifier

In [9]:
best_accuracy = 0
best_n_estimators = 0

for k in range(1, 50):
    model_forest = RandomForestClassifier(max_depth=5, random_state=13, n_estimators=k)
    model_forest.fit(features_train, target_train) 
    valid_predictions = model_forest.predict(features_valid) 
    accuracy = accuracy_score(target_valid, valid_predictions) 

    if accuracy > best_accuracy: 
        best_accuracy = accuracy 
        best_n_estimators = k  

print(f"Validation accuracy: {best_accuracy:.2f} with the best number of n_estimators: {best_n_estimators}")

Validation accuracy: 0.81 with the best number of n_estimators: 16


Using the Random Forest Classifier on the Training and Validation set, our validation accuracy came out to be 0.81. A value that exceeds the the threshold. We got the highest score using the max_depth to 5, which is the highest numbers of nodes before reaching a terminal leaf node. I experimented with higher nodes and lower nodes, and the validation accuracy was lower using a higher or lower max_depth. We also used a for loop with a range of max 50, creating a total of 1225 n_estimators, or nuber of trees to be used in the forest. Since both models showed around the same accuracy score, one being .79 and the other being .80, i will use both models to see what model gives the best results.

### Test Set on Random Forest Classifier

In [None]:
best_accuracy_score = 0
best_estimator = 0

for k in range(1, 50):
    model = RandomForestClassifier(max_depth=5, random_state=13, n_estimators=k)
    model.fit(features_train, target_train)
    validation_prediction = model.predict(features_valid)
    accuracy_validation = accuracy_score(target_valid, validation_prediction)
    
    if accuracy_validation > best_accuracy_score:
        best_accuracy_score = accuracy_validation
        best_estimator = k

final_model = RandomForestClassifier(max_depth=5, random_state=13, n_estimators=best_estimator)
final_model.fit(features_train.append(features_valid), target_train.append(target_valid))  
test_prediction = final_model.predict(features_test)  
accuracy_test = accuracy_score(target_test, test_prediction)

print(f"Final Model Test Set Accuracy: {accuracy_test:.2f} with n_estimators: {best_estimator}.")

After using the validation set against the test data set on a Random Forest Classifier model, the accuracy was actually lower than the test data set. I got a accuracy of 0.79 with a total number of 16 n_estimators. Lower but not too much of a difference, also being over the .75 threshold.

### Conclusion

In this machine learning project, I used the Decision Tree Classifier and the Random Forest Classifier. After using both, i came to a conclusion that the best model to use in this data is the Random Forest Classifier. The hyperparameters that i used in that model is the "max_depth", n_estimators, and the "random_state". My accuracy score is 0.79. Using this model, you can analyze the subscribers behavior and you can recommend one of the Plans, either Ultra or Smart with an accuracy of 79%.
