# Modeling and Prediction 
After preprocessing the data we need to load the data, split it, and then create and train the model, and finally make predictions. For this purpose, we will use different models and compare them together. 

## Importing neccessary libraries

In [1]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import ParameterGrid
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import json
from sklearn.neural_network import MLPClassifier


## Load data 
In this part, we should load data and extract **`X`** and **`y`** (target which is 'churn') 

In [2]:
data = pd.read_csv('preprocessed_data.csv')
X = data.drop(columns=['churn'])
y = data['churn']
X.shape, y.shape

((4250, 64), (4250,))

## Split the data into training and test sets

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Models 
In this section, we will try different models and shows their performance. 
<ol>
    <li> xgboost </li> 
    <li> SVM </li> 
    <li> Neural Networks </li>
</ol>


In [None]:
# provide a dictionary to save results in 
Results = {}

### xgboost
**XGBoost** is a popular machine learning framework that stands for “**Extreme Gradient Boosting**”. It is an implementation of gradient boosted decision trees designed for speed and performance. This code trains an XGBoost model on a given dataset and evaluates its performance on a test set. Here is a brief explanation of each part of the code:

1. The data is first converted into DMatrix format, which is the internal data structure used by XGBoost. This is done using the `xgb.DMatrix` function, which takes the training data `X_train` and labels `y_train` as input.
2. The number of rounds for boosting is set to 10 using the `num_round` variable. This means that 10 trees will be built in the XGBoost model.
3. The parameters for the XGBoost model are set using the `param` dictionary. This includes the maximum depth of the trees (`max_depth`), the learning rate (`eta`), the objective function (`objective`), the number of threads to use for training (`nthread`), and the evaluation metric (`eval_metric`).
4. A grid of hyperparameters is created using the `ParameterGrid` class from the `sklearn.model_selection` module. This generates all possible combinations of hyperparameters from the `param` dictionary.
5. The model is trained for each combination of hyperparameters using the `xgb.train` function. This takes the hyperparameters, training data, and number of rounds as input.
6. Predictions are made on the test set using the `predict` method of the trained model. The predictions are rounded to 0 or 1 using a list comprehension.
7. The accuracy of the model is evaluated using the `accuracy_score` function from the `sklearn.metrics` module. This compares the predicted labels with the true labels in the test set.
8. The accuracy and hyperparameters are printed for each combination of hyperparameters.

I hope this helps! Let me know if you have any questions.

In [14]:
from sklearn.model_selection import ParameterGrid
import xgboost as xgb
from sklearn.metrics import accuracy_score

# Convert the data into DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

num_round = 10

# Set the parameters for the XGBoost model
param = {
    'max_depth': [3, 5, 7],
    'eta': [0.1, 0.3, 0.5],
    'objective': ['binary:logistic'],
    'nthread': [4],
    'eval_metric': ['auc']
}

# Create a grid of hyperparameters
param_grid = ParameterGrid(param)

# Create a dictionary for storing the results
xgb_results = {}

# Train the model for each combination of hyperparameters
for params in param_grid:
    bst = xgb.train(params, dtrain, num_round)
    
    # Make predictions on the test set
    y_pred = bst.predict(dtest)
    y_pred = [round(y) for y in y_pred]
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    
    # Store the result in the dictionary
    xgb_results[accuracy] = params

# Find the result with the best accuracy
best_accuracy = max(xgb_results.keys())
best_params = xgb_results[best_accuracy]

print(f'Best accuracy: {best_accuracy:.2f} with params: {best_params}')
with open('xgb_best_params.json','w') as file:
    json.dump(best_params,file)
    

Best accuracy: 0.97 with params: {'eta': 0.3, 'eval_metric': 'auc', 'max_depth': 5, 'nthread': 4, 'objective': 'binary:logistic'}


## SVM
**SVM** stands for Support Vector Machine. It is a type of supervised machine learning algorithm that can be used for classification or regression tasks. SVMs work by finding the hyperplane that best separates the data into different classes. The hyperplane is chosen to maximize the margin, which is the distance between the hyperplane and the closest data points from each class. These closest data points are called support vectors, hence the name Support Vector Machine.

The code you provided trains an SVM model on a given dataset and evaluates its performance on a test set. Here is a brief explanation of each part of the code:

1. The `SVC` class from the `sklearn.svm` module is imported. This class implements the SVM algorithm for classification.
2. The `GridSearchCV` class from the `sklearn.model_selection` module is also imported, but it is not used in the code.
3. The parameters for the SVM model are set using the `param` dictionary. This includes the regularization parameter (`C`), the kernel function (`kernel`), the degree of the polynomial kernel function (`degree`), and the kernel coefficient (`gamma`).
4. A grid of hyperparameters is created using the `ParameterGrid` class from the `sklearn.model_selection` module. This generates all possible combinations of hyperparameters from the `param` dictionary.
5. The model is trained for each combination of hyperparameters using the `fit` method of the `SVC` class. This takes the training data and labels as input.
6. Predictions are made on the test set using the `predict` method of the trained model.
7. The accuracy of the model is evaluated using the `accuracy_score` function from the `sklearn.metrics` module. This compares the predicted labels with the true labels in the test set.
8. The accuracy and hyperparameters are printed for each combination of hyperparameters.

I hope this helps! Let me know if you have any questions.

In [15]:


# Set the parameters for the SVM model
param = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'degree': [2, 3, 4],
    'gamma': ['scale', 'auto']
}

# Create a grid of hyperparameters
param_grid = ParameterGrid(param)

# create a dictionary to store best results
svm_results = {}

# Train the model for each combination of hyperparameters
for params in param_grid:
    clf = SVC(**params)
    clf.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = clf.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    # Store the result in the dictionary
    svm_results[accuracy] = params
    
# Find the result with the best accuracy
best_accuracy = max(svm_results.keys())
best_params = svm_results[best_accuracy]

# print the accuracy and save the params for the best result
print(f'Best accuracy: {best_accuracy:.2f} with params: {best_params}')
with open('svm_best_params.json','w') as file:
    json.dump(best_params,file)

Best accuracy: 0.91 with params: {'C': 10, 'degree': 4, 'gamma': 'scale', 'kernel': 'rbf'}


## Neural Networks
A neural network is a type of machine learning algorithm that is inspired by the structure and function of the human brain. It consists of layers of interconnected nodes, where each node represents a neuron and the connections between nodes represent synapses. Neural networks can be used for a wide range of tasks, including classification, regression, and clustering.

The code you provided trains a multi-layer perceptron (MLP) neural network on a given dataset and evaluates its performance on a test set. Here is a brief explanation of each part of the code:

1. The `MLPClassifier` class from the `sklearn.neural_network` module is imported. This class implements an MLP neural network for classification.
2. The `ParameterGrid` class from the `sklearn.model_selection` module is also imported.
3. The parameters for the neural network are set using the `param` dictionary. This includes the number of neurons in each hidden layer (`hidden_layer_sizes`), the activation function for the neurons (`activation`), the solver for weight optimization (`solver`), the L2 regularization parameter (`alpha`), and the learning rate schedule for weight updates (`learning_rate`).
4. A grid of hyperparameters is created using the `ParameterGrid` class. This generates all possible combinations of hyperparameters from the `param` dictionary.
5. The model is trained for each combination of hyperparameters using the `fit` method of the `MLPClassifier` class. This takes the training data and labels as input.
6. Predictions are made on the test set using the `predict` method of the trained model.
7. The accuracy of the model is evaluated using the `accuracy_score` function from the `sklearn.metrics` module. This compares the predicted labels with the true labels in the test set.
8. The accuracy and hyperparameters are printed for each combination of hyperparameters.

I hope this helps! Let me know if you have any questions.

In [4]:
# Set the parameters for the neural network
param = {
    'hidden_layer_sizes': [(10,), (50,), (100,)],
    'activation': ['identity', 'logistic', 'tanh', 'relu'],
    'solver': ['lbfgs', 'sgd', 'adam'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'max_iter': [5000]
}


# Create a grid of hyperparameters
param_grid = ParameterGrid(param)

# Create a dictionary to store results 
nn_results = {}

# Train the model for each combination of hyperparameters
for params in param_grid:
    clf = MLPClassifier(**params)
    clf.fit(X_train, y_train)
    
    # Make predictions on the test set
    y_pred = clf.predict(X_test)
    
    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    nn_results[accuracy] = params



# Find the result with the best accuracy
best_accuracy = max(nn_results.keys())
best_params = nn_results[best_accuracy]

# print the accuracy and save the params for the best result
print(f'Best accuracy: {best_accuracy:.2f} with params: {best_params}')
with open('nn_best_params.json','w') as file:
    json.dump(best_params,file)

Best accuracy: 0.93 with params: {'activation': 'tanh', 'alpha': 0.01, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'max_iter': 5000, 'solver': 'adam'}


Best accuracy: 0.97 with params: {'eta': 0.3, 'eval_metric': 'auc', 'max_depth': 5, 'nthread': 4, 'objective': 'binary:logistic'}
