<a href="https://colab.research.google.com/github/pallavrouth/MarketingAnalytics/blob/main/Predictive_modeling_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Core Constructs in Statistical Learning

## Recap: Prediction vs Inference

Prediction involves using a statistical model to make forecasts or estimates about future or unseen data points. It is typically used when the goal is to make informed guesses or projections about what might happen. It is generally used to answer questions such as -

1. Can we predict a response given some predictors?
2. How can we accurately make a prediction of a response given some predictors?

Inference involves drawing conclusions or making inferences about a population or a process based on a sample of data. It is used to understand the associations between variables. Inference is typically used to test hypothesis testing related to these associations. It is used to answer questions such as -

1. Which predictors are associated with the response?
2. What is the nature relationship between the response and each predictor?

## Prediction Accuracy

Prediction is an important core goal of machine learning models. In simple words, the task for most **supervised ML** models to generate outputs given a set of inputs. For example, a bank wants to test whether a customer will stay with the firm given his or her transaction history. One can build a machine learning model that will predict a customer's staying behavior (the output) given his or her transaction history (the inputs). At this point, it is important for an analyst to know whether the prediction is accurate. Because the bank may decide to take (costly) measures to keep the customer.

Therefore, a fundamental task in the machine learning pipeline is to evaluate the accuracy of the predictions of ML models. Prediction accuracy is a measure of **how well** a model can make accurate predictions on **new or unseen data**. The better the accuracy of the prediction, the more confidence of the analyst on the model. Better accuracy also justifies the use of the model from all the alternatives.





## Unseen data? The ML workflow

Notice I say, in order to measure the accuracy of the predictions, I need unseen data. This raises an important question - how do I test the accuracy of a model today if unseen data will manifest tomorrow. This question brings us to the ML workflow - a series of steps that guides analysts on how to go from model building to implementation.

Steps -

1. **Feature engineering** - Manipulate the datasets to create variables (features) that improve your model’s prediction accuracy. Create the same features in both the training set and the testing set.

2. **Split the data** - **Randomly** divide the records in the dataset into a **training** set and a **testing** set. The basic idea is to **pretend** that training set is the data that the analysts has avaiable to him/her today and the test set is the data that the analyst will have with him or her tomorrow. That is test data resembles future unseen data. This makes accuracy testing possible.

3. **Picking a suitable model** - Next, the analyst choses from an ML model that is suited for this task. **Classification versus regression. **

3. **Model building and assessment** - The analyst builds the model on the training set and then uses the same model to create predictions on the test set. The analyst can then compare these predictions to the actual output in the test set.

In [1]:
import pandas as pd
insurance_data = (
    pd.read_csv('https://raw.githubusercontent.com/pallavrouth/MarketingAnalytics/main/datasets/insurance.csv')
      .drop(columns = ['index'])
      .dropna(subset = ['age','region'])
)

insurance_data.shape

(1332, 10)

In [2]:
insurance_data.head()

Unnamed: 0,PatientID,age,gender,bmi,bloodpressure,diabetic,children,smoker,region,claim
0,1,39.0,male,23.2,91,Yes,0,No,southeast,1121.87
1,2,24.0,male,30.1,87,No,0,No,southeast,1131.51
7,8,19.0,male,41.1,100,No,0,No,northwest,1146.8
8,9,20.0,male,43.0,86,No,0,No,northwest,1149.4
9,10,30.0,male,53.1,97,No,0,No,northwest,1163.46


In [3]:
target = insurance_data.loc[:,'claim']
features = insurance_data.loc[:,['age','gender','bmi','bloodpressure','diabetic','children','smoker','region']]

In [4]:
target

0        1121.87
1        1131.51
7        1146.80
8        1149.40
9        1163.46
          ...   
1335    55135.40
1336    58571.07
1337    60021.40
1338    62592.87
1339    63770.43
Name: claim, Length: 1332, dtype: float64

In [5]:
features

Unnamed: 0,age,gender,bmi,bloodpressure,diabetic,children,smoker,region
0,39.0,male,23.2,91,Yes,0,No,southeast
1,24.0,male,30.1,87,No,0,No,southeast
7,19.0,male,41.1,100,No,0,No,northwest
8,20.0,male,43.0,86,No,0,No,northwest
9,30.0,male,53.1,97,No,0,No,northwest
...,...,...,...,...,...,...,...,...
1335,44.0,female,35.5,88,Yes,0,Yes,northwest
1336,59.0,female,38.1,120,No,1,Yes,northeast
1337,30.0,male,34.5,91,Yes,3,Yes,northwest
1338,37.0,male,30.4,106,No,0,Yes,southeast


In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

scaler = StandardScaler()
encoder = OneHotEncoder(sparse_output = False)

num_feats = ['age', 'bmi', 'bloodpressure']
cat_feats = ['gender','diabetic','children', 'smoker', 'region']

final_pipe = ColumnTransformer([
   ('num', scaler, num_feats),
   ('cat', encoder, cat_feats)
])

In [7]:
features_processed = final_pipe.fit_transform(features)
features_processed.shape

(1332, 19)

In [8]:
from sklearn.model_selection import train_test_split
features_train, features_test, target_train, target_test = train_test_split(features_processed, target, test_size = 0.3, random_state = 42)

In [9]:
print(features_train.shape)
print(features_test.shape)

(932, 19)
(400, 19)


## Accuracy interpretability tradeoff

ML models, such as deep neural networks or ensemble methods like random forests, are often capable of achieving high levels of prediction accuracy. They can capture intricate patterns in the data and make accurate predictions. These models are often referred to as **"black boxes"** because their internal workings are not easily understandable.

Other models, like linear regression or decision trees, are generally more interpretable. It's easier to **understand how they make predictions**, as the relationships between input features and the target variable are explicit. Interpretable models are important in scenarios where understanding the reasoning behind predictions is crucial, such as in medical diagnosis or finance.


## Accuracy and Error

In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data. That is, we need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. One way to do that is by calculating the **mean square error**. It is computed by subtracting the actual output from the predicted output. The difference is an error. The error tells us how close the actual output is from the predicted. We want to choose the method that gives the MSE, as opposed to the lowest training MSE.





## Bias variance tradeoff

Associated with accuracy and error is another concept called bias and variance.

1. **Bias:** Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a simplified model. A high bias model is overly simplistic and does not capture the underlying patterns in the data. This leads to **underfitting**, where the model is not flexible enough to represent the data accurately. In this case, the model consistently makes systematic errors, and it has a poor performance on both the training and test data.

2. **Variance:** Variance refers to the error introduced by a model that is too complex and captures noise in the training data. A high-variance model is highly flexible and may **overfit** the training data, capturing random fluctuations and noise rather than the underlying patterns. Such a model may perform very well on the training data but poorly on new, unseen data.

The tradeoff can be summarized as follows:

1. **High Bias, Low Variance:** Models with high bias and low variance are simple and tend to underfit the data. They have a systematic error that is consistent across different datasets.

2. **Low Bias, High Variance:** Models with low bias and high variance are complex and tend to overfit the data. They are very flexible and can adapt to the noise in the training data, leading to poor generalization.

Ideally we want low variance and low bias.

## Addressing bias-variance tradeoff

We can use different tactics at different stages of the ML workflow to address the bias variance tradeoff

1. At the model selection phase -
  1. **Model complexity** - Model complexity is a crucial concept in machine learning that revolves around finding the optimal level of model intricacy or simplicity to achieve the best performance and generalization. It's all about striking the right balance between a model that's too simple (high bias) and one that's too complex (high variance).

    1. High Bias (Underfitting): A model with high bias is overly simplistic and often fails to capture the underlying patterns in the data. It's like trying to fit a straight line through a complex, nonlinear dataset. The model is too rigid and doesn't adapt well to the intricacies of the data.

    2. High Variance (Overfitting): On the other extreme, a model with high variance is overly complex and tends to fit the noise in the data. It's like trying to fit a high-degree polynomial through a dataset with some random fluctuations. The model is too flexible and adapts too closely to the training data, losing its ability to generalize to new, unseen data.

    **Practical Implications:**

    1. Occam's Razor: Model complexity should adhere to the principle of Occam's Razor, which suggests that the simplest explanation (or model) that fits the data is often the best.
    2. Generalization: A well-balanced model complexity is essential for generalization. A model that generalizes well is one that can make accurate predictions on new, unseen data.
    3. Real-World Analogy: Think of model complexity as a lens you use to view the data. Too simple, and it's blurry; too complex, and it magnifies noise. The ideal lens finds a balance between sharpness and clarity.
  
  The following is a chart the displays models from low to high complexity.

  2. **Models with regularization** - Regularization is a technique in machine learning that can be intuitively understood as a method to prevent overfitting and improve the generalization of a model. It does so by adding a penalty or constraint to the model's optimization process, which influences the model's parameter estimates. Regularization adds a constraint or penalty to the model during training. This constraint encourages the model to avoid extreme values for the parameters or coefficients. In essence, it discourages the model from becoming too complex.

2. At the model building and evaluation phase -
  1. **Feature selection** - Removing irrelevant or noisy features can reduce variance, while ensuring you have relevant features can reduce bias.
  2. **Hyperparameter tuning** - Hyperparameters are parameters that are not learned from the data during the training of a machine learning model but are set prior to training. They play a critical role in controlling the behavior of the model and are often tuned to optimize the model's performance.
  Imagine you're driving a car, and the car's performance depends on several settings that you can adjust. These settings include the steering wheel's sensitivity, the pedal's responsiveness, the suspension's stiffness, and the engine's power. You can think of these settings as hyperparameters for your car.
  3. **Cross validation** - Use resampling techniques like k-fold cross-validation to assess how well your model generalizes to unseen data. This helps you identify whether your model is overfitting or underfitting.

# Supervised ML

There are two types of models - regression versus classification. In regression problems, the goal is to predict a continuous or numerical outcome. This outcome can take any real-number value within a certain range, making it a quantitative prediction. In classification problems, the goal is to assign input data points to predefined categories or classes. The outcome is a categorical variable, and predictions are made by assigning each data point to one of these classes.

## Regression Models

1. **Customer Lifetime Value (CLV) Prediction:** Predicting the future value of a customer over their entire relationship with a company. This can help businesses identify high-value customers, optimize marketing spend, and tailor their strategies to retain and acquire such customers.
2. **Sales Forecasting:** Predicting future sales or revenue based on historical sales data, marketing campaigns, seasonality, and other relevant factors. Accurate sales forecasts help in inventory management, resource allocation, and budget planning.
3. **Market Response Modeling:** Modeling the impact of marketing campaigns (e.g., advertising, promotions, email campaigns) on sales or customer acquisition. Understanding which marketing activities are most effective can help allocate resources more efficiently.
4. **Consumer Response Modeling:** Modeling the impact of marketing campaigns (e.g., advertising, promotions, email campaigns) on sales or customer acquisition. Understanding which marketing activities are most effective can help allocate resources more efficiently.


#### Linear Regression

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(features_train, target_train)

predicted_claim = model.predict(features_test)

In [11]:
(
    pd.DataFrame({
        'actual_claim' : list(target_test),
        'predicted_claim' : list(predicted_claim)
    })
).head(n = 10)

Unnamed: 0,actual_claim,predicted_claim
0,43254.42,34355.124208
1,20167.34,22103.976681
2,39836.52,41067.843342
3,4435.09,10160.688588
4,3659.35,6980.696663
5,8310.84,9707.04914
6,3732.63,2477.74245
7,4234.93,6650.206314
8,12890.06,14407.232439
9,9964.06,13531.543861


In [12]:
mse = mean_squared_error(target_test, predicted_claim)
mse

39571168.892895035

## Regression Models with Regularization

#### Lasso Regression

Lasso and Ridge regression are two popular techniques that uses regularization used to improve linear regression models by addressing issues related to overfitting and multicollinearity.

1. **Feature Selection:** Lasso stands for "Least Absolute Shrinkage and Selection Operator." The key idea behind Lasso is to automatically select a subset of the most important features from your dataset while simultaneously reducing the magnitude of the coefficients of less important features.

2. **Penalizing Coefficients:** Lasso adds a penalty term to the linear regression objective function. This penalty is proportional to the absolute values of the coefficients. In other words, it encourages some of the coefficients to become exactly zero.

3. **Shrinking Coefficients:** As a result of the penalty term, Lasso shrinks the coefficients of less important features toward zero. This effectively removes some features from the model, making it more interpretable and potentially reducing overfitting.

In [13]:
from sklearn.linear_model import Lasso

alpha = 0.5
lasso_model = Lasso(alpha = alpha)
lasso_model.fit(features_train, target_train)

**Feature Selection**

In [19]:
coefficients = lasso_model.coef_
feature_names = list(features.columns)

feature_coefficients = [(feature_names[i], abs(coefficients[i])) for i in range(len(feature_names))]
feature_coefficients.sort(key = lambda x: x[1], reverse = True)
for feature, importance in feature_coefficients:
    print(f"Feature: {feature}, Importance: {importance}")

Feature: bmi, Importance: 2714.1597206166066
Feature: gender, Importance: 1938.4886846020656
Feature: region, Importance: 1899.1375940855896
Feature: children, Importance: 458.9728647339177
Feature: age, Importance: 296.0494628836819
Feature: bloodpressure, Importance: 96.82847486661821
Feature: diabetic, Importance: 1.952280742847201e-14
Feature: smoker, Importance: 0.0


**Hyper parameter tuning**

Alpha values in Lasso stand for.....

In [None]:
predicted_claim = lasso_model.predict(features_test)
mse = mean_squared_error(target_test, predicted_claim)
mse

39569906.157286175

In [None]:
alpha_values = [0.001, 0.01, 0.1, 1, 10]

best_alpha = None
best_mse = float('inf')

for alpha in alpha_values:
    model = Lasso(alpha = alpha)
    model.fit(features_train, target_train)

    predicted_claim = model.predict(features_test)
    mse = mean_squared_error(target_test, predicted_claim)

    print(f"Alpha = {alpha}: MSE = {mse}")
    if mse < best_mse:
        best_alpha = alpha
        best_mse = mse

print(f"Best Alpha: {best_alpha}")
print(f"Best MSE: {best_mse}")

Alpha = 0.001: MSE = 39571166.253885135
Alpha = 0.01: MSE = 39571142.65468069
Alpha = 0.1: MSE = 39570908.764939114
Alpha = 1: MSE = 39568718.500300884
Alpha = 10: MSE = 39562093.90058656
Best Alpha: 10
Best MSE: 39562093.90058656


#### Ridge Regression

Ridge regression is similar to Lasso regression in the sense that both types of regression are both regularization techniques used in linear regression to address common issues like overfitting and multicollinearity. The key difference is how they penalize less important input variables - while lasso uses "L1" regularization parameter, ridge uses "L2".

In [None]:
from sklearn.linear_model import Ridge

alpha_values = [0.001, 0.01, 0.1, 1, 10]

best_alpha = None
best_mse = float('inf')

for alpha in alpha_values:
    model = Ridge(alpha = alpha)
    model.fit(features_train, target_train)

    predicted_claim = model.predict(features_test)
    mse = mean_squared_error(target_test, predicted_claim)

    print(f"Alpha = {alpha}: MSE = {mse}")
    if mse < best_mse:
        best_alpha = alpha
        best_mse = mse

print(f"Best Alpha: {best_alpha}")
print(f"Best MSE: {best_mse}")

Alpha = 0.001: MSE = 39571182.99987377
Alpha = 0.01: MSE = 39571310.0588268
Alpha = 0.1: MSE = 39572590.13114069
Alpha = 1: MSE = 39586306.204397365
Alpha = 10: MSE = 39795744.151811816
Best Alpha: 0.001
Best MSE: 39571182.99987377


#### KNN Regression

K-Nearest Neighbors (KNN) regression is an algorithm that uses the concept of proximity to make predictions.

1. **Data Points as Neighbors:** In KNN regression, your dataset consists of data points, each with multiple features (independent variables) and a target value (the value you want to predict). The "neighbors" in KNN are other data points in your dataset.

2. **K and the Neighborhood:** The "K" in KNN represents the number of nearest neighbors you'll consider when making a prediction. For each data point, KNN finds the K nearest neighbors based on a distance metric, often Euclidean distance. These neighbors are the data points with the most similar feature values to the data point you're trying to predict.

3. **Proximity-Based Prediction:** To make a prediction for a new data point, KNN averages (for regression) or takes a majority vote (for classification) of the target values of its K nearest neighbors. In the case of regression, it calculates the mean (average) of the target values of these neighbors.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

k_values = [1, 3, 5, 7, 9]
best_k = None
best_mse = float('inf')

for k in k_values:
    model = KNeighborsRegressor(n_neighbors = k)
    model.fit(features_train, target_train)

    predicted_claim = model.predict(features_test)
    mse = mean_squared_error(target_test, predicted_claim)

    print(f"k = {k}: MSE = {mse}")
    if mse < best_mse:
        best_k = k
        best_mse = mse

print(f"Best k: {best_k}")
print(f"Best MSE: {best_mse}")

k = 1: MSE = 79213741.58862902
k = 3: MSE = 66188856.72200756
k = 5: MSE = 53197633.02411645
k = 7: MSE = 49033842.68498274
k = 9: MSE = 49931386.88734692
Best k: 7
Best MSE: 49033842.68498274


## Classification Models

1. **Churn Prediction:** Identifying customers who are likely to churn (stop using a product or service) versus those who are likely to stay loyal. This allows companies to focus retention efforts on high-risk customers.

2. **Lead Scoring:** Classifying leads or prospects into categories like "hot leads," "warm leads," or "cold leads" based on their likelihood to convert into paying customers. This helps sales and marketing teams prioritize their efforts.

3. **Spam Detection:** Classifying incoming emails, comments, or social media posts as either spam or legitimate content. This ensures that spam content is filtered out, providing a better user experience.

4. **Sentiment Analysis:** Determining the sentiment (positive, negative, or neutral) of customer reviews, social media mentions, or other text data. This information helps gauge public perception and brand sentiment.

### Logistic Regression

In [None]:
churn_data = (
    pd.read_csv('https://raw.githubusercontent.com/pallavrouth/MarketingAnalytics/main/datasets/churn.csv')
      .drop(columns = ['RowNumber'])
)

churn_data.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [None]:
churn_data.columns

Index(['CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age',
       'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember',
       'EstimatedSalary', 'Exited'],
      dtype='object')

In [None]:
target = churn_data.loc[:,'Exited']
features = churn_data.loc[:,['CreditScore', 'Geography', 'Gender', 'Age',
                             'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
                             'IsActiveMember', 'EstimatedSalary']]

scaler = StandardScaler()
encoder = OneHotEncoder(sparse_output = False)

num_feats = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
cat_feats = ['Geography', 'Gender','HasCrCard','IsActiveMember']

final_pipe = ColumnTransformer([
   ('num', scaler, num_feats),
   ('cat', encoder, cat_feats)
])

features_processed = final_pipe.fit_transform(features)
features_train, features_test, target_train, target_test = train_test_split(features_processed, target, test_size = 0.3, random_state = 42)

In [None]:
features_train.shape

(7000, 15)

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(features_train, target_train)

predicted_probability = model.predict_proba(features_test)

In [None]:
predicted_probability

array([[0.75174782, 0.24825218],
       [0.9200272 , 0.0799728 ],
       [0.68134544, 0.31865456],
       ...,
       [0.95564147, 0.04435853],
       [0.96179649, 0.03820351],
       [0.77938373, 0.22061627]])

In [None]:
predicted_label = model.predict(features_test)

In [None]:
predicted_label

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
(
    pd.DataFrame({
        'actual_clickad' : list(target_test),
        'predicted_clickad' : list(predicted_label)
    })
).head(n = 20)

## Validation for Classification Models

A confusion matrix is a table used in machine learning and statistics to evaluate the performance of a classification algorithm. It is a valuable tool for assessing how well a model's predictions align with the actual class labels in a dataset. In a binary classification problem, where there are two possible classes (e.g., "positive" and "negative"), a confusion matrix typically consists of four values:

1. **True Positives (TP):** The number of instances that were correctly predicted as positive (correctly classified as the positive class).
2. **True Negatives (TN):** The number of instances that were correctly predicted as negative (correctly classified as the negative class).
3. **False Positives (FP):** The number of instances that were incorrectly predicted as positive when they were actually negative (a type I error).
4. **False Negatives (FN):** The number of instances that were incorrectly predicted as negative when they were actually positive (a type II error).

A confusion matrix helps you understand the following concepts:

1. **Accuracy:** (TP + TN) / (TP + TN + FP + FN), which measures the overall correctness of the classification.
2. **Precision:** TP / (TP + FP), which quantifies the ability of the model to avoid false positive errors.
3. **Recall** (Sensitivity or True Positive Rate): TP / (TP + FN), which measures the ability of the model to correctly identify positive instances.
4. **F1 Score**: A metric that combines precision and recall to balance both false positives and false negatives.

### ROC Curve

The Receiver Operating Characteristic (ROC) curve and the Area Under the ROC Curve (AUC) are tools used to assess the performance of binary classification models, especially in scenarios where you want to evaluate how well a model can distinguish between two classes.

- The ROC curve is a graphical representation of the performance of a binary classification model as its discrimination threshold is varied.

- It is created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.

- The ROC curve shows how the model's sensitivity and specificity change as the decision threshold is adjusted. A perfect model would result in an ROC curve that is a straight line from (0,0) to (1,1).

AUC (Area Under the ROC Curve):

- The AUC is a scalar value that quantifies the overall performance of a binary classification model using the ROC curve.
- It represents the area under the ROC curve and varies between 0 and 1.
- A model with an AUC of 0.5 performs no better than random chance, as its ROC curve is just a diagonal line from (0,0) to (1,1).
- A model with an AUC greater than 0.5 indicates that it has some level of discriminative power, with higher AUC values indicating better performance. An AUC of 1 would mean a perfect model that can perfectly distinguish between the two classes.

In [None]:
from sklearn.metrics import confusion_matrix

confusion_mat = confusion_matrix(target_test, predicted_label)
pd.DataFrame(confusion_mat, columns=["Predicted 0", "Predicted 1"], index=["Actual 0", "Actual 1"])

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,2318,98
Actual 1,468,116


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

accuracy = accuracy_score(target_test, predicted_label)
precision = precision_score(target_test, predicted_label)
recall = recall_score(target_test, predicted_label)
f1 = f1_score(target_test, predicted_label)
roc_auc = roc_auc_score(target_test, model.predict_proba(features_test)[:, 1])


print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC AUC:", roc_auc)

Accuracy: 0.8113333333333334
Precision: 0.5420560747663551
Recall: 0.19863013698630136
F1 Score: 0.2907268170426065
ROC AUC: 0.7734098589313254


### Support Vector Machines

SVMs are a type of Maximal Margin Classifier that uses a hyperplane to distinguish between two (or more classes).

In a two-dimensional space (2D), a hyperplane is essentially a straight line that separates two classes of data points. In a three-dimensional space (3D), it becomes a flat plane. In higher-dimensional spaces, it's a higher-dimensional flat surface. The critical point is that a hyperplane is a decision boundary that separates data points belonging to different classes.

The key idea behind the Maximal Margin Classifier is to find the hyperplane that maximizes the margin between the classes. The margin is the distance between the hyperplane and the nearest data points from each class.

The data points that are closest to the hyperplane are called "support vectors." These are the critical data points that define the margin. The distance between the support vectors and the hyperplane should be maximized.

In [None]:
from sklearn.svm import SVC

model = SVC(C = 1, kernel = 'rbf', probability = True)
model.fit(features_train, target_train)

predicted_label = model.predict(features_test)
confusion_mat = confusion_matrix(target_test, predicted_label)
print(pd.DataFrame(confusion_mat, columns=["Predicted 0", "Predicted 1"], index=["Actual 0", "Actual 1"]))

accuracy = accuracy_score(target_test, predicted_label)
precision = precision_score(target_test, predicted_label)
recall = recall_score(target_test, predicted_label)
f1 = f1_score(target_test, predicted_label)
roc_auc = roc_auc_score(target_test, model.predict_proba(features_test)[:, 1])

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC AUC:", roc_auc)

          Predicted 0  Predicted 1
Actual 0         2368           48
Actual 1          359          225
Accuracy: 0.8643333333333333
Precision: 0.8241758241758241
Recall: 0.3852739726027397
F1 Score: 0.5250875145857643
ROC AUC: 0.8289875430917173


In [None]:
best_model = None
best_performance = 0

C_values = [0.1, 1, 10]
kernel_values = ['linear', 'rbf', 'poly']

for C in C_values:
    for kernel in kernel_values:
        model = SVC(C = C, kernel = kernel, probability=True)
        model.fit(features_train, target_train)
        predicted_label = model.predict(features_test)

        accuracy = accuracy_score(target_test, predicted_label)
        if accuracy > best_performance:
            best_performance = accuracy
            best_model = model

print("Best Model:")
print(best_model)
print("Best Performance (Accuracy):", best_performance)

Best Model:
SVC(C=10, kernel='poly', probability=True)
Best Performance (Accuracy): 0.869


# Resampling Methods

Resampling methods are an indispensable tool in modern statistics. In simple words, these methods involve **repeatedly drawing** samples from the same data and refitting a particular model of interest on each sample. Resampling methods are particularly important for model testing and evaluation. One such method is K fold Cross Validation.


## K fold Cross Validation

It involves dividing the dataset into K subsets (folds), training and testing the model K times, each time using a different fold as the test set and the remaining folds as the training set. Here's a step-by-step description of how K-fold CV is done:

1. **Data Splitting:** Start with a dataset containing your features (input data) and target variable (output data). The first step is to divide this dataset into K roughly equal-sized subsets, or "folds." The choice of K is determined by you; common values are 5 or 10.

2. **Training and Testing:** Perform K iterations, where each iteration represents one "fold." In each iteration, one of the K folds is used as the test set, while the other K-1 folds are used as the training set.

3. **Model Training:** Train your machine learning model on the training set for the current iteration. This includes selecting the algorithm, specifying hyperparameters, and fitting the model to the training data.

4. **Model Testing:** Use the trained model to make predictions on the test set for the current iteration. These predictions are used to evaluate the model's performance on unseen data.

5. **Performance Metric:** Calculate a performance metric (e.g., accuracy, precision, recall, F1 score, or any relevant metric for your problem) based on the model's predictions and the true values in the test set. This metric assesses how well the model is doing in the current iteration.

K fold Cross Validation can be better than a simple train test split. In K-fold CV each fold serves as the test set once. By averaging the results from multiple iterations, K-fold CV provides a more stable and less variable estimate of a model's performance compared to a single train-test split. This helps you obtain a more reliable assessment of how well the model is likely to perform on new, unseen data.

Additionally, K fold CV can be useful when the dataset is small and there isn't enough samples to split into training and testing.



In [136]:
import numpy as np
from sklearn.model_selection import cross_val_score, KFold

C_values = [0.1, 1, 10]
kernel_values = ['linear', 'rbf', 'poly']

best_model = None
best_mean_accuracy = 0

kf = KFold(n_splits = 3, shuffle = True, random_state = 42)

for C in C_values:
    for kernel in kernel_values:
        model = SVC(C = C, kernel = kernel, probability = True)
        accuracy_scores = cross_val_score(model, features_processed, target, cv = kf, scoring = 'accuracy')
        mean_accuracy = np.mean(accuracy_scores)

        if mean_accuracy > best_mean_accuracy:
            best_mean_accuracy = mean_accuracy
            best_model = model

# Print the best model's hyperparameters and mean accuracy
print("Best Model:")
print(best_model)
print("Best Mean Accuracy:", best_mean_accuracy)

Best Model:
SVC(C=10, kernel='poly', probability=True)
Best Mean Accuracy: 0.8614993472952573


# Tree based Models

## Decision Tree

## Random Forest

## Bagging and Boosting

## Xtreme Gradient Boosting