# ImbalancedLearningRegression (0.0.1): Usage
---
## SMOGNBoost
Amit Shanbhoug, 8677407 \
Adapted from Nick Kunz's SMOGN package: https://github.com/nickkunz/smogn/blob/master/examples/smogn_example_1_beg.ipynb


## Installation

First, we install ImbalancedLearningRegression from the Github repository. Alternatively, we could install from the official PyPI distribution. However, the developer version is utilized here for the latest release. Works on Kernel: python 3.10.7, there may be some issues if you run with a different version, older or newer.

In [None]:
%%capture
## suppress install output

## install pypi release
# !pip install ImbalancedLearningRegression

## install developer version
# !pip install git+https://github.com/paobranco/ImbalancedLearningRegression.git

## Dependencies
Next, we load the required dependencies. Here we import `ImbalancedLearningRegression` to later apply Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise. In addition, we use `pandas` for data handling, and `seaborn` to visualize our results.

In [1]:
## load dependencies
## load libraries
import numpy as np
import pandas as pd
import sklearn
import math

from sklearn import tree
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import seaborn
import matplotlib.pyplot as plt
from ImbalancedLearningRegression.smogn import smogn
from ImbalancedLearningRegression.smogn_boost import smogn_boost
#from ImbalancedLearningRegression import *
import ImbalancedLearningRegression as iblr
from sklearn.model_selection import train_test_split

# Data Description
The dataset is sourced from Kaggle with a usability score of 10.0 (Source: https://www.kaggle.com/datasets/arashnic/imbalanced-data-practice). This data contains information about  and goal here is to predict whether an individual would be interested in vehicle insurance, denoted by response column "response". The target variable is highly imbalanced - 0: 319594, 1: 62531

## Loading the Data
Below, we load our data (Imbalanced Insurance Data set), In this case, we name our training set `data` and test data as 'test-data'. 


In [6]:
#ds = pd.read_csv("https://raw.githubusercontent.com/paobranco/ImbalancedLearningRegression/SMOGNBoost/data/housing.csv")
#print(len(ds))

data = pd.read_csv("https://raw.githubusercontent.com/paobranco/ImbalancedLearningRegression/SMOGNBoost/data/housing.csv")
#ata = "https://raw.githubusercontent.com/paobranco/ImbalancedLearningRegression/SMOGNBoost/data/College.csv"
#test_data = "https://raw.githubusercontent.com/paobranco/ImbalancedLearningRegression/SMOGNBoost/data/CollegeTest.csv"
test_data = pd.read_csv("https://raw.githubusercontent.com/paobranco/ImbalancedLearningRegression/SMOGNBoost/data/housingTest.csv")

## Introduction to SMOGNBoost
Here we cover the focus of this example. We call the `smogn_boost` function from this package (`ImbalancedLearningRegression.smogn_boost`) and satisfy the minimum required arguments: `data` and `y`.

* data: this argument takes a training data set
* test_data: this argument takes a test data set
* y: this argument takes a string, which specifies a response variable by header name 
* TotalIterations: this argument takes a positive integer, which specifies the total number of iterations
* pert: perturbation / noise percentage
* replace: sampling replacement (bool)
* k: num of neighs for over-sampling (pos int)
* error_threshold: this argument takes a positive integer, which specifies an error threshold 
* rel_thres: user defined relevance threshold 
* samp_method: "balance or extreme" - sampling method is perc

In [7]:
## conduct smogn
smogn = smogn(data, y="SalePrice")

dist_matrix: 100%|##########| 213/213 [00:49<00:00,  4.32it/s]
synth_matrix: 100%|##########| 213/213 [00:09<00:00, 23.13it/s]
r_index: 100%|##########| 83/83 [00:01<00:00, 43.69it/s]


In [9]:
## conduct smogn_boost
smogn_boost = smogn_boost(data, test_data, y= 'SalePrice', TotalIterations=1, pert = 0.02, replace = False, k = 5, error_threshold=0.2, rel_thres = 0.5, samp_method = "balance")

A
B
C
E
F
0.000693000693000693
G
[0.000693 0.000693 0.000693 ... 0.000693 0.000693 0.000693]
H
I
J
K
L
M
N
{'method': 'auto', 'num_pts': 3, 'ctrl_pts': [34900, 0, 0, 163000.0, 0, 0, 340000.0, 1, 0]}
O


  if y in data.columns.values is False:


KeyError: 'SalePrice'

We read the test & training data and split them into features (X) and target value (Y).

In [None]:
# read the test data and split features (X) and target value (Y)
df_testData = pd.read_csv(test_data, header = 0)
X_test = df_testData.drop(y, axis = 1)
Y_test = df_testData[y]
    
# read the training data and split features (X) and target value (Y)
df_data = pd.read_csv(data, header = 0)
X_data = df_data.drop(y, axis = 1)
Y_data = df_data[y]

# set for clarity, name of target variable not data
y_train = y

Below, we set an initial iteration as well as initialize empty arrays for the result, beta values, and decision tree predictions based on x_test. The array will store values over each iteration and be used to calculate the result after the final iteration (total iterations specified by the user)

In [None]:
# set an initial iteration
iteration = 1
    
# set an array of results, beta values, and decision tree predictions based on x_test
result = np.empty(TotalIterations, dtype=int)
beta = np.empty(TotalIterations, dtype=int)
dt_test_predictions = np.empty(X_test, dtype=int)
    
# Dt(i) set distribution as 1/m weights, which is length of training data -1, as one of them is the target variable y 
weights = 1/(len(data))
dt_distribution = np.zeros(len(data))
for i in range(len(data)):
    dt_distribution[i] = weights

We call phi control and specificially the control points as it will be used when we apply SMOGN to oversample the imbalanced training data. 

In [None]:
# calling phi control
pc = phi_ctrl_pts (y=y, method="auto", xtrm_type = "both", coeff = 1.5, ctrl_pts=None)
    
# calling only the control points (third value) from the output
rel_ctrl_pts_rg = pc[2]

Below, we begin the main function which runs for the total iterations specified by the user.

We obtain an oversampled dataset using SMOGN and our training dataset, split the oversampled data into features and a target variable.

In [None]:
# loop while iteration is less than user provided iterations
while iteration <= TotalIterations:

# use initial training data set provided by user to obtain oversampled dataset using SMOGN, calculating it for the bumps
    dt_over_sampled = smogn(data=data, y_train = y_train, k = 5, pert = pert, replace=replace, rel_thres = rel_thres, rel_method = "manual", rel_ctrl_pts_rg = rel_ctrl_pts_rg)

# splitting oversampled data for subsequent training data use below
    df_oversampled = dt_over_sampled, header = 0
    x_oversampled = df_oversampled.drop(y_train, axis = 1)
    y_oversampled = df_oversampled[y_train]
    
# calls the decision tree and use it to achieve a new model, predict regression value for y (target response variable), and return the predicted values
    dt_model = tree.DecisionTreeRegressor()
        
# train decision tree classifier
    dt_model = dt_model.fit(x_oversampled, y_oversampled)
        
# predict the features in user provided data
    dt_data_predictions = dt_model.predict(X_data)
        
# predict the features in user provided test data
    dt_test_predictions.append(dt_model.predict(X_test))

# initialize model error rate & epsilon t value
    model_error = np.zeros(len(dt_data_predictions))
    epsilon_t = 0

# calculate the model error rate of the new model achieved earlier, as the delta between original dataset and predicted oversampled dataset
# for each y in the dataset, calculate whether it is greater/lower than threshold and update accordingly
    for i in range(len(dt_data_predictions)):
        model_error[i] = abs((Y_data[i] - dt_data_predictions[i])/Y_data[i])
        
    for i in range(len(dt_data_predictions)):
        if model_error[i] > error_threshold:
            epsilon_t = epsilon_t + dt_distribution[i]
                                      
# beta is the update parameter of weights based on the model error rate calculated
    beta.append(pow(epsilon_t, 2))

# update the distribution weights
    for i in dt_distribution:
        if model_error[i] <= error_threshold:
            dt_distribution[i] = dt_distribution[i] * beta
        else:
            dt_distribution[i] = dt_distribution[i]

# normalize the distribution 
    dt_normalized = preprocessing.normalize(dt_distribution, max)

# iteration count
    iteration += 1

Below, we calculate the result outside the while loop. We split calculations into numerator and denominator calculating a series. To simplify the calculation, we use log(b) instead of (log(1/b))

In [None]:
# calculate result
numer = 0
denom = 0
    
for b, i in zip(beta, dt_test_predictions):
    numer += math.log(b) * i
    denom += math.log(b)
return numer/denom

## Results
After conducting SMOGNBoost, we briefly examine the results. 






In [None]:
## dimensions - original data 
data.shape

## Conclusion
TO DO

## References

Branco, P., Torgo, L., Ribeiro, R. (2017). SMOGN: A Pre-Processing Approach for Imbalanced Regression. Proceedings of Machine Learning Research, 74:36-50. http://proceedings.mlr.press/v74/branco17a/branco17a.pdf.

Torgo, L., Ribeiro, R. P., Pfahringer, B., & Branco, P. (2013, September). Smote for regression. In Portuguese conference on artificial intelligence (pp. 378-389). Springer, Berlin, Heidelberg. https://researchcommons.waikato.ac.nz/bitstream/handle/10289/8518/smoteR.pdf?sequence=23

Kunz, N. (2019). SMOGN: Synthetic Minority Over-Sampling for Regression with Gaussian Noise (Version 0.1.0). Python Package Index.
https://pypi.org/project/smogn. 

Gareth, J., Daniela, W., Trevor, H., & Robert, T. (2013). An introduction to statistical learning: with applications in R. Spinger.
http://www-bcf.usc.edu/~gareth/ISL/data.html.

