# AI vs Machine Learning Vs Deep Learning

- AI is the broad concept of making machines intelligent, mimic human intelligence through a set of algorithms. The field focuses on three skills: learning, reasoning, and self-correction to obtain maximum efficiency.
- Machine Learning is a subset of AI, uses algorithms that learn from data to make predictions.
- Deep Learning is a subset of machine learning that uses artificial neural networks to process and analyze information.

# Machine Learning Types

1. Supervised Learning: Trains models on labeled data to predict or classify new, unseen data. It is generally categorized into two main types:
   - Classification where the goal is to predict discrete labels or categories
   - Regression where the aim is to predict continuous numerical values.
2. Unsupervised Learning: Finds patterns or groups in unlabeled data, like clustering.
3. Semi-supervised Learning: This approach combines a small amount of labeled data with a large amount of unlabeled data. It’s useful when labeling data is expensive or time-consuming.
4. Reinforcement Learning: Learns through trial and error to maximize rewards, ideal for decision-making tasks.

# Linear Regression

- It assumes that there is a linear relationship between the input and output, meaning the output changes at a constant rate as the input changes. This relationship is represented by a straight line.
- best-fit line is the straight line that most accurately represents the relationship between the independent variable (input) and the dependent variable (output). The goal of the best line is to minimize the difference between the actual data points and the predicted values from the model.
- Multiple linear regression generalizes the case of one predictor to several predictors (more than on independent variable)
- Formula for best fit line is: Y = b0 + b1X or Y = b0 + b1X1 + b2X2 + b3X3 + ....

## Estimation of Mean Response

- Fitted regression line can be used to estimate the mean value of y for a given value of x.
- b1 is the slope
- b1 = n∑xy−∑x∑y / n∑(x)2−(∑x)2
- This formula tells us: How much y changes when x increases by 1 unit
- b0 is the point estimation

# K-Nearest Neighbor

- It works bby finding the "k" closest data points (neighbors) to a give input and makes a predictions based on the majority class
- Its also called lazy learner algorithm because it does not learn from the training set immediately instead it stores the entire dataset and performs computations only at the time of classification.
- It uses distance metrics to identify nearest neighbor, like Euclidean Distance and Manhattan Distance

# Feature Scaling

## What is Feature Scaling?

- In Data Processing, we try to change the data in such a way that the model can process it without any problems. Feature Scaling is one such process in which we transform the data into a better version.
- Feature Scaling is done to normalize the features in the dataset into a finite range.

## Why Feature Scaling?

- Real Life Datasets have many features with a wide range of values, Many machine learning algorithms that are using Euclidean distance as a metric to calculate the similarities will fail to give a reasonable recognition
to the smaller featur which in the real case can turn out to be an actually important metric.
- KNN doesn’t know which feature is important. It only knows which numbers are bigger. Bigger numbers dominate the distance unless you scale.

## Feature Scaling Techniques

- Standardization:
  - Standard Scaler
- Normalization:
  - Min Max Scaling
  - Mean Normalization
  - Max Absolute Scaling
  - Robust Scaling

# Standardization

- Standardization is a scaling method where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero, and the resultant distribution has a unit standard deviation.
- Scaling is done by mean and standard deviation.
- Use Case: Models that assume or exploit normality and variance structure, like linear/logistic regression, SVMs, and PCA.
- Scale Not bounded
- Less affected by outliers
- It is used when the data is Gaussian or normally distributed
- It is also known as Z-Score

## Standard Scaler

- Calculate the z-value for each of the data points and replaces those with these values.
- Formula: xNew = (x - xMean) / std

# Normalization

- Normalization is a data preprocessing technique used to adjust the values of features in a dataset to a common scale.
- Scaling is done by the highest and the lowest values.
- Use Case: Distance‑based or gradient‑based models that benefit from bounded inputs, such as k‑NN and many neural networks
- Scale Most commonly to a fixed range such as [0,1]
- Affected by outliers
- Often used when the data distribution is unknown or not Gaussian.
- It is also known as Scaling Normalization

## Min Max Scaler

- In min-max you will subtract the minimum value in the dataset with all the values and then divide this by the range of the dataset(maximum-minimum). In this case, your dataset will lie between 0 and 1 in all cases
- Formula: xNew = (x - xMin) / (xMax - xMin)

## Mean Normalization

- Instead of using the min() value in the previous case, in this case, we will be using the average() value.
- Formula: xNew = (x - xMean) / (xMax - xMin)

## Absolute Maximum Scaler

- Find the absolute maximum value of the feature in the dataset then divide all the values in the column by that maximum value
- If we do this for all the numerical columns, then all their values will lie between -n and m.
- Formula: xNew = x/xMax

## Robust Scaler

- In this method, you need to subtract all the data points with the median value and then divide it by the Inter Quartile Range(IQR) value.
- Formula: xNew = x - xMedian/IQR

# When to Use when to Avoid

## Standard Scaler

- Use when data roughly normal and no heavy outliers
- Avoid when data has strong outliers and distribution is extremely skewed

## Min Max Scaler

- Use when you need 0-1 range and Features originally have very different scales and want to keep relative distances
- Avoid When Outliers present and you require robustness to future unseen values that may fall outside the min/max

## Mean Normalization

- Use when You want features centered at 0 but still scaled by their range instead of variance
- Avoid when outliers present

## Max Absolute Scaler

- Use when Data is already centered at or near 0 and may be sparse
- Avoid when Features are not centered and contain large outliers

## Robust Scaler

- Use when You want features on a comparable scale for distance/gradient‑based models but do not want to trim outliers.
- Avoid when Very small sample sizes, where robust statistics like IQR may be unstable and Data is already well‑behaved and close to Gaussia

# Data Resampling

- A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and / or adding more examples from the minority class (over-sampling).

## What is Imbalanced datasets?

- Imbalanced datasets are those where there is a severe skew in the class distribution, such as 1:100 or 1:1000 examples in the minority class to the majority class.
- This bias in the training dataset can influence many machine learning algorithms, leading some to ignore the minority class entirely. This is a problem as it is typically the minority class on which predictions are most important.

## Resampling Techniques

### Random Under-Sampling

- Aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out.

#### Advantages 

- It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.

#### Disadvantages

- It can discard potentially useful information which could be important for building rule classifiers. The sample chosen by random under sampling may be a biased sample. And it will not be an accurate representative of the population. Thereby, resulting in inaccurate results with the actual test data set.

### Random Over-Sampling

- Increases the number of instances in the minority class by randomly replicating them to present a higher representation of the minority class in the sample.

#### Advantages

- Unlike under sampling this method leads to no information loss. Outperforms under sampling

#### Disadvantages

- It increases the likelihood of over-fitting since it replicates the minority class events.

### Synthetic Minority Over-sampling Technique (SMOTE)

- This technique is followed to avoid over-fitting which occurs when exact replicas of minority instances are added to the main data set. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original datasets.

#### Advantages

- Mitigates the problem of over-fitting caused by random oversampling as synthetic examples are generated rather than replication of instances. Also there is no loss of useful information

#### Disadvantages

- While generating synthetic examples SMOTE does not take into consideration neighboring examples from other classes. This can result in increase in overlapping of classes and can introduce additional noise.

In [80]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Data Preprocessing

In [81]:
loanData = pd.read_csv('Loan_Default.csv')
loanData.drop(['ID', 'year'], axis=1, inplace=True)

In [82]:
categoricalFeatures = loanData.select_dtypes(include=['object']).columns.to_list()

In [83]:
OrdinalFeatures = ['age']
NominalFeatures = categoricalFeatures.copy()
NominalFeatures.remove('age')

In [84]:
encoder = OrdinalEncoder()
loanData[OrdinalFeatures] = encoder.fit_transform(loanData[OrdinalFeatures])

In [85]:
for c in NominalFeatures:
    
    freq = loanData[c].value_counts(normalize=True)
    loanData[c+'_freq'] = loanData[c].map(freq)

loanData.drop(NominalFeatures, axis=1, inplace=True)

In [86]:
high_missing_cols = ['rate_of_interest', 'Interest_rate_spread', 'Upfront_charges', 'property_value', 'LTV', 'dtir1']
Cols_to_be_imputed = ['term', 'income', 'age', 'loan_limit_freq', 'approv_in_adv_freq', 'loan_purpose_freq', 'Neg_ammortization_freq', 'submission_of_application_freq']
loanData =  loanData.drop(high_missing_cols, axis=1)

In [87]:
for c in Cols_to_be_imputed:

    loanData[c].fillna(loanData[c].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  loanData[c].fillna(loanData[c].mean(), inplace=True)


# Feature Scaling

## Standard Scaler

In [88]:
sclr = StandardScaler()
loanData1 = pd.DataFrame(sclr.fit_transform(loanData), columns=loanData.columns)

loanData1

Unnamed: 0,loan_amount,term,income,Credit_Score,age,Status,loan_limit_freq,Gender_freq,approv_in_adv_freq,loan_type_freq,...,lump_sum_payment_freq,construction_type_freq,occupancy_type_freq,Secured_by_freq,total_units_freq,credit_type_freq,co-applicant_credit_type_freq,submission_of_application_freq,Region_freq,Security_Type_freq
0,-1.166980,0.425737,-0.829008,0.502357,-1.474340,1.748627,0.274622,-0.086254,0.432241,0.559687,...,0.152617,0.0149,0.275199,0.0149,0.122273,-0.032010,0.999233,0.741826,-0.099392,0.0149
1,-0.677607,0.425737,-0.314189,-1.275413,0.500233,1.748627,0.274622,0.773193,0.432241,-1.722629,...,-6.552344,0.0149,0.275199,0.0149,0.122273,-2.826994,-1.000767,0.741826,0.556135,0.0149
2,0.409890,0.425737,0.400838,1.158234,-0.816149,-0.571877,0.274622,0.773193,-2.327742,0.559687,...,0.152617,0.0149,0.275199,0.0149,0.122273,-0.032010,0.999233,0.741826,-0.099392,0.0149
3,0.681764,0.425737,0.782185,-0.973365,-0.157958,-0.571877,0.274622,0.773193,0.432241,0.559687,...,0.152617,0.0149,0.275199,0.0149,0.122273,-0.032010,0.999233,-1.349842,0.556135,0.0149
4,1.986759,0.425737,0.553377,-0.843916,-1.474340,-0.571877,0.274622,0.599543,-2.327742,0.559687,...,0.152617,0.0149,0.275199,0.0149,0.122273,0.245329,-1.000767,-1.349842,0.556135,0.0149
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148665,0.573014,-2.656410,0.143428,-0.352008,0.500233,-0.571877,0.274622,-0.086254,0.432241,0.559687,...,0.152617,0.0149,0.275199,0.0149,0.122273,0.701940,-1.000767,0.741826,-0.099392,0.0149
148666,1.388636,0.425737,0.029024,-1.128704,-1.474340,-0.571877,0.274622,0.773193,0.432241,0.559687,...,0.152617,0.0149,-3.595669,0.0149,-8.223628,0.701940,0.999233,-1.349842,-0.099392,0.0149
148667,0.627389,-2.656410,-0.009111,0.019080,-0.157958,-0.571877,0.274622,0.773193,0.432241,0.559687,...,0.152617,0.0149,0.275199,0.0149,0.122273,0.701940,-1.000767,-1.349842,0.556135,0.0149
148668,-0.731981,-2.656410,0.029024,0.321128,0.500233,-0.571877,0.274622,-1.992000,0.432241,0.559687,...,0.152617,0.0149,0.275199,0.0149,0.122273,-0.032010,-1.000767,0.741826,0.556135,0.0149


## Min Max Scaler

In [89]:
sclr = MinMaxScaler()
loanData2 = pd.DataFrame(sclr.fit_transform(loanData), columns=loanData.columns)

loanData2

Unnamed: 0,loan_amount,term,income,Credit_Score,age,Status,loan_limit_freq,Gender_freq,approv_in_adv_freq,loan_type_freq,...,lump_sum_payment_freq,construction_type_freq,occupancy_type_freq,Secured_by_freq,total_units_freq,credit_type_freq,co-applicant_credit_type_freq,submission_of_application_freq,Region_freq,Security_Type_freq
0,0.028090,1.000000,0.003007,0.6450,0.000000,1.0,1.0,0.689191,1.0,1.000000,...,1.0,1.0,1.000000,1.0,1.0,0.792019,1.0,1.0,0.854314,1.0
1,0.053371,1.000000,0.008607,0.1300,0.500000,1.0,1.0,1.000000,1.0,0.061226,...,0.0,1.0,1.000000,1.0,1.0,0.000000,0.0,1.0,1.000000,1.0
2,0.109551,1.000000,0.016385,0.8350,0.166667,0.0,1.0,1.000000,0.0,1.000000,...,1.0,1.0,1.000000,1.0,1.0,0.792019,1.0,1.0,0.854314,1.0
3,0.123596,1.000000,0.020533,0.2175,0.333333,0.0,1.0,1.000000,1.0,1.000000,...,1.0,1.0,1.000000,1.0,1.0,0.792019,1.0,0.0,1.000000,1.0
4,0.191011,1.000000,0.018044,0.2550,0.000000,0.0,1.0,0.937202,0.0,1.000000,...,1.0,1.0,1.000000,1.0,1.0,0.870609,0.0,0.0,1.000000,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148665,0.117978,0.318182,0.013585,0.3975,0.500000,0.0,1.0,0.689191,1.0,1.000000,...,1.0,1.0,1.000000,1.0,1.0,1.000000,0.0,1.0,0.854314,1.0
148666,0.160112,1.000000,0.012341,0.1725,0.000000,0.0,1.0,1.000000,1.0,1.000000,...,1.0,1.0,0.031176,1.0,0.0,1.000000,1.0,0.0,0.854314,1.0
148667,0.120787,0.318182,0.011926,0.5050,0.333333,0.0,1.0,1.000000,1.0,1.000000,...,1.0,1.0,1.000000,1.0,1.0,1.000000,0.0,0.0,1.000000,1.0
148668,0.050562,0.318182,0.012341,0.5925,0.500000,0.0,1.0,0.000000,1.0,1.000000,...,1.0,1.0,1.000000,1.0,1.0,0.792019,0.0,1.0,1.000000,1.0


## Max Absolute Scaler

In [90]:
sclr = MaxAbsScaler()
loanData3 = pd.DataFrame(sclr.fit_transform(loanData), columns=loanData.columns)

loanData3

Unnamed: 0,loan_amount,term,income,Credit_Score,age,Status,loan_limit_freq,Gender_freq,approv_in_adv_freq,loan_type_freq,...,lump_sum_payment_freq,construction_type_freq,occupancy_type_freq,Secured_by_freq,total_units_freq,credit_type_freq,co-applicant_credit_type_freq,submission_of_application_freq,Region_freq,Security_Type_freq
0,0.032574,1.000000,0.003007,0.842222,0.000000,1.0,1.0,0.889317,1.000000,1.000000,...,1.000000,1.0,1.000000,1.0,1.000000,0.858095,1.000000,1.000000,0.856722,1.0
1,0.057738,1.000000,0.008607,0.613333,0.500000,1.0,1.0,1.000000,1.000000,0.183454,...,0.023292,1.0,1.000000,1.0,1.000000,0.317702,0.998468,1.000000,1.000000,1.0
2,0.113659,1.000000,0.016385,0.926667,0.166667,0.0,1.0,1.000000,0.185691,1.000000,...,1.000000,1.0,1.000000,1.0,1.000000,0.858095,1.000000,1.000000,0.856722,1.0
3,0.127639,1.000000,0.020533,0.652222,0.333333,0.0,1.0,1.000000,1.000000,1.000000,...,1.000000,1.0,1.000000,1.0,1.000000,0.858095,1.000000,0.549565,1.000000,1.0
4,0.194743,1.000000,0.018044,0.668889,0.000000,0.0,1.0,0.977637,0.185691,1.000000,...,1.000000,1.0,1.000000,1.0,1.000000,0.911717,0.998468,0.549565,1.000000,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148665,0.122047,0.500000,0.013585,0.732222,0.500000,0.0,1.0,0.889317,1.000000,1.000000,...,1.000000,1.0,1.000000,1.0,1.000000,1.000000,0.998468,1.000000,0.856722,1.0
148666,0.163987,1.000000,0.012341,0.632222,0.000000,0.0,1.0,1.000000,1.000000,1.000000,...,1.000000,1.0,0.053111,1.0,0.002185,1.000000,1.000000,0.549565,0.856722,1.0
148667,0.124843,0.500000,0.011926,0.780000,0.333333,0.0,1.0,1.000000,1.000000,1.000000,...,1.000000,1.0,1.000000,1.0,1.000000,1.000000,0.998468,0.549565,1.000000,1.0
148668,0.054942,0.500000,0.012341,0.818889,0.500000,0.0,1.0,0.643886,1.000000,1.000000,...,1.000000,1.0,1.000000,1.0,1.000000,0.858095,0.998468,1.000000,1.000000,1.0


## Robust Scaler

In [91]:
sclr = RobustScaler()
loanData4 = pd.DataFrame(sclr.fit_transform(loanData), columns=loanData.columns)

loanData4

Unnamed: 0,loan_amount,term,income,Credit_Score,age,Status,loan_limit_freq,Gender_freq,approv_in_adv_freq,loan_type_freq,...,lump_sum_payment_freq,construction_type_freq,occupancy_type_freq,Secured_by_freq,total_units_freq,credit_type_freq,co-applicant_credit_type_freq,submission_of_application_freq,Region_freq,Security_Type_freq
0,-0.750000,0.0,-0.959459,0.293532,-1.0,1.0,0.0,-0.797952,0.00000,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,-0.377872,0.0,0.0,-1.0,0.0
1,-0.375000,0.0,-0.229730,-0.731343,0.5,1.0,0.0,0.202048,0.00000,-0.621585,...,-0.954476,0.0,0.000000,0.0,0.000000,-4.186009,-1.0,0.0,0.0,0.0
2,0.458333,0.0,0.783784,0.671642,-0.5,0.0,0.0,0.202048,-0.68678,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,-0.377872,0.0,0.0,-1.0,0.0
3,0.666667,0.0,1.324324,-0.557214,0.0,0.0,0.0,0.202048,0.00000,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,-0.377872,0.0,-1.0,0.0,0.0
4,1.666667,0.0,1.000000,-0.482587,-1.0,0.0,0.0,0.000000,-0.68678,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,0.000000,-1.0,-1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148665,0.583333,-180.0,0.418919,-0.199005,0.5,0.0,0.0,-0.797952,0.00000,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,0.622128,-1.0,0.0,-1.0,0.0
148666,1.208333,0.0,0.256757,-0.646766,-1.0,0.0,0.0,0.202048,0.00000,0.000000,...,0.000000,0.0,-0.880211,0.0,-0.983117,0.622128,0.0,-1.0,-1.0,0.0
148667,0.625000,-180.0,0.202703,0.014925,0.0,0.0,0.0,0.202048,0.00000,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,0.622128,-1.0,-1.0,0.0,0.0
148668,-0.416667,-180.0,0.256757,0.189055,0.5,0.0,0.0,-3.015362,0.00000,0.000000,...,0.000000,0.0,0.000000,0.0,0.000000,-0.377872,-1.0,0.0,0.0,0.0


## Mean Normalization

In [92]:
for c in loanData.columns:
    colMean = loanData[c].mean()
    colMax = loanData[c].max()
    colMin = loanData[c].min()

    loanData[c] = (loanData[c] - colMean)/(colMax - colMin)

loanData

Unnamed: 0,loan_amount,term,income,Credit_Score,age,Status,loan_limit_freq,Gender_freq,approv_in_adv_freq,loan_type_freq,...,lump_sum_payment_freq,construction_type_freq,occupancy_type_freq,Secured_by_freq,total_units_freq,credit_type_freq,co-applicant_credit_type_freq,submission_of_application_freq,Region_freq,Security_Type_freq
0,-0.060286,0.094180,-0.009017,0.145527,-0.373331,0.753555,0.068659,-0.031193,0.15661,0.230213,...,0.022762,0.000222,0.068879,0.000222,0.014651,-0.009071,0.499617,0.354658,-0.022089,0.000222
1,-0.035005,0.094180,-0.003418,-0.369473,0.126669,0.753555,0.068659,0.279616,0.15661,-0.708560,...,-0.977238,0.000222,0.068879,0.000222,0.014651,-0.801090,-0.500383,0.354658,0.123597,0.000222
2,0.021175,0.094180,0.004360,0.335527,-0.206665,-0.246445,0.068659,0.279616,-0.84339,0.230213,...,0.022762,0.000222,0.068879,0.000222,0.014651,-0.009071,0.499617,0.354658,-0.022089,0.000222
3,0.035220,0.094180,0.008508,-0.281973,-0.039998,-0.246445,0.068659,0.279616,0.15661,0.230213,...,0.022762,0.000222,0.068879,0.000222,0.014651,-0.009071,0.499617,-0.645342,0.123597,0.000222
4,0.102635,0.094180,0.006019,-0.244473,-0.373331,-0.246445,0.068659,0.216818,-0.84339,0.230213,...,0.022762,0.000222,0.068879,0.000222,0.014651,0.069519,-0.500383,-0.645342,0.123597,0.000222
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148665,0.029602,-0.587639,0.001560,-0.101973,0.126669,-0.246445,0.068659,-0.031193,0.15661,0.230213,...,0.022762,0.000222,0.068879,0.000222,0.014651,0.198910,-0.500383,0.354658,-0.022089,0.000222
148666,0.071737,0.094180,0.000316,-0.326973,-0.373331,-0.246445,0.068659,0.279616,0.15661,0.230213,...,0.022762,0.000222,-0.899946,0.000222,-0.985349,0.198910,0.499617,-0.645342,-0.022089,0.000222
148667,0.032411,-0.587639,-0.000099,0.005527,-0.039998,-0.246445,0.068659,0.279616,0.15661,0.230213,...,0.022762,0.000222,0.068879,0.000222,0.014651,0.198910,-0.500383,-0.645342,0.123597,0.000222
148668,-0.037814,-0.587639,0.000316,0.093027,0.126669,-0.246445,0.068659,-0.720384,0.15661,0.230213,...,0.022762,0.000222,0.068879,0.000222,0.014651,-0.009071,-0.500383,0.354658,0.123597,0.000222


# Data Resampling

In [93]:
InsuranceData = pd.read_csv('Insurance.csv')

In [94]:
InsuranceData['Response'].value_counts() # 1 is the minority

Response
0    319553
1     62601
Name: count, dtype: int64

In [95]:
x = InsuranceData.drop('Response', axis=1) # rest of data without responses
y = InsuranceData['Response'] # only responses

## Random Over Sampling

In [96]:
InsuranceData1 = InsuranceData.copy()
overSample = RandomOverSampler(sampling_strategy=0.5) # minority becomes half of majority
xOver, yOver = overSample.fit_resample(x, y)
InsuranceData1 = pd.DataFrame(xOver)
InsuranceData1['Responses'] = yOver

InsuranceData1['Responses'].value_counts()

Responses
0    319553
1    159776
Name: count, dtype: int64

## Random Under Sampling

In [97]:
InsuranceData2 = InsuranceData.copy()
underSample = RandomUnderSampler(sampling_strategy=0.5)
xUnder, yUnder = underSample.fit_resample(x, y)
InsuranceData2 = pd.DataFrame(xUnder)
InsuranceData2['Responses'] = yUnder

InsuranceData2['Responses'].value_counts()

Responses
0    125202
1     62601
Name: count, dtype: int64

## SMOTE

Since SMOTE Doesnt duplicate data like Random Over Sampling or Random Under Sampling, it Finds nearest neighbors, Computes distances and Creates new synthetic points, which need numbers for the calculation not strings or objects, so we need to encode categorical variables first.

In [98]:
categoricalFeatures = InsuranceData.select_dtypes(include=['object']).columns.to_list()

In [99]:
OrdinalFeatures = ['Vehicle_Age']
NominalFeatures = categoricalFeatures.copy()
NominalFeatures.remove('Vehicle_Age')

In [100]:
encoder = OrdinalEncoder()
InsuranceData[OrdinalFeatures] = encoder.fit_transform(InsuranceData[OrdinalFeatures])

In [101]:
for c in NominalFeatures:
    
    freq = InsuranceData[c].value_counts()
    InsuranceData[c+'_freq'] = InsuranceData[c].map(freq)

InsuranceData.drop(NominalFeatures, axis=1, inplace=True)

In [102]:
x = InsuranceData.drop('Response', axis=1)
y = InsuranceData['Response']

In [103]:
InsuranceData3 = InsuranceData.copy()
smote = SMOTE(sampling_strategy='minority')
xSMOTE, ySMOTE = smote.fit_resample(x, y)
InsuranceData3 = pd.DataFrame(xSMOTE)
InsuranceData3['Responses'] = ySMOTE

InsuranceData3['Responses'].value_counts()

Responses
0    319553
1    319553
Name: count, dtype: int64