# Data Preparation - WOE(Weight of Evidence) Basics

Weight of Evidence (WOE) quantifies the strength of the relationship between a categorical independent variable (predictor) and a binary target variable (response) by calculating the logarithm of the odds ratio.
It measures how well the category predicts the positive (1) or negative (0) class of the target variable.


*   If WOE > 0, it indicates that the category is associated with a higher likelihood of the positive event (good outcome).

*   If WOE < 0, it indicates that the category is associated with a higher likelihood of the negative event (bad outcome).

*   If WOE = 0, it suggests that the category has no discriminatory power between the positive and negative events.


## Import Libraries

In [1]:
import pandas as pd
import numpy as np

## Example 1: Simple Calculation

In [2]:
data = pd.DataFrame({'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
                     'Target': [1, 0, 1, 1, 0, 1]})

In [3]:
# Calculate WOE for Category 'A' and 'B'
category_counts = data['Category'].value_counts()
category_counts_pos = data[data['Target'] == 1]['Category'].value_counts()
category_counts_neg = data[data['Target'] == 0]['Category'].value_counts()

print('Category_counts_pos: \n{}'.format(category_counts_pos))
print('\nCategory_counts_neg: \n{}'.format(category_counts_neg))

Category_counts_pos: 
A    2
B    2
Name: Category, dtype: int64

Category_counts_neg: 
B    1
A    1
Name: Category, dtype: int64


In [4]:
# Calculate WOE
woe_A = np.log((category_counts_pos['A'] / category_counts['A']) / (category_counts_neg['A'] / category_counts['A']))
woe_B = np.log((category_counts_pos['B'] / category_counts['B']) / (category_counts_neg['B'] / category_counts['B']))

print(f'WOE for Category A: {woe_A:.2f}')
print(f'WOE for Category B: {woe_B:.2f}')

WOE for Category A: 0.69
WOE for Category B: 0.69


## Example 2: Calculation with Binning

In [5]:
data = pd.DataFrame({'Age': [25, 30, 35, 40, 45, 50, 55, 60],
                     'Target': [1, 0, 1, 0, 1, 0, 0, 1]})

In [6]:
# Create age bins
bins = [0, 35, 45, 55, np.inf]
labels = ['<35', '35-45', '45-55', '55+']
data['Age_Bin'] = pd.cut(data['Age'], bins=bins, labels=labels)

In [7]:
# Calculate WOE for each age bin
def calculate_woe(df, col, target_col):
    category_counts = df[col].value_counts()
    category_counts_pos = df[df[target_col] == 1][col].value_counts()
    category_counts_neg = df[df[target_col] == 0][col].value_counts()
    woe_values = {}
    for category in category_counts.index:
        woe = np.log((category_counts_pos.get(category, 0) / category_counts[category]) /
                     (category_counts_neg.get(category, 0) / category_counts[category]))
        woe_values[category] = woe
    return woe_values

In [8]:
woe_age = calculate_woe(data, 'Age_Bin', 'Target')

print("WOE values for Age Bins:")

for category, woe in woe_age.items():
    print(f'{category}: {woe:.2f}')

WOE values for Age Bins:
<35: 0.69
35-45: 0.00
45-55: -inf
55+: inf


  woe = np.log((category_counts_pos.get(category, 0) / category_counts[category]) /
  woe = np.log((category_counts_pos.get(category, 0) / category_counts[category]) /


## Example 3: Calculation with missing values

In [9]:
data = pd.DataFrame({
    'Category': ['A', 'B', 'A', 'B', 'A', 'C', 'B', 'A', 'C', np.nan],
    'Target': [1, 0, 1, 1, 0, 1, 0, 1, 0, 1]
})

In [10]:
# Replace missing values with a placeholder (e.g., 'Missing')
data['Category'].fillna('Missing', inplace=True)

In [11]:
# Calculate WOE for each category including 'Missing'
def calculate_woe(df, col, target_col):
    category_counts = df[col].value_counts()
    category_counts_pos = df[df[target_col] == 1][col].value_counts()
    category_counts_neg = df[df[target_col] == 0][col].value_counts()
    woe_values = {}
    for category in category_counts.index:
        woe = np.log((category_counts_pos.get(category, 0) / category_counts[category]) /
                     (category_counts_neg.get(category, 0) / category_counts[category]))
        woe_values[category] = woe
    return woe_values

In [12]:
woe_category = calculate_woe(data, 'Category', 'Target')

print("WOE values for Categories:")
for category, woe in woe_category.items():
    print(f'{category}: {woe:.2f}')

WOE values for Categories:
A: 1.10
B: -0.69
C: 0.00
Missing: inf


  woe = np.log((category_counts_pos.get(category, 0) / category_counts[category]) /
