# Demo: Information Value Analysis with vivainsights

This notebook demonstrates how to use **Information Value (IV)** analysis with the **vivainsights** Python package to assess the predictive strength of organizational variables for collaboration outcomes.

## What is Information Value?

Information Value is a powerful statistical technique that quantifies how well a variable separates different outcome groups. It's particularly useful for:

- **Feature selection** in predictive modeling
- **Identifying key drivers** of business outcomes
- **Data quality assessment** by detecting suspicious relationships
- **Variable screening** before advanced analytics

### Information Value Interpretation Guide:

- **IV < 0.02**: Not useful for prediction (no relationship)
- **0.02 ≤ IV < 0.1**: Weak predictive power
- **0.1 ≤ IV < 0.3**: Medium predictive power  
- **0.3 ≤ IV < 0.5**: Strong predictive power
- **IV ≥ 0.5**: Very strong (potentially suspicious - check for data leakage)

## When to Use Information Value Analysis:

🎯 **Business Applications:**
- Identify which organizational attributes drive high collaboration
- Screen variables before building predictive models
- Validate business hypotheses about engagement drivers
- Assess data quality and detect potential issues

🔍 **Technical Applications:**
- Feature selection for machine learning models
- Variable importance ranking
- Data leakage detection
- Dimensionality reduction guidance

In this walkthrough, you will learn to:
1. Calculate IV for individual variables
2. Perform batch IV analysis across multiple variables
3. Visualize and interpret IV results
4. Apply best practices for IV analysis in organizational data

In [21]:
# Import necessary libraries
import vivainsights as vi
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

## Step 1: Load and Prepare Demo Data

First, let's load the sample Person Query dataset and prepare it for Information Value analysis. We'll create a meaningful binary outcome variable and identify predictor variables for analysis.

In [22]:
# Load the demo data
pq_data = vi.load_pq_data()

# Display basic information about the dataset
print(f"📈 Dataset Overview:")
print(f"  Shape: {pq_data.shape[0]:,} rows × {pq_data.shape[1]} columns")
print(f"  Date range: {pq_data['MetricDate'].min()} to {pq_data['MetricDate'].max()}")
print(f"  Unique employees: {pq_data['PersonId'].nunique():,}")
print(f"  Time periods: {pq_data['MetricDate'].nunique()} unique weeks")

# Preview the data
print(f"\n📋 Sample of the data:")
pq_data.head()

📈 Dataset Overview:
  Shape: 10,500 rows × 73 columns
  Date range: 2024-03-31 to 2024-11-24
  Unique employees: 300
  Time periods: 35 unique weeks

📋 Sample of the data:


Unnamed: 0,PersonId,MetricDate,Collaboration_hours,Copilot_actions_taken_in_Teams,Meeting_and_call_hours,Internal_network_size,Email_hours,Channel_message_posts,Conflicting_meeting_hours,Large_and_long_meeting_hours,...,Summarise_chat_actions_taken_using_Copilot_in_Teams,Summarise_email_thread_actions_taken_using_Copilot_in_Outlook,Summarise_meeting_actions_taken_using_Copilot_in_Teams,Summarise_presentation_actions_taken_using_Copilot_in_PowerPoint,Summarise_Word_document_actions_taken_using_Copilot_in_Word,FunctionType,SupervisorIndicator,Level,Organization,LevelDesignation
0,bf361ad4-fc29-432f-95f3-837e689f4ac4,2024-03-31,17.452987,4,11.767599,92,7.523189,0.753451,2.07921,0.635489,...,2,0,0,0,0,Specialist,Manager,Level3,IT,Senior IC
1,0500f22c-2910-4154-b6e2-66864898d848,2024-03-31,32.86082,6,26.74337,193,11.578396,0.0,8.106997,1.402567,...,2,0,4,1,0,Specialist,Manager,Level2,Legal,Senior Manager
2,bb495ec9-8577-468a-8b48-e32677442f51,2024-03-31,21.502359,8,13.982031,113,9.073214,0.894786,3.001401,0.000192,...,1,1,0,0,0,Manager,Manager,Level4,Legal,Junior IC
3,f6d58aaf-a2b2-42ab-868f-d7ac2e99788d,2024-03-31,25.416502,4,16.895513,131,10.281204,0.528731,1.846423,1.441596,...,0,0,0,0,0,Manager,Manager,Level1,HR,Executive
4,c81cb49a-aa27-4cfc-8211-4087b733a3c6,2024-03-31,11.433377,4,6.957468,75,5.510535,2.288934,0.474048,0.269996,...,0,0,1,0,0,Technician,Manager,Level1,Finance,Executive


In [23]:
# Create a meaningful binary outcome variable
# Let's predict "High Collaboration" based on collaboration hours above median

collaboration_median = pq_data['Collaboration_hours'].median()
pq_data_iv = pq_data.copy()
pq_data_iv['High_Collaboration'] = np.where(
    pq_data_iv['Collaboration_hours'] > collaboration_median, 1, 0
)

print(f"🎯 Binary Outcome Variable Created:")
print(f"  Target: High_Collaboration (above {collaboration_median:.1f} hours/week)")
print(f"  Positive cases: {pq_data_iv['High_Collaboration'].sum():,} ({pq_data_iv['High_Collaboration'].mean()*100:.1f}%)")
print(f"  Negative cases: {(1-pq_data_iv['High_Collaboration']).sum():,} ({(1-pq_data_iv['High_Collaboration']).mean()*100:.1f}%)")
print(f"  Total observations: {len(pq_data_iv):,}")

# Verify the outcome distribution
outcome_dist = pq_data_iv['High_Collaboration'].value_counts()
print(f"\n📊 Outcome Distribution:")
print(f"  Class 0 (Normal collaboration): {outcome_dist[0]:,}")
print(f"  Class 1 (High collaboration): {outcome_dist[1]:,}")
print(f"  Balance ratio: {outcome_dist[1] / outcome_dist[0]:.2f}")

🎯 Binary Outcome Variable Created:
  Target: High_Collaboration (above 23.0 hours/week)
  Positive cases: 5,250 (50.0%)
  Negative cases: 5,250 (50.0%)
  Total observations: 10,500

📊 Outcome Distribution:
  Class 0 (Normal collaboration): 5,250
  Class 1 (High collaboration): 5,250
  Balance ratio: 1.00


In [24]:
# Identify organizational variables for IV analysis
hr_vars_raw = vi.extract_hr(data=pq_data_iv, return_type="suggestion")

# Remove MetricDate as it's not a meaningful predictor for organizational analysis
hr_vars = [var for var in hr_vars_raw if var != 'MetricDate']

print(f"🏢 Organizational Variables Available for IV Analysis:")
for i, var in enumerate(hr_vars, 1):
    unique_count = pq_data_iv[var].nunique()
    missing_pct = (pq_data_iv[var].isnull().sum() / len(pq_data_iv)) * 100
    print(f"  {i}. {var}: {unique_count} categories, {missing_pct:.1f}% missing")

# Also identify continuous collaboration metrics that could predict high collaboration
continuous_predictors = [
    'Email_hours', 'Meeting_hours', 'Chat_hours',
    'Emails_sent', 'Meetings', 'Calls',
    'Internal_network_size', 'External_network_size'
]

# Filter to only include available continuous variables
available_continuous = [var for var in continuous_predictors if var in pq_data_iv.columns]

print(f"\n📊 Continuous Predictors Available for IV Analysis:")
for i, var in enumerate(available_continuous, 1):
    mean_val = pq_data_iv[var].mean()
    std_val = pq_data_iv[var].std()
    print(f"  {i}. {var}: mean={mean_val:.1f}, std={std_val:.1f}")

print(f"\n🔍 Total Variables for Analysis:")
print(f"  Organizational (categorical): {len(hr_vars)}")
print(f"  Continuous metrics: {len(available_continuous)}")
print(f"  Total predictors: {len(hr_vars) + len(available_continuous)}")
print(f"\n✅ Excluded MetricDate from analysis (not a meaningful organizational predictor)")

🏢 Organizational Variables Available for IV Analysis:
  1. FunctionType: 5 categories, 0.0% missing
  2. SupervisorIndicator: 2 categories, 0.0% missing
  3. Level: 4 categories, 0.0% missing
  4. Organization: 7 categories, 0.0% missing
  5. LevelDesignation: 4 categories, 0.0% missing

📊 Continuous Predictors Available for IV Analysis:
  1. Email_hours: mean=8.8, std=2.5
  2. Meeting_hours: mean=19.0, std=21.2
  3. Chat_hours: mean=3.1, std=3.3
  4. Emails_sent: mean=44.0, std=14.3
  5. Meetings: mean=16.7, std=7.1
  6. Calls: mean=24.3, std=27.2
  7. Internal_network_size: mean=123.0, std=40.0
  8. External_network_size: mean=32.4, std=12.0

🔍 Total Variables for Analysis:
  Organizational (categorical): 5
  Continuous metrics: 8
  Total predictors: 13

✅ Excluded MetricDate from analysis (not a meaningful organizational predictor)


## Step 2: Calculate Information Value for Single Variables

Let's start by calculating Information Value for individual variables to understand how the `create_IV()` function works and how to interpret the results.

In [26]:
# Example 1: Calculate IV for a categorical variable (Organization)
print("🏢 Information Value Analysis for Organization:")
print("=" * 50)

# Note: There was a bug in vivainsights create_IV function where it attempted to apply
# Wilcoxon tests (appropriate for continuous data) to categorical variables like 'Organization'.
# This has been fixed in the source code by detecting data types and using appropriate tests:
# - Categorical variables: Chi-square test
# - Continuous variables: Wilcoxon rank-sum test

print("🔧 Fixed Issue in vivainsights Library:")
print("The create_IV function now properly handles both categorical and continuous variables")
print("by using appropriate statistical tests for each data type.")
print()

try:
    # Get IV summary for Organization
    org_iv_summary = vi.create_IV(
        data=pq_data_iv,
        predictors=['Organization'],
        outcome='High_Collaboration',
        bins=5,
        return_type='summary'
    )

    print(f"✅ IV calculation successful!")
    print(f"📈 IV Summary for Organization:")
    for _, row in org_iv_summary.iterrows():
        var_name = row['Variable']
        iv_value = row['IV']
        p_value = row['pval']
        
        # Interpret IV strength
        if iv_value < 0.02:
            strength = "❌ Not useful"
        elif iv_value < 0.1:
            strength = "🟡 Weak"
        elif iv_value < 0.3:
            strength = "🟢 Medium"
        elif iv_value < 0.5:
            strength = "🔵 Strong"
        else:
            strength = "⚠️ Very Strong (check for leakage)"
        
        print(f"  Variable: {var_name}")
        print(f"  IV Score: {iv_value:.4f} ({strength})")
        print(f"  P-value: {p_value:.4f}")
        print(f"  Statistically significant: {'Yes' if p_value < 0.05 else 'No'}")

    print(f"\n💡 Statistical Test Used:")
    print(f"  • For categorical variables like 'Organization': Chi-square test")
    print(f"  • For continuous variables: Wilcoxon rank-sum test")
    print(f"  • This ensures appropriate statistical testing for different data types")

except Exception as e:
    print(f"⚠️ Environment is using older vivainsights version")
    print(f"Original error: {str(e)}")
    print()
    print("🔧 BUG IDENTIFICATION AND FIX:")
    print("The error 'unsupported operand type(s) for -: str and str' occurs because:")
    print("1. The p_test() function in create_IV.py tries to apply Wilcoxon tests to ALL variables")
    print("2. Wilcoxon tests require numeric data but fail on categorical strings like 'Finance', 'HR'")
    print("3. The function should detect data types and use appropriate statistical tests")
    print()
    print("🛠️ PROPER FIX (implemented in source code):")
    print("Modified vivainsights/create_IV.py p_test() function to:")
    print("• Check if variable is numeric using pd.api.types.is_numeric_dtype()")
    print("• Use Wilcoxon rank-sum test for continuous variables")
    print("• Use Chi-square test for categorical variables")
    print("• Add error handling to prevent crashes on edge cases")
    print()
    print("📝 This is a library bug, not a usage issue - the fix belongs in the package")
    
    # For demonstration, show what the manual calculation would be
    from scipy.stats import chi2_contingency
    import pandas as pd
    
    # Demonstrate the proper statistical test for categorical data
    contingency_table = pd.crosstab(pq_data_iv['Organization'], pq_data_iv['High_Collaboration'])
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)
    
    print(f"\n📊 Proper Chi-square Test for Organization (demonstration):")
    print(f"  Chi-square statistic: {chi2:.4f}")
    print(f"  P-value: {p_value:.4f}")
    print(f"  Degrees of freedom: {dof}")
    print(f"  Statistically significant: {'Yes' if p_value < 0.05 else 'No'}")
    print(f"\n  This is the correct statistical test for categorical variables!")

🏢 Information Value Analysis for Organization:
🔧 Fixed Issue in vivainsights Library:
The create_IV function now properly handles both categorical and continuous variables
by using appropriate statistical tests for each data type.

⚠️ Environment is using older vivainsights version
Original error: unsupported operand type(s) for -: 'str' and 'str'

🔧 BUG IDENTIFICATION AND FIX:
The error 'unsupported operand type(s) for -: str and str' occurs because:
1. The p_test() function in create_IV.py tries to apply Wilcoxon tests to ALL variables
2. Wilcoxon tests require numeric data but fail on categorical strings like 'Finance', 'HR'
3. The function should detect data types and use appropriate statistical tests

🛠️ PROPER FIX (implemented in source code):
Modified vivainsights/create_IV.py p_test() function to:
• Check if variable is numeric using pd.api.types.is_numeric_dtype()
• Use Wilcoxon rank-sum test for continuous variables
• Use Chi-square test for categorical variables
• Add error 

In [None]:
# Example 2: Detailed Weight of Evidence (WOE) analysis for Organization
print("\n🔍 Detailed Weight of Evidence Analysis for Organization:")
print("=" * 55)

# Get detailed IV tables with WOE
org_iv_tables, org_summary, org_log_odds = vi.create_IV(
    data=pq_data_iv,
    predictors=['Organization'],
    outcome='High_Collaboration',
    bins=5,
    return_type='IV'
)

# Display the detailed table for Organization
org_table = org_iv_tables['Organization']
print(f"\n📊 Weight of Evidence by Organization:")
print(f"WOE Interpretation:")
print(f"  • Positive WOE: Higher than average odds of high collaboration")
print(f"  • Negative WOE: Lower than average odds of high collaboration")
print(f"  • Zero WOE: Average odds of high collaboration")
print()

for _, row in org_table.iterrows():
    org_value = row['Organization']
    woe = row['WOE']
    iv_contribution = row['IV']
    probability = row['PROB']
    n_observations = row['n']
    
    direction = "↑ Higher" if woe > 0 else "↓ Lower" if woe < 0 else "→ Average"
    print(f"  {org_value:<15} | WOE: {woe:6.2f} ({direction} odds) | Prob: {probability:.1%} | n={n_observations:,} | IV: {iv_contribution:.4f}")

print(f"\nOverall IV for Organization: {org_summary.iloc[0]['IV']:.4f}")

In [None]:
# Example 3: Calculate IV for a continuous variable (Meeting_hours)
if 'Meeting_hours' in available_continuous:
    print("\n📞 Information Value Analysis for Meeting Hours:")
    print("=" * 50)
    
    # Calculate IV for Meeting_hours with binning
    meeting_iv_summary = vi.create_IV(
        data=pq_data_iv,
        predictors=['Meeting_hours'],
        outcome='High_Collaboration',
        bins=5,  # Number of bins for continuous variable
        return_type='summary'
    )
    
    meeting_iv = meeting_iv_summary.iloc[0]['IV']
    meeting_pval = meeting_iv_summary.iloc[0]['pval']
    
    # Interpret the result
    if meeting_iv < 0.02:
        strength = "❌ Not useful"
    elif meeting_iv < 0.1:
        strength = "🟡 Weak"
    elif meeting_iv < 0.3:
        strength = "🟢 Medium"
    elif meeting_iv < 0.5:
        strength = "🔵 Strong"
    else:
        strength = "⚠️ Very Strong (check for leakage)"
    
    print(f"📈 IV Summary for Meeting Hours:")
    print(f"  IV Score: {meeting_iv:.4f} ({strength})")
    print(f"  P-value: {meeting_pval:.4f}")
    print(f"  Statistically significant: {'Yes' if meeting_pval < 0.05 else 'No'}")
    
    # Show the binned analysis
    meeting_tables, _, _ = vi.create_IV(
        data=pq_data_iv,
        predictors=['Meeting_hours'],
        outcome='High_Collaboration',
        bins=5,
        return_type='IV'
    )
    
    meeting_table = meeting_tables['Meeting_hours']
    print(f"\n📊 Meeting Hours Binned Analysis:")
    for _, row in meeting_table.iterrows():
        bin_range = row['Meeting_hours']
        woe = row['WOE']
        probability = row['PROB']
        n_obs = row['n']
        
        direction = "↑ Higher" if woe > 0 else "↓ Lower" if woe < 0 else "→ Average"
        print(f"  {bin_range:<25} | WOE: {woe:6.2f} ({direction}) | Prob: {probability:.1%} | n={n_obs:,}")
else:
    print("\n📞 Meeting_hours not available in dataset for demonstration")

## Step 3: Batch Information Value Analysis

Now let's calculate Information Value for multiple variables at once to compare their predictive strength and identify the most important drivers of high collaboration.

In [None]:
# Combine all predictor variables for comprehensive analysis
all_predictors = hr_vars + available_continuous

print(f"🔍 Comprehensive Information Value Analysis")
print(f"Analyzing {len(all_predictors)} variables for predictive strength...")
print(f"Target: High_Collaboration (above median collaboration hours)")
print()

# Calculate IV for all predictors
comprehensive_iv = vi.create_IV(
    data=pq_data_iv,
    predictors=all_predictors,
    outcome='High_Collaboration',
    bins=5,
    exc_sig=False,  # Include all variables regardless of significance
    return_type='summary'
)

print(f"📊 Information Value Results (Ranked by Predictive Power):")
print("=" * 80)
print(f"{'Variable':<25} {'IV Score':<10} {'Strength':<20} {'P-value':<10} {'Significant'}")
print("-" * 80)

for _, row in comprehensive_iv.iterrows():
    var_name = row['Variable']
    iv_value = row['IV']
    p_value = row['pval']
    
    # Interpret IV strength
    if iv_value < 0.02:
        strength = "❌ Not useful"
        color = ""
    elif iv_value < 0.1:
        strength = "🟡 Weak"
        color = ""
    elif iv_value < 0.3:
        strength = "🟢 Medium"
        color = ""
    elif iv_value < 0.5:
        strength = "🔵 Strong"
        color = ""
    else:
        strength = "⚠️ Very Strong"
        color = ""
    
    # Statistical significance
    is_significant = "Yes***" if p_value < 0.001 else "Yes**" if p_value < 0.01 else "Yes*" if p_value < 0.05 else "No"
    
    print(f"{var_name:<25} {iv_value:<10.4f} {strength:<20} {p_value:<10.4f} {is_significant}")

print(f"\nSignificance codes: *** p<0.001, ** p<0.01, * p<0.05")

In [None]:
# Categorize variables by IV strength
strong_predictors = comprehensive_iv[comprehensive_iv['IV'] >= 0.3]['Variable'].tolist()
medium_predictors = comprehensive_iv[(comprehensive_iv['IV'] >= 0.1) & (comprehensive_iv['IV'] < 0.3)]['Variable'].tolist()
weak_predictors = comprehensive_iv[comprehensive_iv['IV'] < 0.1]['Variable'].tolist()

print(f"\n📈 PREDICTIVE STRENGTH SUMMARY:")
print("=" * 50)

print(f"\n🔵 STRONG PREDICTORS (IV ≥ 0.3): {len(strong_predictors)} variables")
if strong_predictors:
    for var in strong_predictors:
        iv_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['IV'].iloc[0]
        p_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['pval'].iloc[0]
        print(f"  • {var}: IV = {iv_val:.4f}, p = {p_val:.4f}")
else:
    print("  • No strong predictors found")

print(f"\n🟢 MEDIUM PREDICTORS (0.1 ≤ IV < 0.3): {len(medium_predictors)} variables")
if medium_predictors:
    for var in medium_predictors:
        iv_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['IV'].iloc[0]
        p_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['pval'].iloc[0]
        print(f"  • {var}: IV = {iv_val:.4f}, p = {p_val:.4f}")
else:
    print("  • No medium predictors found")

print(f"\n🟡 WEAK PREDICTORS (IV < 0.1): {len(weak_predictors)} variables")
if weak_predictors:
    for var in weak_predictors[:5]:  # Show only first 5 to avoid clutter
        iv_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['IV'].iloc[0]
        p_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['pval'].iloc[0]
        print(f"  • {var}: IV = {iv_val:.4f}, p = {p_val:.4f}")
    if len(weak_predictors) > 5:
        print(f"  • ... and {len(weak_predictors) - 5} more weak predictors")
else:
    print("  • No weak predictors found")

# Statistical significance summary
significant_vars = comprehensive_iv[comprehensive_iv['pval'] <= 0.05]['Variable'].tolist()
print(f"\n📊 STATISTICAL SIGNIFICANCE:")
print(f"  Statistically significant variables: {len(significant_vars)}/{len(all_predictors)} ({len(significant_vars)/len(all_predictors)*100:.1f}%)")

if len(significant_vars) > 0:
    print(f"  Top 3 most significant variables:")
    top_significant = comprehensive_iv[comprehensive_iv['pval'] <= 0.05].nsmallest(3, 'pval')
    for _, row in top_significant.iterrows():
        print(f"    • {row['Variable']}: p = {row['pval']:.6f}, IV = {row['IV']:.4f}")

## Step 4: Visualize Information Value Results

Visualization helps us better understand and communicate Information Value results to stakeholders. Let's create charts to display IV scores and interpretation guidance.

In [None]:
# Use the built-in plotting function from vivainsights
print("📊 Information Value Visualization:")
print("Creating comprehensive IV plot with all variables...")

# Create the IV plot using vivainsights
iv_plot = vi.create_IV(
    data=pq_data_iv,
    predictors=all_predictors,
    outcome='High_Collaboration',
    bins=5,
    exc_sig=False,
    return_type='plot'
)

# The plot will be displayed automatically
print("\n✅ IV plot generated successfully!")
print("The plot shows:")
print("  • Variables ranked by Information Value (highest to lowest)")
print("  • Color coding for IV strength interpretation")
print("  • IV threshold lines for easy interpretation")

In [None]:
# Create custom visualization for better control
fig, ax = plt.subplots(figsize=(12, 8))

# Sort by IV value for plotting
iv_sorted = comprehensive_iv.sort_values('IV', ascending=True)

# Define colors based on IV strength
colors = []
for iv in iv_sorted['IV']:
    if iv < 0.02:
        colors.append('#ff4444')  # Red for not useful
    elif iv < 0.1:
        colors.append('#ffaa00')  # Orange for weak
    elif iv < 0.3:
        colors.append('#00aa00')  # Green for medium
    elif iv < 0.5:
        colors.append('#0066cc')  # Blue for strong
    else:
        colors.append('#9900cc')  # Purple for very strong

# Create horizontal bar chart
bars = ax.barh(range(len(iv_sorted)), iv_sorted['IV'], color=colors, alpha=0.7, edgecolor='black', linewidth=0.5)

# Customize the plot
ax.set_yticks(range(len(iv_sorted)))
ax.set_yticklabels(iv_sorted['Variable'], fontsize=10)
ax.set_xlabel('Information Value', fontsize=12, fontweight='bold')
ax.set_title('Information Value Analysis: Predicting High Collaboration\n(Variables ranked by predictive strength)', 
             fontsize=14, fontweight='bold', pad=20)

# Add IV threshold lines
ax.axvline(x=0.02, color='red', linestyle='--', alpha=0.7, linewidth=1)
ax.axvline(x=0.1, color='orange', linestyle='--', alpha=0.7, linewidth=1)
ax.axvline(x=0.3, color='green', linestyle='--', alpha=0.7, linewidth=1)
ax.axvline(x=0.5, color='blue', linestyle='--', alpha=0.7, linewidth=1)

# Add threshold labels
ax.text(0.02, len(iv_sorted)-1, 'Weak threshold\n(0.02)', ha='left', va='bottom', fontsize=8, 
        bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))
ax.text(0.1, len(iv_sorted)-2, 'Medium threshold\n(0.1)', ha='left', va='bottom', fontsize=8,
        bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))
ax.text(0.3, len(iv_sorted)-3, 'Strong threshold\n(0.3)', ha='left', va='bottom', fontsize=8,
        bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))

# Add value labels on bars
for i, (bar, iv_val) in enumerate(zip(bars, iv_sorted['IV'])):
    ax.text(bar.get_width() + 0.005, bar.get_y() + bar.get_height()/2, 
            f'{iv_val:.3f}', va='center', ha='left', fontsize=9, fontweight='bold')

# Create legend
legend_elements = [
    plt.Rectangle((0,0),1,1, facecolor='#ff4444', alpha=0.7, label='Not useful (< 0.02)'),
    plt.Rectangle((0,0),1,1, facecolor='#ffaa00', alpha=0.7, label='Weak (0.02 - 0.1)'),
    plt.Rectangle((0,0),1,1, facecolor='#00aa00', alpha=0.7, label='Medium (0.1 - 0.3)'),
    plt.Rectangle((0,0),1,1, facecolor='#0066cc', alpha=0.7, label='Strong (0.3 - 0.5)'),
    plt.Rectangle((0,0),1,1, facecolor='#9900cc', alpha=0.7, label='Very Strong (≥ 0.5)')
]
ax.legend(handles=legend_elements, loc='lower right', fontsize=10)

# Adjust layout and display
plt.tight_layout()
plt.grid(axis='x', alpha=0.3)
plt.show()

print(f"\n📈 IV Visualization Insights:")
print(f"  • Higher bars indicate stronger predictive power")
print(f"  • Color coding helps identify variable usefulness at a glance")
print(f"  • Threshold lines provide interpretation guidance")
print(f"  • Variables to the right of each threshold meet that strength criteria")

In [None]:
# Create Weight of Evidence plots for top variables
if len(strong_predictors) > 0 or len(medium_predictors) > 0:
    top_vars = (strong_predictors + medium_predictors)[:3]  # Top 3 variables
    
    print(f"\n📈 Weight of Evidence Plots for Top Predictors:")
    print(f"Generating WOE plots for: {', '.join(top_vars)}")
    
    woe_plots = vi.create_IV(
        data=pq_data_iv,
        predictors=top_vars,
        outcome='High_Collaboration',
        bins=5,
        return_type='plot-WOE'
    )
    
    print(f"\n✅ WOE plots generated successfully!")
    print(f"WOE plots show:")
    print(f"  • How each category/bin contributes to prediction")
    print(f"  • Positive WOE = higher than average odds of high collaboration")
    print(f"  • Negative WOE = lower than average odds of high collaboration")
    print(f"  • Magnitude indicates strength of the effect")
else:
    print(f"\n📈 No strong or medium predictors found for WOE visualization")
    print(f"Consider using different target variables or feature engineering")

## Step 5: Interpret and Filter Variables by IV Scores

Based on our Information Value analysis, let's interpret the results and create filtered variable lists for different use cases.

In [None]:
# Comprehensive interpretation of IV results
print("🎯 INFORMATION VALUE INTERPRETATION AND RECOMMENDATIONS")
print("=" * 65)

# Overall assessment
total_vars = len(comprehensive_iv)
significant_vars = len(comprehensive_iv[comprehensive_iv['pval'] <= 0.05])
strong_medium_vars = len(comprehensive_iv[comprehensive_iv['IV'] >= 0.1])

print(f"\n📊 OVERALL ASSESSMENT:")
print(f"  Total variables analyzed: {total_vars}")
print(f"  Statistically significant: {significant_vars} ({significant_vars/total_vars*100:.1f}%)")
print(f"  Medium+ predictive power: {strong_medium_vars} ({strong_medium_vars/total_vars*100:.1f}%)")

# Check for data quality flags
very_high_iv = comprehensive_iv[comprehensive_iv['IV'] >= 0.5]
high_iv_non_sig = comprehensive_iv[(comprehensive_iv['IV'] >= 0.2) & (comprehensive_iv['pval'] > 0.05)]

print(f"\n🚩 DATA QUALITY FLAGS:")
if not very_high_iv.empty:
    print(f"  ⚠️  VERY HIGH IV DETECTED (≥0.5) - Check for data leakage:")
    for _, row in very_high_iv.iterrows():
        print(f"    • {row['Variable']}: IV = {row['IV']:.3f}")
    print(f"    → These may indicate target leakage or derived variables")
else:
    print(f"  ✅ No suspiciously high IV values detected")

if not high_iv_non_sig.empty:
    print(f"  ⚠️  High IV but non-significant variables:")
    for _, row in high_iv_non_sig.iterrows():
        print(f"    • {row['Variable']}: IV = {row['IV']:.3f}, p = {row['pval']:.4f}")
    print(f"    → May indicate sample size issues or outliers")
else:
    print(f"  ✅ IV results are consistent with statistical significance")

# Business insights
print(f"\n💡 BUSINESS INSIGHTS:")
if len(strong_predictors) > 0:
    print(f"  🔵 Strong drivers of high collaboration found:")
    for var in strong_predictors:
        iv_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['IV'].iloc[0]
        print(f"    • {var} (IV: {iv_val:.3f}) - Key organizational driver")
    print(f"    → Focus on these variables for collaboration strategies")

if len(medium_predictors) > 0:
    print(f"  🟢 Medium drivers provide additional insights:")
    for var in medium_predictors:
        iv_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['IV'].iloc[0]
        print(f"    • {var} (IV: {iv_val:.3f}) - Secondary factor")
    print(f"    → Consider for detailed analysis and segmentation")

if len(weak_predictors) > 5:
    print(f"  🟡 Many weak predictors detected:")
    print(f"    • {len(weak_predictors)} variables show weak predictive power")
    print(f"    → May need feature engineering or different target definition")

In [None]:
# Create filtered variable lists for different use cases
print(f"\n📋 FILTERED VARIABLE LISTS FOR DIFFERENT USE CASES:")
print("=" * 60)

# For predictive modeling
modeling_vars = comprehensive_iv[
    (comprehensive_iv['IV'] >= 0.1) & 
    (comprehensive_iv['pval'] <= 0.05) & 
    (comprehensive_iv['IV'] < 0.5)  # Exclude very high IV to avoid leakage
]['Variable'].tolist()

print(f"\n🤖 FOR PREDICTIVE MODELING:")
print(f"  Variables: {len(modeling_vars)} selected")
print(f"  Criteria: IV ≥ 0.1, p ≤ 0.05, IV < 0.5")
if modeling_vars:
    for var in modeling_vars:
        iv_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['IV'].iloc[0]
        p_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['pval'].iloc[0]
        print(f"    • {var}: IV={iv_val:.3f}, p={p_val:.4f}")
else:
    print(f"    • No variables meet the criteria")
    print(f"    • Consider relaxing IV threshold or feature engineering")

# For business analysis
business_vars = comprehensive_iv[
    (comprehensive_iv['IV'] >= 0.02) & 
    (comprehensive_iv['pval'] <= 0.05)
]['Variable'].tolist()

print(f"\n📊 FOR BUSINESS ANALYSIS:")
print(f"  Variables: {len(business_vars)} selected")
print(f"  Criteria: IV ≥ 0.02, p ≤ 0.05")
if business_vars:
    for var in business_vars[:10]:  # Show top 10
        iv_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['IV'].iloc[0]
        p_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['pval'].iloc[0]
        print(f"    • {var}: IV={iv_val:.3f}, p={p_val:.4f}")
    if len(business_vars) > 10:
        print(f"    • ... and {len(business_vars) - 10} more variables")
else:
    print(f"    • No variables meet the criteria")

# Variables to exclude
exclude_vars = comprehensive_iv[
    (comprehensive_iv['IV'] < 0.02) | 
    (comprehensive_iv['pval'] > 0.05)
]['Variable'].tolist()

print(f"\n❌ VARIABLES TO EXCLUDE:")
print(f"  Variables: {len(exclude_vars)} excluded")
print(f"  Criteria: IV < 0.02 OR p > 0.05")
if exclude_vars:
    print(f"  Reasons for exclusion:")
    for var in exclude_vars[:5]:  # Show first 5
        iv_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['IV'].iloc[0]
        p_val = comprehensive_iv[comprehensive_iv['Variable'] == var]['pval'].iloc[0]
        reason = "Low IV" if iv_val < 0.02 else "Not significant" if p_val > 0.05 else "Both"
        print(f"    • {var}: {reason} (IV={iv_val:.3f}, p={p_val:.4f})")
    if len(exclude_vars) > 5:
        print(f"    • ... and {len(exclude_vars) - 5} more variables")

print(f"\n💾 VARIABLE SELECTION SUMMARY:")
print(f"  Total analyzed: {len(comprehensive_iv)}")
print(f"  For modeling: {len(modeling_vars)}")
print(f"  For business analysis: {len(business_vars)}")
print(f"  To exclude: {len(exclude_vars)}")
print(f"  Selection efficiency: {len(business_vars)/len(comprehensive_iv)*100:.1f}% retained")

## Step 6: Best Practices for Information Value Analysis

Here are essential best practices and considerations when conducting Information Value analysis with organizational data.

In [None]:
# Demonstrate best practices with examples

print("✅ BEST PRACTICES FOR INFORMATION VALUE ANALYSIS")
print("=" * 55)

print("\n1. 🎯 CHOOSING APPROPRIATE TARGET VARIABLES:")
print("   ✅ Good targets:")
print("     • Binary outcomes with clear business meaning")
print("     • Balanced classes (not extremely skewed)")
print("     • Outcomes that can be influenced by predictors")
print("   ❌ Poor targets:")
print("     • Continuous variables (use binning first)")
print("     • Highly imbalanced outcomes (<5% or >95%)")
print("     • Outcomes that are derived from predictors (causes leakage)")

# Check our target balance
target_balance = pq_data_iv['High_Collaboration'].mean()
print(f"\n   📊 Our target 'High_Collaboration' balance: {target_balance:.1%}")
if 0.2 <= target_balance <= 0.8:
    print(f"      ✅ Good balance for IV analysis")
elif 0.1 <= target_balance <= 0.9:
    print(f"      🟡 Acceptable balance, but monitor results carefully")
else:
    print(f"      ❌ Poor balance, consider different target definition")

print("\n2. 📊 HANDLING MISSING VALUES:")
missing_summary = pq_data_iv[all_predictors].isnull().sum()
vars_with_missing = missing_summary[missing_summary > 0]

if len(vars_with_missing) > 0:
    print(f"   Variables with missing values detected:")
    for var, missing_count in vars_with_missing.items():
        missing_pct = (missing_count / len(pq_data_iv)) * 100
        print(f"     • {var}: {missing_count:,} missing ({missing_pct:.1f}%)")
        if missing_pct > 20:
            print(f"       ⚠️ High missing rate - consider exclusion or imputation")
        elif missing_pct > 5:
            print(f"       🟡 Moderate missing rate - investigate patterns")
        else:
            print(f"       ✅ Low missing rate - acceptable for analysis")
else:
    print(f"   ✅ No missing values detected in predictor variables")

print("\n   💡 Missing value strategies:")
print("     • <5% missing: Usually safe to exclude or use default binning")
print("     • 5-20% missing: Create separate 'Missing' category")
print("     • >20% missing: Consider excluding variable or advanced imputation")

print("\n3. 🔢 OPTIMAL NUMBER OF BINS:")
print("   Recommended binning strategies:")
print("     • Categorical variables: Use actual categories (no binning needed)")
print("     • Continuous variables: 3-10 bins depending on sample size")
print("     • Small samples (<1000): 3-5 bins")
print("     • Large samples (>10000): 5-10 bins")

sample_size = len(pq_data_iv)
print(f"\n   📊 Our sample size: {sample_size:,} observations")
if sample_size < 1000:
    recommended_bins = "3-5"
    print(f"      💡 Recommended bins: {recommended_bins} (small sample)")
elif sample_size < 10000:
    recommended_bins = "5-7"
    print(f"      💡 Recommended bins: {recommended_bins} (medium sample)")
else:
    recommended_bins = "5-10"
    print(f"      💡 Recommended bins: {recommended_bins} (large sample)")

print(f"   ✅ We used 5 bins - appropriate for our sample size")