# Week 7 Workshop: Bivariate & Multivariate Analysis

## Water Consumption Dataset - Deep Dive into Relationships

**Student Name:** _____________________

**Date:** _____________________

---

### Workshop Objectives

1. Build a complete correlation matrix analysis
2. Test if relationships change by municipality
3. Generate 5 actionable insights from bivariate analysis

---

## Setup

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualizations
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully!")

In [None]:
# Load the dataset
# UPDATE THE PATH if your file is in a different location
df = pd.read_csv('HISTORICO_CONSUMO.csv')

# Display basic information
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

In [None]:
# Preview the data
df.head()

In [None]:
# Check data types and missing values
df.info()

---

# Part 1: Complete Correlation Matrix Analysis (45 minutes)

---

## Task 1.1: Calculate and Visualize Correlations

In [None]:
# Select numeric columns for correlation analysis
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric columns ({len(numeric_cols)}): {numeric_cols}")

In [None]:
# TODO: Calculate the correlation matrix
# YOUR CODE HERE
corr_matrix = ___

# Display the correlation matrix (rounded for readability)
corr_matrix.round(3)

In [None]:
# TODO: Create a professional heatmap visualization
# Requirements:
# - Use RdYlGn or coolwarm color scheme
# - Show annotation with correlation values
# - Add appropriate title
# - Make sure values are readable

plt.figure(figsize=(12, 10))

# YOUR CODE HERE
sns.heatmap(
    ___,
    annot=___,
    cmap=___,
    center=0,
    vmin=-1,
    vmax=1,
    fmt='.2f',
    square=True,
    linewidths=0.5,
    cbar_kws={'shrink': 0.8}
)

plt.title('Correlation Heatmap - Water Consumption Data', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Task 1.2: Identify Key Correlations

In [None]:
# Function to extract all unique correlations
def get_all_correlations(corr_matrix):
    """
    Extract all unique correlation pairs from correlation matrix.
    Returns a DataFrame sorted by absolute correlation.
    """
    correlations = []
    
    # TODO: Loop through the upper triangle of the correlation matrix
    # and extract variable pairs with their correlations
    # YOUR CODE HERE
    
    for i in range(len(corr_matrix.columns)):
        for j in range(i + 1, len(corr_matrix.columns)):
            var1 = corr_matrix.columns[i]
            var2 = corr_matrix.columns[j]
            r = corr_matrix.iloc[i, j]
            correlations.append({
                'Variable 1': var1,
                'Variable 2': var2,
                'Correlation': r,
                'Abs Correlation': abs(r)
            })
    
    # Create DataFrame and sort
    corr_df = pd.DataFrame(correlations)
    corr_df = corr_df.sort_values('Abs Correlation', ascending=False)
    
    return corr_df

# Get all correlations
all_correlations = get_all_correlations(corr_matrix)
all_correlations.head(10)

In [None]:
# TODO: Add interpretation column
def interpret_correlation(r):
    """
    Interpret correlation value.
    Returns string with direction and strength.
    """
    # YOUR CODE HERE
    # Hint: Check if positive/negative and check magnitude
    
    direction = "Positive" if r > 0 else "Negative"
    
    abs_r = abs(r)
    if abs_r >= 0.7:
        strength = "Strong"
    elif abs_r >= 0.3:
        strength = "Moderate"
    else:
        strength = "Weak"
    
    return f"{strength} {direction}"

# Apply interpretation
all_correlations['Interpretation'] = all_correlations['Correlation'].apply(interpret_correlation)

# Display top 5 correlations
print("Top 5 Correlations:")
print("=" * 70)
all_correlations[['Variable 1', 'Variable 2', 'Correlation', 'Interpretation']].head(5)

### Top 5 Correlations Summary Table

Fill in this table based on your results:

| Rank | Variable 1 | Variable 2 | r | Interpretation |
|------|------------|------------|---|----------------|
| 1 | | | | |
| 2 | | | | |
| 3 | | | | |
| 4 | | | | |
| 5 | | | | |

## Task 1.3: Scatter Plot Analysis

In [None]:
# Get top 3 correlations for scatter plots
top_3 = all_correlations.head(3)

# TODO: Create scatter plots for top 3 correlations
# Include regression line and title with r value

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, row in enumerate(top_3.itertuples()):
    var1 = row._1  # Variable 1
    var2 = row._2  # Variable 2
    r = row.Correlation
    
    # YOUR CODE HERE: Create scatter plot with regression line
    sns.regplot(
        x=___,
        y=___,
        data=df,
        ax=axes[i],
        scatter_kws={'alpha': 0.3},
        line_kws={'color': 'red', 'linewidth': 2}
    )
    
    axes[i].set_title(f'{var1} vs {var2}\nr = {r:.3f}', fontsize=12, fontweight='bold')
    axes[i].set_xlabel(var1)
    axes[i].set_ylabel(var2)

plt.tight_layout()
plt.show()

### Scatter Plot Observations

For each of the top 3 correlations, document your observations:

**Correlation 1:**
- Linear pattern? Yes/No
- Outliers present? Yes/No
- Notes:

**Correlation 2:**
- Linear pattern? Yes/No
- Outliers present? Yes/No
- Notes:

**Correlation 3:**
- Linear pattern? Yes/No
- Outliers present? Yes/No
- Notes:

---

# Part 2: Test Relationships by Municipality (45 minutes)

---

## Task 2.1: Select Key Variables

In [None]:
# Check available grouping columns
print("Potential grouping columns:")
for col in ['MUNICIPIO', 'DEPARTAMENTO', 'USO', 'ESTRATO']:
    if col in df.columns:
        print(f"  {col}: {df[col].nunique()} unique values")

In [None]:
# TODO: Choose variables for analysis
# Select two numeric variables from your top correlations
# And one categorical variable for grouping

x_var = ___  # e.g., 'CONSUMO_FACTURADO'
y_var = ___  # e.g., 'NUMERO_SUSCRIPTORES'
group_var = ___  # e.g., 'DEPARTAMENTO' or 'MUNICIPIO'

print(f"Analyzing: {x_var} vs {y_var}")
print(f"Grouped by: {group_var}")

## Task 2.2: Calculate Correlations by Group

In [None]:
# TODO: Calculate correlation for each group
# Create a summary table with group name, sample size, and correlation

def calculate_group_correlations(df, x_var, y_var, group_var, min_samples=10):
    """
    Calculate correlation between x_var and y_var for each group.
    Only includes groups with at least min_samples observations.
    """
    results = []
    
    # YOUR CODE HERE
    for group in df[group_var].dropna().unique():
        subset = df[df[group_var] == group]
        n = len(subset)
        
        if n >= min_samples:
            r = subset[x_var].corr(subset[y_var])
            results.append({
                group_var: group,
                'Sample Size (n)': n,
                'Correlation (r)': r
            })
    
    return pd.DataFrame(results).sort_values('Correlation (r)', ascending=False)

# Calculate group correlations
group_corr = calculate_group_correlations(df, x_var, y_var, group_var)
group_corr

In [None]:
# Summary statistics of group correlations
print(f"\nCorrelation Statistics by {group_var}:")
print("=" * 50)
print(f"Number of groups analyzed: {len(group_corr)}")
print(f"Mean correlation: {group_corr['Correlation (r)'].mean():.3f}")
print(f"Std deviation: {group_corr['Correlation (r)'].std():.3f}")
print(f"Min correlation: {group_corr['Correlation (r)'].min():.3f}")
print(f"Max correlation: {group_corr['Correlation (r)'].max():.3f}")

## Task 2.3: Visualize Differences

In [None]:
# Get top groups by sample size for visualization
top_groups = group_corr.nlargest(5, 'Sample Size (n)')[group_var].tolist()
print(f"Top 5 groups by sample size: {top_groups}")

In [None]:
# TODO: Create scatter plot colored by group
# Filter to only include top groups for clarity

df_filtered = df[df[group_var].isin(top_groups)]

plt.figure(figsize=(12, 8))

# YOUR CODE HERE
sns.scatterplot(
    x=___,
    y=___,
    hue=___,
    data=df_filtered,
    alpha=0.6
)

plt.title(f'{x_var} vs {y_var} by {group_var}\n(Top 5 groups by sample size)')
plt.xlabel(x_var)
plt.ylabel(y_var)
plt.legend(title=group_var, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Bar chart of correlations by group
plt.figure(figsize=(12, 6))

# Sort by correlation value
sorted_corr = group_corr.sort_values('Correlation (r)')

# Color bars based on positive/negative
colors = ['green' if r > 0 else 'red' for r in sorted_corr['Correlation (r)']]

plt.barh(sorted_corr[group_var], sorted_corr['Correlation (r)'], color=colors, alpha=0.7)
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.xlabel('Correlation (r)')
plt.ylabel(group_var)
plt.title(f'Correlation between {x_var} and {y_var} by {group_var}')
plt.tight_layout()
plt.show()

### Does the relationship change significantly by group?

Answer the following questions:

1. **Range of correlations:** What is the range (min to max) of correlations across groups?

   _Your answer:_

2. **Consistency:** Are most groups showing similar patterns (all positive or all negative)?

   _Your answer:_

3. **Notable differences:** Which group(s) show notably different patterns?

   _Your answer:_

## Task 2.4: Simpson's Paradox Check

In [None]:
# TODO: Compare overall correlation vs subgroup correlations

# Calculate overall correlation
overall_r = df[x_var].corr(df[y_var])
print(f"Overall correlation (all data): r = {overall_r:.3f}")

# Compare to subgroup correlations
print(f"\nSubgroup correlations:")
print(group_corr[['Correlation (r)', group_var, 'Sample Size (n)']].to_string())

# Check for Simpson's Paradox
print("\n" + "=" * 50)
print("Simpson's Paradox Check:")
print("=" * 50)

positive_overall = overall_r > 0
subgroup_directions = (group_corr['Correlation (r)'] > 0).value_counts()

print(f"\nOverall direction: {'Positive' if positive_overall else 'Negative'}")
print(f"Subgroups with positive correlation: {subgroup_directions.get(True, 0)}")
print(f"Subgroups with negative correlation: {subgroup_directions.get(False, 0)}")

### Simpson's Paradox Analysis

Is there evidence of Simpson's Paradox in your data?

Simpson's Paradox occurs when a trend appears in different groups of data but disappears or reverses when these groups are combined.

**Your analysis:**

_Write your analysis here..._

---

# Part 3: Generate 5 Actionable Insights (30 minutes)

---

## Insight Template

For each insight, document:

1. **Finding:** What did you discover? (Be specific with numbers)
2. **So What?:** Why does this matter?
3. **Now What?:** What action could be taken?
4. **Caution:** Is this correlation or causation? Confounding variables?

### Insight #1: [Title]

---

**Finding:** 

_Describe your finding with specific numbers (e.g., "There is a strong positive correlation (r = 0.85) between...")_

**So What?:** 

_Explain why this matters for the water utility company_

**Now What?:** 

_Suggest a specific action based on this finding_

**Caution:** 

_Identify potential confounding variables or limitations_

---

### Insight #2: [Title]

---

**Finding:** 


**So What?:** 


**Now What?:** 


**Caution:** 


---

### Insight #3: [Title]

---

**Finding:** 


**So What?:** 


**Now What?:** 


**Caution:** 


---

### Insight #4: [Title]

---

**Finding:** 


**So What?:** 


**Now What?:** 


**Caution:** 


---

### Insight #5: [Title]

---

**Finding:** 


**So What?:** 


**Now What?:** 


**Caution:** 


---

---

# Summary

---

## Key Findings

Summarize your most important discoveries:

1. **Strongest correlations found:**
   - 
   - 
   - 

2. **How relationships vary by group:**
   - 

3. **Most actionable insight:**
   - 

## Reflection

Answer these questions:

1. **What surprised you most in this analysis?**

   _Your answer:_

2. **Which correlation was most likely to be confused for causation?**

   _Your answer:_

3. **What additional data would help confirm your findings?**

   _Your answer:_

---

## Checklist Before Submission

- [ ] All code cells executed without errors
- [ ] Correlation heatmap created and labeled
- [ ] Top 5 correlations identified and interpreted
- [ ] Scatter plots for top 3 correlations
- [ ] Group correlation analysis completed
- [ ] Simpson's Paradox check performed
- [ ] 5 actionable insights documented
- [ ] All markdown cells completed

---

*End of Workshop*