# Week 7 Exercise: Bivariate & Multivariate Analysis

## Water Consumption Dataset - Finding Relationships

**Time:** 30 minutes

**Objective:** Apply bivariate analysis techniques to discover relationships between variables in the water consumption dataset.

### What You Will Do:
1. Calculate correlations between numeric variables
2. Create a scatter plot for the strongest correlation
3. Compare consumption across categories using boxplots
4. Create a correlation heatmap
5. Find and explain the top 3 correlations

---

## Setup: Load Libraries and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Load the water consumption dataset
# NOTE: Update the path if your file is in a different location
df = pd.read_csv('HISTORICO_CONSUMO.csv')

# Quick overview
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names:\n{df.columns.tolist()}")

In [None]:
# Preview the data
df.head()

In [None]:
# Check data types
df.dtypes

---

## Task 1: Calculate Correlations Between Numeric Variables (5 minutes)

Create a correlation matrix for all numeric columns in the dataset.

**Instructions:**
1. Select only numeric columns
2. Calculate the correlation matrix using `.corr()`
3. Display the matrix

In [None]:
# Task 1: Calculate correlation matrix

# Step 1: Select numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric columns: {numeric_cols}")

# Step 2: Calculate correlation matrix
# YOUR CODE HERE
corr_matrix = ___

# Step 3: Display the correlation matrix
corr_matrix

In [None]:
# Round to 2 decimal places for readability
corr_matrix.round(2)

---

## Task 2: Create a Scatter Plot for the Strongest Correlation (5 minutes)

Find the pair of variables with the strongest correlation (excluding the diagonal) and create a scatter plot.

**Instructions:**
1. Identify the strongest correlation (highest absolute value, not 1.0)
2. Create a scatter plot with a regression line
3. Add proper labels and title

In [None]:
# Task 2: Find strongest correlation

# Create a copy of the correlation matrix and remove the diagonal
corr_no_diag = corr_matrix.copy()
np.fill_diagonal(corr_no_diag.values, np.nan)

# Find the maximum absolute correlation
# YOUR CODE HERE: Identify which pair has the strongest correlation
# Hint: Use .abs().max() and .abs().idxmax()

# Find max correlation value
max_corr = corr_no_diag.abs().max().max()
print(f"Strongest correlation value: {max_corr:.3f}")

# Find which variables
# Hint: You can use np.where or loop through the matrix
for col in corr_no_diag.columns:
    for idx in corr_no_diag.index:
        if abs(corr_no_diag.loc[idx, col]) == max_corr:
            var1, var2 = idx, col
            print(f"Variables: {var1} and {var2}")
            break

In [None]:
# Create scatter plot with regression line
plt.figure(figsize=(10, 6))

# YOUR CODE HERE: Create scatter plot using seaborn's regplot
# Hint: sns.regplot(x=var1, y=var2, data=df)
sns.regplot(x=___, y=___, data=df, scatter_kws={'alpha': 0.5})

plt.xlabel(var1)
plt.ylabel(var2)
plt.title(f'Scatter Plot: {var1} vs {var2}\n(r = {max_corr:.3f})')
plt.tight_layout()
plt.show()

---

## Task 3: Compare Consumption Across Categories (5 minutes)

Create a boxplot to compare water consumption across different categories.

**Instructions:**
1. Choose a categorical variable (e.g., ESTRATO, USO, DEPARTAMENTO)
2. Create a boxplot showing consumption distribution by category
3. Interpret the results

In [None]:
# Task 3: Boxplot comparing consumption by category

# First, check available categorical columns
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print(f"Categorical columns: {cat_cols}")

# Also check columns that might be categorical but stored as numeric
# (like ESTRATO which has values 1-6)
print(f"\nUnique values in potential categorical columns:")
for col in ['ESTRATO', 'USO']:
    if col in df.columns:
        print(f"  {col}: {df[col].nunique()} unique values")

In [None]:
# Create boxplot
plt.figure(figsize=(12, 6))

# YOUR CODE HERE: Create boxplot
# Choose your categorical variable and the consumption column
# Example: sns.boxplot(x='ESTRATO', y='CONSUMO_FACTURADO', data=df)

category_col = ___  # Choose: 'ESTRATO', 'USO', or another categorical column
consumption_col = 'CONSUMO_FACTURADO'  # Or your consumption column name

sns.boxplot(x=category_col, y=consumption_col, data=df)

plt.xlabel(category_col)
plt.ylabel('Water Consumption (m3)')
plt.title(f'Water Consumption by {category_col}')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Calculate mean consumption by category
consumption_by_cat = df.groupby(category_col)[consumption_col].agg(['mean', 'median', 'count'])
consumption_by_cat.sort_values('mean', ascending=False)

**Your Interpretation:** (Write 1-2 sentences about what you observe)

_Write your interpretation here..._

---

## Task 4: Create a Correlation Heatmap (5 minutes)

Visualize all correlations at once using a heatmap.

**Instructions:**
1. Use seaborn's heatmap function
2. Add annotations showing correlation values
3. Use an appropriate color scheme

In [None]:
# Task 4: Create correlation heatmap

plt.figure(figsize=(12, 10))

# YOUR CODE HERE: Create heatmap
# Hint: sns.heatmap(corr_matrix, annot=True, cmap='RdYlGn', center=0)

sns.heatmap(
    corr_matrix,
    annot=___,          # Show numbers in cells (True/False)
    cmap=___,           # Color scheme: 'RdYlGn', 'coolwarm', 'RdBu'
    center=0,           # Center the colormap at 0
    vmin=-1, vmax=1,    # Set scale limits
    fmt='.2f',          # Format: 2 decimal places
    square=True,        # Make cells square
    linewidths=0.5      # Add lines between cells
)

plt.title('Correlation Heatmap - Water Consumption Data', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---

## Task 5: Find Top 3 Correlations and Explain Each (10 minutes)

Identify the three strongest correlations (excluding diagonal) and provide explanations.

**For each correlation, answer:**
1. What is the r value?
2. Is it positive or negative?
3. Does it make logical sense?
4. Could there be a confounding variable?
5. What is the business implication?

In [None]:
# Task 5: Find top 3 correlations

# Get all correlations as a list (excluding duplicates and diagonal)
def get_top_correlations(corr_matrix, n=3):
    """
    Extract top n correlations from correlation matrix.
    Returns list of tuples: (var1, var2, correlation)
    """
    correlations = []
    
    for i in range(len(corr_matrix.columns)):
        for j in range(i + 1, len(corr_matrix.columns)):  # Upper triangle only
            var1 = corr_matrix.columns[i]
            var2 = corr_matrix.columns[j]
            corr_value = corr_matrix.iloc[i, j]
            correlations.append((var1, var2, corr_value))
    
    # Sort by absolute correlation value
    correlations.sort(key=lambda x: abs(x[2]), reverse=True)
    
    return correlations[:n]

# Get top 3
top_3 = get_top_correlations(corr_matrix, n=3)

print("=" * 60)
print("TOP 3 CORRELATIONS")
print("=" * 60)

for i, (var1, var2, r) in enumerate(top_3, 1):
    direction = "Positive" if r > 0 else "Negative"
    strength = "Strong" if abs(r) > 0.7 else "Moderate" if abs(r) > 0.3 else "Weak"
    
    print(f"\n{i}. {var1} vs {var2}")
    print(f"   r = {r:.3f}")
    print(f"   Direction: {direction}")
    print(f"   Strength: {strength}")

In [None]:
# Create scatter plots for top 3 correlations
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for i, (var1, var2, r) in enumerate(top_3):
    sns.regplot(
        x=var1, y=var2, data=df,
        ax=axes[i],
        scatter_kws={'alpha': 0.3},
        line_kws={'color': 'red'}
    )
    axes[i].set_title(f'{var1} vs {var2}\nr = {r:.3f}')
    axes[i].set_xlabel(var1)
    axes[i].set_ylabel(var2)

plt.tight_layout()
plt.show()

### Your Explanations for Top 3 Correlations

**IMPORTANT: Remember that Correlation does NOT equal Causation!**

---

#### Correlation 1: [Variable 1] vs [Variable 2]

- **r value:** 
- **Direction:** Positive / Negative
- **Does it make logical sense?** 
- **Possible confounding variable?** 
- **Business implication:** 

---

#### Correlation 2: [Variable 1] vs [Variable 2]

- **r value:** 
- **Direction:** Positive / Negative
- **Does it make logical sense?** 
- **Possible confounding variable?** 
- **Business implication:** 

---

#### Correlation 3: [Variable 1] vs [Variable 2]

- **r value:** 
- **Direction:** Positive / Negative
- **Does it make logical sense?** 
- **Possible confounding variable?** 
- **Business implication:** 

---

---

## Bonus: Multivariate Analysis (If Time Permits)

Check if the relationship between two variables changes across categories.

In [None]:
# Bonus: Does correlation change by category?

# Choose your variables
x_var = 'NUMERO_SUSCRIPTORES'  # Replace with your variable
y_var = 'CONSUMO_FACTURADO'    # Replace with your variable
group_var = 'ESTRATO'          # Replace with your categorical variable

# Calculate correlation for each group
print(f"Correlation between {x_var} and {y_var} by {group_var}:")
print("=" * 50)

for group in sorted(df[group_var].dropna().unique()):
    subset = df[df[group_var] == group]
    if len(subset) > 10:  # Need enough data points
        r = subset[x_var].corr(subset[y_var])
        print(f"{group_var} = {group}: r = {r:.3f} (n = {len(subset)})")

In [None]:
# Visualize with color by category
plt.figure(figsize=(10, 6))

sns.scatterplot(
    x=x_var, y=y_var,
    hue=group_var,
    data=df,
    alpha=0.6
)

plt.title(f'{x_var} vs {y_var} by {group_var}')
plt.xlabel(x_var)
plt.ylabel(y_var)
plt.legend(title=group_var, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

---

## Summary

### Key Takeaways from This Exercise:

1. **Correlation measures relationship strength** between -1 and +1
2. **Positive correlation:** Variables move in the same direction
3. **Negative correlation:** Variables move in opposite directions
4. **Correlation does NOT equal causation** - always look for confounding variables!
5. **Relationships can differ by group** - always check multivariate patterns

### Questions to Think About:

- Which correlations were surprising?
- Which correlations make business sense?
- What hidden variables might explain some correlations?

---

*End of Exercise*