# Week 4 Lab: Exploratory Data Analysis

**Estimated Time:** 30-60 minutes  
**Objective:** Perform comprehensive exploratory data analysis on Philippine demographic data.

In this lab, you will:
- Calculate and interpret descriptive statistics
- Analyze data distributions
- Discover correlations between variables
- Identify outliers and patterns

---

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set styling
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)

print("âœ“ Libraries imported successfully!")

---
## Part 1: Descriptive Statistics

**Background:** Let's analyze Philippine regional demographic data.

### Exercise 1.1: Load and Explore Data

In [None]:
# Philippine regional data
data = {
    'Region': ['NCR', 'CAR', 'Region I', 'Region II', 'Region III', 'Region IV-A', 'Region IV-B', 
               'Region V', 'Region VI', 'Region VII', 'Region VIII', 'Region IX', 'Region X',
               'Region XI', 'Region XII', 'BARMM'],
    'Population': [13484462, 1797660, 5301139, 3685744, 12422172, 16195042, 3228558, 
                   6080937, 7954723, 8081988, 4547150, 3875576, 5022768, 5602882, 4962605, 4404000],
    'Poverty_Rate': [4.9, 24.6, 15.3, 11.2, 7.8, 6.5, 14.2, 20.4, 18.3, 23.8, 33.4, 28.1, 32.4, 18.8, 35.4, 44.0],
    'Literacy_Rate': [99.2, 94.3, 97.8, 96.5, 98.7, 98.9, 97.1, 96.2, 96.8, 96.4, 94.9, 93.8, 95.2, 96.7, 93.5, 89.2],
    'Unemployment_Rate': [7.8, 5.2, 4.3, 4.8, 5.6, 6.2, 4.9, 5.8, 6.4, 7.2, 6.1, 7.8, 7.5, 6.9, 7.1, 9.2]
}

df = pd.DataFrame(data)
print("Dataset Overview:")
print(df.head(10))

# TODO: Display basic information about the dataset
# Your code here (use .info())

# TODO: Display shape of the dataset
print(f"\nDataset shape: {None}")  # Your code here

### Exercise 1.2: Calculate Descriptive Statistics

In [None]:
# TODO: Display summary statistics for all numerical columns
# Your code here (use .describe())

# TODO: Calculate specific statistics for Population
pop_mean = None      # Your code here
pop_median = None    # Your code here
pop_std = None       # Your code here
pop_min = None       # Your code here
pop_max = None       # Your code here

print(f"\nPopulation Statistics:")
print(f"Mean: {pop_mean:,.0f}")
print(f"Median: {pop_median:,.0f}")
print(f"Std Dev: {pop_std:,.0f}")
print(f"Range: {pop_min:,.0f} - {pop_max:,.0f}")

# TODO: Calculate the range (max - min) for Poverty_Rate
poverty_range = None  # Your code here
print(f"\nPoverty Rate Range: {poverty_range:.1f}%")

---
## Part 2: Distribution Analysis

### Exercise 2.1: Visualize Distributions

In [None]:
# TODO: Create a histogram of Poverty_Rate
plt.figure(figsize=(10, 6))
# Your code here (use plt.hist() or df['Poverty_Rate'].hist())
plt.xlabel('Poverty Rate (%)')
plt.ylabel('Frequency')
plt.title('Distribution of Poverty Rates Across Philippine Regions')
plt.show()

# TODO: Create a box plot of Literacy_Rate
plt.figure(figsize=(10, 6))
# Your code here (use plt.boxplot() or sns.boxplot())
plt.ylabel('Literacy Rate (%)')
plt.title('Distribution of Literacy Rates Across Philippine Regions')
plt.show()

### Exercise 2.2: Identify Outliers

In [None]:
# TODO: Calculate Q1, Q3, and IQR for Population
Q1 = None  # Your code here (use .quantile(0.25))
Q3 = None  # Your code here (use .quantile(0.75))
IQR = None  # Your code here (Q3 - Q1)

print(f"Population IQR Analysis:")
print(f"Q1 (25th percentile): {Q1:,.0f}")
print(f"Q3 (75th percentile): {Q3:,.0f}")
print(f"IQR: {IQR:,.0f}")

# TODO: Calculate outlier boundaries
lower_bound = None  # Your code here (Q1 - 1.5 * IQR)
upper_bound = None  # Your code here (Q3 + 1.5 * IQR)

print(f"\nOutlier Boundaries:")
print(f"Lower: {lower_bound:,.0f}")
print(f"Upper: {upper_bound:,.0f}")

# TODO: Identify outliers
outliers = None  # Your code here (filter df where Population < lower_bound or > upper_bound)

print(f"\nRegions with outlier populations:")
print(outliers[['Region', 'Population']])

---
## Part 3: Correlation Analysis

### Exercise 3.1: Calculate Correlations

In [None]:
# TODO: Calculate correlation matrix for numerical columns
correlation_matrix = None  # Your code here (use .corr())

print("Correlation Matrix:")
print(correlation_matrix)

# TODO: Find correlation between Poverty_Rate and Literacy_Rate
poverty_literacy_corr = None  # Your code here

print(f"\nCorrelation between Poverty Rate and Literacy Rate: {poverty_literacy_corr:.3f}")
if poverty_literacy_corr < -0.5:
    print("Strong negative correlation: Higher literacy associated with lower poverty")
elif poverty_literacy_corr > 0.5:
    print("Strong positive correlation")
else:
    print("Weak or moderate correlation")

### Exercise 3.2: Visualize Correlations

In [None]:
# TODO: Create a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
# Your code here (use sns.heatmap())
# Hint: add parameters annot=True, cmap='coolwarm', center=0
plt.title('Correlation Heatmap: Philippine Regional Indicators')
plt.show()

# TODO: Create scatter plot: Poverty Rate vs Literacy Rate
plt.figure(figsize=(10, 6))
# Your code here (use plt.scatter() or sns.scatterplot())
plt.xlabel('Literacy Rate (%)')
plt.ylabel('Poverty Rate (%)')
plt.title('Relationship between Literacy and Poverty Rates')
plt.show()

---
## Part 4: Comparative Analysis

### Exercise 4.1: Group Comparisons

In [None]:
# TODO: Create a category column based on Population
# Large: > 10M, Medium: 5M-10M, Small: < 5M
def categorize_region(pop):
    # Your code here
    pass

df['Size_Category'] = None  # Your code here (use .apply())

print("Region Size Categories:")
print(df[['Region', 'Population', 'Size_Category']])

# TODO: Calculate average poverty rate by size category
avg_poverty_by_size = None  # Your code here (use .groupby())

print("\nAverage Poverty Rate by Region Size:")
print(avg_poverty_by_size)

### Exercise 4.2: Top/Bottom Analysis

In [None]:
# TODO: Find top 5 regions by population
top_5_population = None  # Your code here (use .nlargest())

print("Top 5 Most Populated Regions:")
print(top_5_population[['Region', 'Population']])

# TODO: Find bottom 3 regions by literacy rate
bottom_3_literacy = None  # Your code here (use .nsmallest())

print("\nBottom 3 Regions by Literacy Rate:")
print(bottom_3_literacy[['Region', 'Literacy_Rate']])

# TODO: Find regions with poverty rate > 30%
high_poverty = None  # Your code here (filter df)

print("\nRegions with High Poverty (>30%):")
print(high_poverty[['Region', 'Poverty_Rate']])

---
## Reflection Questions

1. **Distribution Insights:** What does the distribution of poverty rates tell us about regional inequality in the Philippines?

2. **Correlation Interpretation:** You found a correlation between literacy rate and poverty rate. Does this mean one causes the other? Why or why not?

3. **Outlier Significance:** Are outliers always bad? How might population outliers (NCR, Region IV-A) be meaningful in policy planning?

### Your Answer to Question 1:

[Your answer here]

### Your Answer to Question 2:

[Your answer here]

### Your Answer to Question 3:

[Your answer here]

---

## ðŸŽ¯ Congratulations!

You've completed Week 4 Lab on Exploratory Data Analysis.

**Key Skills Practiced:**
- Calculating descriptive statistics (mean, median, std, quartiles)
- Analyzing distributions with histograms and box plots
- Identifying outliers using IQR method
- Computing and interpreting correlations
- Creating comparative analyses

**Remember:** Check the **solution notebook** if you need help!

**Next Steps:**
- Practice EDA on other Philippine datasets
- Learn about advanced visualization techniques
- Move on to Week 5 lab (Data Visualization Techniques)

---

*CMSC 178DA - Data Analytics | University of the Philippines Cebu*