# Example Data Analysis

**Author**: Template User
**Created**: 2025-11-15
**Purpose**: Demonstrate basic data analysis workflow with the analysis package

## Setup

Import libraries and configure display options.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from analysis.data import clean_data
from analysis.viz import plot_distribution, plot_correlation_matrix

# Configure display
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

%matplotlib inline

## Create Sample Data

Generate synthetic data for demonstration.

In [None]:
# Generate sample dataset
np.random.seed(42)
n_samples = 1000

df = pd.DataFrame({
    'age': np.random.normal(35, 10, n_samples),
    'income': np.random.lognormal(10, 0.5, n_samples),
    'score': np.random.beta(5, 2, n_samples) * 100,
    'category': np.random.choice(['A', 'B', 'C'], n_samples)
})

# Introduce some nulls and duplicates
df.loc[df.sample(50).index, 'age'] = None
df = pd.concat([df, df.sample(20)])

print(f"Dataset shape: {df.shape}")
df.head()

## Data Cleaning

Use the analysis package to clean the data.

In [None]:
# Clean data using our utility function
df_clean = clean_data(df)

print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {df_clean.shape}")
print(f"Removed {df.shape[0] - df_clean.shape[0]} rows")

## Exploratory Data Analysis

Examine distributions and correlations.

In [None]:
# Summary statistics
df_clean.describe()

In [None]:
# Plot age distribution using our viz utility
fig = plot_distribution(df_clean['age'], title='Age Distribution')
plt.show()

In [None]:
# Plot score distribution
fig = plot_distribution(df_clean['score'], title='Score Distribution')
plt.show()

In [None]:
# Correlation matrix (numeric columns only)
numeric_df = df_clean.select_dtypes(include=[np.number])
fig = plot_correlation_matrix(numeric_df)
plt.show()

## Analysis by Category

Compare metrics across categories.

In [None]:
# Group by category
category_summary = df_clean.groupby('category').agg({
    'age': ['mean', 'std'],
    'income': ['mean', 'median'],
    'score': ['mean', 'std']
}).round(2)

category_summary

In [None]:
# Visualize category distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for idx, col in enumerate(['age', 'income', 'score']):
    sns.boxplot(data=df_clean, x='category', y=col, ax=axes[idx])
    axes[idx].set_title(f'{col.title()} by Category')

plt.tight_layout()
plt.show()

## Conclusions

Key findings from the analysis:

1. The dataset contains 3 categories with similar sample sizes
2. Age follows a normal distribution centered around 35
3. Income shows a right-skewed distribution (lognormal)
4. Score shows a beta distribution pattern
5. Weak correlations between variables suggest independence

**Next Steps**:
- Investigate outliers in the income distribution
- Test for statistical significance of category differences
- Build predictive models for score based on age and income