# Problem 1: Descriptive Statistics - CEO Compensation Analysis

This notebook analyzes CEO compensation data from 1999, examining various statistical measures and relationships between variables.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

In [2]:
df = pd.read_excel('../data/ceo.xls')


print("Dataset shape:", df.shape)
df.head()

Dataset shape: (447, 8)


Unnamed: 0,salary,totcomp,tenure,age,sales,profits,assets,Unnamed: 7
0,3030,8138,7,61,161315.0,2956.0,257389.0,
1,6050,14530,0,51,144416.0,22071.0,237545.0,
2,3571,7433,11,63,139208.0,4430.0,49271.0,
3,3300,13464,6,60,100697.0,6370.0,92630.0,
4,10000,68285,18,63,100469.0,9296.0,355935.0,


In [3]:
print("Dataset information:")
df.info()
print("\nBasic statistics:")
df.describe()

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 447 entries, 0 to 446
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   salary      447 non-null    int64  
 1   totcomp     447 non-null    int64  
 2   tenure      447 non-null    int64  
 3   age         447 non-null    int64  
 4   sales       447 non-null    float64
 5   profits     447 non-null    float64
 6   assets      447 non-null    float64
 7   Unnamed: 7  2 non-null      object 
dtypes: float64(3), int64(4), object(1)
memory usage: 28.1+ KB

Basic statistics:


Unnamed: 0,salary,totcomp,tenure,age,sales,profits,assets
count,447.0,447.0,447.0,447.0,447.0,447.0,447.0
mean,2027.516779,8340.058166,7.834452,56.469799,11557.780984,700.46085,27054.290828
std,1722.566389,31571.803005,8.246721,6.806641,16168.368902,1542.538013,64659.043191
min,100.0,100.0,0.0,34.0,2896.4,-2669.0,717.8
25%,1084.0,1575.5,2.0,52.0,4184.15,108.45,3856.95
50%,1600.0,2951.0,5.0,57.0,6704.0,333.1,7810.8
75%,2347.5,6043.0,10.0,61.0,12976.8,738.0,21105.55
max,15250.0,589101.0,60.0,84.0,161315.0,22071.0,668641.0


## 1(a) Location Measures for Total Compensation

In [4]:
totcomp = df['totcomp'].dropna()

mean_val = totcomp.mean()
trimmed_mean = stats.trim_mean(totcomp, 0.05)
median_val = totcomp.median()
q1 = totcomp.quantile(0.25)
q3 = totcomp.quantile(0.75)
q05 = totcomp.quantile(0.05)
q95 = totcomp.quantile(0.95)

print("LOCATION MEASURES FOR TOTAL COMPENSATION\n" + "="*50)
print(f"Mean: ${mean_val:,.2f}")
print(f"5%-Trimmed Mean: ${trimmed_mean:,.2f}")
print(f"Median (Q2): ${median_val:,.2f}")
print(f"Lower Quartile (Q1): ${q1:,.2f}")
print(f"Upper Quartile (Q3): ${q3:,.2f}")
print(f"Lower 5% Quantile: ${q05:,.2f}")
print(f"Upper 95% Quantile: ${q95:,.2f}")

LOCATION MEASURES FOR TOTAL COMPENSATION
Mean: $8,340.06
5%-Trimmed Mean: $4,637.68
Median (Q2): $2,951.00
Lower Quartile (Q1): $1,575.50
Upper Quartile (Q3): $6,043.00
Lower 5% Quantile: $783.70
Upper 95% Quantile: $24,563.30


### Economic Interpretation:

**Mean**: The average CEO total compensation represents the typical compensation if wealth were distributed equally. However, it's sensitive to extreme values (outliers).

**5%-Trimmed Mean**: By removing the top and bottom 5%, this measure provides a robust average less influenced by extreme compensations, better representing the "typical" CEO compensation.

**Median**: Half of CEOs earn below this value and half above. It's the most representative measure for skewed distributions and shows the compensation of the "middle" CEO.

**Lower Quartile (Q1)**: 25% of CEOs earn less than this amount, representing the threshold for the lower-paid quarter of executives.

**Upper Quartile (Q3)**: 75% of CEOs earn below this value; only the top 25% earn more, indicating the entry point for high-earner CEOs.

**5% Quantile**: Only 5% of CEOs earn below this threshold, representing near-minimum compensation levels.

**95% Quantile**: Only 5% of CEOs earn above this value, indicating exceptionally high compensation packages.

## 1(b) Empirical Cumulative Distribution Function

In [None]:
sorted_totcomp = np.sort(totcomp)
ecdf_values = np.arange(1, len(sorted_totcomp) + 1) / len(sorted_totcomp)

plt.figure(figsize=(10, 6))
plt.plot(sorted_totcomp, ecdf_values, linewidth=2)
plt.xlabel('Total Compensation ($1000)', fontsize=12)
plt.ylabel('Cumulative Probability', fontsize=12)
plt.title('Empirical Cumulative Distribution Function - Total Compensation', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
f_inv_01 = totcomp.quantile(0.1)
f_inv_09 = totcomp.quantile(0.9)

f_2000 = (totcomp <= 2000).mean()
one_minus_f_4000 = (totcomp > 4000).mean()

print("ECDF QUANTITIES\n" + "="*50)
print(f"F̂⁻¹(0.1): ${f_inv_01:,.2f}")
print(f"F̂⁻¹(0.9): ${f_inv_09:,.2f}")
print(f"\nF̂(2000): {f_2000:.4f} ({f_2000*100:.2f}%)")
print(f"1 - F̂(4000): {one_minus_f_4000:.4f} ({one_minus_f_4000*100:.2f}%)")

### Economic Interpretation:

**F̂⁻¹(0.1)**: The 10th percentile shows that 10% of CEOs earn below this compensation level. This represents the lower-income threshold for CEO positions.

**F̂⁻¹(0.9)**: The 90th percentile indicates that 90% of CEOs earn less than this amount. CEOs earning above this are in the top 10% of compensations.

**F̂(2000)**: The proportion of CEOs earning $2 million or less. This helps understand market concentration at lower compensation levels.

**1 - F̂(4000)**: The proportion of CEOs earning more than $4 million, indicating the prevalence of high-compensation packages in the sample.

## 1(c) Histogram and Box Plot

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].hist(totcomp, bins='auto', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Total Compensation ($1000)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Histogram of Total Compensation', fontsize=12)
axes[0].grid(True, alpha=0.3)

axes[1].boxplot(totcomp, vert=True)
axes[1].set_ylabel('Total Compensation ($1000)', fontsize=11)
axes[1].set_title('Box Plot of Total Compensation', fontsize=12)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Interpretation:

**Histogram Area**: Each rectangle's area represents the proportion (or count) of observations within that bin interval. The total area under the histogram sums to the total number of observations (or 1 if normalized).

**Box Plot Components**:
- Bottom whisker: Minimum or 1.5×IQR below Q1
- Box bottom: Q1 (25th percentile)
- Line in box: Median (Q2, 50th percentile)
- Box top: Q3 (75th percentile)
- Top whisker: Maximum or 1.5×IQR above Q3
- Points beyond whiskers: Outliers

## 1(d) Distribution Analysis and Symmetry

In [None]:
skewness = totcomp.skew()
kurtosis = totcomp.kurtosis()
iqr = q3 - q1
std_dev = totcomp.std()

print("DISTRIBUTION CHARACTERISTICS\n" + "="*50)
print(f"Skewness: {skewness:.4f}")
print(f"Kurtosis: {kurtosis:.4f}")
print(f"IQR: ${iqr:,.2f}")
print(f"Standard Deviation: ${std_dev:,.2f}")
print(f"\nMean vs Median: ${mean_val:,.2f} vs ${median_val:,.2f}")
print(f"Difference: ${mean_val - median_val:,.2f}")

if skewness > 1:
    print("\nThe distribution is HIGHLY RIGHT-SKEWED")
elif skewness > 0.5:
    print("\nThe distribution is MODERATELY RIGHT-SKEWED")
else:
    print("\nThe distribution is APPROXIMATELY SYMMETRIC")

### Conclusions:

The distribution shows strong right skewness, indicating:
1. Most CEOs earn relatively moderate compensations
2. A small number receive exceptionally high packages (outliers)
3. The mean is pulled upward by these extreme values

**Appropriateness of Location Measures**:
- **Mean**: Less appropriate due to extreme sensitivity to outliers
- **Median**: Most appropriate - robust to outliers, better represents typical CEO
- **Trimmed mean**: Good compromise between robustness and using all data

The high positive skewness suggests we should primarily rely on median and quantiles rather than mean for understanding typical compensation.

## 1(e) Histogram Bandwidth Selection

In [None]:
print("Histogram Bandwidth Selection Methods:\n" + "="*50)
print("\nPython/Matplotlib's 'auto' uses the Sturges' Rule or Freedman-Diaconis method:")
print("\n1. Sturges' Rule: bins = ceil(log2(n) + 1)")
print("   Simple but may not work well for non-normal distributions")
print("\n2. Freedman-Diaconis: bin_width = 2 * IQR / n^(1/3)")
print("   More robust to outliers, adapts to data spread")

n = len(totcomp)
sturges_bins = int(np.ceil(np.log2(n) + 1))
fd_bin_width = 2 * iqr / (n ** (1/3))
fd_bins = int(np.ceil((totcomp.max() - totcomp.min()) / fd_bin_width))

print(f"\nFor our data (n={n}):")
print(f"Sturges suggests: {sturges_bins} bins")
print(f"Freedman-Diaconis suggests: {fd_bins} bins")

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

axes[0].hist(totcomp, bins=5, edgecolor='black', alpha=0.7)
axes[0].set_title('Too Rough (5 bins)', fontsize=12)
axes[0].set_xlabel('Total Compensation ($1000)')
axes[0].set_ylabel('Frequency')
axes[0].grid(True, alpha=0.3)

axes[1].hist(totcomp, bins='auto', edgecolor='black', alpha=0.7)
axes[1].set_title('Optimal (auto)', fontsize=12)
axes[1].set_xlabel('Total Compensation ($1000)')
axes[1].grid(True, alpha=0.3)

axes[2].hist(totcomp, bins=100, edgecolor='black', alpha=0.7)
axes[2].set_title('Too Detailed (100 bins)', fontsize=12)
axes[2].set_xlabel('Total Compensation ($1000)')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nLearnings:")
print("- TOO ROUGH: Loses important distribution features, over-smooths data")
print("- OPTIMAL: Balances detail and clarity, reveals true distribution shape")
print("- TOO DETAILED: Shows noise, makes interpretation difficult, irregular bars")

## 1(f) Log Transformation

In [None]:
log_totcomp = np.log(totcomp)

log_mean = log_totcomp.mean()
log_median = log_totcomp.median()
log_skewness = log_totcomp.skew()

print("LOG-TRANSFORMED STATISTICS\n" + "="*50)
print(f"Original Mean: ${mean_val:,.2f}")
print(f"Original Median: ${median_val:,.2f}")
print(f"Original Skewness: {skewness:.4f}")
print(f"\nLog Mean: {log_mean:.4f}")
print(f"Log Median: {log_median:.4f}")
print(f"Log Skewness: {log_skewness:.4f}")
print(f"\nSkewness reduction: {abs(skewness - log_skewness):.4f}")

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].hist(totcomp, bins='auto', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Original: Histogram', fontsize=12)
axes[0, 0].set_xlabel('Total Compensation ($1000)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].boxplot(totcomp, vert=True)
axes[0, 1].set_title('Original: Box Plot', fontsize=12)
axes[0, 1].set_ylabel('Total Compensation ($1000)')
axes[0, 1].grid(True, alpha=0.3)

axes[1, 0].hist(log_totcomp, bins='auto', edgecolor='black', alpha=0.7, color='green')
axes[1, 0].set_title('Log-Transformed: Histogram', fontsize=12)
axes[1, 0].set_xlabel('ln(Total Compensation)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].boxplot(log_totcomp, vert=True)
axes[1, 1].set_title('Log-Transformed: Box Plot', fontsize=12)
axes[1, 1].set_ylabel('ln(Total Compensation)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Economic Interpretation of Log Transformation:

**Why logarithms work**:
- Converts multiplicative relationships to additive ones
- Compresses the scale, reducing impact of extreme values
- Makes percentage changes interpretable as linear differences

**Key findings**:
1. The log-transformed distribution is much more symmetric
2. Mean and median become closer, indicating better balance
3. Outliers are less extreme on the log scale
4. The distribution approximates normality better

**Economic meaning**:
- Original scale: Dollar differences matter (e.g., $1M to $2M)
- Log scale: Proportional/percentage differences matter (e.g., doubling compensation)
- Log transformation suggests CEO compensation may follow a log-normal distribution, common in income data
- This implies that percentage changes in compensation are more consistent than absolute changes

## 2(a) Correlation Analysis - Pearson Correlation

In [None]:
numeric_cols = ['salary', 'totcomp', 'tenure', 'age', 'sales', 'profits', 'assets']
df_numeric = df[numeric_cols].dropna()

pearson_corr = df_numeric.corr(method='pearson')

plt.figure(figsize=(10, 8))
sns.heatmap(pearson_corr, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Pearson Correlation Heatmap', fontsize=14, pad=20)
plt.tight_layout()
plt.show()

print("\nSTRONGEST CORRELATIONS WITH TOTCOMP:")
print("="*50)
totcomp_corrs = pearson_corr['totcomp'].drop('totcomp').sort_values(ascending=False)
for var, corr in totcomp_corrs.items():
    strength = "Very Strong" if abs(corr) > 0.8 else "Strong" if abs(corr) > 0.6 else "Moderate" if abs(corr) > 0.4 else "Weak"
    print(f"{var:10s}: {corr:6.3f}  ({strength})")

### Discussion:

The Pearson correlation measures linear relationships between variables. Key observations:
- **Strong positive correlations** suggest variables move together linearly
- **Weak correlations** may indicate non-linear relationships or no relationship
- Company size metrics (sales, assets) often correlate with CEO compensation
- Personal characteristics (age, tenure) may show different patterns

## 2(b) Scatter Plots and Spearman Correlation

In [None]:
scatter_vars = ['salary', 'tenure', 'age', 'sales']
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, var in enumerate(scatter_vars):
    axes[idx].scatter(df_numeric[var], df_numeric['totcomp'], alpha=0.5)
    axes[idx].set_xlabel(var, fontsize=11)
    axes[idx].set_ylabel('totcomp', fontsize=11)
    axes[idx].set_title(f'totcomp vs {var}', fontsize=12)
    axes[idx].grid(True, alpha=0.3)
    
    z = np.polyfit(df_numeric[var], df_numeric['totcomp'], 1)
    p = np.poly1d(z)
    axes[idx].plot(df_numeric[var], p(df_numeric[var]), "r--", alpha=0.8, linewidth=2)

plt.tight_layout()
plt.show()

In [None]:
spearman_corr = df_numeric.corr(method='spearman')

plt.figure(figsize=(10, 8))
sns.heatmap(spearman_corr, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Spearman Correlation Heatmap', fontsize=14, pad=20)
plt.tight_layout()
plt.show()

print("\nCOMPARISON: PEARSON vs SPEARMAN\n" + "="*60)
comparison_df = pd.DataFrame({
    'Variable': totcomp_corrs.index,
    'Pearson': totcomp_corrs.values,
    'Spearman': spearman_corr['totcomp'].drop('totcomp').loc[totcomp_corrs.index].values
})
comparison_df['Difference'] = comparison_df['Spearman'] - comparison_df['Pearson']
print(comparison_df.to_string(index=False))

### Analysis:

**Appropriateness of Pearson correlation**:
- Check scatter plots for linearity
- If relationships are non-linear or have outliers, Pearson may not capture true association

**Spearman vs Pearson**:
- **Spearman** measures monotonic relationships (consistent direction)
- **Pearson** measures linear relationships (straight line)
- Large differences indicate non-linear but monotonic relationships
- Spearman is more robust to outliers

## 2(c) Rank Analysis and Correlation Concepts

In [None]:
target_value = 6737
rank = (df_numeric['totcomp'] <= target_value).sum()
percentile = (rank / len(df_numeric)) * 100

print(f"RANK ANALYSIS FOR TOTCOMP = {target_value}\n" + "="*50)
print(f"Rank: {rank} out of {len(df_numeric)}")
print(f"Percentile: {percentile:.2f}%")
print(f"\nThis CEO earns more than {rank} other CEOs ({percentile:.1f}% of the sample)")

### Conceptual Difference Between Correlations:

**Pearson Correlation**:
- Measures **linear dependence**: how well data fits a straight line
- Uses actual values of variables
- Sensitive to outliers and scale
- Appropriate when relationship is linear
- Formula based on covariance and standard deviations

**Spearman Correlation**:
- Measures **monotonic dependence**: consistent increase/decrease direction
- Uses ranks (positions) instead of actual values
- Robust to outliers and non-linear transformations
- Captures any consistent trend, not just linear
- Essentially Pearson correlation applied to ranks

**Example**:
- Linear relationship: y = 2x → Both correlations = 1
- Exponential relationship: y = e^x → Spearman = 1, Pearson < 1
- U-shaped relationship: y = x² → Both correlations ≈ 0

## 2(d) Age Subgroup Analysis

In [None]:
younger = df_numeric[df_numeric['age'] < 50]['totcomp']
older = df_numeric[df_numeric['age'] >= 50]['totcomp']

print(f"SUBGROUP ANALYSIS\n" + "="*60)
print(f"Younger than 50: n = {len(younger)}")
print(f"50 or older: n = {len(older)}")

print(f"\nLOCATION MEASURES:")
print(f"{'Measure':<20} {'Age < 50':>15} {'Age >= 50':>15}")
print("-" * 60)
print(f"{'Mean':<20} ${younger.mean():>14,.2f} ${older.mean():>14,.2f}")
print(f"{'Median':<20} ${younger.median():>14,.2f} ${older.median():>14,.2f}")
print(f"{'Q1':<20} ${younger.quantile(0.25):>14,.2f} ${older.quantile(0.25):>14,.2f}")
print(f"{'Q3':<20} ${younger.quantile(0.75):>14,.2f} ${older.quantile(0.75):>14,.2f}")

print(f"\nDISPERSION MEASURES:")
print(f"{'Measure':<20} {'Age < 50':>15} {'Age >= 50':>15}")
print("-" * 60)
print(f"{'Std Dev':<20} ${younger.std():>14,.2f} ${older.std():>14,.2f}")
print(f"{'IQR':<20} ${(younger.quantile(0.75)-younger.quantile(0.25)):>14,.2f} ${(older.quantile(0.75)-older.quantile(0.25)):>14,.2f}")
print(f"{'CV':<20} {(younger.std()/younger.mean()):>15.4f} {(older.std()/older.mean()):>15.4f}")
print(f"{'Range':<20} ${(younger.max()-younger.min()):>14,.2f} ${(older.max()-older.min()):>14,.2f}")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].hist(younger, bins=20, alpha=0.6, label='Age < 50', edgecolor='black')
axes[0].hist(older, bins=20, alpha=0.6, label='Age >= 50', edgecolor='black')
axes[0].set_xlabel('Total Compensation ($1000)', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].set_title('Overlapping Histograms by Age Group', fontsize=12)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

sorted_younger = np.sort(younger)
sorted_older = np.sort(older)
ecdf_younger = np.arange(1, len(sorted_younger) + 1) / len(sorted_younger)
ecdf_older = np.arange(1, len(sorted_older) + 1) / len(sorted_older)

axes[1].plot(sorted_younger, ecdf_younger, label='Age < 50', linewidth=2)
axes[1].plot(sorted_older, ecdf_older, label='Age >= 50', linewidth=2)
axes[1].set_xlabel('Total Compensation ($1000)', fontsize=11)
axes[1].set_ylabel('Cumulative Probability', fontsize=11)
axes[1].set_title('Empirical CDFs by Age Group', fontsize=12)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Discussion:

**Location measures** tell us about central tendency:
- Do older CEOs earn more on average?
- What's the typical compensation in each group?

**Dispersion measures** tell us about variability:
- **Standard Deviation**: Absolute variability in dollars
- **IQR**: Spread of middle 50%, robust to outliers
- **Coefficient of Variation**: Relative variability (std/mean), allows comparison across groups
- **Range**: Full spread, sensitive to extremes

**Key learnings**:
- Compare not just averages but also spread
- Higher dispersion suggests more inequality within group
- ECDFs show entire distribution, revealing differences beyond summary statistics

## 3. Contingency Table Analysis

In [None]:
df_clean = df[['salary', 'age']].dropna()

df_clean['S'] = pd.cut(df_clean['salary'], 
                        bins=[-np.inf, 3000, 5000, np.inf],
                        labels=['S1', 'S2', 'S3'])

df_clean['A'] = pd.cut(df_clean['age'],
                       bins=[-np.inf, 50, np.inf],
                       labels=['A1', 'A2'])

contingency_absolute = pd.crosstab(df_clean['A'], df_clean['S'], margins=True)
contingency_relative = pd.crosstab(df_clean['A'], df_clean['S'], normalize=True, margins=True)

print("CONTINGENCY TABLE - ABSOLUTE FREQUENCIES\n" + "="*60)
print(contingency_absolute)
print("\n\nCONTINGENCY TABLE - RELATIVE FREQUENCIES\n" + "="*60)
print(contingency_relative.round(4))

## 3(b) Interpretation of Table Values

In [None]:
n12 = contingency_absolute.loc['A1', 'S2']
h12 = contingency_relative.loc['A1', 'S2']
n1_dot = contingency_absolute.loc['A1', 'All']
h1_dot = contingency_relative.loc['A1', 'All']

print("INTERPRETATION OF TABLE VALUES\n" + "="*60)
print(f"\nn₁₂ = {n12}")
print(f"Interpretation: {n12} CEOs are younger than 50 AND have salary between $3000-5000K")

print(f"\nh₁₂ = {h12:.4f}")
print(f"Interpretation: {h12*100:.2f}% of all CEOs are younger than 50 AND earn $3000-5000K")

print(f"\nn₁• = {n1_dot}")
print(f"Interpretation: Total number of CEOs younger than 50 (marginal frequency)")

print(f"\nh₁• = {h1_dot:.4f}")
print(f"Interpretation: {h1_dot*100:.2f}% of all CEOs are younger than 50 (marginal proportion)")

## 3(c) Dependence Measure

In [None]:
contingency_no_margins = pd.crosstab(df_clean['A'], df_clean['S'])

chi2, p_value, dof, expected = stats.chi2_contingency(contingency_no_margins)

n_total = contingency_no_margins.sum().sum()
cramers_v = np.sqrt(chi2 / (n_total * (min(contingency_no_margins.shape) - 1)))

print("DEPENDENCE ANALYSIS\n" + "="*60)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
print(f"Cramér's V: {cramers_v:.4f}")

print(f"\nINTERPRETATION:")
if p_value < 0.05:
    print("✓ Variables are statistically dependent (p < 0.05)")
else:
    print("✗ No significant dependence detected (p >= 0.05)")

if cramers_v < 0.1:
    strength = "negligible"
elif cramers_v < 0.3:
    strength = "weak"
elif cramers_v < 0.5:
    strength = "moderate"
else:
    strength = "strong"

print(f"Cramér's V indicates {strength} association between age and salary groups")

### What Can We Infer?

**About Co-movement**:
- Cramér's V measures strength of association (0 = independent, 1 = perfect)
- Cannot determine direction (positive/negative) from chi-square test
- Shows whether knowing age helps predict salary category

**About Opposite Directions**:
- Chi-square test does NOT measure direction of relationship
- Need to examine conditional probabilities or odds ratios for direction
- For nominal variables like these categories, "opposite direction" is not well-defined

**Better approach for direction**:
- Compare P(high salary | older) vs P(high salary | younger)
- Examine the standardized residuals from chi-square test
- Use conditional probability tables

In [None]:
conditional_probs = pd.crosstab(df_clean['A'], df_clean['S'], normalize='index')

print("\nCONDITIONAL PROBABILITIES (Row Percentages)\n" + "="*60)
print("P(Salary Category | Age Group):\n")
print(conditional_probs.round(4))

print("\nINTERPRETATION:")
print("Each row shows the distribution of salary categories within that age group.")
print("This helps us understand: Do older CEOs have different salary distributions?")

## Summary and Conclusions

This analysis revealed several key insights about CEO compensation:

1. **Distribution**: Highly right-skewed with significant outliers
2. **Central Tendency**: Median more appropriate than mean due to skewness
3. **Transformation**: Log transformation normalizes distribution
4. **Correlations**: Company size metrics strongly correlate with compensation
5. **Age Effects**: Differences exist between age groups in both location and spread
6. **Categorical Relationships**: Salary and age categories show measurable association