# Frequency Statistics
- Frequency tables, Histograms, Binning, Contingency tables
- Real examples: Categorical data analysis, Cross-tabulation

In [1]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd
print('Frequency statistics module loaded')

Frequency statistics module loaded


## Frequency Tables

**Concept**: Count occurrences of values
**Use**: Categorical and discrete data
**Tools**: np.unique, pd.value_counts

In [2]:
# Frequency table
np.random.seed(42)
data = np.random.choice(['A', 'B', 'C', 'D'], size=100, p=[0.4, 0.3, 0.2, 0.1])

print('Frequency Table\n')

values, counts = np.unique(data, return_counts=True)
for val, count in zip(values, counts):
    freq = count / len(data) * 100
    print(f'  {val}: {count:3d} ({freq:5.1f}%)')

Frequency Table

  A:  46 ( 46.0%)
  B:  24 ( 24.0%)
  C:  21 ( 21.0%)
  D:   9 (  9.0%)


## Binning Continuous Data

**Purpose**: Convert continuous to discrete
**Methods**:
- Fixed width bins
- Fixed count bins (quantiles)
- Custom bins

In [3]:
# Histogram binning
np.random.seed(42)
scores = np.random.normal(75, 15, 200)

print('Test Score Distribution\n')

# Create bins
bins = [0, 60, 70, 80, 90, 100]
labels = ['F', 'D', 'C', 'B', 'A']

# Bin the data
bin_indices = np.digitize(scores, bins[:-1])

print('Grade distribution:')
for i, label in enumerate(labels, 1):
    count = (bin_indices == i).sum()
    pct = count / len(scores) * 100
    print(f'  {label} ({bins[i-1]}-{bins[i]}): {count:3d} ({pct:5.1f}%)')

Test Score Distribution

Grade distribution:
  F (0-60):  32 ( 16.0%)
  D (60-70):  41 ( 20.5%)
  C (70-80):  62 ( 31.0%)
  B (80-90):  42 ( 21.0%)
  A (90-100):  23 ( 11.5%)


## Contingency Tables

**Definition**: Cross-tabulation of two categorical variables
**Use**: Chi-square test, association analysis

In [4]:
# Contingency table
np.random.seed(42)
n = 200

# Generate data
gender = np.random.choice(['Male', 'Female'], n)
product = np.random.choice(['A', 'B', 'C'], n)

print('Contingency Table: Gender × Product\n')

# Create contingency table
df = pd.DataFrame({'Gender': gender, 'Product': product})
contingency = pd.crosstab(df['Gender'], df['Product'], margins=True)

print(contingency)
print('\nRow percentages:')
row_pct = pd.crosstab(df['Gender'], df['Product'], normalize='index') * 100
print(row_pct.round(1))

Contingency Table: Gender × Product

Product   A   B   C  All
Gender                  
Female   32  37  31  100
Male     41  24  35  100
All      73  61  66  200

Row percentages:
Product     A     B     C
Gender                   
Female   32.0  37.0  31.0
Male     41.0  24.0  35.0


## Chi-Square Test of Independence

**Test**: Are two categorical variables independent?
**H₀**: Variables are independent
**H₁**: Variables are associated

In [5]:
# Chi-square test
observed = pd.crosstab(df['Gender'], df['Product']).values

chi2, p_value, dof, expected = stats.chi2_contingency(observed)

print('Chi-Square Test of Independence\n')
print(f'Chi-square statistic: {chi2:.4f}')
print(f'p-value: {p_value:.4f}')
print(f'Degrees of freedom: {dof}')
print(f'\nConclusion:')
if p_value < 0.05:
    print('  Reject H₀: Variables are associated (p < 0.05)')
else:
    print('  Fail to reject H₀: No evidence of association')

print('\nExpected frequencies:')
print(expected.round(1))

Chi-Square Test of Independence

Chi-square statistic: 4.1225
p-value: 0.1273
Degrees of freedom: 2

Conclusion:
  Fail to reject H₀: No evidence of association

Expected frequencies:
[[36.5 30.5 33. ]
 [36.5 30.5 33. ]]


## Real Example: Customer Survey Analysis

**Data**: Age group × Product satisfaction
**Task**: Test for association

In [6]:
# Survey data
print('Customer Satisfaction Survey\n')

# Contingency table
data = np.array([
    [25, 35, 15],  # 18-30: Dissatisfied, Neutral, Satisfied
    [30, 50, 40],  # 31-50
    [15, 25, 45]   # 51+
])

age_groups = ['18-30', '31-50', '51+']
satisfaction = ['Dissatisfied', 'Neutral', 'Satisfied']

print('Observed frequencies:')
df_survey = pd.DataFrame(data, index=age_groups, columns=satisfaction)
print(df_survey)
print()

# Row percentages
print('Row percentages:')
row_pct = df_survey.div(df_survey.sum(axis=1), axis=0) * 100
print(row_pct.round(1))
print()

# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(data)

print('Statistical Test:')
print(f'  Chi-square: {chi2:.3f}')
print(f'  p-value: {p_value:.4f}')
print(f'  df: {dof}')
if p_value < 0.05:
    print('\n  Significant association between age and satisfaction!')
else:
    print('\n  No significant association detected')

Customer Satisfaction Survey

Observed frequencies:
       Dissatisfied  Neutral  Satisfied
18-30            25       35         15
31-50            30       50         40
51+              15       25         45

Row percentages:
       Dissatisfied  Neutral  Satisfied
18-30          33.3     46.7       20.0
31-50          25.0     41.7       33.3
51+            17.6     29.4       52.9

Statistical Test:
  Chi-square: 19.683
  p-value: 0.0006
  df: 4

  Significant association between age and satisfaction!


## Cramer's V (Effect Size)

**Measure**: Strength of association (0 to 1)
**Interpretation**:
- 0.1: Weak
- 0.3: Moderate
- 0.5: Strong

In [7]:
# Cramer's V
n_total = data.sum()
min_dim = min(data.shape[0], data.shape[1]) - 1
cramers_v = np.sqrt(chi2 / (n_total * min_dim))

print(f'Effect Size:\n')
print(f'  Cramer\'s V: {cramers_v:.3f}')
if cramers_v < 0.1:
    print('  Weak association')
elif cramers_v < 0.3:
    print('  Moderate association')
else:
    print('  Strong association')

Effect Size:

  Cramer's V: 0.187
  Moderate association


## Summary

### Frequency Tables:
```python
values, counts = np.unique(data, return_counts=True)
freq_table = pd.value_counts(data)
```

### Binning:
```python
# Fixed bins
bins = [0, 25, 50, 75, 100]
bin_indices = np.digitize(data, bins)

# Histogram
counts, edges = np.histogram(data, bins=10)
```

### Contingency Tables:
```python
# Create table
contingency = pd.crosstab(var1, var2)

# Chi-square test
chi2, p, dof, expected = stats.chi2_contingency(table)

# Cramer's V
V = np.sqrt(chi2 / (n * min_dim))
```

### Applications:
- Categorical data analysis
- Survey analysis
- A/B testing
- Market research
- Quality control