# 06. Basic Statistics and GroupBy

In this notebook, we'll learn how to perform basic statistical operations and group data, which is useful for analyzing character distributions and variant relationships.


In [None]:
import pandas as pd


## Basic Aggregation

Simple statistical operations on columns:


In [None]:
# Load variant data
df_variants = pd.read_csv('../cjkvi-variants/cjkvi-simplified.txt',
                          sep=',',
                          comment='#',
                          names=['variant', 'type', 'target'],
                          encoding='utf-8')

# Count total rows
print(f"Total variant entries: {df_variants['variant'].count()}")

# Count unique values
print(f"Unique variants: {df_variants['variant'].nunique()}")
print(f"Unique targets: {df_variants['target'].nunique()}")

# Count non-null values
print(f"Non-null variants: {df_variants['variant'].notna().sum()}")


## Counting by Categories: .value_counts()

One of the most useful operations - count how many times each value appears:


In [None]:
# Count how many entries of each variant type
variant_type_counts = df_variants['type'].value_counts()
print("Variant types and their counts:")
print(variant_type_counts)

# Get percentages
print("\nPercentages:")
print(df_variants['type'].value_counts(normalize=True) * 100)


## GroupBy Operations

GroupBy lets you split data into groups and perform operations on each group:


In [None]:
# Group by variant type and count entries in each group
grouped_by_type = df_variants.groupby('type')
print("Groups created:")
print(grouped_by_type.groups.keys())

# Count size of each group
print("\nSize of each group:")
print(grouped_by_type.size())


### GroupBy with Aggregation

You can perform different operations on grouped data:


In [None]:
# Count unique targets for each variant type
unique_targets_per_type = df_variants.groupby('type')['target'].nunique()
print("Unique targets per variant type:")
print(unique_targets_per_type)

# Get first variant for each type (example of aggregation)
first_variant_per_type = df_variants.groupby('type')['variant'].first()
print("\nFirst variant in each type:")
print(first_variant_per_type)


## Practical Exercises with CJK Data

### Exercise 1: Count Characters by Variant Type


In [None]:
# Count simplified vs traditional characters
simp_trad_counts = df_variants[df_variants['type'].isin(['cjkvi/simplified', 'cjkvi/traditional'])]['type'].value_counts()
print("Simplified vs Traditional counts:")
print(simp_trad_counts)


### Exercise 2: Find Most Common Variant Relationships


In [None]:
# Load joyo variants
df_joyo = pd.read_csv('../cjkvi-variants/joyo-variants.txt',
                     sep=',',
                     comment='#',
                     names=['character', 'type', 'variant'],
                     encoding='utf-8')

# Find characters with the most variants
variant_counts = df_joyo.groupby('character').size().sort_values(ascending=False)
print("Characters with most variants (top 10):")
print(variant_counts.head(10))


### Exercise 3: Count Characters by IDS Structure Type


In [None]:
# Load IDS data
# Note: Some characters have multiple IDS decompositions, so we use usecols to read only first 3 columns
df_ids = pd.read_csv('../cjkvi-ids-unicode/rawdata/cjkvi-ids/ids.txt',
                     sep='\t',
                     skiprows=2,  # Skip copyright header
                     usecols=[0, 1, 2],  # Only read first 3 columns
                     names=['unicode', 'character', 'ids'],
                     encoding='utf-8',
                     nrows=10000)  # Load more for better statistics

# Extract structure type (first character of IDS)
df_ids['structure'] = df_ids['ids'].str[0]

# Count characters by structure type
structure_counts = df_ids['structure'].value_counts()
print("Characters by IDS structure type:")
print(structure_counts)


## Summary

| Operation | Method | Example |
|-----------|--------|---------|
| Count rows | `.count()` | `df['col'].count()` |
| Count unique | `.nunique()` | `df['col'].nunique()` |
| Value counts | `.value_counts()` | `df['type'].value_counts()` |
| Group by | `.groupby('col')` | `df.groupby('type')` |
| Group size | `.groupby().size()` | `df.groupby('type').size()` |
| Group count | `.groupby().count()` | `df.groupby('type')['var'].count()` |
| Group unique | `.groupby().nunique()` | `df.groupby('type')['var'].nunique()` |

## What's Next?

In the final notebook, we'll put everything together:
- Complete data processing workflow
- Building a character lookup table from scratch
- Combining filtering, merging, and statistics

## Try It Yourself

1. Count characters by different variant types
2. Find which characters have the most variants
3. Group IDS data by structure type and analyze
4. Experiment with different GroupBy aggregations
