# 05. Statistics and Data Quality

OK Marina, I bet you're tired. You know what Daniel likes when he's tired? Maths! Pandas is made for statistics, so let's use it for its actual purpose!

Statistics isn't just about numbers - it's about understanding your data. How many things are missing? What's the distribution? What patterns can we find? 

Let's load up the table we built in the last notebook and start poking around:


In [3]:
import pandas as pd

# Load the table we built in the last notebook
df = pd.read_csv('../marina_tables/ids_with_radicals.csv',
                 encoding='utf-8')

print(f"Loaded {len(df):,} characters")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
print(df.head())


Loaded 99,016 characters

Columns: ['character', 'components', 'radical_string', 'all_components']

First few rows:
  character components radical_string all_components
0         丁          丁            高土氵              丁
1         成        ⿵戊丁           土氵山广             戊丁
2         丂          丂            口扌艹              丂
3         七          七            刀木木              七
4         三       ⿳一一一         一一一亻忄風            一一一


## First Things First: What's Missing?

Before we do anything fancy, let's see what's missing and broken. Yes, I know, there is obviously something missing and profoundly broken in your life if you have turned to Japanology, but I'm not talking about you, silly! I mean the tables we just spent all this time compiling. Let's check:

In [4]:
# Check for missing values in each column
print("Missing values per column:")
print(df.isna().sum())

print(f"\nTotal rows: {len(df):,}")
print(f"Rows with missing data: {df.isna().any(axis=1).sum():,}")
print(f"Rows with complete data: {df.notna().all(axis=1).sum():,}")

# Check for empty strings (which are different from NaN!)
print(f"\nEmpty radical_string: {(df['radical_string'] == '').sum():,}")
print(f"Empty all_components: {(df['all_components'] == '').sum():,}")


Missing values per column:
character             0
components            0
radical_string    11567
all_components        0
dtype: int64

Total rows: 99,016
Rows with missing data: 11,567
Rows with complete data: 87,449

Empty radical_string: 0
Empty all_components: 0


## Your Turn: Explore Missing Data

OK, so we have some missing `radical_string` values (NaN, not empty strings). That's totally normal: some characters might not have any radicals in our list. Or, more likely, we fucked something up. 

**Your mission:** 

1. **Calculate the percentage**: Find out what percentage of characters have a missing `radical_string` (NaN).
   - Hint: You can count NaN values with `df['radical_string'].isna().sum()`
   - To get a percentage, divide by the total number of rows (`len(df)`) and multiply by 100
   - Use f-strings to format it nicely: `f"{percentage:.1f}%"`

2. **Find interesting cases**: Show me a few examples of characters that have missing radical strings but DO have components.
   - Hint: You'll need to filter for rows where `radical_string` is NaN AND `all_components` is not empty
   - Use `.isna()` to check for NaN, and `!= ''` to check for non-empty strings
   - Use `.head(10)` to show just a few examples

Go ahead, try it in the cell below! The comments will guide you step by step.


In [None]:
# Your code here!
# Step 1: Count how many characters have missing radical_string (NaN)
# Hint: Use .isna() to check for NaN values, then .sum() to count them
# empty_radical_count = ...

# Step 2: Get the total number of characters
# Hint: Use len(df) to get the total number of rows
# total_count = ...

# Step 3: Calculate the percentage
# Hint: Divide empty_radical_count by total_count, then multiply by 100
# percentage_empty = ...

# Step 4: Print the results
# Hint: Use f-strings with :, for thousands separator and :.1f for one decimal place
# print(f"Characters with missing radical_string: {empty_radical_count:,} ({percentage_empty:.1f}%)")
# print(f"Characters with radical_string: {total_count - empty_radical_count:,} ({100 - percentage_empty:.1f}%)")

# Step 5: Find characters with missing radical_string but WITH components
# Hint: Filter where radical_string is NaN (.isna()) AND all_components is not empty (!= '')
# empty_radical_but_has_components = df[
#     (df['radical_string'].isna()) & 
#     (df['all_components'] != '')
# ]

# Step 6: Print how many you found and show examples
# print(f"\nCharacters with missing radical_string but WITH components: {len(empty_radical_but_has_components):,}")
# print("\nExamples:")
# print(empty_radical_but_has_components[['character', 'components', 'radical_string', 'all_components']].head(10))


## Radical String Lengths: How Many Radicals Per Character?

Let's see how many radicals we extracted per character. Some characters might have 1 radical, some might have 5. Let's see the distribution:

**New methods we'll use:**
- `.str.len()` - Gets the length of each string in a column. Since `radical_string` contains strings, we use `.str` to access string methods, then `.len()` to get the length of each string.
- `.value_counts()` - Counts how many times each unique value appears. Super useful for seeing distributions!
- `.sort_index()` - Sorts a Series by its index (the values). We'll use this to see the distribution in order (0, 1, 2, 3...).
- `.nlargest(n, 'column')` - Finds the n rows with the largest values in a column. Perfect for finding "top 10" or "most extreme" cases.

**Reminder:** When selecting multiple columns from a DataFrame, use **double brackets** `[['col1', 'col2']]` to get a DataFrame back. Single brackets `['col']` gives you a Series.


In [None]:
# Add a column for radical_string length
df['radical_count'] = df['radical_string'].str.len()

# Count how many characters have each number of radicals
radical_count_distribution = df['radical_count'].value_counts().sort_index()

print("Distribution of radical counts:")
print(radical_count_distribution.head(20))

print(f"\nCharacters with 0 radicals: {(df['radical_count'] == 0).sum():,}")
print(f"Characters with 1 radical: {(df['radical_count'] == 1).sum():,}")
print(f"Characters with 2+ radicals: {(df['radical_count'] >= 2).sum():,}")

# Find characters with the most radicals
print(f"\nCharacters with the most radicals:")
most_radicals = df.nlargest(10, 'radical_count')[['character', 'components', 'radical_string', 'radical_count']]
print(most_radicals)


## Your Turn: Component Counts

Now it's your turn! Can you do the same thing for `all_components`? 

**Your mission:**

1. **Create a column**: Add a column called `component_count` that counts the length of each `all_components` string.
   - Hint: Use the same method as before: `df['all_components'].str.len()`

2. **Show the distribution**: Count how many characters have each number of components.
   - Hint: Use `.value_counts()` on the new column, then `.sort_index()` to see it in order
   - Use `.head(20)` to show the first 20 values

3. **Count specific ranges**: Print how many characters have 0, 1, and 2+ components.
   - Hint: Use boolean conditions like `(df['component_count'] == 0).sum()`
   - Use `>= 2` for "2 or more"

4. **Find extremes**: Find the characters with the most components.
   - Hint: Use `.nlargest(10, 'component_count')` to get the top 10
   - Remember to use **double brackets** `[['col1', 'col2']]` when selecting multiple columns!

You've already seen all these methods in the previous cell - now it's your turn to use them!


In [None]:
# Your code here!
# Step 1: Create component_count column
# Hint: Use .str.len() on the 'all_components' column
# df['component_count'] = ...

# Step 2: Get the distribution of component counts
# Hint: Use .value_counts() then .sort_index()
# component_distribution = ...
# print("Distribution of component counts:")
# print(component_distribution.head(20))

# Step 3: Count characters with 0, 1, and 2+ components
# Hint: Use boolean conditions with .sum() and f-strings with :, for formatting
# print(f"\nCharacters with 0 components: ...")
# print(f"Characters with 1 component: ...")
# print(f"Characters with 2+ components: ...")

# Step 4: Find characters with the most components
# Hint: Use .nlargest(10, 'component_count') and remember double brackets for multiple columns!
# print(f"\nCharacters with the most components:")
# most_components = df.nlargest(10, 'component_count')[['character', 'components', 'all_components', 'component_count']]
# print(most_components)


## GroupBy: The Magic of Grouping

OK, so `.value_counts()` is cool, but what if you want to do more complex things? That's where `.groupby()` comes in. It lets you split your data into groups and then do operations on each group.

**How it works:**
- `df.groupby('column')` - Groups the DataFrame by unique values in that column
- `.size()` - Counts how many rows are in each group
- `.mean()` - Calculates the average of a column for each group
- You can also use `.sum()`, `.min()`, `.max()`, `.count()`, etc.

Let's group by radical count and see what we can learn:


In [None]:
# Group by radical_count and see how many characters in each group
grouped_by_radical_count = df.groupby('radical_count')

print("Number of characters grouped by radical count:")
print(grouped_by_radical_count.size().head(15))

# What's the average component count for each radical count?
print("\nAverage component count for each radical count:")
avg_components_by_radicals = grouped_by_radical_count['component_count'].mean()
print(avg_components_by_radicals.head(15))


## Your Turn: GroupBy Practice

Now it's your turn! Try grouping by `component_count` and see what patterns you can find.

**Your mission:**

1. **Group by component_count**: Create a groupby object grouped by `component_count`
   - Hint: Use `df.groupby('component_count')`

2. **Count characters in each group**: See how many characters have each number of components
   - Hint: Use `.size()` on your groupby object
   - Use `.head(15)` to show the first 15 groups

3. **Find average radical count**: For each component count, what's the average number of radicals?
   - Hint: Use `grouped['radical_count'].mean()` to get the average radical count for each component count group
   - This will show you if characters with more components tend to have more radicals

4. **Bonus challenge**: Can you find the minimum and maximum radical counts for each component count group?
   - Hint: Use `.min()` and `.max()` instead of `.mean()`

Go ahead and try it!


In [None]:
# Your code here!
# Step 1: Group by component_count
# Hint: Use df.groupby('component_count')
# grouped_by_component = ...

# Step 2: Count characters in each group
# Hint: Use .size() on the groupby object
# print("Number of characters grouped by component count:")
# print(grouped_by_component.size().head(15))

# Step 3: Find average radical count for each component count
# Hint: Use grouped_by_component['radical_count'].mean()
# print("\nAverage radical count for each component count:")
# avg_radicals_by_components = ...
# print(avg_radicals_by_components.head(15))

# Step 4 (Bonus): Find min and max radical counts
# print("\nMinimum radical count for each component count:")
# min_radicals = ...
# print(min_radicals.head(15))
# 
# print("\nMaximum radical count for each component count:")
# max_radicals = ...
# print(max_radicals.head(15))


## Most Common Radicals

Which radicals appear most often? Let's find out by looking at all the radical_string values and counting individual radicals:


In [None]:
# Extract all individual radicals from radical_string
# We'll split each radical_string into individual characters and count them
all_radicals = []
for radical_string in df['radical_string']:
    if pd.notna(radical_string) and radical_string != '':
        # Each character in the string is a radical
        for radical in radical_string:
            all_radicals.append(radical)

# Count how many times each radical appears
radical_counts = pd.Series(all_radicals).value_counts()

print("Most common radicals (top 20):")
print(radical_counts.head(20))

print(f"\nTotal unique radicals found: {len(radical_counts)}")
print(f"Total radical occurrences: {len(all_radicals):,}")


## Your Turn: Most Common Components

Now you try! Can you find the most common components (not just radicals) from the `all_components` column? 

Hint: You'll need to do something similar to what we just did - extract all individual characters from `all_components` and count them.


In [None]:
# Extract all individual components from all_components
all_components_list = []
for components_str in df['all_components']:
    if pd.notna(components_str) and components_str != '':
        # Each character in the string is a component
        for component in components_str:
            all_components_list.append(component)

# Count how many times each component appears
component_counts = pd.Series(all_components_list).value_counts()

print("Most common components (top 30):")
print(component_counts.head(30))

print(f"\nTotal unique components found: {len(component_counts)}")
print(f"Total component occurrences: {len(all_components_list):,}")

# Compare: how many of the top components are also in our radical list?
print("\nTop 10 components and whether they're radicals:")
top_10_components = component_counts.head(10)
for component in top_10_components.index:
    # We'd need to load radical_list to check, but let's just show the data
    print(f"  {component}: appears {top_10_components[component]:,} times")


## Data Quality Check: Radical vs Component Count

Let's see if there's a relationship between how many radicals we found and how many total components a character has:


In [None]:
# Group by radical_count and see statistics about component_count
radical_component_stats = df.groupby('radical_count')['component_count'].agg(['mean', 'min', 'max', 'count'])

print("Component count statistics grouped by radical count:")
print(radical_component_stats.head(15))

# Characters where radical_count > component_count (shouldn't happen!)
weird_cases = df[df['radical_count'] > df['component_count']]
print(f"\n⚠️  Characters where radical_count > component_count: {len(weird_cases)}")
if len(weird_cases) > 0:
    print("This shouldn't happen! Let's see:")
    print(weird_cases[['character', 'components', 'radical_string', 'all_components', 'radical_count', 'component_count']].head())


## Your Turn: Find Patterns

OK Marina, here's a challenge for you. Can you:

1. Find characters where `radical_count` is 0 but `component_count` is greater than 0 (characters with components but no radicals extracted)
2. Group by `component_count` and find the average `radical_count` for each group
3. Find the character(s) with the biggest difference between `component_count` and `radical_count`

Go wild! Explore your data!


In [None]:
# 1. Characters with components but no radicals
no_radicals_but_has_components = df[
    (df['radical_count'] == 0) & 
    (df['component_count'] > 0)
]

print(f"Characters with components but no radicals: {len(no_radicals_but_has_components):,}")
print("\nExamples:")
print(no_radicals_but_has_components[['character', 'components', 'radical_string', 'all_components']].head(10))

# 2. Average radical_count grouped by component_count
print("\n" + "="*60)
print("Average radical_count grouped by component_count:")
avg_radicals_by_components = df.groupby('component_count')['radical_count'].mean()
print(avg_radicals_by_components.head(15))

# 3. Biggest difference between component_count and radical_count
df['radical_component_diff'] = df['component_count'] - df['radical_count']
biggest_diff = df.nlargest(10, 'radical_component_diff')

print("\n" + "="*60)
print("Characters with biggest difference (component_count - radical_count):")
print(biggest_diff[['character', 'components', 'radical_string', 'all_components', 
                    'radical_count', 'component_count', 'radical_component_diff']])


## Summary: What We Learned

Statistics isn't just about numbers - it's about understanding your data. Today we learned:

1. **Missing Data**: How to check for missing values and empty strings
2. **Value Counts**: Using `.value_counts()` to see distributions
3. **GroupBy**: Using `.groupby()` to split data and analyze groups
4. **Data Quality**: Finding weird cases and inconsistencies
5. **Exploration**: Poking around to find patterns

## Key Methods

| What You Want | How to Do It | Example |
|---------------|--------------|---------|
| Count missing | `.isna().sum()` | `df['col'].isna().sum()` |
| Count values | `.value_counts()` | `df['type'].value_counts()` |
| Group data | `.groupby('col')` | `df.groupby('radical_count')` |
| Group size | `.groupby().size()` | `df.groupby('type').size()` |
| Group average | `.groupby().mean()` | `df.groupby('type')['count'].mean()` |
| Find largest | `.nlargest(n, 'col')` | `df.nlargest(10, 'radical_count')` |

## What's Next?

You've now learned:
- ✅ Loading and cleaning data
- ✅ Filtering
- ✅ Merging datasets
- ✅ Statistics and data quality

You're a big girls now. You're ready to stop playing with Jupyter and start doing things for real in an IDE like Pycharm.
