In [None]:
# Install required packages
%pip install --upgrade --quiet pandas
%pip install --upgrade --quiet natural_pdf
%pip install --upgrade --quiet tqdm


In [None]:
# Download and extract data files
import os
import urllib.request
import zipfile

url = 'https://github.com/jsoma/2025-ds-dojo/raw/main/docs/02-pandas/02-estats-pandas-data.zip'
print(f'Downloading data from {url}...')
urllib.request.urlretrieve(url, '02-estats-pandas-data.zip')

print('Extracting 02-estats-pandas-data.zip...')
with zipfile.ZipFile('02-estats-pandas-data.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

os.remove('02-estats-pandas-data.zip')
print('✓ Data files extracted!')

Check the data types. What's wrong with the **Total population** columns?

It's because the columns are `2,924,000`. How can we correct that? There are two ways – the best approach involves fixing it while you're reading in the data.

The long column name of `A1302_Total population (15-64)[person]` is awkward and difficult to type, so let's rename the columns to be:

- `pop_15-64`
- `pop_65-over`

What prefecture has the largest number of 65+ population?

Let's only look at **2019 data**. Filter your dataset for 2019, and confirm it has 47 rows.

Now find the prefecture with the largest number of 65+ population.

Tokyo is just *big*, though, so we should probably adjust this to be **based on percentage**.

Create a new column called `total` that is the total population.

Create two new columns for the percentage of the prefecture that is 15-64 and 65 and over. Name them `pct_15-64` and `pct_65-over`. 

> It's fine to keep the percentage as 0.0-1.0, but you can make it 0-100 if you'd really like!

What prefecture has the highest population of 65+ population?

Save the dataframe as `population-with-totals.csv`. **Check the file after you save it to make sure it doesn't have a weird empty column.**

## Bonus: Prefecture names in kana

Right now the content is **only in romaji**. It would be nice to have it in kana instead! I've created `prefecture_names.csv` with a mapping between the two.

Try to read it in: there's a issue you'll need to solve before it will work! You probably want to call it `df_pref`  so it doesn't overwrite your previous dataframe.

Merge your dataframes together, saving it as `merged`. Then graph again with the `name_jp` column.

The plot might not work because it uses non-Latin characters! You'll need to find a font that works.

Maybe you'll need to list fonts?

## Bonus: Interactive visualization

Read in the file again. In one cell, be sure to:

- Take care of the thousands, making sure the population is read in as numbers
- Rename the columns as we did above
- Create the `total` column
- Create the percentage columns

But this time we will NOT remove the pre-2022 data.

Confirm you have 2,256 rows.

Use ChatGPT to create a graphic to show you each prefecture's population over time. **On Thursday we will look at what the best approach to visualizing this data might be.**