In [None]:
import pandas as pd
import os
import glob

# How Big is Canada, Really?

Canada is the second largest country in the world by area, but most of it is virtually uninhabited. In 2016, 66% of the Canadian population lived within 100km of the US border [source](https://www150.statcan.gc.ca/n1/daily-quotidien/170208/dq170208a-eng.htm).
I've been curious for a long while now, how big is Canada really, if we only include area that has passes a certain population density?

A naive approach would to go to every country's own statistics/census website, wrestle with their chosen data accessing and formatting patterns.
A slightly smarter approach is to let someone else do it for us, and use their results!
The [Socioeconomic Data and Applications Center (sedac)](https://sedac.ciesin.columbia.edu/data/collection/gpw-v4) has done this since 1995.

I downloaded their Population Density dataset v4.11, which contains ASCII and TIFF files of pixel-level (up to a resolution of 30 arc-second) data.
Unfortunately, I couldn't think of a convenient way of figuring out which pixels belong to each country.
A much easier approach is to use the Administrative Unit Center Points with Population Estimates dataset, which gives, per administrative unit in a given country, both its population and area, which makes getting the density easy as pie.
I'm not sure if there's some resolution lost by solving this way, since a hypothetically gigantic admin unit could have everyone concentrated in a single square kilometer, and we'd never know.
Something to look out for when we start wrangling the data.

## Step 1: Download Data

I went over the the link above (or [here](https://sedac.ciesin.columbia.edu/data/collection/gpw-v4), same link), create an account, selected Global/Regional as the Geography, Comma Separated Value as the file format, and then made a decision: either

1. Tick the Global box, which contains a single CSV for the rest of the world minus the US, then four separate CSV for the latter, or
2. Tick the rest of the boxes, which will split the CSV files for the rest of the world into continents (with four separate CSV files for the US again)

I figure this hinges on how much RAM these files would occupy, and whether I wanted to parallelize processing.

The single global CSV is 3.1GB, while the CSV files for subsections of the US are up to 2.7GB.
The files for continents are all under 1.5GB. According to [Jeff](https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas), it takes about double the size of the file in RAM to open it up; since I won't be doing much more than removing unecessary columns, it's relatively affordable.

Since I'm lazy and have my fingers crossed this'll run relatively quickly given how simple what I'm doing will be, I won't bother parallelizing either.

So Global download only it is! I also downloaded the documentation, since it describes the column titles used below.

## Step 2: Load, Trim, Permute, and Save

Next step is pretty straightforward; load each CSV, remove all columns but the country, density, and area (I'll keep population too for futzing around later), and save it all to a CSV for probably faster loading if I ever want to open these again.

In [None]:
curr_dir = os.getcwd()
files = glob.glob(curr_dir + '/data/unprocessed/*.csv')
combined_csv = pd.DataFrame()

for file in files:
    df = pd.read_csv(file)
    df = df[['COUNTRYNM', 'LAND_A_KM', 'UN_2020_E', 'UN_2020_DS']]
    combined_csv = pd.concat([combined_csv, df], ignore_index=True)

combined_csv.to_csv(curr_dir + '/data/processed/data.csv', index=False) # creative naming, I know

That trims our CSV file down to a much more manageable 700MB.
Let's load it back up and make sure everything is in the right ballpark.

In [None]:
processed_dir = curr_dir + '/data/processed/data.csv'
world = pd.read_csv(processed_dir)

canada = world.loc[world['COUNTRYNM'] == 'Canada']
canada[['LAND_A_KM', 'UN_2020_E']].sum()

Population of 37 million and land area of 9 million? Seems about right.

## Step 3: How Dense is Dense Enough?

What density is reasonable to consider an administrative unit sufficiently populated to garner interest?
There's a few approaches we can take:

- Trim all units with a density below the global average
- Trim all units with a density below the average density in north-american farmland