## Introduction

This project is still in DRAFT form! 

This project reads in Canadian population data from a table on a page in Wikipedia. It showcases how to read in the data, clean it up so it can be plotted, and then plots the data. 

The flow in the program follows the standard IPO programming model of Input-Process-Output. 

We look at two data cases:

- population by province, sorted from low population to high
- population density by province, sorted from low to high

#TODO Update this comment once the project is finished.

## Import libraries

This code snippet imports two Python libraries for working with data. The first line is a comment that starts with `#` and explains the purpose of the next line. The second line uses the import keyword to load the pandas library, which is a popular tool for data manipulation and analysis. It also uses the as keyword to give pandas a shorter name, pd, which can be used later to access its features.

The third line is another comment that describes the next line. The fourth line imports the matplotlib library, which is a powerful tool for data visualization. It also gives matplotlib a shorter name, plt, which can be used to create plots and charts.

In [2]:
# Import pandas for data manipulation and analysis
import pandas as pd

# Import matplotlib for data visualization
import matplotlib.pyplot as plt


## Input: Get the Data

This code is a way of getting and showing the population data of Canada by province and territory from a Wikipedia page. Here is what each line does:

- The first line assigns a string value to a variable named url. This string is the web address of the Wikipedia page that has the data we want.
- The second line uses the pd.read_html function from the pandas library to read all the tables from the web page and store them in a list named tables. A list is a collection of items that can be accessed by their position.
- The third line assigns the first item of the list tables to a variable named df. This item is a pandas DataFrame, which is a data structure that holds tabular data in rows and columns. The first item of the list is the table we want because it has the population data of Canada by province and territory.
- The fourth line uses the display function to show the DataFrame df in a nice format. This function is useful for displaying data in notebooks or interactive shells.

In [13]:
# Define the URL of the Wikipedia page that contains the population data of Canada by province and territory
url = "https://en.wikipedia.org/wiki/Population_of_Canada_by_province_and_territory"

# Use pandas to read all the tables from the URL and store them in a list
tables = pd.read_html(url)

# Select the first table from the list, which is the one we want
df = tables[0]

# Display the dataframe
display(df)

Unnamed: 0_level_0,Population,Name[1],"Population, 2021 Census","Population, 2021 Census","Growth, 2016–21",Land area (km2),Population density (per km2),House of Commons seats,House of Commons seats,Senate seats,Senate seats
Unnamed: 0_level_1,Population,Name[1],Total,Proportion,"Growth, 2016–21",Land area (km2),Population density (per km2),Total,Proportion,Total,Proportion
0,1,Ontario,14223942,38.45%,5.8%,908699.33,15.2,121,35.8%,24,22.86%
1,2,Quebec,8501833,22.98%,4.1%,1356625.27,6.5,78,23.1%,24,22.86%
2,3,British Columbia,5000879,13.52%,7.6%,922503.01,5.4,42,12.4%,6,5.71%
3,4,Alberta,4262635,11.52%,4.8%,640330.46,6.7,34,10.1%,6,5.71%
4,5,Manitoba,1342153,3.63%,5.8%,552370.99,2.3,14,4.1%,6,5.71%
5,6,Saskatchewan,1132505,3.06%,3.4%,588243.54,2.0,14,4.1%,6,5.71%
6,7,Nova Scotia,969383,2.62%,5.0%,52942.27,18.4,11,3.3%,10,9.52%
7,8,New Brunswick,775610,2.09%,3.8%,71388.81,10.9,10,3.0%,10,9.52%
8,9,Newfoundland and Labrador,510550,1.38%,-1.8%,370514.08,1.4,7,2.1%,6,5.71%
9,10,Prince Edward Island,154331,0.42%,8.0%,5686.03,27.2,4,1.2%,4,3.81%


## Process: Clean the Data

If you look at the the output above you'll see that there are two contents rows, which makes plotting more difficult. Technically we have a "MultiIndex", as seen below:

In [14]:
display(df.columns) # default is a MultiIndex — Messy to work with! 

# create single index by merging the MultiIndex
# This is also messy but we'll rename the columns later

MultiIndex([(                  'Population',                   'Population'),
            (                     'Name[1]',                      'Name[1]'),
            (     'Population, 2021 Census',                        'Total'),
            (     'Population, 2021 Census',                   'Proportion'),
            (             'Growth, 2016–21',              'Growth, 2016–21'),
            (             'Land area (km2)',              'Land area (km2)'),
            ('Population density (per km2)', 'Population density (per km2)'),
            (      'House of Commons seats',                        'Total'),
            (      'House of Commons seats',                   'Proportion'),
            (                'Senate seats',                        'Total'),
            (                'Senate seats',                   'Proportion')],
           )

Let's eliminate the MultiIndex by joining the first row to the second. This isn't ideal but it can be cleaned up later by renaming the columns.

In [15]:
df.columns = df.columns.map('_'.join)  

display(df.head())

Unnamed: 0,Population_Population,Name[1]_Name[1],"Population, 2021 Census_Total","Population, 2021 Census_Proportion","Growth, 2016–21_Growth, 2016–21",Land area (km2)_Land area (km2),Population density (per km2)_Population density (per km2),House of Commons seats_Total,House of Commons seats_Proportion,Senate seats_Total,Senate seats_Proportion
0,1,Ontario,14223942,38.45%,5.8%,908699.33,15.2,121,35.8%,24,22.86%
1,2,Quebec,8501833,22.98%,4.1%,1356625.27,6.5,78,23.1%,24,22.86%
2,3,British Columbia,5000879,13.52%,7.6%,922503.01,5.4,42,12.4%,6,5.71%
3,4,Alberta,4262635,11.52%,4.8%,640330.46,6.7,34,10.1%,6,5.71%
4,5,Manitoba,1342153,3.63%,5.8%,552370.99,2.3,14,4.1%,6,5.71%


Let's now rename the column headers to clean it up. 

You'll notice I use underscores ("_") between the words. It's not necessary but it makes it easier for some of the programming we'll be doing later on. 

In [16]:
df = df.rename(columns={
    'Population_Population': 'Population_Rank',
    'Name[1]_Name[1]': 'Name',
    'Population, 2021 Census_Total': 'Population_2021',
    'Population, 2021 Census_Proportion': 'Population_Proportion',
    'Growth, 2016–21_Growth, 2016–21': 'Growth_2016_21',
    'Land area (km2)_Land area (km2)': 'Land_area_km2',
    'Population density (per km2)_Population density (per km2)': 'Population_density_per_km2',
    'House of Commons seats_Total': 'Commons_house_seats',
    'House of Commons seats_Proportion': 'Commons_seats_Proportion',
    'Senate seats_Total': 'Senate_seats',
    'Senate seats_Proportion': 'Senate_seats_Proportion'
})

display(df)

Unnamed: 0,Population_Rank,Name,Population_2021,Population_Proportion,Growth_2016_21,Land_area_km2,Population_density_per_km2,Commons_house_seats,Commons_seats_Proportion,Senate_seats,Senate_seats_Proportion
0,1,Ontario,14223942,38.45%,5.8%,908699.33,15.2,121,35.8%,24,22.86%
1,2,Quebec,8501833,22.98%,4.1%,1356625.27,6.5,78,23.1%,24,22.86%
2,3,British Columbia,5000879,13.52%,7.6%,922503.01,5.4,42,12.4%,6,5.71%
3,4,Alberta,4262635,11.52%,4.8%,640330.46,6.7,34,10.1%,6,5.71%
4,5,Manitoba,1342153,3.63%,5.8%,552370.99,2.3,14,4.1%,6,5.71%
5,6,Saskatchewan,1132505,3.06%,3.4%,588243.54,2.0,14,4.1%,6,5.71%
6,7,Nova Scotia,969383,2.62%,5.0%,52942.27,18.4,11,3.3%,10,9.52%
7,8,New Brunswick,775610,2.09%,3.8%,71388.81,10.9,10,3.0%,10,9.52%
8,9,Newfoundland and Labrador,510550,1.38%,-1.8%,370514.08,1.4,7,2.1%,6,5.71%
9,10,Prince Edward Island,154331,0.42%,8.0%,5686.03,27.2,4,1.2%,4,3.81%


If we look at the output above we'll see that there's a "Canada" line in the rows. We need to delete that! 

We can find that row by looking for the item in the Name column (df.Name) in the dataframe (called df) that equals "Canada":

In [22]:
df[df.Name == "Canada"]

Unnamed: 0,Population_Rank,Name,Population_2021,Population_Proportion,Growth_2016_21,Land_area_km2,Population_density_per_km2,Commons_house_seats,Commons_seats_Proportion,Senate_seats,Senate_seats_Proportion
13,Total,Canada,36991981,100%,5.2%,8965588.85,4.2,338,100%,105,100%


Deleting can only happen by using the index number (or value), so let's modify the above accordingly:

In [23]:
df[df.Name == "Canada"].index

Index([13], dtype='int64')

That will return the index number we need to delete. Putting it all together we get:

In [24]:
# Delete the row with the Name "Canada" and modify the original DataFrame
df.drop(df[df.Name == "Canada"].index, inplace=True)

display(df)

Unnamed: 0,Population_Rank,Name,Population_2021,Population_Proportion,Growth_2016_21,Land_area_km2,Population_density_per_km2,Commons_house_seats,Commons_seats_Proportion,Senate_seats,Senate_seats_Proportion
0,1,Ontario,14223942,38.45%,5.8%,908699.33,15.2,121,35.8%,24,22.86%
1,2,Quebec,8501833,22.98%,4.1%,1356625.27,6.5,78,23.1%,24,22.86%
2,3,British Columbia,5000879,13.52%,7.6%,922503.01,5.4,42,12.4%,6,5.71%
3,4,Alberta,4262635,11.52%,4.8%,640330.46,6.7,34,10.1%,6,5.71%
4,5,Manitoba,1342153,3.63%,5.8%,552370.99,2.3,14,4.1%,6,5.71%
5,6,Saskatchewan,1132505,3.06%,3.4%,588243.54,2.0,14,4.1%,6,5.71%
6,7,Nova Scotia,969383,2.62%,5.0%,52942.27,18.4,11,3.3%,10,9.52%
7,8,New Brunswick,775610,2.09%,3.8%,71388.81,10.9,10,3.0%,10,9.52%
8,9,Newfoundland and Labrador,510550,1.38%,-1.8%,370514.08,1.4,7,2.1%,6,5.71%
9,10,Prince Edward Island,154331,0.42%,8.0%,5686.03,27.2,4,1.2%,4,3.81%


In [9]:
#TODO Delete the columns we don't need

# df = df.drop(14) # The last row is the total for Canada

# Sort the data by name alphabetically
df = df.sort_values(by="Population_2021", ascending=False)
df.head()

Unnamed: 0,Population_Rank,Name,Population_2021,Population_Proportion,Growth_2016_21,Land_area_km2,Population_density_per_km2,Commons_house_seats,Commons_seats_Proportion,Senate_seats,Senate_seats_Proportion
13,Total,Canada,36991981,100%,5.2%,8965588.85,4.2,338,100%,105,100%
0,1,Ontario,14223942,38.45%,5.8%,908699.33,15.2,121,35.8%,24,22.86%
1,2,Quebec,8501833,22.98%,4.1%,1356625.27,6.5,78,23.1%,24,22.86%
2,3,British Columbia,5000879,13.52%,7.6%,922503.01,5.4,42,12.4%,6,5.71%
3,4,Alberta,4262635,11.52%,4.8%,640330.46,6.7,34,10.1%,6,5.71%


## Output: Plot the Population for Each Province

In [None]:
# Plot the bar chart for population
plt.figure(figsize=(10,6)) # Set the figure size
plt.bar(df["Name"], df["Population"], color="green") # Plot the bars
plt.xticks(rotation=90) # Rotate the x-axis labels
plt.xlabel("Province or territory") # Set the x-axis label
plt.ylabel("Population (2021)") # Set the y-axis label
plt.title("Population of Canada by province and territory") # Set the title
plt.show() # Show the plot

In [None]:
# Plot the bar chart for density
plt.figure(figsize=(10,6)) # Set the figure size
plt.bar(df["Name"], df["Density"], color="blue") # Plot the bars
plt.xticks(rotation=90) # Rotate the x-axis labels
plt.xlabel("Province or territory") # Set the x-axis label
plt.ylabel("Population density (per km2)") # Set the y-axis label
plt.title("Population density of Canada by province and territory") # Set the title
plt.show() # Show the plot