# Canada Immigration Trend Study

**What this notebook shows**
- Exploratory visualization with Seaborn

**Data**
- See in-notebook references (no external files required).

In [None]:
# Project: Canada Immigration Trend Study
# Authors: Manish Mogan & Ritesh Penumatsa
# Context: Personal research log exploring Canadian immigration flows
# Created: September 21, 2025
# Last Updated: September 25, 2025


In [None]:
%pip install openpyxl

In [None]:
# import libraries
import numpy as np
import pandas as pd
import openpyxl

In [None]:
# read data file 'Canada.xlsx' and create a data frame
df = pd.read_excel ('Canada.xlsx', sheet_name = 'Canada by Citizenship (2)')

In [None]:
# get the size of the dataframe (rows, cols)
df.shape

In [None]:
# get the head of the dataframe
df.head()

In [None]:
# get the tail of the dataframe
df.tail()

In [None]:
# get the information on the dataframe
df.info (verbose = False)

In [None]:
# get a description of the dataframe
df.describe()

In [None]:
# get a list of column headers
df.columns

In [None]:
# get a list of indices
df.index

In [None]:
# drop unnecessary columns
# in pandas: rows is axis =0 and columns is axis = 1
df.drop (['Type', 'Coverage', 'AREA', 'REG', 'DEV', 'DevName'], axis = 1, inplace = True)

In [None]:
# check the deletion of unnecessary columns
df.head()

In [None]:
# rename column names
df.rename (columns = {'OdName':'Country', 'AreaName':'Continent', 'RegName': 'Region'}, inplace = True)

In [None]:
# check if columns were renamed
df.head()

In [None]:
# add a column at the end giving the total number of immigrants for each country
df['Total'] = df.sum (axis = 1, numeric_only = True)

In [None]:
# check if column was added
df.head()

In [None]:
# change the index to be the name of the country
df.set_index ('Country', inplace = True)

In [None]:
# check if the index was changed
df.head()

In [None]:
# get a slice of the data
df.loc ['Costa Rica']

In [None]:
# get data for only certain years
df.loc ['Greece', [1981, 1988, 1994, 1999]]

In [None]:
# convert column names into strings
df.columns = list (map (str, df.columns))

In [None]:
# create a condition
cond = (df['Continent'] == 'Asia')
print (cond)

In [None]:
# create a compound condition using Boolean operators: ~ (not), & (and), | (or)
cond = df[(df['Continent'] == 'Asia') & (df['Region'] == 'Southern Asia')]
print (cond)

**Reference Notes**
Some helper links I lean on when wrangling this dataset:
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
* https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html


In [None]:
DATA_PATH = r"Canada.xlsx"

def load_canada_xlsx(path=DATA_PATH):
    df = pd.read_excel(path, sheet_name="Canada by Citizenship", skiprows=range(20), skipfooter=2)
    df.drop(columns=["Type", "Coverage"], inplace=True, errors="ignore")
    df.rename(columns={"OdName": "Country", "AreaName": "Continent", "RegName": "Region"}, inplace=True)
    years = list(range(1980, 2014))
    df = df[df["Country"] != "Total"].copy()
    for y in years:
        if y in df.columns:
            df[y] = pd.to_numeric(df[y], errors="coerce").fillna(0).astype(int)
    df["Total"] = df[years].sum(axis=1)
    return df, years

df, YEARS = load_canada_xlsx()
df.head()


**Continental Leaders Snapshot**
Curious about which country tops total immigration within each continent (Africa, Asia, Europe, Latin America & the Caribbean, Northern America, Oceania).

Continental leaders based on cumulative arrivals (1980–2013):

- Africa — Egypt: 72,745
- Asia — India: 691,904
- Europe — United Kingdom of Great Britain and Northern Ireland: 551,500
- Latin America & the Caribbean — Jamaica: 106,431
- Northern America — United States of America: 241,122
- Oceania — Australia: 23,829


In [None]:
continents_to_check = ["Africa", "Asia", "Europe", "Latin America and the Caribbean", "Northern America", "Oceania"]
rows = []
for cont in continents_to_check:
    sub = df[df["Continent"] == cont]
    if len(sub) == 0:
        rows.append((cont, None, 0))
    else:
        idx = sub["Total"].idxmax()
        row = sub.loc[idx]
        rows.append((cont, row["Country"], int(row["Total"])))
q6a_df = pd.DataFrame(rows, columns=["Continent", "Country", "Total Immigration (1980-2013)"])
q6a_df

**Contextual Notes for the Leaders Above**
A quick narrative on why each country dominated within its continent during 1980–2013.

Egypt (Africa) → jobs opportunities, education, politics shifting, 2011 Egyptian Revolution, family sponsorship stayed strong.

India (Asia) → tech boom brought jobs for engineers, doctors, students; late ’80s onward numbers rose fast; family links kept adding more every year.

UK (Europe) → commonwealth ties and common use of english made it easier; steady flow of workers, students, and family migration all through the period.

Jamaica (Caribbean) → family chain migration rooted since the 60s; steady inflow for reunification and work; numbers stayed consistent across decades.

USA (North America) → work opportunities, education, and family; American Dream, NAFTA in the 90s boosted cross-border jobs; constant two-way traffic.

Australia (Oceania) → english-speaking skilled workers and students; school exchange and family ties; not large numbers but reliable every year.

**Volatility Scan by Country**
Measure the range (max − min) of yearly arrivals for every country to see who experienced the wildest swings.

In [None]:
ranges = (df[YEARS].max(axis=1) - df[YEARS].min(axis=1)).astype(int)
q7a_df = pd.DataFrame({"Country": df["Country"].values, "Range (1980-2013)": ranges.values}).sort_values("Country").reset_index(drop=True)
q7a_df

**Boxplot Diagnostic**
Visualize every country (134 boxplots) to compare distributions from 1980–2013. Melted data makes it easier to facet through groups in batches of ~30 countries.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


df_melted = df.melt(
    id_vars=["Country"],
    value_vars=YEARS,
    var_name="Year",
    value_name="Immigration"
)

countries_sorted = df["Country"].sort_values().unique()

for i in range(0, len(countries_sorted), 30):
    subset = countries_sorted[i:i+30]
    plot_data = df_melted[df_melted["Country"].isin(subset)]

    plt.figure(figsize=(12, 8))
    ax = sns.boxplot(
        data=plot_data,
        x="Immigration", y="Country",
        showfliers=False
    )

    ax.set_xlim(left=0)

    plt.title(f"Immigration Per Country (1980–2013): Countries {i+1}–{i+len(subset)}")
    plt.tight_layout()
    plt.show()


**Caribbean Decade Totals**
Compare immigration sums for each Caribbean nation across the Eighties (1980–1989) and Nineties (1990–1999).

In [None]:
carib = df[df["Region"] == "Caribbean"].copy()
eighties_years = list(range(1980, 1990))
nineties_years = list(range(1990, 2000))

carib["Eighties"] = carib[eighties_years].sum(axis=1).astype(int)
carib["Nineties"] = carib[nineties_years].sum(axis=1).astype(int)
q8_df = carib[["Country", "Eighties", "Nineties"]].sort_values("Country").reset_index(drop=True)
q8_df


**Nordic Spread Tracker**
Denmark vs. Norway vs. Sweden — capture the yearly gap between the highest and lowest inflow.

In [None]:
scand = df[df["Country"].isin(["Denmark", "Norway", "Sweden"])].set_index("Country")
rows = []
for y in YEARS:
    vals = scand[y]
    rows.append((y, int(vals.max() - vals.min())))
q9_df = pd.DataFrame(rows, columns=["Year", "Immigration Range (1980-2013)"])
q9_df

**African Extremes by Year**
For every year, flag the African country with the highest arrivals and the one with the lowest to understand dispersion across the continent.

In [None]:
africa = df[df["Continent"] == "Africa"]
rows = []
for y in YEARS:
    col = africa[["Country", y]].copy()
    max_idx = col[y].idxmax()
    min_idx = col[y].idxmin()
    rows.append((y, africa.loc[max_idx, "Country"], int(africa.loc[max_idx, y]),
                    africa.loc[min_idx, "Country"], int(africa.loc[min_idx, y])))
q10_df = pd.DataFrame(rows, columns=["Year", "Max Country", "Max Immigration", "Min Country", "Min Immigration"])
q10_df