# Canada Immigration Fundamentals

**What this notebook shows**
- End-to-end exploratory analysis (loading, cleaning, EDA)
- Clear visual storytelling and interpretation

**Data**
- See in-notebook references (no external files required).

In [None]:
# Project: Canada Immigration Fundamentals
# Authors: Manish Mogan & Ritesh Penumatsa
# Context: Personal reference notebook for initial EDA on CIC landing data
# Created: September 17, 2025
# Last Updated: September 17, 2025


In [None]:
%pip install openpyxl

In [None]:
# import libraries
import numpy as np
import pandas as pd
import openpyxl

In [None]:
# read data file 'Canada.xlsx' and create a data frame
df = pd.read_excel ('Canada.xlsx', sheet_name = 'Canada by Citizenship (2)')

In [None]:
# get the size of the dataframe (rows, cols)
df.shape

In [None]:
# get the head of the dataframe
df.head()

In [None]:
# get the tail of the dataframe
df.tail()

In [None]:
# get the information on the dataframe
df.info (verbose = False)

In [None]:
# get a description of the dataframe
df.describe()

In [None]:
# get a list of column headers
df.columns

In [None]:
# get a list of indices
df.index

In [None]:
# drop unnecessary columns
# in pandas: rows is axis =0 and columns is axis = 1
df.drop (['Type', 'Coverage', 'AREA', 'REG', 'DEV', 'DevName'], axis = 1, inplace = True)

In [None]:
# check the deletion of unnecessary columns
df.head()

In [None]:
# rename column names
df.rename (columns = {'OdName':'Country', 'AreaName':'Continent', 'RegName': 'Region'}, inplace = True)

In [None]:
# check if columns were renamed
df.head()

In [None]:
# add a column at the end giving the total number of immigrants for each country
df['Total'] = df.sum (axis = 1, numeric_only = True)

In [None]:
# check if column was added
df.head()

In [None]:
# change the index to be the name of the country
df.set_index ('Country', inplace = True)

In [None]:
# check if the index was changed
df.head()

In [None]:
# get a slice of the data
df.loc ['Costa Rica']

In [None]:
# get data for only certain years
df.loc ['Greece', [1981, 1988, 1994, 1999]]

In [None]:
# convert column names into strings
df.columns = list (map (str, df.columns))

In [None]:
# create a condition
cond = (df['Continent'] == 'Asia')
print (cond)

In [None]:
# create a compound condition using Boolean operators: ~ (not), & (and), | (or)
cond = df[(df['Continent'] == 'Asia') & (df['Region'] == 'Southern Asia')]
print (cond)

**Q. 0 (0 points)**
Some useful functions to make your life easier in this assignment:
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
* https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html


**Q. 1 (20 points)**  
Add a row at the bottom of the dataframe that gives the total immigration for each year from 1980 to 2013.

In [None]:
# years var is -> list of years as strings like '1980', '1981', etc.
years = [str(year) for year in range(1980, 2014)]

# created series called total per years (with years var created ^^ above as index) -> sums total immigrants for each year
total_per_year = df[years].sum()

# created grand total which sum across all years to get an overall grand totla
grand_total = total_per_year.sum()

# total_row ->> created row for yearly totals and make it index with name = '' parameter
total_row = pd.Series(total_per_year, name='Total')

# in final totals row for continent/region columns, this specifies "All" to clarify its totaling acrosss all regions and continents
total_row['Continent'] = 'All'
total_row['Region'] = 'All'
total_row['Total'] = grand_total

#adding to the df
df = pd.concat([df, total_row.to_frame().T])

#showing last few rows for confmation
df

**Q. 2 (20 points)**  
For each year, find the maximum number of immigrants and the country that had the maximum number of immigrants. Create a new dataframe having three columns - year, country, and number of immigrants. For a given year, if there are two or more countries having the same maximum number of immigrants, then choose that country that comes first in alphabetical order.

In [None]:
# drop the "Total" row if it exists (we only want real countries)
df_no_total = df.drop(index='Total', errors='ignore')

# make list of years as strings like b4, '1980'...'2013' since theyre strings in the ds
# your df columns for years are strings, not ints
years = [str(year) for year in range(1980, 2014)]

# empty list to store results
records = []

# loop over each year
for year in years:
    # get the biggest immigrant number for that year (single max value)
    max_val = df_no_total[year].max()

    # grab the country that hit that max (or countries coz there could be ties)
    countries_with_max = df_no_total[df_no_total[year] == max_val].index.tolist()
    #if tie, sort them alphabetically and just take the first one
    country = sorted(countries_with_max)[0]

    # build a little dict with year, country name, and immigrant number
    # cast to int so it's clean, no numpy dtype stuff
    records.append({'year': int(year), 'country': country, 'number_of_immigrants': int(max_val)})

# turn the list of dicts into a dataframe
max_df = pd.DataFrame(records)

#display
max_df

**Q. 3 (20 points)**   
For each year find the total, mean, standard deviation, maximum, minimum, and range (max - min). Create a new dataframe having columns - year, total, mean, standard deviation, maximum, minimum, and range.

In [None]:
# drop total row to only make sure df has country and year stuff - important for querying, searching purposes
df_no_total = df.drop(index='Total', errors='ignore')

# make years list
years = [str(year) for year in range(1980, 2014)]

# empty list for statistics for each year
stats_records = []

# loop each year in the datasret
for year in years:
    # grab immigration number columns shown under each year (one year of immigration numbers across all countries)
    s = df_no_total[year]

    total = s.sum() #sum up immigrants across the countries for that year
    mean = s.mean()#mean immigrants across the countries for that year
    std = s.std()#std of immigrants across the countries for that year
    maximum = s.max() #max # of immigrants across the countries for that year
    minimum = s.min()#min # of immigrants across the countries for that year
    rng = maximum - minimum#range = max - min of immigrants across the years
    stats_records.append({'year': int(year), 'total': int(total), 'mean': mean, 'std': std, 'max': int(maximum), 'min': int(minimum), 'range': int(rng)})
stats_df = pd.DataFrame(stats_records)
stats_df

**Q. 4 (20 points)**   
For the Scandinavian countries - Denmark, Norway, and Sweden - print the name of the country and the total immigration for each of these countries.

In [None]:
# list of scandinavian countries to look for
scandinavian_countries = ['Denmark', 'Norway', 'Sweden']

#looping through the scandanavian countries
for country in scandinavian_countries:
    #getting immigration amount for each country by looking for total column for that country using loc
    total_immigration = df.loc[country, 'Total']
    #printing country name and its total immigration amount
    print(country, int(total_immigration))

**Q. 5 (20 points)**    
Sum the immigration for all the years from the following continents - Africa, Asia, Europe, Latin America and  the Caribbean, Northern America, and Oceania. Print the name of the continent and the total immigration.

In [None]:
#list of continents to look at
continents_to_sum = ['Africa', 'Asia', 'Europe', 'Latin America and the Caribbean', 'Northern America', 'Oceania']

#loop thru each contintent
for continent in continents_to_sum:
    # only looking at countries in continent of current iteration of loop^^, repeats this for each continent
    subset = df[df['Continent'] == continent]

    #summing up totals at country level for that continent
    total_continent = subset['Total'].sum()
    #printing each continent's total immigration amount
    print(continent, int(total_continent))