<h1> Final Project 1 Jupyter Notebook: India, Pakistan, and Bangladesh </h1>
<h2> Group 1: Dorothy Thomas </h2>
<br/>
<p>Authors: Rishi Boddu, Kita Hu, Sage Tulabing, Leanna Baltonado</p>
<br/>
<p>In this Jupyter Notebook, we'll be introducing population functions, population pyramid functions and more in order to quantify the 1994 Rwandan Genocide using data analysis. We primarily utilize the WBData Population dataset in order to compile data containing population for each age, gender, year, and country. We cite the information from WB Data. https://wbdata.readthedocs.io. We primarily aim to analyze the 1994 Rwandan Genocide using graphs and visualizations.</p>

In [2]:
# installations and importing data

%pip install plotly
%pip install wbdata
%pip install eep153_tools
%pip install python_gnupg
%pip install -U gspread_pandas

import wbdata
import pandas as pd
import numpy as np

Note: you may need to restart the kernel to use updated packages.
Collecting wbdata
  Using cached wbdata-1.1.0-py3-none-any.whl.metadata (2.1 kB)
Collecting appdirs<2,>=1.4 (from wbdata)
  Using cached appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting cachetools<6,>=5.3.2 (from wbdata)
  Using cached cachetools-5.5.2-py3-none-any.whl.metadata (5.4 kB)
Collecting dateparser<2,>=1.2.0 (from wbdata)
  Downloading dateparser-1.3.0-py3-none-any.whl.metadata (30 kB)
Collecting shelved-cache<0.4,>=0.3.1 (from wbdata)
  Using cached shelved_cache-0.3.1-py3-none-any.whl.metadata (4.7 kB)
Collecting tabulate<1,>=0.8.5 (from wbdata)
  Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting tzlocal>=0.2 (from dateparser<2,>=1.2.0->wbdata)
  Using cached tzlocal-5.3.1-py3-none-any.whl.metadata (7.6 kB)
Using cached wbdata-1.1.0-py3-none-any.whl (18 kB)
Using cached appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Using cached cachetools-5.5.2-py3-none-any.whl (10 kB)
Down

Key '908928392666670524' not in persistent cache.
Key '6816556363565469494' not in persistent cache.
Key '1368085445840460484' not in persistent cache.
Key '9137008510512675864' not in persistent cache.
Key '-9071094013739866322' not in persistent cache.
Key '6359577487516839820' not in persistent cache.
Key '9038855796116952968' not in persistent cache.
Key '8527968594482825886' not in persistent cache.
Key '-2709414708725071635' not in persistent cache.
Key '-4467262090490648840' not in persistent cache.
Key '7830179600893910772' not in persistent cache.
Key '-7351732370817082538' not in persistent cache.
Key '-7188400888950285268' not in persistent cache.
Key '7568051928805496522' not in persistent cache.
Key '-8199413836303941946' not in persistent cache.
Key '-2247774701848335460' not in persistent cache.
Key '431503606751777688' not in persistent cache.
Key '5065931009304700304' not in persistent cache.
Key '-3095456167335194868' not in persistent cache.
Key '4113904603166459227'

In [3]:
wbdata.get_countries(query="India")[0]['id']

'IND'

In [4]:
wbdata.get_countries(query="Pakistan")[0]['id']

'BMN'

In [5]:
wbdata.get_countries(query="Bangladesh")[0]['id']

'BGD'

In [6]:
#wbdata.get_indicators()
SOURCE = 40 # "Population estimates and projections

indicators = wbdata.get_indicators(source=SOURCE)
indicators



id                 name
-----------------  -------------------------------------------------------------------
SH.DTH.0509        Number of deaths ages 5-9 years
SH.DTH.0514        Number of deaths ages 5-14 years
SH.DTH.1014        Number of deaths ages 10-14 years
SH.DTH.1019        Number of deaths ages 10-19 years
SH.DTH.1519        Number of deaths ages 15-19 years
SH.DTH.2024        Number of deaths ages 20-24 years
SH.DTH.IMRT        Number of infant deaths
SH.DTH.IMRT.FE     Number of infant deaths, female
SH.DTH.IMRT.MA     Number of infant deaths, male
SH.DTH.MORT        Number of under-five deaths
SH.DTH.MORT.FE     Number of under-five deaths, female
SH.DTH.MORT.MA     Number of under-five deaths, male
SH.DTH.NMRT        Number of neonatal deaths
SH.DYN.0509        Probability of dying among children ages 5-9 years (per 1,000)
SH.DYN.0514        Probability of dying at age 5-14 years (per 1,000 children age 5)
SH.DYN.1014        Probability of dying among adolescents ages 1

<h1> [A] Population Dataframe</h1>

In [20]:
countries = wbdata.get_countries()
country_dict = {}

# building a dictionary so that users can just search by country name rather than the ID (e.g. "India" vs "IND")

for country in countries:
    country_code = country['id']
    country_name = country['name']
    country_dict[country_name] = country_code

# formatting the ages so that they dont have to be two digit strings (e.g 05 vs 5)

def int_to_str(num):
    if 0 <= num < 10:
        return f"0{num}"
    else:
        return str(num)

In [61]:
# function to display dataframe 
def population_dataframe(area, year_range, age_range):
   
    """
    area (str): e.g. "India"
    year_range (tuple): (start_year, end_year), use (year, year) if just want singular year
    age_range (tuple): (low_age, high_age)

    """

    # find country with the exact name entered
    country_id = next(
        (c["id"] for c in countries if c["name"] == area),
        None
    )
    
    low, high = age_range
    ages = range(low, high + 1)

    indicators = {}
    for age in ages:
        age2 = f"{age:02d}"
        indicators[f"SP.POP.AG{age2}.MA.IN"] = f"male_{age2}"
        indicators[f"SP.POP.AG{age2}.FE.IN"] = f"female_{age2}"

    df = wbdata.get_dataframe(
        indicators,
        country=country_id,
        parse_dates=True
    )

    # Filter by year range
    start, end = year_range
    df = df[(df.index.year >= start) & (df.index.year <= end)]

    # Convert datetime index to Year column
    df = df.copy()
    df.insert(0, "Year", df.index.year)
    df.reset_index(drop=True, inplace=True)

    # Rename columns to readable English
    rename_map = {}
    for col in df.columns:
        if col.startswith("male_"):
            age = int(col.split("_")[1])
            rename_map[col] = f"Male age {age}"
        elif col.startswith("female_"):
            age = int(col.split("_")[1])
            rename_map[col] = f"Female age {age}"

    df.rename(columns=rename_map, inplace=True)

    return df

In [63]:
# test code of dataframe function
df = population_dataframe( "India", (2000, 2010), (1, 3))
df

Unnamed: 0,Year,Male age 1,Female age 1,Male age 2,Female age 2,Male age 3,Female age 3
0,2010,13495545.0,12298938.0,13415841.5,12215921.0,13456025.5,12244401.0
1,2009,13472670.5,12282255.0,13484389.5,12278052.5,13601170.5,12368019.5
2,2008,13548239.5,12352226.5,13634321.0,12407041.5,13782866.0,12523308.0
3,2007,13705075.0,12488959.0,13820419.0,12567362.0,13918835.0,12647810.0
4,2006,13894405.5,12653374.0,13957326.5,12693488.0,14008123.0,12742689.0
5,2005,14036289.5,12785347.5,14049018.5,12791509.5,14090686.0,12813192.0
6,2004,14133068.5,12889388.5,14134137.5,12865285.5,14053390.5,12752942.5
7,2003,14222743.0,12968494.5,14098552.5,12807316.5,13882771.5,12589598.0
8,2002,14193578.5,12917390.5,13931333.0,12647920.0,13743147.5,12466298.5
9,2001,14029296.5,12761370.5,13792863.5,12526431.5,13663218.5,12399016.5


<h1>[A] Population Statistics</h1>

In [51]:
def population_statistics(area, year_range, age_range, sex):
    """
    area (str): e.g. "India", "World"
    year_range (tuple): (start_year, end_year) — use (year, year) for a single year
    age_range (tuple): (low_age, high_age)
    sex (str): "Male", "Female", or "All"

    Answers queries of the form:
    In [year] how many [people/males/females] aged [low] to [high]
    were living in [the world/region/country]?

    Returns
    -------
    pandas.Series
        Indexed by year with population counts
    """

    # Get base population dataframe
    df = population_dataframe(area, year_range, age_range)

    # Aggregate age groups
    df["male_total"] = df.filter(like="male_").sum(axis=1)
    df["female_total"] = df.filter(like="female_").sum(axis=1)
    df["population_total"] = df["male_total"] + df["female_total"]

    # Keep only aggregated columns
    agg = df[["male_total", "female_total", "population_total"]].copy()

    # Format index
    agg.index = agg.index.year
    agg.index.name = "year"

    # Normalize sex input
    sex = sex.capitalize()
    if sex not in {"Male", "Female", "All"}:
        raise ValueError("sex must be 'Male', 'Female', or 'All'")

    # Select output series
    if sex == "Female":
        result = agg["female_total"]
    elif sex == "Male":
        result = agg["male_total"]
    else:
        result = agg["population_total"]

    # Compute total across all years
    total_value = result.sum()
    result.loc["Total"] = total_value

    # Build title
    start, end = year_range
    low, high = age_range
    year_text = f"{start}" if start == end else f"{start}–{end}"

    title = (
        f"Population in {area} during {year_text}, "
        f"ages {low}–{high} ({sex})"
    )

    # Attach title for display
    result.name = title

    return result

In [52]:
# test code of statistics function
india_pop = population_statistics('India', (2000, 2010), (0, 3), 'All')
india_pop

year
2010     1.523889e+08
2009     1.530605e+08
2008     1.541782e+08
2007     1.557785e+08
2006     1.574501e+08
2005     1.589519e+08
2004     1.598060e+08
2003     1.597774e+08
2002     1.590998e+08
2001     1.579490e+08
2000     1.567327e+08
Total    1.725173e+09
Name: Population in India during 2000–2010, ages 0–3 (All), dtype: float64

<h1> [A] Unit tests </h1>

In [57]:
def test_func():
    

SyntaxError: incomplete input (2463406973.py, line 2)

In [58]:
# running the unit tests
test_func()

NameError: name 'test_func' is not defined

<h1> [B] Population Pyramids </h1>