# Final Project for Data and Network Visualization
# Mukhamejan Assan
# March 31, 2023

## Data and Motivation

The motivation behind this project came while I was searching for a "good" visualization on the Internet. I stumbled upon a population pyramid chart on a website, but I didn't like that I had to click each year to see the relevant information. Additionally, the chart only displayed the percentage share of each age group, which I found limiting. So I decided to take matters into my own hands and create a better version of the chart.

I scraped the data from the same website and used Plotly to create an interactive population pyramid chart for Kazakhstan. The chart features a slider that allows the user to explore the data for different years, as well as a horizontal bar chart that displays the absolute number of people in each age group. The chart also distinguishes between males and females, making it easy to compare the population distribution between the two genders.

I hope this visualization will be useful for anyone interested in exploring the demographic trends of Kazakhstan.

### Code for Data scraping

In [1]:
import requests
import pandas as pd
import io

country_code = 398 # change this to the country code you want to retrieve data for
csv_data_list = [] # to store the CSV data for each year
year_list = [] # to store the year values for each CSV data string

# Loop through years from 1950 to 2100
for year in range(1950, 2101):
    # Send a GET request to the CSV URL for the current year and retrieve the contents
    url = f"https://www.populationpyramid.net/api/pp/{country_code}/{year}/?csv=true"
    response = requests.get(url)
    csv_data = response.content.decode()

    # Add the year value to the list and append the CSV data to the list
    year_list.append(year)
    csv_data_list.append(csv_data)

# Create a DataFrame for each CSV data string and add a year column
df_list = []
for i in range(len(csv_data_list)):
    df = pd.read_csv(io.StringIO(csv_data_list[i]))
    df.insert(0, "Year", year_list[i])
    df_list.append(df)

# Concatenate the DataFrames into a single DataFrame using pandas
df = pd.concat(df_list)


### Code for the first visualization

In [5]:
import plotly.graph_objs as go
import pandas as pd
from plotly.offline import plot

# Load data
data = df

# Define function to filter data by year
def get_data_by_year(data, year):
    return data[data['Year'] == year]

# Define function to update plot traces based on year
def update_traces(year):
    data_year = get_data_by_year(data, year)
    male_data = data_year[['Age', 'M']]
    female_data = data_year[['Age', 'F']]
    age_groups = list(data_year['Age'])
    bins = [i for i in range(len(age_groups) + 1)]

    fig.data[0].x = male_data['M']
    fig.data[0].y = age_groups
    fig.data[1].x = -female_data['F']
    fig.data[1].y = age_groups


# Create initial plot 
year = 1950
data_year = get_data_by_year(data, year)
male_data = data_year[['Age', 'M']]
female_data = data_year[['Age', 'F']]
age_groups = list(data_year['Age'])
bins = [i for i in range(len(age_groups) + 1)]

male_trace = go.Bar(
    y=age_groups,
    x=male_data['M'],
    orientation='h',
    name='Male',
    marker=dict(color='crimson')
)

female_trace = go.Bar(
    y=list(age_groups),
    x=[-x for x in female_data['F']],
    orientation='h',
    name='Female',
    marker=dict(color='seagreen')
)

fig = go.Figure(data=[male_trace, female_trace])

# Create slider for years
years = list(data['Year'].unique())
slider = dict(
    active=0,
    currentvalue={"prefix": "Year: "},
    pad={"t": 50},
    steps=[{"label": str(year), "method": "update", "args": [{"y": [list(get_data_by_year(data, year)['Age']),
                                                                 list(get_data_by_year(data, year)['Age'])],
                                                              "x": [list(get_data_by_year(data, year)['M']),
                                                                    [-x for x in list(get_data_by_year(data, year)['F'])] ],
                                                              "name": ["Male", "Female"],
                                                              "marker": [{"color": "crimson"}, {"color": "seagreen"}],
                                                              "title": f"Population Pyramid for Kazakhstan"}]}
             for year in years]
)

# Set the layout of the figure with slider and axis ranges
fig.update_layout(sliders=[slider], title='Population Pyramid for Kazakhstan',
                  xaxis=dict(range=[-1200000, 1200000],
                             tickvals=[-1000000 ,-800000, -600000, -400000, -200000, 0, 200000, 400000, 600000, 800000, 1000000],
                             ticktext=["-1M" , "800K", "600K", "400K", "200K", "0", "200K", "400K", "600K", "800K", "1M"],
                             title='Number'),
                  yaxis=dict(title='Age'),
                  barmode='overlay', bargap=0.1)

# Show the figure
plot(fig, filename='population_pyramid.html')


'population_pyramid.html'

### Data for the second visualization

In [3]:
# Load data into a pandas DataFrame
data = df

# Filter out the total rows for each year
data = data[data['Age'] != 'Total']

# Group the data by year and age
grouped_data = data.groupby(['Year', 'Age']).sum()

# Calculate the total population for each year
year_totals = grouped_data.groupby('Year').sum()

# Calculate the percentage share of M and F within each age group with respect to the total population for that year
grouped_data['M_p'] = grouped_data['M'] / year_totals['M'] * 100
grouped_data['F_p'] = grouped_data['F'] / year_totals['F'] * 100

# Reset the index of the DataFrame
grouped_data = grouped_data.reset_index()

# Add the values of the M_p and F_p columns to the DataFrame
for year in grouped_data['Year'].unique():
    year_total = year_totals.loc[year]['M'] + year_totals.loc[year]['F']
    year_row = grouped_data[grouped_data['Year'] == year]
    m_p_sum = year_row['M'].sum() / year_total * 100
    f_p_sum = year_row['F'].sum() / year_total * 100
    total_row = pd.DataFrame({'Year': [year], 'Age': ['Total'], 'M': [year_row['M'].sum()], 'F': [year_row['F'].sum()], 'M_p': [m_p_sum], 'F_p': [f_p_sum]})
    grouped_data = pd.concat([grouped_data, total_row])

# Filter out "Total" for each year as they are not needed
df = grouped_data
df = df[df.Age != 'Total']
# Recover the age order within each year
age_order = ['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90-94', '95-99']
df['Age'] = pd.Categorical(df['Age'], categories=age_order, ordered=True)  # order the Age column
df = df.sort_values(['Year', 'Age'])  # sort again by Year and Age



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### Code for the second visualization

In [7]:
import plotly.graph_objs as go
import pandas as pd
from plotly.offline import plot

# Load data
data = df

def get_data_by_year(data, year):
    return data[data['Year'] == year]

def update_traces(year):
    data_year = get_data_by_year(data, year)
    male_data = data_year[['Age', 'M_p']]
    female_data = data_year[['Age', 'F_p']]
    age_groups = list(data_year['Age'])
    bins = [i for i in range(len(age_groups) + 1)]

    fig.data[0].x = [x for x in male_data['M_p']]
    fig.data[0].y = age_groups
    fig.data[1].x = [-x for x in female_data['F_p']]
    fig.data[1].y = age_groups


# Create initial plot 
year = 1950
data_year = get_data_by_year(data, year)
male_data = data_year[['Age', 'M_p']]
female_data = data_year[['Age', 'F_p']]
age_groups = list(data_year['Age'])
bins = [i for i in range(len(age_groups) + 1)]

male_trace = go.Bar(
    y=age_groups,
    x=[x for x in male_data['M_p']],
    orientation='h',
    name='Male',
    marker=dict(color='crimson')
)

female_trace = go.Bar(
    y=list(age_groups),
    x=[-x for x in female_data['F_p']],
    orientation='h',
    name='Female',
    marker=dict(color='seagreen')
)

fig = go.Figure(data=[male_trace, female_trace])

# Create slider for years
years = list(data['Year'].unique())
slider = dict(
    active=0,
    currentvalue={"prefix": "Year: "},
    pad={"t": 50},
    steps=[{"label": str(year), "method": "update", "args": [{"y": [list(get_data_by_year(data, year)['Age']),
                                                                 list(get_data_by_year(data, year)['Age'])],
                                                              "x": [[x for x in list(get_data_by_year(data, year)['M_p'])],
                                                                    [-x for x in list(get_data_by_year(data, year)['F_p'])] ],
                                                              "name": ["Male", "Female"],
                                                              "marker": [{"color": "crimson"}, {"color": "seagreen"}],
                                                              "title": f"Population Pyramid for Kazakhstan"}]}
             for year in years]
)

fig.update_layout(sliders=[slider], title='Population Pyramid for Kazakhstan',
                  xaxis=dict(range=[-17, 17],
                             tickvals=[-15, -10, -5, 0, 5, 10, 15],
                             ticktext=['-15%', '-10%', '-5%', '0%', '5%', '10%', '15%'],
                             title='Percent'),
                  yaxis=dict(title='Age'),
                  barmode='overlay', bargap=0.1)

# Update traces to show initial year
update_traces(year)

# Show the figure
plot(fig, filename='population_pyramid.html')


'population_pyramid.html'

#### Conclusion / Comments

I am happy to report that I have successfully addressed the issues I encountered with the population pyramid visualizations on the website I came across. By creating a slider, I was able to make it easier to access the relevant information and incorporate the time-dimension in the most efficient way possible. Additionally, I was able to redo the original visualization with shares, resulting in a more comprehensive and accurate representation of the data.

Overall, I am pleased with the changes I made and believe that they have improved upon the existing visualizations. Through this process, I have gained valuable experience in data manipulation and visualization techniques, and I look forward to continuing to develop my skills in these areas in the future.

#### References

Website:
PopulationPyramid.net. Retrieved March 30, 2023, from https://www.populationpyramid.net/kazakhstan/2021/

Forum:
Plotly Community. (2018, June 22). Slider change chart data and chart title. Retrieved March 30, 2023, from https://community.plotly.com/t/slider-change-chart-data-and-chart-title/18510/3

Plotly. (n.d.). Sliders. Retrieved March 30, 2023, from https://plotly.com/python/sliders/?_ga=2.132637325.2046321635.1680195794-1552011825.1680195794