Title: Population Statisitics

Author: Philip Cullen

Task:

Part 1:

Write a jupyter notebook that analyses the differences between the sexes by age in Ireland:

- Weighted mean age (by sex)

- The difference between the sexes by age

Part 2:

In the same notebook, make a variable that stores an age (say 35)

- Write that code that would group the people within 5 years of that age together, into one age group 

- Calculate the population difference between the sexes in that age group.

Part 3:

- Write the code that would work out which region in Ireland has the biggest population difference between the sexes in that age group

In [7]:
# First I need to get the data that I'll be using, in this case is the file cso-populationbyage.csv

import pandas as pd

url = "https://raw.githubusercontent.com/andrewbeattycourseware/PFDA-courseware/refs/heads/main/code/data/cso-populationbyage.csv"
df = pd.read_csv(url)
# Creates a panda dataframe of the data from the .csv file.

print(df.shape)
print(df.columns)
df.head()
# Checking the data to make sure I have the correct information

(3264, 7)
Index(['Statistic Label', 'CensusYear', 'Sex', 'Single Year of Age',
       'Administrative Counties', 'UNIT', 'VALUE'],
      dtype='object')


Unnamed: 0,Statistic Label,CensusYear,Sex,Single Year of Age,Administrative Counties,UNIT,VALUE
0,Population,2022,Both sexes,All ages,Ireland,Number,5149139
1,Population,2022,Both sexes,All ages,Carlow County Council,Number,61968
2,Population,2022,Both sexes,All ages,Dublin City Council,Number,592713
3,Population,2022,Both sexes,All ages,Dún Laoghaire Rathdown County Council,Number,233860
4,Population,2022,Both sexes,All ages,Fingal County Council,Number,330506


Now that the data has been obtained I need to clean/prepare it for analysis

This ensures I only have the data for what I need.

In [None]:
# First I'm going to drop any columns I don't require.
# In this case I don't need the Statistic Label, Census Year or UNIT columns
# df.drop(columns=drop_col_list, inplace=True) → removes those columns from the DataFrame
# errors=ignore, prevents Python crashing if it encounters any columns that don't exist
drop_col_list = ["Statistic Label", "CensusYear", "UNIT"]
df.drop(columns=drop_col_list, inplace=True, errors='ignore')


df = df[df["Single Year of Age"] != "All ages"]
# This removes any row where Single Year of Age column says All ages

df["Single Year of Age"] = df["Single Year of Age"].str.replace('Under 1 year', '0')
# Changes the text “Under 1 year” to “0”, so babies under 1 year old are represented as age 0
df["Single Year of Age"] = df["Single Year of Age"].str.replace(r'\D', '', regex=True)
# This removes any non-numeric characters from the age column
# \D means anything that isn’t a number

df["Single Year of Age"] = df["Single Year of Age"].astype('int64')
# Converts the cleaned text values into real integers
# Essentially any strings are text convert to integers

df.rename(columns={
    "Single Year of Age": "Age",
    "VALUE": "Population",
    "Administrative Counties": "Region"
}, inplace=True)

# --- STANDARDIZE SEX NAMES ---
df["Sex"] = df["Sex"].str.strip().str.capitalize()

# --- CHECK CLEANED DATA ---
print("✅ Cleaned Data Sample:")
print(df.head(10))
print("\nColumns:", df.columns.tolist())
print("\nUnique Sex values:", df["Sex"].unique())
print("\nAge range:", df["Age"].min(), "-", df["Age"].max())