## 25-26: 4369 -- PROGRAMMING FOR DATA ANALYTICS  
## Topic 05 :Population Statistics   
   
# Assignment 5 Population  
Author:  Niall Naughton  
Date:  25/10/2025  

***
## <font color="crimson">Part 1 70%</font>
Write a jupyter notebook that analyses the differences between the sexes by age in Ireland.

* Weighted mean age (by sex)
* The difference between the sexes by age
This part does not need to look at the regions.

ie You can take the notebook I used in the lectures and substitute the sexes for the regions.

Data extracted from cso.ie website.  
Census->Census2022->Summary Results  

![alt text](resources/cso_link.png)   

At base of page, select "API Data Query" to extract URL  
* Format = CSV(1.0)  
* RESTful URL  

https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en   

Use this URL extract CSV Data

In [76]:
import pandas as pd

url = "https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en"
df = pd.read_csv(url)
df.tail(3)

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),CensusYear,C02199V02655,Sex,C02076V03371,Single Year of Age,C03789V04537,Administrative Counties,UNIT,VALUE
9789,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-149d-13a3-e055-000000000001,Cavan County Council,Number,12
9790,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-14a4-13a3-e055-000000000001,Donegal County Council,Number,31
9791,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-1495-13a3-e055-000000000001,Monaghan County Council,Number,7


## Clean Population Dataset 1 : Format Columns  
We are only interested in Columns "Sex", "Administrative Counties", "Single Year of Age", and "Value"   
*Remove all other columns*   *
* We are not analyzing County specific data ... So filter out records that are not "ireland"*  
* We are analyzing Age specific data ... So filter out records with "All ages"  
* We are analyzing Sex specific data ... So filter out records with "Both Sexes"  

* Column Names could do with simplifying/shortening  
* In Age Column, replace 'Under 1 year' with 0  
* In Age Column, Remove all alpha charters and Convert to int64    

In [77]:

# Remove unwanted columns
drop_col_list = ['STATISTIC', 'Statistic Label','TLIST(A1)','CensusYear','C02199V02655','C02076V03371','C03789V04537','UNIT']
df.drop(columns=drop_col_list, inplace=True)

# Filter "Administrative Counties" for Ireland only
#df = df[df["Administrative Counties"] == "Ireland"]
#Filter "Single Year of Age" by removing "all Ages
df = df[df["Single Year of Age"] != "All ages"]
#Filter "Sex" by removing "Both Sexes"
df = df[df["Sex"] != "Both sexes"]

#Tidy up Column Names by renaming them
df.rename(columns={'Administrative Counties': 'Region', 'Single Year of Age': 'Age'}, inplace=True)

#Format Age Column (replace 'Under 1 year' with 0, and convert to int)
df['Age'] = df['Age'].str.replace('Under 1 year', '0')
df['Age'] = df['Age'].str.replace('\\D', '', regex=True)
# Convert columns to int64
df['Age']=df['Age'].astype('int64')
df['VALUE']=df['VALUE'].astype('int64')

#review formatted dataframe
print(df)

         Sex  Age                                 Region  VALUE
3296    Male    0                                Ireland  29610
3297    Male    0                  Carlow County Council    346
3298    Male    0                    Dublin City Council   3188
3299    Male    0  Dún Laoghaire Rathdown County Council   1269
3300    Male    0                  Fingal County Council   2059
...      ...  ...                                    ...    ...
9787  Female  100               Roscommon County Council      7
9788  Female  100                   Sligo County Council      9
9789  Female  100                   Cavan County Council     12
9790  Female  100                 Donegal County Council     31
9791  Female  100                Monaghan County Council      7

[6464 rows x 4 columns]


## Clean Population Dataset 2 : Pivot Dataframe  
For a given age, We have Male and Female Population Data in 2 separate Rows.   
These 2 population datapoints should be "pivoted" horizontally across columns on a singe row   

With the Pivot, 'Age' and 'Region' are not regular columns anymore — they have been moved into the index (row labels)  
We need 'Age' and 'Region' back as normal columns with the **reset_index()** method  
Furthermore, After pivoting, pandas keeps the name of the previous column (Sex) and attaches it to the name of the index.   
We need to remove this misleading name

In [78]:
#pivot 'Male', 'Female' data into columns
pivot_df = df.pivot(index=['Age', 'Region'], columns='Sex', values='VALUE')
#We need Age and Region back as normal columns
pivot_df = pivot_df.reset_index()
#remove misleading index name
pivot_df.index.name = None  
pivot_df.columns.name = None   

#review pivoted dataframe
print (pivot_df)
pivot_df.info()

      Age                           Region  Female  Male
0       0            Carlow County Council     353   346
1       0             Cavan County Council     501   505
2       0             Clare County Council     691   686
3       0                Cork City Council    1124  1159
4       0              Cork County Council    2055  2135
...   ...                              ...     ...   ...
3227  100         Tipperary County Council      19     4
3228  100  Waterford City & County Council      15     4
3229  100         Westmeath County Council       9     3
3230  100           Wexford County Council      14     2
3231  100           Wicklow County Council      15     3

[3232 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3232 entries, 0 to 3231
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Age     3232 non-null   int64 
 1   Region  3232 non-null   object
 2   Female  3232 non-null   int64 
 3   

## Generate Weighted Mean Age (By Sex)
The formula for a "Weighted" Mean is :  
sum(age*population at age) / sum (populations at age)
    
    
$\bar{age}_w = \frac{\sum_{i=1}^{n} age_i * population_i}{\sum_{i=1}^{n} population_i}$


This calculation does not need to look at the specific regions.

In [79]:
#get population data only for Ireland.
df_pivot_ire = pivot_df[pivot_df['Region'] == 'Ireland']

#Weighted mean is sum(age*population at age) / sum (populations at age)
weighted_mean_male = (df_pivot_ire['Age'] * df_pivot_ire['Male']).sum() / df_pivot_ire['Male'].sum()
weighted_mean_female = (df_pivot_ire['Age'] * df_pivot_ire['Female']).sum() / df_pivot_ire['Female'].sum()

print(f"Weighted Mean Male Age : {weighted_mean_male}")
print(f"Weighted Mean Female Age : {weighted_mean_female}")

Weighted Mean Male Age : 37.7394477371039
Weighted Mean Female Age : 38.9397958987787


## The difference between the sexes by age
Lets look at Population difference between the sexes by age for wholes of Ireland

In [80]:
#Use Absolute Difference Value
df_pivot_ire['Difference'] = abs(df_pivot_ire['Male'] - df_pivot_ire['Female'])
print(df_pivot_ire[['Age', 'Male', 'Female', 'Difference']])

      Age   Male  Female  Difference
11      0  29610   28186        1424
43      1  28875   27545        1330
75      2  30236   28974        1262
107     3  31001   29483        1518
139     4  31686   29819        1867
...   ...    ...     ...         ...
3083   96    327     956         629
3115   97    217     732         515
3147   98    130     492         362
3179   99    105     336         231
3211  100    154     584         430

[101 rows x 4 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pivot_ire['Difference'] = abs(df_pivot_ire['Male'] - df_pivot_ire['Female'])


***
## <font color="crimson">Part 2 20%</font>
In the same notebook, make a variable that stores an age (say 35).

Write that code that would group the people within 5 years of that age together, into one age group 

Calculate the population difference between the sexes in that age group.




In [60]:
age_group = 25
age_group_limit1 = age_group - 5;
age_group_limit2 = age_group + 5;

#again only look at Ireland data (df_pivot_ire)
df_grouped = df_pivot_ire[(df_pivot_ire['Age'] >= age_group_limit1) & (df_pivot_ire['Age'] <= age_group_limit2)]
agegroup_male_population = df_grouped['Male'].sum()
agegroup_female_population = df_grouped['Female'].sum()
agegroup_population_difference = abs(agegroup_male_population - agegroup_female_population)

#print(df_grouped)

print(f"Total Irish Population for Male Age group ({age_group_limit1} - {age_group_limit2}) : {agegroup_male_population}")
print(f"Total Irish Population for Female Age group ({age_group_limit1} - {age_group_limit2}) : {agegroup_female_population}")

print(f"Total Irish Population difference between the sexes for Age Group ({age_group_limit1} - {age_group_limit2}) : {agegroup_population_difference}")


Total Irish Population for Male Age group (20 - 30) : 333702
Total Irish Population for Female Age group (20 - 30) : 332948
Total Irish Population difference between the sexes for Age Group (20 - 30) : 754


***
## <font color="crimson">Part 3 10%</font>
In the same notebook.

Write the code that would work out which region in Ireland has the biggest population difference between the sexes in that age group

In [None]:
max_population_diff = 0
max_population_diff_region = ""

age_group = 30
age_group_limit1 = age_group - 5
age_group_limit2 = age_group + 5

#Iterate thru each Region in pivot_df ... but not Ireland this time
for region in pivot_df['Region']:
    if region != 'Ireland': #ignore total Ireland data
        df_pivot_region = pivot_df[pivot_df['Region'] == region]
        df_region_grouped = df_pivot_region[(df_pivot_region['Age'] >= age_group_limit1) & (df_pivot_region['Age'] <= age_group_limit2)]
        region_male_population = df_region_grouped['Male'].sum()
        region_female_population = df_region_grouped['Female'].sum()
        region_population_difference = abs(region_male_population - region_female_population)
        if (region_population_difference > max_population_diff):
            max_population_diff = region_population_difference
            max_population_diff_region = region

print(f"Region with Maximum population difference between the Sexes for Age Group ({age_group_limit1} - {age_group_limit2}) is {max_population_diff_region} with difference of {max_population_diff} ")

Region with Maximum population difference between the Sexes for Age Group (25 - 35) is South Dublin County Council with difference of 1814 
