In [None]:
su

# Python Project - Samuele Ceol 
A case study on the correlation between the labour market and suicide rates in Europe

In [7]:
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd
import seaborn as sns 

# OECD Data - Description

For the OECD we have three distinct files, one for males, one for females and one aggregate (M/F).

The dataset contains only data related to european countries.

No age group distinction is provided.

The column LOCATION identifies the interested country with a unique three letter abbreviation.

Suicide rates are indicated as nr of suicides per 100.000 people.

TODO - Flag codes?

In [154]:
suicide_tot = pd.read_csv('./source/OECD_suicides_total.csv')
suicide_male = pd.read_csv('./source/OECD_suicides_male.csv')
suicide_female = pd.read_csv('./source/OECD_suicides_female.csv')

In [124]:
suicide_tot.shape

(1064, 8)

In [125]:
suicide_tot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1064 entries, 0 to 1063
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   LOCATION    1064 non-null   object 
 1   INDICATOR   1064 non-null   object 
 2   SUBJECT     1064 non-null   object 
 3   MEASURE     1064 non-null   object 
 4   FREQUENCY   1064 non-null   object 
 5   TIME        1064 non-null   int64  
 6   Value       1064 non-null   float64
 7   Flag Codes  47 non-null     object 
dtypes: float64(1), int64(1), object(6)
memory usage: 66.6+ KB


In [126]:
suicide_tot.head()

Unnamed: 0,LOCATION,INDICATOR,SUBJECT,MEASURE,FREQUENCY,TIME,Value,Flag Codes
0,AUT,SUICIDE,TOT,100000PER,A,1960,24.2,
1,AUT,SUICIDE,TOT,100000PER,A,1961,23.2,
2,AUT,SUICIDE,TOT,100000PER,A,1962,23.8,
3,AUT,SUICIDE,TOT,100000PER,A,1963,23.0,
4,AUT,SUICIDE,TOT,100000PER,A,1964,24.1,


# OECD Data - Required actions

We would like to join this three different entities into a single dataframe.

We start by adding a column that identifies which category (Male/Female/Total) the dataframe refers to.


In [155]:
suicide_tot['Sex']     = 'total'
suicide_male['Sex']    = 'male'
suicide_female['Sex']  = 'female'

We then proceed to concatenate the three dataframes.

In [156]:
OECD_suicide = pd.concat([suicide_tot, suicide_male, suicide_female])

We can discard the INDICATOR, SUBJECT, MEASURE and FREQUENCY columns since they are only descriptive and not needed for our analysis

In [157]:
OECD_suicide = OECD_suicide.drop(['INDICATOR', 'SUBJECT', 'MEASURE', 'FREQUENCY', 'Flag Codes'], axis=1)

To join the OECD dataset with the one provided by the World Health Organization, we want to change the three letters identifiers to the full country names.
To do that, we use the pycountry library.

In [158]:
import pycountry

OECD_suicide['LOCATION'] = OECD_suicide['LOCATION'].apply(
    lambda x: pycountry.countries.get(alpha_3=x).name 
    if len(x) == 3 
    else pycountry.countries.get(alpha_2=x).name
)

We can finally rename the remaining columns and see how the dataframe looks like.

TODO - Maintain the groupby even though we don't have duplicates

In [159]:
OECD_suicide.head()

Unnamed: 0,LOCATION,TIME,Value,Sex
0,Austria,1960,24.2,total
1,Austria,1961,23.2,total
2,Austria,1962,23.8,total
3,Austria,1963,23.0,total
4,Austria,1964,24.1,total


In [164]:
OECD_suicide.columns = ['country', 'year', 'suicides_no', 'sex']
OECD_suicide.groupby(['country', 'year', 'sex']).sum().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,suicides_no
country,year,sex,Unnamed: 3_level_1
Austria,1960,female,14.7
Austria,1960,male,36.6
Austria,1960,total,24.2
Austria,1961,female,13.7
Austria,1961,male,35.5


# WHO Data - Description

For the data provided by the World Health Organization we have a single file containing data for males and females with a further division by age group.

Suicide values are stored as totals unlike the previous dataframe where they were stored in relation to nr of suicides per 100.000.

The dataframe also contains information related to the total population of the country in the given year.


In [95]:
WHO_suicide = pd.read_csv('./source/WHO_suicides_aggregate.csv')

In [90]:
WHO_suicide.shape

(43776, 6)

In [91]:
WHO_suicide.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43776 entries, 0 to 43775
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   country      43776 non-null  object 
 1   year         43776 non-null  int64  
 2   sex          43776 non-null  object 
 3   age          43776 non-null  object 
 4   suicides_no  41520 non-null  float64
 5   population   38316 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 2.0+ MB


In [92]:
WHO_suicide.head()

Unnamed: 0,country,year,sex,age,suicides_no,population
0,Albania,1985,female,15-24 years,,277900.0
1,Albania,1985,female,25-34 years,,246800.0
2,Albania,1985,female,35-54 years,,267500.0
3,Albania,1985,female,5-14 years,,298300.0
4,Albania,1985,female,55-74 years,,138700.0


# WHO Data - Required actions

As a starting point, we want to filter out the countries that are not in Europe

We can also drop the rows in which the suicide numbers are not present

In [96]:
WHO_suicide = WHO_suicide[WHO_suicide['suicides_no'].notna()]

For our analysis, we are not considering the differences between age groups.

Because of this reason, we can drop the 'age' column and sum population and suicide numbers.


In [97]:
WHO_suicide = WHO_suicide.drop('age', axis=1).groupby(['country', 'year', 'sex']).sum()

Since the dataframe only has a division between males and females, we want to add a new category for the total values which contains the sum for both nr of suicides between males and females and value of the total population.

In [98]:
WHO_suicide.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,suicides_no,population
country,year,sex,Unnamed: 3_level_1,Unnamed: 4_level_1
Albania,1987,female,25.0,1316900.0
Albania,1987,male,48.0,1392700.0
Albania,1988,female,22.0,1343600.0
Albania,1988,male,41.0,1420700.0
Albania,1989,female,15.0,1363300.0


We can now convert the suicide values to suicides per 100.000 and then drop the population column.