# Exploratory Data Analysis

This notebook explores the provided data sets.


## Overall Goals of this Analysis

We ultimately want to understand:

G1.  The state of labour market inactivity. Possible lenses to analyse this include: cross-country comparisons, breakdowns by demographic.

G2. Where Nesta can focus its efforts. This can be by identifying the key cross-sectional attributes that influnce economic inactivity.


## 0. Load the data provided

We have four files, with the first three pertaining to Labour Force Survey (LFS) data:
1. `variable_names.csv` - A list of the variables included in the lfs quarterly and monthly data sets 
2. `lfs_monthly_variables.csv` - Main labour market indicators collected through the UK labour force survey including data relating to international comparisons, and reasons for inactivity across demographic groups.
3. `lfs_quarterly_variables.csv` - Same as above. I assume this is the same data just aggregated to quarterly frequency. Verify this, and if so, use the monthly data for better granularity.
4. `qual_survey_responses.csv` - Independent survey responses from 100 participants across the UK to the question “What do you think the government should be doing to address economic  inactivity (unemployment)”


### Plan
First, we will analyse the `variable_names` to understand the available fields and identify which may be useful in understanding the state of labour market inactivity. 
We will group these into:

- (A) Fields that can help us understand international comparisons, i.e. the UK's employment rate relative to other countries;
- (B) Fields that can give cross-sectional breakdowns of inactivity by demographic within the UK, e.g. sex, age, region, educational attainment, ethnicity, disability.

Secondly, analyse the shortlisted fields in `lfs_monthly_variables` and note interesting findings relevant to the goals of this notebook.

Thirdly, confirm that `lfs_quarterly_variables` is simply an aggregated view of the monthly data. If so, we can ignore this.

Lastly, the qualitative survey data in `qual_survey_responses` is less likely to be relevant for the current task, but we will explore the data to see if there are any immediate takeaways.

In [1]:
from pathlib import Path

from matplotlib import pyplot as plt
import pandas as pd
import plotly
import plotly.express as px


pd.options.plotting.backend = 'plotly'
pd.options.display.max_rows = 100


DATA_DIR = Path.cwd().parent / 'data'

## 1. Variable Names

Identify relevant fields of the LFS monthly data for further analysis.


### Glossary of relevant terms
- LFS: Labour Force Survey
- AWE: Average weekly Earnings
- ASHE: Annual Survey of Hours and Earnings
- ILO: International Labour Organization

In [2]:
var_df = pd.read_csv(DATA_DIR / 'variable_names.csv', encoding="ISO-8859-1")
var_df

Unnamed: 0,Title
0,AWE: Whole Economy Real Terms Year on Year Sin...
1,AWE: Whole Economy Real Terms Year on Year thr...
2,AWE: Whole Economy Real Terms Level (£): Seaso...
3,AWE: Whole Economy Real Terms Year on Year Sin...
4,AWE: Whole Economy Real Terms Year on Year Thr...
...,...
1825,STANDARDISED ILO UNEMPLOYMENT RATES SEASONALLY...
1826,STANDARDISED ILO UNEMPLOYMENT RATES SEASONALLY...
1827,STANDARDISED ILO UNEMPLOYMENT RATES SEASONALLY...
1828,STANDARDISED ILO UNEMPLOYMENT RATES SEASONALLY...


There are a lot of fields... There are many with similar names which suggests they will be useful for comparisons of different countries or demographics. 

Group these together by stemming the field names to the first few words.

In [3]:
def preprocess_field_name(field_name: str) -> str:
    """Some of the field names have naming inconsistencies, so preprocess these for consistency"""
    return (
        field_name
        # Some fields are sentence case, others are upper case. Cast all to lower case
        .lower()  
        # Punctuation is inconsistent so remove
        .replace(':', '')
        .replace('.', '')
        .replace('"', '')
        # Abbreviations are inconsistent
        .replace("economically", "econ")
        .replace("economic", "econ")
        .replace("inactivity", "inact")
        .replace("inactive", "inact")
        .replace("education", "educ")
    )

In [4]:
var_df['Title'].apply(lambda s: preprocess_field_name(s))

0       awe whole economy real terms year on year sing...
1       awe whole economy real terms year on year thre...
2       awe whole economy real terms level (£) seasona...
3       awe whole economy real terms year on year sing...
4       awe whole economy real terms year on year thre...
                              ...                        
1825    standardised ilo unemployment rates seasonally...
1826    standardised ilo unemployment rates seasonally...
1827    standardised ilo unemployment rates seasonally...
1828    standardised ilo unemployment rates seasonally...
1829    standardised ilo unemployment rates seasonally...
Name: Title, Length: 1830, dtype: object

In [5]:
# Stem to the first N words
N_WORDS = 3
stemmed_title = (var_df['Title']
                 .apply(lambda s: preprocess_field_name(s))
                 .str.split(' ')
                 .str[:N_WORDS]
                 .str.join(' '))

# Alternatively, stem by character count
# stemmed_title = var_df['Title'].str[:20]

In [6]:
px.bar(stemmed_title.value_counts().sort_index())

In [7]:
var_df[var_df['Title'].apply(lambda s: preprocess_field_name(s)).str.contains("lfs econ inact")].head(100)

Unnamed: 0,Title
71,LFS: Econ. inactive: Aged 16-17: Not in full-t...
72,LFS: Econ. inactive: Aged 18-24: Not in full-t...
73,LFS: Econ. inactive: Aged 16-24: Not in full-t...
74,LFS: Econ. inactive: Aged 16-17: Not in full-t...
75,LFS: Econ. inactive: Aged 18-24: Not in full-t...
76,LFS: Econ. inactive: Aged 16-24: Not in full-t...
77,LFS: Econ. inactive: Aged 16-17: Not in full-t...
78,LFS: Econ. inactive: Aged 18-24: Not in full-t...
79,LFS: Econ. inactive: Aged 16-24: Not in full-t...
80,LFS: Econ. inactive: Aged 16-17: In full-time ...


In [8]:
pd.options.display.max_colwidth = 100

In [9]:
var_df[(var_df['Title'].apply(lambda s: preprocess_field_name(s)).str.contains("econ inact rate")) 
       & (var_df['Title'].apply(lambda s: preprocess_field_name(s)).str.contains("sa"))
       & (var_df['Title'].apply(lambda s: preprocess_field_name(s)).str.contains("16"))
       & (var_df['Title'].apply(lambda s: preprocess_field_name(s)).str.contains("all"))
       ]

Unnamed: 0,Title
128,LFS: Economic inactivity rate: Aged 16-24: UK: All: %: SA
131,LFS: Econ. inactivity rate: Aged 16-17: Not in full-time educ.: UK: All: %: SA
133,LFS: Econ. inactivity rate: Aged 16-24: Not in full-time educ.: UK: All: %: SA
140,LFS: Econ. inactivity rate: Aged 16-17: In full-time educ.: UK: All: %: SA
142,LFS: Econ. inactivity rate: Aged 16-24: In full-time educ.: UK: All: %: SA
294,LFS: Economic Inactivity Rate Annual Change: UK: All: Aged 16-64 (pp): SA
300,LFS: Economic Inactivity rate quarterly change: UK: All: Aged 16-64 (pp): SA
807,LFS: Economic inactivity rate: UK: All: Aged 16-64: %: SA
866,LFS: Economic inactivity rate: Aged 16-64: GB: All: %: SA
868,LFS: Economic inactivity rate: North East: Aged 16-64: All: %: SA


From the above plot and manual checking of the field names, potentially useful avenues to research further are:

- `AWE: Whole Economy Real Terms Level (£): Seasonally Adjusted Total Pay`
- `STANDARDISED ILO UNEMPLOYMENT RATES SEASONALLY ADJUSTED`
    - Split by countries. Compare G7 countries.
    - Alternatively `International Comparison Employment Rates`
- `LFS: Econ. inactivity rate` and `LFS: Economic inactivity rate` and `LFS: Economic Inactivity` and `LFS: Econ inactive`
    - Split by demographic (region/age/sex), lots of fields.
- `LFS: Employment rate`
    - Split by sex and age
- `LFS: Usual weekly hrs of work` and `LFS: Usual weekly hours of work`
- `LFS: Econ. inactivity reasons`
    - Retired, discouraged, long-term sick, etc
    - Split by total and female
- `LFS: Econ. inactivity wants a job` and `LFS: Econ. inactivity does not want a job`
    - reasons for inactivity, e.g. Looking after family
- `16-17 year old population`, `16-17 year old total in FTE` and `16-17 year old total not in FTE`
    - Same for 18-24
    - Focus on young people if deep-dive needed
- `LFS: Employment` and `LFS: Economic Activity`
- `Employment rates by country of birth` and `Employment rates by nationality`
- `UK Job Vacancies`

## 2. LFS Monthly


Load and pre-process the data

In [10]:
lfs_df = pd.read_csv(
    DATA_DIR / 'lfs_monthly_variables.csv',
    header=[0],
    skiprows=[1,2,3],  # The CDID, PreUnit and Unit headers aren't needed 
    encoding="ISO-8859-1",  # Some of the PreUnit values seem to cause decoding issues with the default utf-8 encoding
    parse_dates=['Title'],
    date_format="%Y %b"  # Parse dates from "2024 JAN" format
)
lfs_df = lfs_df.rename(columns={'Title': 'Date'})
lfs_df = lfs_df.set_index("Date")
lfs_df.columns = [preprocess_field_name(k) for k in lfs_df.columns]

lfs_df

Unnamed: 0_level_0,awe whole economy real terms year on year single month growth (%) seasonally adjusted regular pay,awe whole economy real terms year on year three month growth (%) seasonally adjusted regular pay,awe whole economy real terms level (£) seasonally adjusted regular pay,awe whole economy real terms year on year single month growth (%) seasonally adjusted total pay,awe whole economy real terms year on year three month growth (%) seasonally adjusted total pay,awe whole economy real terms level (£) seasonally adjusted total pay,employment rate canada (oecd) seasonally adjusted,employment rate japan (oecd) seasonally adjusted,employment rate united states (oecd) seasonally adjusted,"standardised ilo unemployment rates, seasonally adjusted, romania - eurostat",...,standardised ilo unemployment rates seasonally adjusted luxembourg - eurostat,standardised ilo unemployment rates seasonally adjusted netherlands - eurostat,standardised ilo unemployment rates seasonally adjusted austria - eurostat,standardised ilo unemployment rates seasonally adjusted portugal - eurostat,standardised ilo unemployment rates seasonally adjusted finland - eurostat,standardised ilo unemployment rates seasonally adjusted sweden - eurostat,standardised ilo unemployment rates seasonally adjusted united kingdom eurostat,standardised ilo unemployment rates seasonally adjusted united states,standardised ilo unemployment rates seasonally adjusted japan- eurostat,standardised ilo unemployment rates seasonally adjusted canada-oecd
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1992-03-01,,,,,,,,,,,...,,,,,,,,,,10.9
1992-04-01,,,,,,,,,,,...,,,,,,,,,,10.7
1992-05-01,,,,,,,,,,,...,,,,,,,,,,10.9
1992-06-01,,,,,,,,,,,...,,,,,,,,,,11.4
1992-07-01,,,,,,,,,,,...,,,,,,,,,,11.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-11-01,2.0,1.5,482.0,1.4,1.5,513.0,,,,,...,,,,,,,,,,
2023-12-01,1.9,1.8,481.0,1.6,1.4,513.0,,,,,...,,,,,,,,,,
2024-01-01,1.7,1.9,481.0,1.4,1.5,513.0,,,,,...,,,,,,,,,,
2024-02-01,2.0,1.9,482.0,2.0,1.7,516.0,,,,,...,,,,,,,,,,


### 2.1. Pay 

In [11]:
lfs_df['awe whole economy real terms level (£) seasonally adjusted total pay'].plot()

In [12]:
lfs_df['awe whole economy real terms level (£) seasonally adjusted regular pay'].plot()

### 2.2. Unemployment rates compared across countries

In [13]:
unemployment_countries_columns = [k for k in lfs_df.columns if 'ilo unemployment rates' in k]
unemployment_countries_columns

['standardised ilo unemployment rates, seasonally adjusted, romania - eurostat',
 'standardised ilo unemployment rates, seasonally adjusted, bulgaria - eurostat',
 'standardised ilo unemployment rates - total eu',
 'standardised ilo unemployment rates seasonally adjusted cyprus eurostat',
 'standardised ilo unemployment rates seasonally adjusted czech republic eurostat',
 'standardised ilo unemployment rates seasonally adjusted estonia eurostat',
 'standardised ilo unemployment rates seasonally adjusted hungary eurostat',
 'standardised ilo unemployment rates seasonally adjusted latvia eurostat',
 'standardised ilo unemployment rates seasonally adjusted lithuania eurostat',
 'standardised ilo unemployment rates seasonally adjusted malta eurostat',
 'standardised ilo unemployment rates seasonally adjusted poland eurostat',
 'standardised ilo unemployment rates seasonally adjusted slovak republic eurostat',
 'standardised ilo unemployment rates seasonally adjusted slovenia eurostat',
 's

`international comparison employment rates` does not include the UK, so `ilo unemployment rates` is preferred

In [14]:
lfs_df[unemployment_countries_columns].dropna(how='all')

Unnamed: 0_level_0,"standardised ilo unemployment rates, seasonally adjusted, romania - eurostat","standardised ilo unemployment rates, seasonally adjusted, bulgaria - eurostat",standardised ilo unemployment rates - total eu,standardised ilo unemployment rates seasonally adjusted cyprus eurostat,standardised ilo unemployment rates seasonally adjusted czech republic eurostat,standardised ilo unemployment rates seasonally adjusted estonia eurostat,standardised ilo unemployment rates seasonally adjusted hungary eurostat,standardised ilo unemployment rates seasonally adjusted latvia eurostat,standardised ilo unemployment rates seasonally adjusted lithuania eurostat,standardised ilo unemployment rates seasonally adjusted malta eurostat,...,standardised ilo unemployment rates seasonally adjusted luxembourg - eurostat,standardised ilo unemployment rates seasonally adjusted netherlands - eurostat,standardised ilo unemployment rates seasonally adjusted austria - eurostat,standardised ilo unemployment rates seasonally adjusted portugal - eurostat,standardised ilo unemployment rates seasonally adjusted finland - eurostat,standardised ilo unemployment rates seasonally adjusted sweden - eurostat,standardised ilo unemployment rates seasonally adjusted united kingdom eurostat,standardised ilo unemployment rates seasonally adjusted united states,standardised ilo unemployment rates seasonally adjusted japan- eurostat,standardised ilo unemployment rates seasonally adjusted canada-oecd
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1992-03-01,,,,,,,,,,,...,,,,,,,,,,10.9
1992-04-01,,,,,,,,,,,...,,,,,,,,,,10.7
1992-05-01,,,,,,,,,,,...,,,,,,,,,,10.9
1992-06-01,,,,,,,,,,,...,,,,,,,,,,11.4
1992-07-01,,,,,,,,,,,...,,,,,,,,,,11.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-07-01,5.4,6.1,7.4,6.9,2.7,7.8,4.6,9.0,9.0,4.1,...,7.2,4.5,5.6,7.9,8.6,9.1,4.3,10.2,2.9,10.9
2020-08-01,5.3,6.2,7.5,7.4,2.8,8.0,4.4,8.8,9.6,4.1,...,6.8,4.6,5.4,8.1,8.5,9.1,,8.4,3.0,10.2
2020-09-01,5.2,6.2,7.5,8.0,2.8,,,8.4,9.8,4.0,...,6.7,4.4,5.5,7.7,8.4,9.0,,7.9,3.0,9.0
2020-10-01,5.3,5.7,7.6,10.5,2.9,,,8.0,10.4,3.9,...,6.5,4.3,5.4,7.5,8.3,8.6,,6.9,3.1,8.9


In [15]:
lfs_df[unemployment_countries_columns].plot()

In [16]:
g7_countries = {'canada', 'france', 'germany', 'italy', 'japan', 'united kingdom', 'united states', 'total eu'}
unemployment_g7_cols = list(filter(lambda col: any([country in col for country in g7_countries]), unemployment_countries_columns))
unemployment_g7_cols

['standardised ilo unemployment rates - total eu',
 'standardised ilo unemployment rates seasonally adjusted germany - eurostat',
 'standardised ilo unemployment rates seasonally adjusted france - eurostat',
 'standardised ilo unemployment rates seasonally adjusted italy - eurostat',
 'standardised ilo unemployment rates seasonally adjusted united kingdom eurostat',
 'standardised ilo unemployment rates seasonally adjusted united states',
 'standardised ilo unemployment rates seasonally adjusted japan- eurostat',
 'standardised ilo unemployment rates seasonally adjusted canada-oecd']

In [17]:
lfs_df[unemployment_g7_cols].plot()

### 2.3. Economic inactivity rate

In [18]:
economic_inactivity_columns = [k for k in lfs_df.columns if 'lfs econ inact' in k]
economic_inactivity_columns

['lfs econ inact aged 16-17 not in full-time educ uk all 000s sa',
 'lfs econ inact aged 18-24 not in full-time educ uk all 000s sa',
 'lfs econ inact aged 16-24 not in full-time educ uk all 000s sa',
 'lfs econ inact aged 16-17 not in full-time educ uk male 000s sa',
 'lfs econ inact aged 18-24 not in full-time educ uk male 000s sa',
 'lfs econ inact aged 16-24 not in full-time educ uk male 000s sa',
 'lfs econ inact aged 16-17 not in full-time educ uk female 000s sa',
 'lfs econ inact aged 18-24 not in full-time educ uk female 000s sa',
 'lfs econ inact aged 16-24 not in full-time educ uk female 000s sa',
 'lfs econ inact aged 16-17 in full-time educ uk all 000s sa',
 'lfs econ inact aged 18-24 in full-time educ uk all 000s sa',
 'lfs econ inact aged 16-24 in full-time educ uk all 000s sa',
 'lfs econ inact aged 16-17 in full-time educ uk male 000s sa',
 'lfs econ inact aged 18-24 in full-time educ uk male 000s sa',
 'lfs econ inact aged 16-24 in full-time educ uk male 000s sa',
 'lf

In [19]:
lfs_df['lfs econ inact rate uk people aged 16 and over % nsa'].plot()

In [20]:
lfs_df['lfs econ inact rate uk people aged 16 and over % nsa'].resample('1YE').mean().plot()

### 2.4. Breakdown inactivity per demographic dimension
#### 2.4.1 Sex

In [21]:
[k for k in economic_inactivity_columns if 'female' in k]

['lfs econ inact aged 16-17 not in full-time educ uk female 000s sa',
 'lfs econ inact aged 18-24 not in full-time educ uk female 000s sa',
 'lfs econ inact aged 16-24 not in full-time educ uk female 000s sa',
 'lfs econ inact aged 16-17 in full-time educ uk female 000s sa',
 'lfs econ inact aged 18-24 in full-time educ uk female 000s sa',
 'lfs econ inact aged 16-24 in full-time educ uk female 000s sa',
 'lfs econ inact rate aged 16-24 uk female % sa',
 'lfs econ inact rate aged 16-17 not in full-time educ uk female % sa',
 'lfs econ inact rate aged 18-24 not in full-time educ uk female % sa',
 'lfs econ inact rate aged 16-24 not in full-time educ uk female % sa',
 'lfs econ inact rate aged 16-17 in full-time educ uk female % sa',
 'lfs econ inact rate aged 18-24 in full-time educ uk female % sa',
 'lfs econ inact rate aged 16-24 in full-time educ uk female % sa',
 'lfs econ inact uk female aged 50-64 thousands sa',
 'lfs econ inact uk female aged 16-64 thousands sa',
 'lfs econ inact

In [22]:
cols = ['lfs econ inact rate uk female aged 16-64 % sa',
        'lfs econ inact rate uk female all aged 16 and over % sa',
        'lfs econ inact rate uk women aged 16-64 % nsa',
        'lfs econ inact reasons total uk female%']

In [23]:
lfs_df[cols].plot()

In [25]:
cols = ['lfs econ inact rate uk men aged 16-64 % nsa',
        'lfs econ inact rate uk male all aged 16 and over % sa',
        'lfs econ inact reasons total uk male%']

gender_cols = [
        'lfs econ inact rate uk women aged 16-64 % nsa',
        'lfs econ inact rate uk men aged 16-64 % nsa',
        'lfs econ inact rate uk women aged 16 and over % nsa',
        'lfs econ inact rate uk men aged 16 and over % nsa',
        'lfs econ inact rate uk female aged 16-64 % sa',
        'lfs econ inact rate uk male all aged 16 and over % sa',
        'lfs econ inact rate uk female all aged 16 and over % sa',
        'lfs econ inact rate uk male aged 16-64 % sa'
]

In [26]:
[k for k in economic_inactivity_columns if ('lfs econ inact rate uk' in k) and ('men' in k or 'male' in k)]

['lfs econ inact rate uk women aged 16-64 % nsa',
 'lfs econ inact rate uk men aged 16-64 % nsa',
 'lfs econ inact rate uk women aged 16 and over % nsa',
 'lfs econ inact rate uk men aged 16 and over % nsa',
 'lfs econ inact rate uk female aged 16-64 % sa',
 'lfs econ inact rate uk female aged 50-64 % sa',
 'lfs econ inact rate uk male aged 16-17 % sa',
 'lfs econ inact rate uk female aged 16-17 % sa',
 'lfs econ inact rate uk male aged 18-24 % sa',
 'lfs econ inact rate uk female aged 18-24 % sa',
 'lfs econ inact rate uk male aged 25-34 % sa',
 'lfs econ inact rate uk female aged 25-34 % sa',
 'lfs econ inact rate uk male aged 35-49 % sa',
 'lfs econ inact rate uk female aged 35-49 % sa',
 'lfs econ inact rate uk male aged 50-64 % sa',
 'lfs econ inact rate uk male aged 65+ % sa',
 'lfs econ inact rate uk male all aged 16 and over % sa',
 'lfs econ inact rate uk female all aged 16 and over % sa',
 'lfs econ inact rate uk male aged 16-64 % sa']

In [27]:
lfs_df[gender_cols].plot()

2 takeaways:
1. Economic inactivity has gotten worse for men since covid
2. Economic inactivity is worse for women than men. Progress that was being made to close the gender gap pre-covid has flatlined.

#### 2.4.2 Age

In [28]:
[k for k in economic_inactivity_columns if ('lfs econ inact rate uk' in k) and ('18' in k)]

['lfs econ inact rate uk all aged 18-24 % sa',
 'lfs econ inact rate uk male aged 18-24 % sa',
 'lfs econ inact rate uk female aged 18-24 % sa']

In [29]:
[k for k in economic_inactivity_columns if ('lfs econ inact rate uk all' in k)]

['lfs econ inact rate uk all aged 16-64 % sa',
 'lfs econ inact rate uk all aged 50-64 % sa',
 'lfs econ inact rate uk all aged 16-17 % sa',
 'lfs econ inact rate uk all aged 18-24 % sa',
 'lfs econ inact rate uk all aged 25-34 % sa',
 'lfs econ inact rate uk all aged 35-49 % sa',
 'lfs econ inact rate uk all all aged 16 and over % sa']

In [30]:
age_cols = [
 'lfs econ inact rate uk all aged 16-64 % sa',
 'lfs econ inact rate uk all aged 50-64 % sa',
 'lfs econ inact rate uk all aged 16-17 % sa',
 'lfs econ inact rate uk all aged 18-24 % sa',
 'lfs econ inact rate uk all aged 25-34 % sa',
 'lfs econ inact rate uk all aged 35-49 % sa',
 'lfs econ inact rate uk all all aged 16 and over % sa'
 ]

In [31]:
lfs_df[age_cols].plot()

In [32]:
lfs_df[[
    'lfs econ inact rate aged 16-17 not in full-time educ uk all % sa',
    'lfs econ inact rate aged 16-17 in full-time educ uk all % sa',
    ]].plot()

In [33]:
[k for k in economic_inactivity_columns if ('lfs econ inact rate' in k) and ('16' in k)]

['lfs econ inact rate aged 16-24 uk all % sa',
 'lfs econ inact rate aged 16-24 uk male % sa',
 'lfs econ inact rate aged 16-24 uk female % sa',
 'lfs econ inact rate aged 16-17 not in full-time educ uk all % sa',
 'lfs econ inact rate aged 16-24 not in full-time educ uk all % sa',
 'lfs econ inact rate aged 16-17 not in full-time educ uk male % sa',
 'lfs econ inact rate aged 16-24 not in full-time educ uk male % sa',
 'lfs econ inact rate aged 16-17 not in full-time educ uk female % sa',
 'lfs econ inact rate aged 16-24 not in full-time educ uk female % sa',
 'lfs econ inact rate aged 16-17 in full-time educ uk all % sa',
 'lfs econ inact rate aged 16-24 in full-time educ uk all % sa',
 'lfs econ inact rate aged 16-17 in full-time educ uk male % sa',
 'lfs econ inact rate aged 16-24 in full-time educ uk male % sa',
 'lfs econ inact rate aged 16-17 in full-time educ uk female % sa',
 'lfs econ inact rate aged 16-24 in full-time educ uk female % sa',
 'lfs econ inact rate annual change

Major takeaway:
- Age 16-17 impacted the worst

#### 2.4.3. Region

In [34]:
[k for k in economic_inactivity_columns if ('lfs econ inact rate' in k) and ('16-64 all % sa' in k)]

['lfs econ inact rate north east aged 16-64 all % sa',
 'lfs econ inact rate north west aged 16-64 all % sa',
 'lfs econ inact rate yorks & the humber aged 16-64 all % sa',
 'lfs econ inact rate east midlands aged 16-64 all % sa',
 'lfs econ inact rate west midlands aged 16-64 all % sa',
 'lfs econ inact rate east aged 16-64 all % sa',
 'lfs econ inact rate london aged 16-64 all % sa',
 'lfs econ inact rate south east (gor) aged 16-64 all % sa',
 'lfs econ inact rate south west aged 16-64 all % sa',
 'lfs econ inact rate england aged 16-64 all % sa',
 'lfs econ inact rate wales aged 16-64 all % sa',
 'lfs econ inact rate scotland aged 16-64 all % sa']

In [35]:
region_cols = [
#  'lfs econ inact rate london aged 16-64 female % sa',
#  'lfs econ inact rate london aged 16-64 male % sa',
 
 'lfs econ inact rate north east aged 16-64 all % sa',
 'lfs econ inact rate north west aged 16-64 all % sa',
 'lfs econ inact rate yorks & the humber aged 16-64 all % sa',
 'lfs econ inact rate east midlands aged 16-64 all % sa',
 'lfs econ inact rate west midlands aged 16-64 all % sa',
 'lfs econ inact rate east aged 16-64 all % sa',
 'lfs econ inact rate london aged 16-64 all % sa',
 'lfs econ inact rate south east (gor) aged 16-64 all % sa',
 'lfs econ inact rate south west aged 16-64 all % sa',
 'lfs econ inact rate england aged 16-64 all % sa',
 'lfs econ inact rate wales aged 16-64 all % sa',
 'lfs econ inact rate scotland aged 16-64 all % sa'

]

In [36]:
lfs_df[region_cols].plot()

In [37]:
[k for k in lfs_df.columns if ('ireland' in k)]

['northern ireland - working age inact levels 000s sa men',
 'northern ireland - working age inact rates % sa men',
 'workforce jobs sa  northern ireland (thousands)',
 'lfs econ activity rate northern ireland aged 16-64 all % sa',
 'lfs employment rate northern ireland aged 16-64 all % sa',
 'lfs employment rate northern ireland aged 16-64 female % sa',
 'northern ireland - 16-64 inact levels 000s sa people',
 'northern ireland - 16-64 inact levels 000s sa women',
 'northern ireland - 16-64 inact rates % sa people',
 'northern ireland - 16-64 inact rates % sa women',
 'lfs econ active northern ireland all thousands sa',
 'lfs population aged 16 and over northern ireland all thousands nsa',
 'international comparison employment rates ireland',
 'lfs econ active northern ireland male thousands sa',
 'lfs econ active northern ireland female thousands sa',
 'lfs ilo unemployed northern ireland all thousands sa',
 'lfs ilo unemployment rate northern ireland all % sa',
 'lfs ilo unemployed 

In [38]:
econ_act_cols = [k for k in lfs_df.columns if ('lfs econ activity rate' in k and 'all' in k)]
econ_act_cols

['lfs econ activity rate uk all aged 16-64 (%) sa',
 'lfs econ activity rate uk all aged 50-64 % sa',
 'lfs econ activity rate north east aged 16-64 all % sa',
 'lfs econ activity rate north west (gor) aged 16-64 all % sa',
 'lfs econ activity rate yorks and humber aged 16-64 all % sa',
 'lfs econ activity rate east midlands aged 16-64 all % sa',
 'lfs econ activity rate west midlands aged 16-64 all % sa',
 'lfs econ activity rate east of england aged 16-64 all % sa',
 'lfs econ activity rate london aged 16-64 all % sa',
 'lfs econ activity rate south east (gor) aged 16-64 all % sa',
 'lfs econ activity rate south west aged 16-64 all % sa',
 'lfs econ activity rate england aged 16-64 all % sa',
 'lfs econ activity rate wales aged 16-64 all % sa',
 'lfs econ activity rate scotland aged 16-64 all % sa',
 'lfs econ activity rate great britain aged 16-64 all % sa',
 'lfs econ activity rate northern ireland aged 16-64 all % sa',
 'lfs econ activity rate uk all all aged 16 and over % sa',
 '

In [39]:
lfs_df[econ_act_cols].plot()

In [40]:
region_act_cols = [k for k in econ_act_cols if ('aged 16-64 all % sa' in k)]
region_act_cols


['lfs econ activity rate north east aged 16-64 all % sa',
 'lfs econ activity rate north west (gor) aged 16-64 all % sa',
 'lfs econ activity rate yorks and humber aged 16-64 all % sa',
 'lfs econ activity rate east midlands aged 16-64 all % sa',
 'lfs econ activity rate west midlands aged 16-64 all % sa',
 'lfs econ activity rate east of england aged 16-64 all % sa',
 'lfs econ activity rate london aged 16-64 all % sa',
 'lfs econ activity rate south east (gor) aged 16-64 all % sa',
 'lfs econ activity rate south west aged 16-64 all % sa',
 'lfs econ activity rate england aged 16-64 all % sa',
 'lfs econ activity rate wales aged 16-64 all % sa',
 'lfs econ activity rate scotland aged 16-64 all % sa',
 'lfs econ activity rate great britain aged 16-64 all % sa',
 'lfs econ activity rate northern ireland aged 16-64 all % sa']

In [41]:
(100 - lfs_df[region_act_cols]).plot()

Takeaways:
- Geographic divide: south of england has it best. The further you go the worse economic inactivity is.

#### 2.4.4. Reason

In [42]:
economic_inactivity_columns

['lfs econ inact aged 16-17 not in full-time educ uk all 000s sa',
 'lfs econ inact aged 18-24 not in full-time educ uk all 000s sa',
 'lfs econ inact aged 16-24 not in full-time educ uk all 000s sa',
 'lfs econ inact aged 16-17 not in full-time educ uk male 000s sa',
 'lfs econ inact aged 18-24 not in full-time educ uk male 000s sa',
 'lfs econ inact aged 16-24 not in full-time educ uk male 000s sa',
 'lfs econ inact aged 16-17 not in full-time educ uk female 000s sa',
 'lfs econ inact aged 18-24 not in full-time educ uk female 000s sa',
 'lfs econ inact aged 16-24 not in full-time educ uk female 000s sa',
 'lfs econ inact aged 16-17 in full-time educ uk all 000s sa',
 'lfs econ inact aged 18-24 in full-time educ uk all 000s sa',
 'lfs econ inact aged 16-24 in full-time educ uk all 000s sa',
 'lfs econ inact aged 16-17 in full-time educ uk male 000s sa',
 'lfs econ inact aged 18-24 in full-time educ uk male 000s sa',
 'lfs econ inact aged 16-24 in full-time educ uk male 000s sa',
 'lf

In [43]:
lfs_df[[
    'lfs econ inact reasons does not want a job uk 16-64%',
    'lfs econ inact reasons wants a job uk 16-64%',
]].plot()

In [44]:
reason_cols = [k for k in economic_inactivity_columns if ('a job' in k) and ('%' in k) and ('male' not in k) and ('total' not in k)]
reason_cols

['lfs econ inact reasons does not want a job uk 16-64%',
 'lfs econ inact reasons wants a job uk 16-64%',
 'lfs econ inact wants a job student uk 16-64%',
 'lfs econ inact wants a job looking after family/home uk 16-64%',
 'lfs econ inact wants a job temp sick uk 16-64%',
 'lfs econ inact wants a job long-term sick uk 16-64%',
 'lfs econ inact wants a job discouraged workers uk 16-64%',
 'lfs econ inact wants a job other uk 16-64%',
 'lfs econ inact does not want a job student uk 16-64%',
 'lfs econ inact does not want a job temp sick uk 16-64%',
 'lfs econ inact does not want a job long-term sick uk 16-64%',
 'lfs econ inact does not want a job retired uk 16-64%',
 'lfs econ inact does not want a job other uk 16-64%']

In [45]:
lfs_df[reason_cols].plot()

In [46]:
lfs_df[reason_cols].plot()

Keys takeaways:
1. The vast majority of those who are inactive do NOT want a job (80%)
2. Long-terms sickness is the biggest issue, and getting worse since the pandemic.

#### 2.4.5 Extension to GDP

The ultimate mission is to grow the economy, so a proxy of the economy is important. I have downloaded GDP data from https://www.ons.gov.uk/economy/grossdomesticproductgdp/datasets/monthlygdpandmainsectorstofourdecimalplaces

Another approach to picking out important fields is to train a random forest classifier on the data and use the feature importances to determine any pertinent fields that may have been missed when examining the fields "by eye". I haven't done this here, but could be an interesting extension for more thorough analysis.

In [47]:
gdp_df = pd.read_csv(DATA_DIR / 'UK_GDP_monthly.csv', parse_dates=['Month'], date_format="%Y%b")
gdp_df = gdp_df.rename(columns={'Month': 'Date'})
gdp_df = gdp_df.set_index('Date')
gdp_df

Unnamed: 0_level_0,Monthly GDP (A-T),Agriculture (A),Construction (F),Production (B-E),Services (G-T)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1997-01-01,63.3398,51.9196,88.6822,81.8493,58.9698
1997-02-01,63.9959,52.5957,89.3841,82.0903,59.7182
1997-03-01,64.0355,53.2952,90.3395,80.8967,59.9700
1997-04-01,64.6273,54.6361,92.6826,82.9387,60.1678
1997-05-01,64.1371,55.2798,93.1397,82.5730,59.5863
...,...,...,...,...,...
2024-01-01,102.4029,86.1374,104.7962,93.9243,103.8100
2024-02-01,102.6546,86.0256,103.1278,94.8269,104.1036
2024-03-01,103.0951,86.1430,103.3000,94.9909,104.6154
2024-04-01,103.1284,86.2512,102.2113,94.0890,104.8967


In [48]:
gdp_df.plot()

In [49]:
gdp_df.pct_change().plot()

## 3. LFS Quarterly

Verifying that the quarterly data contains the same columns as the monthly.

In [53]:
lfs_quarterly_df = pd.read_csv(
    DATA_DIR / 'lfs_quarterly_variables.csv',
    header=[0],
    skiprows=[1,2,3],  # The CDID, PreUnit and Unit headers aren't needed 
    encoding="ISO-8859-1",  # Some of the PreUnit values seem to cause decoding issues with the default utf-8 encoding
    # parse_dates=['Title'],
    # date_format="%Y %b"  # Parse dates from "2024 JAN" format
)
lfs_quarterly_df = lfs_quarterly_df.rename(columns={'Title': 'Date'})
# lfs_quarterly_df['Date'] = pd.to_datetime(lfs_quarterly_df['Date'])
lfs_quarterly_df = lfs_quarterly_df.set_index("Date")
lfs_quarterly_df.columns = [preprocess_field_name(k) for k in lfs_quarterly_df.columns]

lfs_quarterly_df

Unnamed: 0_level_0,awe whole economy real terms year on year single month growth (%) seasonally adjusted regular pay,awe whole economy real terms year on year three month growth (%) seasonally adjusted regular pay,awe whole economy real terms level (£) seasonally adjusted regular pay,awe whole economy real terms year on year single month growth (%) seasonally adjusted total pay,awe whole economy real terms year on year three month growth (%) seasonally adjusted total pay,awe whole economy real terms level (£) seasonally adjusted total pay,employment rate canada (oecd) seasonally adjusted,employment rate japan (oecd) seasonally adjusted,employment rate united states (oecd) seasonally adjusted,"standardised ilo unemployment rates, seasonally adjusted, romania - eurostat",...,standardised ilo unemployment rates seasonally adjusted luxembourg - eurostat,standardised ilo unemployment rates seasonally adjusted netherlands - eurostat,standardised ilo unemployment rates seasonally adjusted austria - eurostat,standardised ilo unemployment rates seasonally adjusted portugal - eurostat,standardised ilo unemployment rates seasonally adjusted finland - eurostat,standardised ilo unemployment rates seasonally adjusted sweden - eurostat,standardised ilo unemployment rates seasonally adjusted united kingdom eurostat,standardised ilo unemployment rates seasonally adjusted united states,standardised ilo unemployment rates seasonally adjusted japan- eurostat,standardised ilo unemployment rates seasonally adjusted canada-oecd
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1994 Q1,,,,,,,,69.3,71.6,,...,3.2,6.8,,7.3,17.0,9.7,9.8,6.6,2.9,11.0
1994 Q2,,,,,,,,69.5,71.8,,...,3.3,7.0,,7.6,17.1,9.4,9.5,6.2,2.8,10.6
1994 Q3,,,,,,,,69.3,72.0,,...,3.1,7.3,,7.7,16.5,9.2,9.2,6.0,3.0,10.1
1994 Q4,,,,,,,,69.2,72.6,,...,3.0,7.7,,7.8,15.7,9.0,8.8,5.6,2.9,9.8
1995 Q1,,,,,,,67.7,69.3,72.7,,...,2.9,8.1,3.9,8.0,15.1,8.8,8.7,5.5,3.0,9.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023 Q1,,,473.0,,,507.0,,,,,...,,,,,,,,,,
2023 Q2,,,476.0,,,515.0,,,,,...,,,,,,,,,,
2023 Q3,,,480.0,,,515.0,,,,,...,,,,,,,,,,
2023 Q4,,,480.0,,,513.0,,,,,...,,,,,,,,,,


In [54]:
len(lfs_quarterly_df.columns)

1830

In [55]:
len(lfs_df.columns)

1830

In [56]:
all(lfs_df.columns == lfs_quarterly_df.columns)

True

The quarterly data contains the same columns as the monthly data, so we ignore this.

## 4. Survey responses

Independent survey responses
from 100 participants across
the UK to the question “what
do you think the government
should be doing to address
economic inactivity
(unemployment)”.

Single response survey
collected 15 July 2024.

In [57]:
survey_df = pd.read_csv(DATA_DIR / 'qual_survey_responses.csv')
survey_df

Unnamed: 0,Participant ID,Employment Status,UK Region,Response
0,P001,Unemployed,Scotland,The government should focus on providing more comprehensive job training programs. They need to ...
1,P002,Employed,North West,I believe the government should create more incentives for businesses to hire and train unemploy...
2,P003,Self-employed,London,"The government needs to address the root causes of unemployment, such as lack of education and s..."
3,P004,Unemployed,Wales,The government should increase unemployment benefits to provide better support during job search...
4,P005,Employed,South East,I think the government is already doing too much. People need to take more responsibility for th...
...,...,...,...,...
105,P106,Employed,East of England,I think the government should focus on promoting employee ownership and profit-sharing schemes. ...
106,P107,Unemployed,West Midlands,The government needs to improve support for people with criminal records seeking employment. The...
107,P108,Self-employed,North East,"The government should focus on supporting the night-time economy, which can be a significant sou..."
108,P109,Unemployed,Scotland,The government should provide more support for people looking to start cooperatives or worker-ow...


In [58]:
survey_df['Response']

0      The government should focus on providing more comprehensive job training programs. They need to ...
1      I believe the government should create more incentives for businesses to hire and train unemploy...
2      The government needs to address the root causes of unemployment, such as lack of education and s...
3      The government should increase unemployment benefits to provide better support during job search...
4      I think the government is already doing too much. People need to take more responsibility for th...
                                                      ...                                                 
105    I think the government should focus on promoting employee ownership and profit-sharing schemes. ...
106    The government needs to improve support for people with criminal records seeking employment. The...
107    The government should focus on supporting the night-time economy, which can be a significant sou...
108    The government should provide 

Not much quantitative analysis to be done here, without going hardcore into NLP, whihc is overkill here for a small sample and short task.

Maybe make a pretty wordcloud out of these responses?