# Pre-Processing Labor Force Survey Data

The [OpenDP documentation](https://docs.opendp.org) uses data derived from the [EU Labor Force Survey](https://ec.europa.eu/eurostat/web/microdata/public-microdata/labour-force-survey):

> Public microdata, also referred to as public use files, for the EU Labour force survey (LFS) were created to enable interested parties to become familiar with microdata.
>
> At the same time, the privacy of respondents had to be protected. The structure of public microdata is the same as that of research microdata available in scientific use files.
>
> Public microdata enable researchers and trainers to develop programmes using the same formats and variable names as for the actual LFS scientific use files. The files have been designed so that programmes and procedures created with public microdata will also work with scientific use files.

Code developed to work with a public microdata set like this could also be used with the scientific use files, and we believe that differential privacy would be a good tool to ensure that statistics derived from scientific use files could not inadvertantly reveal personal information.

To reduce the download size for the tutorial, we've preprocessed the data by selecting a single country (France), dropping unused columns, sampling a subset of the rows, and concatenating the result into a single CSV. The code we'll present in the tutorials could be run on the original public microdata, or for that matter, the full private scientific use files.

In [2]:
![ -e FR_PUF_LFS.zip ] || wget https://ec.europa.eu/eurostat/cache/website/microdata/public-microdata-lfs/FR_PUF_LFS.zip
!unzip -q FR_PUF_LFS.zip -d FR_PUF_LFS

In [9]:
import pandas as pd 
import os
from pathlib import Path

dfs = []
for csv_path in Path('FR_PUF_LFS').glob('*Q*.csv'):
    dfs.append(pd.read_csv(csv_path))

df = pd.concat(dfs, ignore_index=True)

# Drop empty columns and rows:
df = df.dropna(axis=1, how='all')
df = df.dropna(axis=0, how='all')

# Convert quarter to integer:
df['QUARTER_N'] = df['QUARTER'].apply(lambda quarter: int(quarter.replace('Q', '')))

# TODO: Select a subset of columns
df

Unnamed: 0,COEFF,QUARTER,REFYEAR,REFWEEK,INTWEEK,COUNTRY,DEGURBA,HHINST,INTWAVE,INTQUEST,...,ISCOPR1D,DURUNE,EDUC4WN,HATLEV1D,STARTIME,LEAVCLAS,NACE1D,NACE2J1D,NACEPR1D,HHTYPE
0,0.962105,Q1,2004,7,9.0,FR,0.0,9.0,5,2,...,,9.0,0.0,H,96.0,,,9.0,9.0,1
1,0.794941,Q1,2004,5,6.0,FR,0.0,9.0,3,2,...,,9.0,1.0,M,336.0,,,9.0,9.0,1
2,0.674642,Q1,2004,7,8.0,FR,0.0,9.0,2,2,...,,9.0,9.0,9,999.0,,9.0,9.0,9.0,1
3,0.320986,Q1,2004,11,12.0,FR,3.0,9.0,6,2,...,,9.0,0.0,L,372.0,,,9.0,9.0,1
4,0.464221,Q1,2004,11,12.0,FR,0.0,9.0,6,2,...,,9.0,9.0,9,999.0,,9.0,9.0,9.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4159827,0.654399,Q1,2008,6,8.0,FR,0.0,9.0,1,2,...,,9.0,0.0,M,72.0,,F,9,9,1
4159828,1.392808,Q1,2008,2,3.0,FR,0.0,9.0,5,2,...,,9.0,0.0,H,96.0,,H,9,9,1
4159829,0.355306,Q1,2008,1,2.0,FR,0.0,9.0,4,2,...,,9.0,0.0,H,11.0,,G,9,9,1
4159830,0.854188,Q1,2008,10,11.0,FR,0.0,9.0,5,2,...,,9.0,0.0,L,300.0,,H,9,9,1


In [11]:
sample = df.sample(50_000, random_state=1)
sample.to_csv('sample_FR_LFS2.csv', index=False)

In [14]:
!zip 2-sample_FR_LFS.csv.zip sample_FR_LFS.csv

updating: sample_FR_LFS.csv
 (deflated 90%)
