# Pre-Processing Labor Force Survey Data

The [OpenDP documentation](https://docs.opendp.org) uses data derived from the [EU Labor Force Survey](https://ec.europa.eu/eurostat/web/microdata/public-microdata/labour-force-survey):

> Public microdata, also referred to as public use files, for the EU Labour force survey (LFS) were created to enable interested parties to become familiar with microdata.
>
> At the same time, the privacy of respondents had to be protected. The structure of public microdata is the same as that of research microdata available in scientific use files.
>
> Public microdata enable researchers and trainers to develop programmes using the same formats and variable names as for the actual LFS scientific use files. The files have been designed so that programmes and procedures created with public microdata will also work with scientific use files.

Code developed to work with a public microdata set like this could also be used with the scientific use files, and we believe that differential privacy would be a good tool to ensure that statistics derived from scientific use files could not inadvertantly reveal personal information.

To reduce the download size needed for the tutorial, we have preprocessed the data by selecting a single country (France), dropping unused columns, and sampling a random subset of the rows. The code we'll present in the tutorials could be run on the original public microdata, or for that matter, the full private scientific use files.

In [1]:
import pandas as pd 
import os 
import polars as pl 

dfs = []
for filename in os.listdir('FR_PUF_LFS'):
    if filename.endswith('.csv') and "_Y" not in filename:
        file_path = os.path.join('FR_PUF_LFS', filename)
        temp = pd.read_csv(file_path)
        dfs.append(temp)

df = pd.concat(dfs, ignore_index=True)

#drop all columns and rows with only nan values
df = df.dropna(axis=1, how='all')
df = df.dropna(axis=0, how='all')
df['QUARTER'] = df['QUARTER'].apply(lambda x: int(x[-1]))
df


FileNotFoundError: [Errno 2] No such file or directory: 'FR_PUF_LFS'

In [None]:
df.to_csv('all_FR_LFS.csv', index=False)

## Create Sample Dataset

In [None]:
sample = df.sample(50_000, random_state=1)
sample.to_csv('sample_FR_LFS.csv', index=False)