# Pre-Processing Labor Force Survey Data

In this tutorial, we'll be pre-processing the dataset that will be used in the series of Polars & Context tutorials. 

The dataset throughout the series of examples will be the Labor Force Survey microdata released by Eurostat. The data is organized per country by year and quarter. For the purposes of this tutorial, we will use the data from France for all years. 

The structure of the public microdata is the same as that of research microdata available in scientific use files. The files are designed so the code written for the public microdata will also will work with the private data. 

The data are protected using traditional statistical disclosure control methods such as global recoding, local suppression and addition of noise. 

Learn more about the dataset and download it [here](https://ec.europa.eu/eurostat/web/microdata/public-microdata/labour-force-survey)! The [user guide](https://ec.europa.eu/eurostat/documents/1978984/6037342/EULFS-Database-UserGuide.pdf) may also be helpful to learn more about the different variables. 


In [1]:
import pandas as pd 
import os 
import polars as pl 

dfs = []
for filename in os.listdir('FR_PUF_LFS'):
    if filename.endswith('.csv') and "_Y" not in filename:
        file_path = os.path.join('FR_PUF_LFS', filename)
        temp = pd.read_csv(file_path)
        dfs.append(temp)

df = pd.concat(dfs, ignore_index=True)

#drop all columns and rows with only nan values
df = df.dropna(axis=1, how='all')
df = df.dropna(axis=0, how='all')
df['QUARTER'] = df['QUARTER'].apply(lambda x: int(x[-1]))
df


FileNotFoundError: [Errno 2] No such file or directory: 'FR_PUF_LFS'

In [None]:
df.to_csv('all_FR_LFS.csv', index=False)

## Create Sample Dataset

In [None]:
sample = df.sample(50_000, random_state=1)
sample.to_csv('sample_FR_LFS.csv', index=False)