# EconML

We have downloaded the **[Enquête emploi en continu 2015** database from the [INSEE website](https://www.insee.fr/fr/statistiques/2388681). The database is in dBASE format (`.dbf`, 80mb), compressed into a ZIP file (`.zip`, 9mb).

The `.zip` file contains three dBASE files:
   - `eec15.dbf` (77MB)
   - `varlist.dbf` (33KB)
   - `varmod.dbf` (33KB)

Because the dBASE format is inconvenient to work with, we import the contents of these three files into an SQLite databases using the [`dbf2sqlite`](https://github.com/olemb/dbfread/blob/master/examples/dbf2sqlite) script, with the following command:

    ./dbf2sqlite --encoding=cp850 -o ee-insee-2015.sqlite eec15.dbf varlist.dbf varmod.dbf

## Setup

### Libraries & Settings

In [1]:
import os              # General OS commands
import pandas as pd    # Python Data Analysis Library
import sqlite3         # SQLite3 Database Driver

In [2]:
# Never truncate columns, display all the data
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', -1)

# Temporary files to delete at the end of the experiment
temp_files = []

### Database

The resulting SQLite database weighs about 89MB, which is too large for GitHub (our version control repository), so we compressed the database and decompress it on-the-fly before each experiment.

In [3]:
from zipfile import ZipFile
with ZipFile("data/ee-insee-2015-sqlite.zip") as zip_file:
    zip_file.extractall("data/")

eedb = "data/ee-insee-2015.sqlite"
temp_files.append(eedb)

## First look

### Variables
We first look at the list of variables to find out what kind of data we have access to. The variables (i.e. the columns of the main table `eec15`) and their labels are stored in the `varlist` table.

In [4]:
with sqlite3.connect(eedb) as con:
    query = "SELECT * FROM varlist"
    df = pd.read_sql_query(query, con)

df.columns = ["Variable", "Libellé"]
HTML(df.to_html(index=False))

Variable,Libellé
AAC,"Exercice d'une activité professionnelle régulière antérieure, pour les inactifs, chômeurs et personnes ayant une activité temporaire ou d'appoint autre qu'un emploi informel"
ACTEU,Statut d'activité au sens du Bureau International du Travail (BIT) selon l'interprétation communautaire
ACTEU6,Statut d'activité au sens du Bureau International du Travail (BIT) selon l'interprétation communautaire (6 postes)
ACTIF,Actif au sens du BIT
ACTOP,Actif occupé au sens du Bureau International du Travail (BIT)
AGE3,"Âge au dernier jour de la semaine de référence (3 postes, premier type de regroupement)"
AGE5,Âge au dernier jour de la semaine de référence (5 postes)
AIDFAM,Aide familial ou conjoint collaborateur
ANCCHOM,Ancienneté de chômage en 8 postes
ANCENTR4,Ancienneté dans l'entreprise ou dans la fonction publique (4 postes)


### Survey Data
We also look at the actual survey data to get an idea of how much data we are dealing with, and what this data looks like.

In [5]:
with sqlite3.connect(eedb) as con:
    cur = con.cursor()
    
    # Total number of rows/observations
    cur.execute("SELECT count(*) FROM eec15")
    total = cur.fetchone()[0]
    
    # Number of men
    cur.execute("SELECT count(*) FROM eec15 WHERE sexe=1")
    men = cur.fetchone()[0]
    
    # Number of women
    cur.execute("SELECT count(*) FROM eec15 WHERE sexe=2")
    women = cur.fetchone()[0]

print "Total: {}".format(total)
print "Men: {}".format(men)
print "Women: {}".format(women)

Total: 431678
Men: 203516
Women: 228162


In [6]:
with sqlite3.connect(eedb) as con:
    query = "SELECT * FROM eec15 LIMIT 10"
    df = pd.read_sql_query(query, con)

HTML(df.to_html(index=False))

annee,trim,catau2010r,metrodom,typmen7,age3,age5,coured,enfred,nfrred,sexe,acteu,acteu6,actif,actop,aidfam,ancchom,ancinact,contact,creaccp,dem,dispoc,gardeb,halor,inscont,mra,mrb,mrbbis,mrc,mrd,mrdbis,mre,mrec,mrf,mrg,mrgbis,mrh,mri,mrj,mrk,mrl,mrm,mrn,mro,mrpassa,mrpassb,mrpassc,mrs,nondic,nrec,nreca,nrecb,occref,officc,offre,pastra,pastrb,pastrf,percrev,rabs,raisnrec,raisnsou,raispas,sou,soua,soub,souc,sousempl,stche,temp,traref,typcont,typcontb,chpub,cse,cser,csp,cstot,cstotr,fonctc,nafg004un,nafg010un,nafg017un,nafg021un,nafg038un,nafg088un,pub3fp,qprc,stc,contra,rdet,stat2,statoep,statut,statutr,titc,cstmn,cstplc,dispplc,duhab,gardea,hhc6,horaic,raison,raistp,stmn,stplc,tppred,txtppred,ancentr4,sitant,aac,csa,nafant,nafantg004,dip11,cstotprm,identm,extrian16,empnbh,hrec,hhce,hplusa,jourtr,nbtote
2015,2,1,1,4,30,30,1,1,1,2,1,1,1,1,,,,,,0,,,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,2,,,,,,,,,,2,2.0,2.0,,,,,1,,,1.0,56.0,5.0,56.0,56,5,7.0,EV,GI,IZ,I,IZ,56.0,4.0,4.0,3.0,1.0,,2.0,35.0,35.0,5.0,,,,,6.0,,4.0,1.0,,,2.0,2.0,1.0,,3.0,2.0,,,,,71,53,1,127.083178,37.0,,37.0,,5.0,
2015,2,1,1,4,15,15,1,1,1,1,2,3,1,2,2.0,5.0,6.0,,,1,1.0,,2,,1.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,,,,,,2.0,2.0,1.0,,,,2.0,,1.0,1.0,,1,1,2.0,2.0,2.0,2.0,,,,,,1,,,1.0,,1.0,,2,2.0,,,53.0,5.0,,53,5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,53.0,84.0,EV,42,53,1,127.083178,,,,,,
2015,2,1,1,1,15,15,2,2,1,1,1,1,1,1,,,,,13.0,0,,,2,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,2,2,,1.0,,1.0,,1.0,9.0,,,1,1.0,2.0,,,,,2,,,1.0,54.0,5.0,54.0,54,5,4.0,EV,GI,GZ,G,GZ,47.0,4.0,4.0,3.0,1.0,,2.0,35.0,35.0,5.0,,,,,6.0,,4.0,1.0,,,2.0,2.0,1.0,,2.0,1.0,,,,,10,54,2,133.254104,,,36.0,,5.0,
2015,3,1,1,1,15,15,2,2,1,1,1,1,1,1,,,,,13.0,0,,,2,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,2,2,,1.0,,1.0,,1.0,9.0,,,1,1.0,2.0,,,,,2,,,1.0,54.0,5.0,54.0,54,5,4.0,EV,GI,GZ,G,GZ,47.0,4.0,4.0,3.0,1.0,,2.0,35.0,35.0,5.0,,,1.0,1.0,6.0,,4.0,1.0,,,,1.0,1.0,,2.0,1.0,,,,,10,54,3,141.969987,,,36.0,39.0,5.0,
2015,4,1,1,1,15,15,2,2,1,1,1,1,1,1,,,,,11.0,0,,,2,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,2,2,,,,,,,9.0,,,1,1.0,2.0,,,,,1,,,1.0,54.0,5.0,54.0,54,5,4.0,EV,GI,GZ,G,GZ,47.0,4.0,4.0,3.0,1.0,,2.0,35.0,35.0,5.0,,,4.0,1.0,6.0,,4.0,1.0,,,,1.0,1.0,,2.0,1.0,,,,,10,54,4,136.181683,36.0,,36.0,40.0,5.0,
2015,1,1,1,1,30,30,2,2,1,2,1,1,1,1,,,,,,0,,,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,2,,,,,,,,,,2,2.0,2.0,,,,,1,,,1.0,43.0,4.0,43.0,43,4,,EV,OQ,OQ,Q,QB,87.0,4.0,9.0,3.0,1.0,,2.0,35.0,35.0,5.0,,,,,7.0,,5.0,1.0,,,2.0,2.0,1.0,,2.0,1.0,,,,,10,43,5,139.40894,40.0,,45.0,,5.0,
2015,4,1,1,1,30,30,2,2,1,2,1,1,1,1,,,,,,0,,,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2,2,,,,,,,,,,2,2.0,2.0,,,,,1,,,1.0,43.0,4.0,43.0,43,4,,EV,OQ,OQ,Q,QB,87.0,4.0,9.0,3.0,1.0,,2.0,35.0,35.0,5.0,,,,,7.0,,5.0,1.0,,,2.0,2.0,1.0,,2.0,1.0,,,,,10,43,6,138.495584,45.0,,45.0,,5.0,
2015,2,1,1,4,30,30,1,1,2,1,3,6,2,2,2.0,,6.0,,,0,2.0,,1,,,,,,,,,2.0,,,,,,,,,,,,,,,,8.0,,,,1,2,,2.0,1.0,1.0,,,,,,2,2.0,2.0,,,,,2,,,,,,,78,7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,64.0,49.0,EV,71,78,7,192.891889,,,,,,
2015,2,1,1,4,15,15,1,1,2,2,3,6,2,2,2.0,,6.0,,,0,,4.0,2,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,1,2,,2.0,2.0,2.0,,,,2.0,,2,,,2.0,,,,2,,,,,,,77,7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,56.0,88.0,EV,71,78,7,192.891889,,,,,,
2015,3,1,1,4,30,30,1,1,2,1,2,3,1,2,2.0,5.0,6.0,,,1,1.0,,2,,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,,,,,,2.0,1.0,,,,,2.0,,1.0,1.0,,1,1,2.0,2.0,2.0,2.0,,,,,,1,,,1.0,,1.0,,2,2.0,,,64.0,6.0,,64,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,64.0,49.0,EV,71,64,8,212.328424,,,,,,


## Variable description
As our objective is to model employment/unemployment, we take a deeper look into the variables that are available to us. In particular, we identify some variables of interest that we believe would be appropriate parameters in a regression model.

We then try to summarise these variables through (1) **descriptive statistics** (2) **graphical representation**, which gives us better insights into the variables and what they look like.

### Dependent variable/Regressor
The objective of our model will be to identify the relationship between **employment status** and some parameters that we will choose from the set of variables in the database. As such, the regressor in our model will be a binary variable **employed/unemployed**.

The database provides us with the following data, so we need to figure out which of these is closest to a simple employed/unemployed binary variable:
- **ACTEU**: *Statut d'activité au sens du Bureau International du Travail (BIT) selon l'interprétation communautaire*
- **ACTEU6**: *Statut d'activité au sens du Bureau International du Travail (BIT) selon l'interprétation communautaire (6 postes)*
- **ACTIF**: *Actif au sens du BIT*
- **ACTOP**: *Actif occupé au sens du BIT*

The possible values for these variables are contained in the `varmod` table and are summarised in the following table.

In [7]:
with sqlite3.connect(eedb) as con:
    query = """
        SELECT *
        FROM varmod
        WHERE variable IN ("ACTEU", "ACTEU6", "ACTIF", "ACTOP")
    """
    df = pd.read_sql_query(query, con)

df.columns = ["Variable", "Modalité", "Libellé"]
HTML(df.to_html(index=False))

Variable,Modalité,Libellé
ACTEU,,"Sans objet (ACTEU non renseigné, individus de 15 ans et plus nécessairement non pondérés)"
ACTEU,1.0,Actif occupé
ACTEU,2.0,Chômeur
ACTEU,3.0,Inactif
ACTEU6,,"Sans objet (ACTEU non renseigné, individus de 15 ans et plus nécessairement non pondérés)"
ACTEU6,1.0,Actif occupé
ACTEU6,3.0,Chômeur PSERE (Population sans Emploi à la Recherche d'un Emploi)
ACTEU6,4.0,Autre chômeur BIT
ACTEU6,5.0,"Etudiant, élève, stagiaire en formation (inactifs)"
ACTEU6,6.0,Autres inactifs (dont retraités)


## Cleanup

Now that the experiment has concluded, we delete all the "temporary" files.

In [8]:
for temp in temp_files:
    os.remove(temp)