# EconML

Welcome to our Jupyter Notebook.

## Database
We have downloaded the ***Enquête Emploi 2015 - Fichier détail*** database from the [INSEE website](https://www.insee.fr/fr/statistiques/2388681). The database is in dBASE format (`.dbf`, 80mb), compressed into a ZIP file (`.zip`, 9mb).

The `.zip` file contains three dBASE files:
   - `eec15.dbf` (77MB)
   - `varlist.dbf` (33KB)
   - `varmod.dbf` (33KB)

Because the dBASE format is inconvenient to work with, we import the contents of these three files into an SQLite databases using the [`dbf2sqlite`](https://github.com/olemb/dbfread/blob/master/examples/dbf2sqlite) script, with the following command:

    ./dbf2sqlite --encoding=cp850 -o ee-insee-2015.sqlite eec15.dbf varlist.dbf varmod.dbf

## Setup

The resulting SQLite database weighs about 88MB, which is too large for GitHub (our version control repository), so we compressed the database and decompress it on-the-fly before each experiment.

In [None]:
from zipfile import ZipFile
with ZipFile("data/ee-insee-2015-sqlite.zip") as zip_file:
    zip_file.extractall("data/")

## First look

### Variables
We first look at the list of variables to find out what kind of data we have access to. The variables (i.e. the columns of the main table `eec15`) and their labels are stored in the `varlist` table.

In [None]:
import pandas as pd
import sqlite3

from IPython.display import display, HTML
pd.set_option('display.max_colwidth', -1)

with sqlite3.connect("data/ee-insee-2015.sqlite") as con:
    query = "SELECT * FROM varlist"
    df = pd.read_sql_query(query, con)

df.columns = ["Variable", "Libellé"]
HTML(df.to_html(index=False))

### Survey Data
We also look at the actual survey data to get an idea of how much data we are dealing with, and what this data looks like.

In [None]:
import pandas as pd
import sqlite3

from IPython.display import display, HTML
pd.set_option('display.max_colwidth', -1)

with sqlite3.connect("data/ee-insee-2015.sqlite") as con:
    cur = con.cursor()
    
    # Total number of rows/observations
    cur.execute("SELECT count(*) FROM eec15")
    total = cur.fetchone()[0]
    
    # Number of men
    cur.execute("SELECT count(*) FROM eec15 WHERE sexe=1")
    men = cur.fetchone()[0]
    
    # Number of women
    cur.execute("SELECT count(*) FROM eec15 WHERE sexe=2")
    women = cur.fetchone()[0]

print "Total: {}".format(total)
print "    Men: {}".format(men)
print "    Women: {}".format(women)

In [None]:
import pandas as pd
import sqlite3

from IPython.display import display, HTML
pd.set_option('display.max_colwidth', -1)

with sqlite3.connect("data/ee-insee-2015.sqlite") as con:
    query = "SELECT * FROM eec15 LIMIT 10"
    df = pd.read_sql_query(query, con)

HTML(df.to_html(index=False))

## Cleanup

Now that the experiment has concluded, we delete all the "temporary" files.

In [None]:
import os
os.remove("data/ee-insee-2015.sqlite")