# Data collection walkthrough 

This notebook is step-by-step guide to the data construction pipeline, `scripts/collect_data.py`.

Runs:
1. `build.population.run()`
2. `build.outcomes.run()`
3. `build.panel.run()`
4. `build.female_foreign.run()`


### Setup

Import paths from configuration (see README.md for a configuration guide).

In [None]:
!python scripts/validate_setup.py # check that paths are correctly defined and data is available
from depo_paper.config import PATHS
PATHS # see that paths are indeed correctly defined

## 1) Build convict and spouse sample 

Step: `build/population.py`

### What happens
- Load deportation orders and keep first order per person.
- Load residency permits and keep earliest permit per person.
- Load conviction histories.
- Define control group (convicted, not deported, sentenced to prison).
- Match spouses/partners and keep each partner's first exposure.
    - Spouses/partners are defined based on the population records (BEF) from the year prior to year of conviction.

### Key assumptions
- Same-day conviction rows are collapsed (sentence lengths=sum).
- Control group restricted to first qualifying conviction being in year >= 2000.
- Danish citizens removed (`statsb == 5100`).

In [None]:
from depo_paper.build.population import run as build_population
build_population()

## 2) Build spouse outcomes 

Step: `build/outcomes.py`

### What happens
- Builds monthly labor outcomes (`wages`, `fulltime`, `transfers`).
- Builds monthly crime outcomes (`convicted`, `charged`, `incarcerated`).
- Builds net assets from year prior to year of conviction.
- Derives legal grounds of residency.

### Key assumptions
- Wages/fulltime/transfers clipped at [0, p99].
- Missing assets set to 0, then clipped to [p1, p99].
- Some imputation is done on missing release dates in the incarceration data, e.g.:
    - Individuals with missing release dates for recent incarcerations are assumed to still be incarcerated.
- Residency grounds imputed based on citizenship and residency permit records in OPHG.

In [None]:
from depo_paper.build.outcomes import run as build_outcomes
build_outcomes()

## 3) Build event-time panel 

Step: `build/panel.py`

### What happens
- Expands spouse data to monthly event-time window (`-18` to `+18`).
- Constructs in-country status from movement records.
- Merges on monthly labor market outcomes and crime outcomes.
- Removes control cases where spouses where convicted on the same criminal case.

### Key assumptions
- Individuals are assumed in-country unless otherwise indicated by migration data.
- For in-country months in 2008-2021, missing labor market outcomes are set to 0.
- Employment is defined as `wages > 0` when wages are observed.

In [None]:
from depo_paper.build.panel import run as build_panel
build_panel()

## 4) Build female foreign-born comparison sample 

Step: `build/female_foreign.py`

### Key assumptions
- Missing wages/transfers/assets are set to 0.
- Employment is `wages > 0`.
- Labour market outcomes (wages/fulltime/transfers/assets) are clipped as in section 2.
- Residency grounds are imputed as in section 2.

In [None]:
from depo_paper.build.female_foreign import run as build_female_foreign
build_female_foreign()