# family-resources-survey

This is a small Python package aiming to make processing data from the UK Family Resources Survey as easy as possible. This documentation outlines the basic usage principles and some examples. The package is designed to handle the conversion from TAB files to Pandas DataFrames, but by default uses MicroDataFrames, a modification of the DataFrame class which handles survey weighting behind the scenes and provides useful population-related functions.

## Installation

Install the package from the git repository, using pip:
```console
pip install git+https://github.com/nikhilwoodruff/family-resources-survey
```

## Initialising with Microdata

The Family Resources Survey is classified as *safeguarded* on the UK Data Service (more restrictive than *open*, but less restrictive than *controlled*). This means you need to have an account with the UK Data Service to be able to request the microdata. Once you have, simply find the desired issue of the FRS, and download the TAB microdata (only TAB files are supported right now). Then, use the command:
```console
frs-data save --path [PATH_TO_ZIP_FILE] -- year [YEAR] --zipped
```
This will save the microdata in the package, and run any preprocessing necessary. The package will index the tables by person, benunit and household, and try to automatically parse the codebook (this works with the 2018 release, others have not been tested). The ```zipped``` argument just specifies that the folder is zipped and should be extracted - if you've already done this (and extracted to inside another folder), you can leave it out and point it to that folder instead (just make sure that the object the script is pointed to has the same structure as the zipped file). Then, from a Python environment, it's trivial to import the microdata in the right format:

In [1]:
from family_resources_survey import FRS

frs = FRS(2018)

This ```frs``` object has all the tables as properties (they'll be loaded when/if you access them), and a ```description``` dictionary if the codebook parsing was successful. The tables are stored as MicroDataFrames where they have the weights explicitly provided (that is, for tables 'adult', 'benunit', and 'househol'). If not, they're returned as normal DataFrames.

## Relational Tools

The FRS is a relational database - one person might have zero or more jobs, and these are linked using primary and foreign keys. This package has two top-level tools for doing things quickly that need to operate over multiple tables: ```consolidate``` and ```join```.

### Consolidate

Often a table can contain multiple entries for the same person - for example, a person can have more than one job. This can cause problems, so the ```consolidate``` function takes the table, a maximum number of occurrences that we want to distinguish $n$, a new column name to write the number of occurrences for each person, and optionally can set $n$ to its maximum from the data. For example, let's say in the jobs table (suppose it contains one variable called 'pay') that at most a person has 4 jobs. We can perform a few different operations:

In [10]:
from family_resources_survey import consolidate

consolidate(table=frs.job, max_count=1, count_label="num_jobs")    # 1
consolidate(table=frs.job, max_count=2, count_label="num_jobs")    # 2
consolidate(table=frs.job, count_label="num_jobs", count_all=True) # 3
pass

1. Returns a dataframe with a row per person and the column "PAY", which is the PAY summed over all jobs for this person.
2. Returns a dataframe with a row per person and the columns "PAY_1", "PAY_other", which is the PAY from the first job, and the PAY summed from others, respectively.
3. Returns a dataframe with a row per person and the columns "PAY_1", "PAY_2", "PAY_3", "PAY_4", which is the PAY from each job, respectively. Equivalent to setting max_count to 4, but we don't need to figure out to do that beforehand.

## Join

```join``` is simpler - it just recursively performs a left join on all the dataframes passed into it.

In [12]:
from family_resources_survey import join

join(frs.adult, consolidate(frs.job), consolidate(frs.mortgage))
pass

Here, we've consolidated by summing up the amounts in JOB and MORTGAGE per person, and first joined JOB onto ADULT, then MORTGAGE onto the result of that.

## Contact

Feel free to contact with any questions. This package is purely a data processing tool - no adjustments are applied on the microdata.