# Reading the Climate variables

In the following we load the climate variables computed at `./weather/climate_vars.ipynb`. We index all the rows based on their `nuts` location and the timestamp of the datapoint. This will allow us to merge it with the SWB data later on.

To know from which climate table we want to merge the data, the nuts information has to be provided

In [19]:
# relevant imports
import pandas as pd
import numpy as np
from collections import defaultdict
from sqlalchemy import create_engine, inspect

Reading the climate data:

In [20]:
c_path = "weather/prod/climate.db"
c_engine = create_engine("sqlite:///"+c_path, echo=False)

In [21]:
# function for reading only the required data
def load_climate_data(con, nuts_lvl):
    # read tables in db
    insp = inspect(con)
    tables = insp.get_table_names()
    tables = [x for x in tables if x[0] == str(nuts_lvl)]
    # create dataframe
    df = pd.DataFrame()
    for table in tables:
        tmp = pd.read_sql_table(table, con)
        tmp["table_name"] = table
        if df.empty:
            df = tmp
        else:
            df = pd.concat([df, tmp], ignore_index=True)
    return df

In [22]:
nuts_lvl = 1
df_climate = load_climate_data(c_engine, nuts_lvl=nuts_lvl) 

# Reading the SOEP Household data

Because the individual questionair file is quite big we will merge it chunkwise with the weather and household data. Therefore we first read the hh files. All the required hh data can be found in two different data sets. These are merged in the following:

In [23]:
hbrutto_path = "./data/SOEP-CORE.v37eu_CSV/CSV/hbrutto.csv"
hl_path = "./data/SOEP-CORE.v37eu_CSV/CSV/hl.csv"

`hbrutto` contains meta data about the interviews. This includes data such as "day of interview", "location", etc. The `hl` file contains the interview responses of all waves in a long format.

In the following are the columns of interest with in the `hl` and `hbrutto` file with the corresponding meaning:

In [24]:
hl_var = {
    "syear":"year", # survey year -> prim. key
    "hid":"hid", # hh id -> prim. key
    # hh control variables:
    "hlc0005_h":"hh_income", #[de] Monatliches HH-Netto-Einkommen [harmonisiert]
    "hlc0043":"hh_children", # Number Children
    "hlf0001_h":"homeownership", # Homewonership
    "hlk0056":"interview_type", # Type of interview
} # 

hbrutto_var = {
    "syear":"year", # survey year -> prim. key
    "hid":"hid", # hh id -> prim. key
    "bula_h":"sloc" # location -> bundesland
}

The two files are now read in to their respective dataframe.

In [25]:
df_hl = pd.read_csv(hl_path, usecols=hl_var.keys())

In [26]:
df_hbrutto = pd.read_csv(hbrutto_path, usecols=hbrutto_var.keys())
# change location variable to be able to merge with weather data
df_hbrutto.rename({"bula_h":"sloc"}, axis=1, inplace=True)

The two dataframes can now be merged on their primary key. According to the documentation of the db the prim. keys are `hid` and `syear`.

In [27]:
df_hh = df_hl.merge(df_hbrutto, how="inner", on=["hid", "syear"])

# Reading the SOEP Individual Data

In [28]:
# clear db
!echo > ./prod/data.db

In [29]:
engine = create_engine("sqlite:///prod/data.db", echo=False)

Now we read the individual data. This dataset includes the target variable *SWB* as well as many other control variables. The datasets that contain the information used in this analysis are `ppathl` (tracking file) and `pl` (data)

In [30]:
ppathl_path = "./data/SOEP-CORE.v37eu_CSV/CSV/ppathl.csv"
pequiv_path = "./data/SOEP-CORE.v37eu_CSV/CSV/pequiv.csv"
pl_path = "./data/SOEP-CORE.v37eu_CSV/CSV/pl.csv"

In [31]:
ppathl_var = {
    "pid":"pid", # person id -> prim. key
    "syear":"year", #survey year -> prim. key
    # relevant covariates
    "sex":"sex", # gender [1] female [2] male
    "gebjahr":"birth_year", # year of birth
    "partner":"relationship", # [0] no partner, [1] spouse, [2] partner,
                              # [3] Probably spouse, [4] probably partner
                              # NOTE: join 1&3 and 2&4
}

pequiv_var = {
    "pid":"pid", # person id -> prim. key
    "syear":"year", #survey year -> prim. key
    "d11109":"education_years", #years of education
    "m11124":"disability_status", #Disability status
    "e11103":"" #Labor Participation
}

pl_var = {
    # ids
    "pid":"pid", # person id -> prim. key
    "syear":"year", #survey year -> prim. key
    "hid":"hid", # hh id -> prim. key
    # target variable
    "plh0182":"swb", # Current life dsatisfaction [0-10]
    # relevant covariates
    "ptagin":"day", #day of interview
    "pmonin":"month", #month of interview
    "plh0171":"satisfaction_health", # Current Health [1-5] (0=schlecht, 10=gut)
    "plb0021":"unemployed", # [2] No [1] Yes
    "plh0173":"satisfaction_work", # [0-10] not satisfied <-> very satisfied, NOTE: many nan
    "plh0174":"satisfaction_work_hh", # same as above (NOTE: maybe take max of both?)
    "plh0175":"satisfaction_income" # Satisfaction With Household Income
}

The `ppathl` contains the tracking data of a person. This includes for instance the age or marital status.

In [32]:
df_ppathl = pd.read_csv(ppathl_path, usecols=ppathl_var.keys())

In [33]:
df_pequiv = pd.read_csv(pequiv_path, usecols=pequiv_var.keys())

In the following we merge all the dataframes loaded into memory.

In [34]:
chunk = pd.read_csv(pl_path, usecols=pl_var.keys())

## MERGE WITH OTHER DATASETS
# merge with tracking data
chunk = chunk.merge(df_ppathl, on=["syear", "pid"], how="inner")
# merge with pequiv (TODO what is this table)
chunk = chunk.merge(df_pequiv, on=["syear", "pid"], how="inner")
# merge with household
chunk = chunk.merge(df_hh, on=["syear", "hid"], how="inner")

## CALCULATE RELEVANT VARIABLES
# age:
chunk["age"] = chunk["syear"] - chunk["gebjahr"]
# time stamp:
chunk.rename({'syear':"year", 'pmonin':"month", 'ptagin':"day"}, axis=1, inplace=True)
chunk["time"] = pd.to_datetime(chunk[['year', 'month', 'day']], errors='coerce')
# drop unuseful columns:
chunk.drop(['year', 'month', 'day'], axis=1, inplace=True)
# delete invalid time stamps as they cannot be merged with climate data:
chunk = chunk[chunk['time'].notna()]

## MERGE WITH CLIMATE DF
final = pd.merge(chunk, df_climate, on=["time", 'sloc'], how='inner')

## SAVE TO DATABASE
final.to_csv("./prod/1_data.csv")
final.to_sql(f"{nuts_lvl}_data", con = engine, if_exists='replace')

722380