### First Regression (Turnover intention ~ unfair treatment x neg. reciprocity)

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import math

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

### Read in SOEP Data:
- vp : 2005 data : main variables of interest: questions on negative reciprocity
- wp: 2006 data : main variables of interest: question on perceived recognition for work
- xp: 2007 data : main variables of interest: turnover intentions, controls

In [4]:
# define path: insert the path where the SOEP data is stored on your computer here
from pathlib import Path
data_folder = Path(f"C:/Users/max-admin/Desktop/Masterstudium/WiSe_22_23/Research_Module/SOEP-Data/Stata/raw")
# define relevant subsets of SOEP-data
file_names = ['vp', 'wp', 'xp']

file_paths = [data_folder / f"{file_name}.dta" for file_name in file_names]
# some controls are in gen data
file_paths_2 = [data_folder / f"{file_name}gen.dta" for file_name in file_names]

In [22]:
# read in 2005 data for the reciprocity measures
data05 = pd.read_stata(file_paths[0], columns=["pid","hid", "syear","vp12602", "vp12603", "vp12605"]).set_index(['pid', 'hid'])
df_05 = data05.rename(columns={ 'vp12602': 'take_revenge', 'vp12603': 'similar_problems', 'vp12605': 'insult_back'})
df_05.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,syear,take_revenge,similar_problems,insult_back
pid,hid,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
201,27,2005,[1] Trifft ueberhaupt nicht zu,[5] Skala 1-7,[1] Trifft ueberhaupt nicht zu
203,60313,2005,[2] Skala 1-7,[3] Skala 1-7,[2] Skala 1-7
602,60,2005,[5] Skala 1-7,[4] Skala 1-7,[3] Skala 1-7


In [21]:
# read in 2006 data
# still includes all unfair treat
data06 = pd.read_stata(file_paths[1], columns=["pid", "hid", "syear", "wp43b03", "wp43b05", "wp43b07", 'wp43b02', 'wp43b04', 'wp43b06']).set_index(['pid', 'hid'])
df_06 = data06.rename(columns={"wp43b03": "recog_effort", "wp43b05": "recog_personal", "wp43b07": "recog_pay", 'wp43b02': 'felt_recog_sup', 'wp43b04': 'felt_recog_effort', 'wp43b06': 'felt_recog_pay'})
df_06.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,syear,recog_effort,recog_personal,recog_pay,felt_recog_sup,felt_recog_effort,felt_recog_pay
pid,hid,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
201,27,2006,[-2] trifft nicht zu,[-2] trifft nicht zu,[-2] trifft nicht zu,[-2] trifft nicht zu,[-2] trifft nicht zu,[-2] trifft nicht zu
203,60313,2006,[1] Ja,[2] Nein,[2] Nein,[-2] trifft nicht zu,[-2] trifft nicht zu,[2] Maessig
602,60,2006,[1] Ja,[1] Ja,[1] Ja,[2] Maessig,[-2] trifft nicht zu,[-2] trifft nicht zu


In [23]:
#read in 2007 data
#for outcome and all controls
data3= pd.read_stata(file_paths[2], columns=["pid", "hid", "syear", 'xp8601','xp0102', 'xp2701', 'xp7302']).set_index(['pid', 'hid'])
df_07 = data3.rename(columns= {'xp8601':"school_degree",'xp2701': 'turnover_intention' , 'xp7302': 'wage_lastm'})
df_07.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,syear,school_degree,xp0102,turnover_intention,wage_lastm
pid,hid,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
201,27,2007,[-2] trifft nicht zu,[-2] trifft nicht zu,[-2] trifft nicht zu,-2
203,60313,2007,[-2] trifft nicht zu,[6] 6 Zufrieden: Skala 0-Niedrig bis 10-Hoch,[0] 0% wahrscheinlich,1200
602,60,2007,[-2] trifft nicht zu,[-1] keine Angabe,[90] 90% wahrscheinlich,200


## HERE WE NEED additional controls:
- Gender
- Age
- Age^2 / 100 to control for non linear relationship with age
- Work experience ?
- Years of education : xbilzeit in xpgen (subset for 2007 with generated variables) ~20000 valid answers
- Industry Sector: Nace07 in xpgen (for controlling for wage relative to industry median: will see if thats feasible with the data) ~11.000 valid answers
- Size of company: betr in xpgen ~ 11.000 valid answers
- Tenure : xerwzeit in xpgen ~ 10.000 valid answers

- Did one change its job in the past 12 months ? also need variable for that



## When everything is added:

- merge dataframes by PID 
- recode categorical variables as in Maxies datamanagment notebook
- construct a control variable for the wage ( relative to industry sector, minzer-residuals - we dont have to do that for the first regression)
- drop N.a.N -> this will be interesting lets see how many we will have left : about 5000 would be good otherwise we have to look for other control variables on the SOEP-Companion

and conduct the first analysis, where people which changed their job in the last 12 months are dropped for the first analysis

