# 03b Calculating Intercoder reliability
Loading the annotated samples of two researchers and calcuating the Krippendorff's Alpha.
Some variables are dependent on others (see codebooks for more information), which has to be taken into account when calculating Krippendorff Alpha. 
- SPECIFIC, EFFECTIVE, QUESTION AND LANGUAGE are dependent on VALID.
- All remaining variables are dependent on SPECIFIC.

In [1]:
import pandas as pd
from datetime import datetime
import krippendorff
import numpy as np
date = datetime.now().strftime('%d%m%Y')

In [2]:
# setting paths
PATH = '/Users/marieke/SearchingForBias'

In [3]:
sample1 = pd.read_excel(PATH+"/data/immigration/manual_coding/ICB_sample_25022021_MVH.xlsx", engine='openpyxl')
sample2 = pd.read_excel(PATH+"/data/immigration/manual_coding/ICB_sample_25022021_DT.xlsx", engine='openpyxl')

In [4]:
print(sample1.shape, sample2.shape)

(274, 20) (274, 20)


In [5]:
#sample1.head()

In [6]:
#sample2.head()

#### Calculation

In [7]:
def ICB_calculation(df1, df2, varname, conditional=None):
    tmpdf = pd.concat([df1[varname], df2[varname]], axis=1)
    if conditional:
        print('Conditional on...', conditional)
        index_conditional=pd.concat([df1[conditional], df2[conditional]], axis=1, keys=["MVH", "DT"]).query("MVH==1 & DT==1").index.to_list()
        tmpdf = tmpdf[tmpdf.index.isin(index_conditional)]
    matrix = np.array([tmpdf.iloc[:,0], tmpdf.iloc[:,1]])
    alpha = krippendorff.alpha(reliability_data=matrix, level_of_measurement='nominal')
    print(f'The intercoderreliability for {varname} is {alpha}')
    alpha = round(alpha, 3)
    return {varname:alpha}

In [8]:
icr1 = ICB_calculation(sample1, sample2, 'VALID')

The intercoderreliability for VALID is 0.7318627450980393


In [9]:
varnames1 = ['SPECIFIC', 'EFFECTIVE', 'QUESTION', 'LANGUAGE']

In [10]:
icr2={}
for varname in varnames1:
    icr2.update(ICB_calculation(sample1, sample2, varname, conditional='VALID'))

Conditional on... VALID
The intercoderreliability for SPECIFIC is 0.74877916440586
Conditional on... VALID
The intercoderreliability for EFFECTIVE is 0.7264081027667985
Conditional on... VALID
The intercoderreliability for QUESTION is 0.9208504556012149
Conditional on... VALID
The intercoderreliability for LANGUAGE is 0.9501294607677586


In [11]:
varnames2=['DEBATE', 'PROBLEMS', 'ADMISSION',
       'HOUSING', 'INTEGRATION', 'CRIME', 'RACISM', 'ECONOMY',
       'CULTURE_RELIGION', 'CAUSES', 'POLITICS', 'STATISTICS', 'NEWS']

In [12]:
icr3 = {}
for varname in varnames2:
    icr3.update(ICB_calculation(sample1, sample2, varname, conditional='SPECIFIC'))

Conditional on... SPECIFIC
The intercoderreliability for DEBATE is 0.7926988265971316
Conditional on... SPECIFIC
The intercoderreliability for PROBLEMS is 0.7342089552238806
Conditional on... SPECIFIC
The intercoderreliability for ADMISSION is 0.8414529914529915
Conditional on... SPECIFIC
The intercoderreliability for HOUSING is 0.8350377945753669
Conditional on... SPECIFIC
The intercoderreliability for INTEGRATION is 0.7904800322710771
Conditional on... SPECIFIC
The intercoderreliability for CRIME is 0.8282407407407407
Conditional on... SPECIFIC
The intercoderreliability for RACISM is 1.0
Conditional on... SPECIFIC
The intercoderreliability for ECONOMY is 0.4789325842696629
Conditional on... SPECIFIC
The intercoderreliability for CULTURE_RELIGION is 1.0
Conditional on... SPECIFIC
The intercoderreliability for CAUSES is 0.8072727272727273
Conditional on... SPECIFIC
The intercoderreliability for POLITICS is 0.8835530445699937
Conditional on... SPECIFIC
The intercoderreliability for STAT

In [13]:
result = pd.DataFrame({**icr1, **icr2, **icr3}, index=["Krippendorff Alpha"]).T
result.to_latex(PATH+"/report/tables/IRC_table.tex")

  result.to_latex(PATH+"/report/tables/IRC_table.tex")


In [14]:
result

Unnamed: 0,Krippendorff Alpha
VALID,0.732
SPECIFIC,0.749
EFFECTIVE,0.726
QUESTION,0.921
LANGUAGE,0.95
DEBATE,0.793
PROBLEMS,0.734
ADMISSION,0.841
HOUSING,0.835
INTEGRATION,0.79


Relatively low alpha on ECONOMY can be explained by disagreement on coding search queries referring to financial support as economy. Fixed by adding a new category: FINANCIAL SUPPORT (see below).

#### Inspect disagreement

In [15]:
sample1.columns = [c+"_MVH" for c in sample1.columns]
sample2.columns = [c+"_DT" for c in sample2.columns]

In [16]:
samples = pd.concat([sample1, sample2],axis=1)
samples = samples.reindex(sorted(samples.columns), axis=1)

In [17]:
samples

Unnamed: 0,ADMISSION_DT,ADMISSION_MVH,CAUSES_DT,CAUSES_MVH,CRIME_DT,CRIME_MVH,CULTURE_RELIGION_DT,CULTURE_RELIGION_MVH,Comments_DT,Comments_MVH,...,RACISM_DT,RACISM_MVH,SPECIFIC_DT,SPECIFIC_MVH,STATISTICS_DT,STATISTICS_MVH,VALID_DT,VALID_MVH,search query_DT,search query_MVH
0,,,,,,,,,,,...,,,0.0,0.0,,,1,1,buitenlanders,buitenlanders
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,...,0.0,0.0,1.0,1.0,1.0,1.0,1,1,hoeveel immigranten zijn er in nederland,hoeveel immigranten zijn er in nederland
2,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,,,...,0.0,0.0,1.0,1.0,1.0,1.0,1,1,hoeveel asielzoekers worden geweigerd in neder...,hoeveel asielzoekers worden geweigerd in neder...
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,...,0.0,0.0,1.0,1.0,0.0,0.0,1,1,inburgeringscursus eisen,inburgeringscursus eisen
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,spelfout,,...,0.0,0.0,1.0,1.0,0.0,0.0,1,1,overneid migratie,overneid migratie
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
269,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,...,0.0,0.0,1.0,1.0,0.0,0.0,1,1,banen van migranten,banen van migranten
270,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,...,0.0,0.0,1.0,1.0,0.0,0.0,1,1,gevaren veel emigraties nederland,gevaren veel emigraties nederland
271,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,...,0.0,0.0,1.0,1.0,0.0,0.0,1,1,statushouders intergratie,statushouders intergratie
272,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,...,0.0,0.0,1.0,1.0,0.0,0.0,1,1,noodhulp,noodhulp


In [18]:
varstotal=["VALID"]+varnames1+varnames2

In [19]:
disagreement = {}

In [20]:
filterdisagreements = samples.query('VALID_MVH!=VALID_DT').index.to_list()
disagree = samples.loc[filterdisagreements,]
disagree = disagree[['search query_DT', 'Comments_DT', 'Comments_MVH', 'VALID_MVH', 'VALID_DT']]
disagreement.update({'VALID':disagree})

In [21]:
for varname in varnames1:
    cols = [c for c in samples.columns if c.startswith(varname)]
    index_conditional=samples.query("VALID_MVH==1 & VALID_DT==1").index.to_list()
    df = samples[samples.index.isin(index_conditional)]
    filterdisagreements = df.query(f'{cols[0]}!={cols[1]}').index.to_list()
    disagree = df.loc[filterdisagreements,]
    disagree = disagree[['search query_DT', 'Comments_DT', 'Comments_MVH']+cols]
    disagreement.update({varname:disagree})

In [22]:
for varname in varnames2:
    cols = [c for c in samples.columns if c.startswith(varname)]
    index_conditional=samples.query("SPECIFIC_MVH==1 & SPECIFIC_DT==1").index.to_list()
    df = samples[samples.index.isin(index_conditional)]
    filterdisagreements = df.query(f'{cols[0]}!={cols[1]}').index.to_list()
    disagree = df.loc[filterdisagreements,]
    disagree = disagree[['search query_DT', 'Comments_DT', 'Comments_MVH']+cols]
    disagreement.update({varname:disagree})

In [23]:
disagree.columns

Index(['search query_DT', 'Comments_DT', 'Comments_MVH', 'NEWS_DT',
       'NEWS_MVH'],
      dtype='object')

In [24]:
writer = pd.ExcelWriter(PATH+'/data/immigration/manual_coding/ICR_disagreement.xlsx', engine='openpyxl') 

In [25]:
for varname, df in disagreement.items():
    df.to_excel(writer, sheet_name=varname)

In [26]:
writer.save()

In [27]:
samples.loc[samples.ECONOMY_DT==1][['search query_DT', 'Comments_MVH', 'Comments_DT', 'ECONOMY_MVH', 'ECONOMY_DT']]

Unnamed: 0,search query_DT,Comments_MVH,Comments_DT,ECONOMY_MVH,ECONOMY_DT
43,financeel plaatje immigranten,Financial stuff now belongs nowhere.,,0.0,1.0
124,bijstand,"No category for ""financial support""",,0.0,1.0
158,goedkope arbeiders,,,1.0,1.0
174,financiele steun imigranten in ned,"No category for ""financial support""",,0.0,1.0
177,opleiding assielzoekers,,,1.0,1.0
186,opleidingsniveau immigranten,,,1.0,1.0
207,ww,,,,1.0
210,kosten immigratie,"No category for ""kosten""",,0.0,1.0
241,uitkeringen,"No category for ""financial support""",,0.0,1.0
254,kosten emigranten,"No category for ""kosten""",,0.0,1.0


In [28]:
samples.loc[samples.ECONOMY_MVH==1][['search query_DT', 'Comments_MVH', 'Comments_DT', 'ECONOMY_MVH', 'ECONOMY_DT']]

Unnamed: 0,search query_DT,Comments_MVH,Comments_DT,ECONOMY_MVH,ECONOMY_DT
26,percentage of english speakers looking for joo...,,,1.0,0.0
158,goedkope arbeiders,,,1.0,1.0
177,opleiding assielzoekers,,,1.0,1.0
186,opleidingsniveau immigranten,,,1.0,1.0
269,banen van migranten,,,1.0,1.0
