ANLP 2020/2021 final project    
Friederike Schreiber, Peng Chen, Anton Rabe

This skript evaluates the human study.   

The basic structure of the study:
The study was split into two parts the first part testing for the Naturalness of the given sentence. 
The second part asking if a sentence was more likely written by a human or a machine.  

In each part 4 conditions were tested: 
    
    Questions 0-10 Two Layer Neural Network
    Questions 11-20 Markov Chain Model
    Questions 21-30 Real Song Text
    Questions 31-40 Random Text
    


Results of the evaluation:

The Spearman Correlation shows a link between a verse being percieved as natural and likely written by a human.  

ANOVA and Tukey Test show that all four test conditions are significantly different from each other for both halfs of the study. The only exception are the results for the two-layer network and the markov model. There is no significant difference between these two approaches.   
There is also no significant difference between the two parts. 

ANOVA and Tukey Test show no significant difference in the results for people who have not much previous interaction with machine generated text and participants who were familiar with machine generated text. 

In [1]:
#Data cleaning

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [3]:
data=pd.read_csv("../resources/study_results.csv",sep=';')

In [4]:
#filter for only completed studies
compl=data[data['Teilnahmestatus']=="teilgenommen und beendet"]

#drop all columns with text only
compl=compl.dropna(axis=1, how='all')

In [5]:
#drop all the meta data
compl=compl.drop(['_Antwort-ID', 'Resume-Code',"Start","Datum und Zeit","Teilnahmestatus"], axis=1)

#drop all questions about the participants
onlyanswers=compl.drop(['1. How old are you?',"Beginner (1) - Native Language (7)","I dont listen to rap at all (1) - I hear a lot of rap music (7)","No prior experience with machine generated text (1) - Frequent interaction with machine generated text (7)"],axis=1)

In [6]:
#The scale changed for the second half of the study (mainly to avoid bias) to make sure participants noticed 
#there was an extra testquestion. Four participants didnt answer the test question right and were excluded. 

onlyanswers=onlyanswers[onlyanswers['7']!=1.0]


In [7]:
#At the beginning of each part were test questions that should help familarise the participants with the setup and 
#make sure that they understood the task.
#Answers below a certain value would indicate that the participant had difficulties with the task.
#Some participants answered one of the four test question incorrectly but looking at their data 
#there is no indication that they had problems with the task overall.
#For this reason they were kept in the evaluation. 

#Extract the test questions
testquestion = onlyanswers[["Completely Unatural (1) - Completely Natural (7)", "Completely Unatural (1) - Completely Natural (7).1","Written by a human (1) - Written by a machine (7)","Written by a human (1) - Written by a machine (7).1"]].copy()


testquestion=testquestion.rename(columns={"Completely Unatural (1) - Completely Natural (7)": "RealNat", "Completely Unatural (1) - Completely Natural (7).1": "RandomNat","Written by a human (1) - Written by a machine (7)":"RealComp","Written by a human (1) - Written by a machine (7).1":"RandomComp"})
print("RealNat")
print(testquestion[testquestion.RealNat <3.0])
print("RandomNat")
print(testquestion[testquestion.RandomNat>5.0])
print("RealComp")
print(testquestion[testquestion.RealComp >5.0])
print("RandomComp")
print(testquestion[testquestion.RandomComp<3.0])


RealNat
    RealNat  RandomNat  RealComp  RandomComp
15        2          3         2           6
RandomNat
    RealNat  RandomNat  RealComp  RandomComp
20        5          7         5           3
RealComp
   RealNat  RandomNat  RealComp  RandomComp
8        3          1         6           7
RandomComp
   RealNat  RandomNat  RealComp  RandomComp
2        4          2         3           2


In [8]:
#Getting different dataframes

In [9]:
def frame_natural(inputframe):
    #Naturalness:

    #get all column names that belong to the natural part
    nat_col = [col for col in inputframe.columns if 'Natural' in col]

    #make a new dataframe
    natural= inputframe[nat_col].copy()

    #the order of the naturalness questions
    natorder=[30,23,7,9,25,26,12,19,8,20,3,33,28,14,36,24,37,5,11,39,35,10,2,32,38,6,13,17,40,15,34,4,29,16,18,31,21,22,1,27]

    #rename columns
    natural.columns = natural.columns[:2].tolist() + natorder

    #drop the first to columns they contain the test data
    natural=natural.drop(["Completely Unatural (1) - Completely Natural (7)","Completely Unatural (1) - Completely Natural (7).1"],axis=1)

    #sort the questions in their original order
    natural = natural.reindex(sorted(natural.columns), axis=1)
    return natural

In [10]:
natural=frame_natural(onlyanswers)

In [11]:
natural.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,31,32,33,34,35,36,37,38,39,40
0,6,5.0,2,3,4,2,3,3,3,4,...,1,2,1,1,1,2.0,2,2,2,1
1,7,3.0,5,6,5,4,2,4,5,3,...,1,2,2,1,1,2.0,1,1,2,1
2,3,5.0,3,3,5,4,3,4,2,4,...,3,4,3,4,3,4.0,3,3,4,3
4,5,6.0,5,4,6,5,2,5,5,6,...,6,2,2,2,2,5.0,2,2,5,2
5,5,5.0,5,3,3,4,4,4,3,5,...,4,4,5,5,4,4.0,3,4,4,5


In [12]:
def frame_comparison(inputframe):
    #Comparison:

    #get all column names that belong to the natural part
    comp_col = [col for col in inputframe.columns if 'human' in col]

    #make a new dataframe
    compare= inputframe[comp_col].copy()

    #the order of the naturalness questions
    comporder=[9,18,39,28,27,7,31,10,32,20,8,40,33,1,11,25,2,38,34,24,19,26,30,35,29,36,22,23,3,21,17,13,12,6,15,14,4,16,5,37]

    #rename columns
    compare.columns = compare.columns[:2].tolist() + comporder

    #drop the first to columns they contain the test data
    compare=compare.drop(["Written by a human (1) - Written by a machine (7)","Written by a human (1) - Written by a machine (7).1"],axis=1)

    #sort the questions in their original order
    compare = compare.reindex(sorted(compare.columns), axis=1)
    
    #In the second part of the study the scale flipped to avoid bias. But to compare the to parts here values we changed back. 
    #So 7 means written by a human, 1 written by a machine. 
    compare=compare.replace([1.0, 2.0, 3.0, 4.0,5.0,6.0,7.0], [7.0, 6.0, 5.0, 4.0,3.0,2.0,1.0])
    
    return compare

In [13]:
compare=frame_comparison(onlyanswers)

In [14]:
compare.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,31,32,33,34,35,36,37,38,39,40
0,6.0,3.0,3.0,3.0,6.0,3.0,2.0,2.0,4.0,3.0,...,1.0,3.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,2.0
1,6.0,2.0,3.0,4.0,5.0,3.0,2.0,5.0,2.0,3.0,...,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
2,5.0,3.0,5.0,6.0,5.0,6.0,5.0,6.0,6.0,4.0,...,6.0,6.0,5.0,5.0,5.0,5.0,6.0,5.0,6.0,5.0
4,6.0,6.0,6.0,5.0,2.0,6.0,6.0,6.0,5.0,6.0,...,6.0,6.0,5.0,2.0,2.0,6.0,5.0,2.0,3.0,5.0
5,4.0,4.0,5.0,4.0,5.0,3.0,4.0,3.0,5.0,,...,5.0,5.0,4.0,5.0,,4.0,5.0,5.0,3.0,3.0


In [15]:

print("There are",natural.isna().sum().sum(),"questions with NAN in the natural set")
print("There are",compare.isna().sum().sum(),"questions with NAN in the comparison set")

There are 4 questions with NAN in the natural set
There are 6 questions with NAN in the comparison set


In [16]:
#Some participants skipped questions for easier evaluation the mean of the other participants for this question is added
natural=natural.apply(lambda x: x.fillna(int(x.mean())),axis=0)
compare=compare.apply(lambda x: x.fillna(int(x.mean())),axis=0)

In [17]:
def split_models(inputframe):
    
    tl=inputframe.iloc[:, : 10]
    mc=inputframe.iloc[:, 10 :20]
    real=inputframe.iloc[:, 20:30]
    rand=inputframe.iloc[:, 30:40]
    allframe=[tl,mc,real,rand]
    return tl,mc,real,rand,allframe

In [18]:
tlnat,mcnat,realnat,randnat,allnat=split_models(natural)
tlcom,mccom,realcom,randcom,allcom=split_models(compare)


In [19]:
tlall=pd.concat([tlnat, tlcom], axis=1)
mcall=pd.concat([mcnat, mccom], axis=1)
realall=pd.concat([realnat, realcom], axis=1)
randall=pd.concat([randnat, randcom], axis=1)

allsorted=[tlall,mcall,realall,randall]

In [20]:
mcall.columns=list(tlall.columns)
realall.columns=list(tlall.columns)
randall.columns=list(tlall.columns)

condcombined=tlall.append([mcall,realall,randall])


In [21]:
#Getting mean and std

In [22]:
def mean_and_std(dataframe):
    mean=round((dataframe.mean().mean()),2)
    std=round((dataframe.stack().std()),2)
    return mean,std

In [23]:
print("Mean and Std for Naturalness:")
for frame in allnat:
    print(mean_and_std(frame))

print()
print("Mean and Std for Comparison:")
for frame in allcom:
    print(mean_and_std(frame))
    
print()
print("Mean and Std for both:")
for frame in allsorted:
    print(mean_and_std(frame))

Mean and Std for Naturalness:
(3.64, 1.59)
(3.58, 1.48)
(5.3, 1.46)
(2.51, 1.45)

Mean and Std for Comparison:
(3.8, 1.73)
(3.91, 1.6)
(5.55, 1.58)
(2.79, 1.7)

Mean and Std for both:
(3.71, 1.66)
(3.75, 1.55)
(5.43, 1.53)
(2.65, 1.59)


In [24]:
#Getting ANOVA and Tukey 

In [25]:
#Making a frame with the the four test conditions and their mean overall participants
#Condition: Naturalness
index=range(1,41)
namena=pd.Series(["tlna","mcna","randna","realna"])
namena=namena.repeat(10)
namena.index = index

meanna=natural.mean()

combnat=pd.concat([namena,meanna ], axis=1,ignore_index=True)

combnat.columns = ['Method', 'Metascore']

In [26]:
#Condition: Comparison
index=range(1,41)
namecom=pd.Series(["tlcom","mccom","randcom","realcom"])
namecom=namecom.repeat(10)
namecom.index = index

meancom=compare.mean()

combcom=pd.concat([namecom,meancom], axis=1,ignore_index=True)
combcom.columns=["Method","Metascore"]

In [27]:
#Condition: Both combined
index=range(1,81)
nameall=pd.Series(["tlnat","tlcom","mcnat","mccom","randnat","randcom","realnat","realcom"])
nameall=nameall.repeat(10)
nameall.index = index

meanall=condcombined.mean(axis=1)
meanall.index=index
comball=pd.concat([nameall,meanall ], axis=1,ignore_index=True)

comball.columns = ['Method', 'Metascore']

In [28]:
#Make models for oneway ANOVA test
lm=ols("Metascore~Method",data=combnat).fit()
table=sm.stats.anova_lm(lm)
print(table)

lm=ols("Metascore~Method",data=combcom).fit()
table=sm.stats.anova_lm(lm)
print(table)

lm=ols("Metascore~Method",data=comball).fit()
table=sm.stats.anova_lm(lm)
print(table)

            df     sum_sq    mean_sq          F        PR(>F)
Method     3.0  39.957688  13.319229  68.092768  6.551074e-15
Residual  36.0   7.041750   0.195604        NaN           NaN
            df    sum_sq    mean_sq          F        PR(>F)
Method     3.0  39.04025  13.013417  66.277569  9.878368e-15
Residual  36.0   7.06850   0.196347        NaN           NaN
            df     sum_sq    mean_sq         F        PR(>F)
Method     7.0  79.346719  11.335246  11.39869  1.224757e-09
Residual  72.0  71.599250   0.994434       NaN           NaN


In [29]:
#The ANOVA results indicate that there is statistical significane between some of the test groups.
#Use Tukey Test to see how the groups compare 

In [30]:

tukeynat = pairwise_tukeyhsd(endog=combnat['Metascore'],
                          groups=combnat['Method'],
                          alpha=0.05)

tukeycom = pairwise_tukeyhsd(endog=combcom['Metascore'],
                          groups=combcom['Method'],
                          alpha=0.05)
tukeyall = pairwise_tukeyhsd(endog=comball['Metascore'],
                          groups=comball['Method'],
                          alpha=0.05)

In [31]:
print(tukeynat)
print()
print(tukeycom)
print()
print(tukeyall)

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower   upper  reject
---------------------------------------------------
  mcna randna     1.72 0.001  1.1873  2.2527   True
  mcna realna   -1.075 0.001 -1.6077 -0.5423   True
  mcna   tlna     0.05   0.9 -0.4827  0.5827  False
randna realna   -2.795 0.001 -3.3277 -2.2623   True
randna   tlna    -1.67 0.001 -2.2027 -1.1373   True
realna   tlna    1.125 0.001  0.5923  1.6577   True
---------------------------------------------------

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
 group1  group2 meandiff p-adj  lower   upper  reject
-----------------------------------------------------
  mccom randcom     1.64 0.001  1.1063  2.1737   True
  mccom realcom   -1.115 0.001 -1.6487 -0.5813   True
  mccom   tlcom   -0.115   0.9 -0.6487  0.4187  False
randcom realcom   -2.755 0.001 -3.2887 -2.2213   True
randcom   tlcom   -1.755 0.001 -2.2887 -1.2213   True
realcom   tlcom      1.0 0.001  0.4663  1.5337 

In [32]:
#Spearmann Correlation between Naturalness and Written by a human

In [33]:
from scipy import stats
realcorr=stats.spearmanr(realnat.stack(),realcom.stack())
print(realcorr)
randcorr=stats.spearmanr(randnat.stack(),randcom.stack())
print(randcorr)

#The Spearman Correlation results suggest a statistically significant correlation between a sentence that is percieved as natural 
#and perceived as written by a human.

SpearmanrResult(correlation=0.5484845330623838, pvalue=4.191372566150168e-17)
SpearmanrResult(correlation=0.6074746058123445, pvalue=1.4691614334082227e-21)


In [34]:
#Comparing results for different machine generated text knowledge backgrounds:

In [35]:
#The participants were asked if they had previous experience with machine generated text
#The following part investigates if people who claied to have previous knowledge answer different than those who have not.

In [36]:
#Split the participants:
noknow=compl[compl['No prior experience with machine generated text (1) - Frequent interaction with machine generated text (7)']<4.0]
moreknow=compl[compl['No prior experience with machine generated text (1) - Frequent interaction with machine generated text (7)']>4.0]

In [37]:
#Get the cleaned up dataframe
natural_noknow=frame_natural(noknow)
compare_noknow=frame_comparison(noknow)

#Combine the two halfs to make structuring easier
noknowall=pd.concat([natural_noknow, compare_noknow], axis=1)

natural_moreknow=frame_natural(moreknow)
compare_moreknow=frame_comparison(moreknow)

moreknowall=pd.concat([natural_moreknow, compare_moreknow], axis=1)

print("There are",noknowall.isna().sum().sum(),"questions with NAN in the no machine knowledge set")
print("There are",moreknowall.isna().sum().sum(),"questions with NAN in the more machine knowledge set")

#Some participants skipped questions for easier evaluation the mean of the other participants for this question is added
noknowall=noknowall.apply(lambda x: x.fillna(int(x.mean())),axis=0)
moreknowall=moreknowall.apply(lambda x: x.fillna(int(x.mean())),axis=0)

There are 6 questions with NAN in the no machine knowledge set
There are 0 questions with NAN in the more machine knowledge set


In [38]:
#Split the dataframe into the four test conditions but combine the two parts
def split_large_frame(inputframe):
    tl=pd.concat([inputframe.iloc[:, : 10], inputframe.iloc[:, 40:50]], axis=1)
    mc=pd.concat([inputframe.iloc[:, 10:20], inputframe.iloc[:, 50:60]], axis=1)
    real=pd.concat([inputframe.iloc[:, 20:30], inputframe.iloc[:,60:70]], axis=1)
    rand=pd.concat([inputframe.iloc[:, 30:40], inputframe.iloc[:, 70:80]], axis=1)
    allframe=[tl,mc,real,rand]
    return tl,mc,real,rand,allframe

In [39]:
#Get the dataframe
tlno,mcno,realno,randno,noknowsort=split_large_frame(noknowall)
tlmore,mcmore,realmore,randmore,moreknowsort=split_large_frame(moreknowall)

In [40]:
for frame in noknowsort:
    print(mean_and_std(frame))

print()
for frame in moreknowsort:
    print(mean_and_std(frame))

(3.8, 1.75)
(3.62, 1.5)
(5.12, 1.55)
(2.94, 1.59)

(3.93, 1.55)
(4.22, 1.47)
(5.52, 1.56)
(3.01, 1.71)


In [41]:
#Append the frames into one dataframe
mcno.columns=list(tlno.columns)
mcmore.columns=list(tlno.columns)
realno.columns=list(tlno.columns)
randno.columns=list(tlno.columns)
realmore.columns=list(tlno.columns)
randmore.columns=list(tlno.columns)

combframe=tlno.append([tlmore,mcno,mcmore,realno,realmore,randno,randmore])

In [42]:
#Make a dataframe with the mean score of each participant sorted by question
namelist=["notl","moretl","nomc","moremc","noreal","morereal","norand","morerand"]
index=range(1,69)
name=pd.Series(namelist)
name=name.repeat([8,9,8,9,8,9,8,9])
name.index = index

mean=combframe.mean(axis=1)
mean.index=index


comb=pd.concat([name,mean ], axis=1,ignore_index=True)

comb.columns = ['Method', 'Metascore']


In [43]:
#Calculate ANOVA
lm=ols("Metascore~Method",data=comb).fit()
table=sm.stats.anova_lm(lm)
print(table)



            df     sum_sq   mean_sq         F    PR(>F)
Method     7.0  49.144651  7.020664  7.462714  0.000002
Residual  60.0  56.445937  0.940766       NaN       NaN


In [44]:
#Calculate Tukey
tukey = pairwise_tukeyhsd(endog=comb['Metascore'],
                          groups=comb['Method'],
                          alpha=0.05)

In [45]:
print(tukey)

  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1   group2  meandiff p-adj   lower   upper  reject
--------------------------------------------------------
  moremc morerand  -1.2111 0.1593 -2.6469  0.2247  False
  moremc morereal   1.4278 0.0523  -0.008  2.8635  False
  moremc   moretl  -0.1111    0.9 -1.5469  1.3247  False
  moremc     nomc  -0.3833    0.9 -1.8633  1.0966  False
  moremc   norand  -0.9896 0.4295 -2.4695  0.4904  False
  moremc   noreal   0.9917 0.4267 -0.4883  2.4716  False
  moremc     notl  -0.3458    0.9 -1.8258  1.1341  False
morerand morereal   2.6389  0.001  1.2031  4.0747   True
morerand   moretl      1.1 0.2575 -0.3358  2.5358  False
morerand     nomc   0.8278  0.632 -0.6522  2.3077  False
morerand   norand   0.2215    0.9 -1.2584  1.7015  False
morerand   noreal   2.2028  0.001  0.7228  3.6827   True
morerand     notl   0.8653  0.586 -0.6147  2.3452  False
morereal   moretl  -1.5389 0.0274 -2.9747 -0.1031   True
morereal     nomc  -1.8111 0.00