# sample generator for embedded questions
This notebook is for creating the random sample .txt files for the experiment.

In [1]:
import pandas as pd
import numpy as np

 # Contents
 1. [Constrain the data set to stimuli set](#Constrain-the-dataset-to-stimuli-set)
 2. [Figuring out the distribution of factors per list](#Figuring-out-the-distribution-of-factors-per-list)
 3. [Figure out how to collapse the matrix verb columns](#Figure-out-how-to-collapse-the-matrix-verb-columns)
 4. [Add in the paraphrases](#Add-in-the-paraphrases)
 5. [Split EntireSentence on Question](#Split-EntireSentence-on-Question)
 6. [Controls](#Controls)
 7. [Balancing factors](#Balancing-factors)
     1. [Modal Balancing](#Modal-Balancing)
     2. [Wh Balancing](#Wh-Balancing)
         1. [Who](#Who)
         2. [What](#What)
         3. [Where](#Where)
         4. [When](#When)
         5. [How](#How)
         6. [Why](#Why)
 8. [Generating-random-samples](#Generating-random-samples)
     1. [First Iteration](#First-Iteration)
     2. [Second Iteration](#Second-Itreation)
     3. [Third Iteration](#Third-Iteration)
     4. [Fourth Iteration](#Fourth-Iteration)
     5. [Fifth Iteration](#Fifth-Iteration)
     6. [Sixth Iteration](#Sixth-Iteration)
     7. [Final Set](#Final-Set)
 9. [Pilot Samples](#Pilot-Samples)

In [2]:
# import the database file from the TGrep2 searching
df = pd.read_csv("../results/swbd.tab", sep='\t', engine='python')

In [3]:
# This makes the display show more info
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [4]:
df.pivot_table(index=['QuestionType'], values="Question", aggfunc=len).groupby(["QuestionType"]).Question.transform(lambda x: x/len(df)).reset_index()

Unnamed: 0,QuestionType,Question
0,adjunct,0.075203
1,cleft,0.064712
2,embadjunct,0.236886
3,embedded,0.1608
4,fragment,0.067654
5,relative,0.13315
6,root,0.124424
7,subject,0.137072


# Constrain the dataset to stimuli set
for experimental mock-up

First we have to remove the questions that we don;t want to include:
1. embedded questions only
2. no degree questions
3. no identity questions
4. generally only monomorphemic wh-phrases
5. only who-, what-, where-, when-, how-, and why-questions

In [43]:
critical = df[(df['QuestionType'] == 'embedded') # only root questions
              & 
              (df['DegreeQ'] == 'no' ) # no degree questions
              &
              (df['IdentityQ'] == "no") # no identity questions
              &
              (df['WhPhaseType'] == "monomorphemic") # no identity questions
              &
              (df['Wh'].isin(['how','How','where','Where','who','Who','what','What','why','Why','when','When']))] # just these wh-words]

In [11]:
len(critical)

1073

### Figuring out how many lists

1073/30 = 35.76

35 lists x 30 = 1050

1073-1050 = 23

35 lists of 30, 1 list of 23


### Number of participants
36 lists x 30 participants per list = 1080 participants

# Figure out how to collapse the matrix verb columns - TBD

In [113]:
critical = critical.assign(Matrix = critical.MatrixPredVerb.astype(str) + ' ' + \
                critical.MatrixPredOther.astype(str) + ' ' +\
  critical.MatrixPredParticle.astype(str))

In [114]:
df['ColumnA'] = df[df.columns[1:]].apply(
    lambda x: ','.join(x.dropna().astype(str)),
    axis=1)

In [47]:
def verb_label(df):
    if df["MatrixPredVerb"] != "NaN":
        df["Matrix"] = df.loc(["MatrixPredVerb"])
    elif df["MatrixPredOther"] != "NaN":
        df["Matrix"] = df["MatrixPredOther"]
    elif df["MatrixPredParticle"] != "NaN":
        df["Matrix"] = df["MatrixPredParticle"]

In [49]:
critical["Matrix"] == ""

8        False
28       False
58       False
65       False
66       False
69       False
70       False
77       False
111      False
114      False
143      False
167      False
180      False
186      False
192      False
222      False
233      False
252      False
260      False
263      False
267      False
273      False
281      False
290      False
295      False
298      False
325      False
327      False
334      False
336      False
373      False
376      False
377      False
379      False
392      False
397      False
407      False
420      False
429      False
445      False
448      False
449      False
451      False
452      False
462      False
463      False
464      False
476      False
485      False
491      False
520      False
523      False
530      False
535      False
545      False
564      False
572      False
577      False
581      False
583      False
594      False
602      False
659      False
660      False
661      False
663      False
689      F

In [53]:
critical["Matrix"] = critical["MatrixPredVerb"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  critical["Matrix"] = critical["MatrixPredVerb"]


In [52]:
critical.apply(lambda x: verb_label(x))

KeyError: 'MatrixPredVerb'

In [21]:
critical.columns

Index(['Item_ID', 'Sentence', 'HaveNeedTo', 'Finite', 'ModalPresent',
       'QuestionType', 'DegreeQ', 'SubjectAuxInv', 'WhAll', 'MatrixNegPresent',
       'EmbeddedNegPresent', 'SbarNomPresent', 'QuantifiedSubject',
       'QuantifiedPredicate', 'Wh', 'MatrixNegation', 'InvertedAuxVerb',
       'MatrixPredAux', 'MatrixPredVerb', 'MatrixPredOther',
       'MatrixPredParticle', 'MatrixPred2', 'Modal', 'EmbeddedNegation',
       'Verb1', 'Verb2', 'Verb3', 'DeterminerSubject', 'DeterminerNonSubject',
       'FullWhPhrase', 'JustMatrixClause', 'DeterminerSubjPresent',
       'DeterminerNonSubjPresent', 'WhNode', 'WhParse', 'Question',
       'SentenceParse', 'WhPhaseType', 'IdentityQ'],
      dtype='object')

# Combine contexts with constrained db

In [44]:
# read in df with contexts
cntxts = pd.read_csv("swbd_contexts.csv")

In [45]:
cntxts = cntxts.drop(columns="FollowingContext")

In [46]:
# get the indixes from critical
crit_index = critical.Item_ID

### Merge back in Wh and ModalPresent colums

In [139]:
critical.groupby(['Wh','Finite'])['Wh'].count()

Wh     Finite
how    no         92
       yes       156
what   no         25
       yes       365
when   yes        21
where  no          5
       yes       122
who    no          2
       yes        53
why    yes        67
Name: Wh, dtype: int64

In [140]:
df_WhMod = critical[["Item_ID","Wh","ModalPresent","Finite","Question"]].rename(columns={"Item_ID": "TGrepID"})

In [141]:
# subset to the items that are just the ones filtered in the previos section

# otherwise, if using the database file with contexts directly in there, then this step
# is not necessary
df_valid = cntxts[cntxts["TGrepID"].isin(set(crit_index))]

In [185]:
# Merge
df_valid = df_valid.merge(df_WhMod, how = 'inner', indicator=False)

# Merge ModalPresent and Finiteness into one Modal Column

In [186]:
modP = df_valid[df_valid.ModalPresent == "yes"].TGrepID
fin = df_valid[df_valid.Finite == "yes"].TGrepID

In [187]:
mods = pd.concat([modP,nonfin])
nonmods = pd.concat([modNP,fin])

In [188]:
df_valid["Modal"] = ""

In [189]:
df_valid_mods = df_valid[df_valid["TGrepID"].isin(set(mods))].assign(Modal = "yes")
df_valid_nomods = df_valid[~df_valid["TGrepID"].isin(set(mods))].assign(Modal = "no")

In [190]:
df_valid = pd.concat([df_valid_mods,df_valid_nomods])

In [191]:
len(df_valid)

1073

# Split EntireSentence on Question 
This is necessary because we need to bold the question only and not the Matrix in the experimental file

In [199]:
# split EntireSentence
df_valid["Matrix"] = df_valid.apply(lambda x: x['EntireSentence'].replace(x['Question'],"").strip(),axis=1)

In [200]:
# split that last punctuation off, to be added back on in .js script
df_valid["punct"] = df_valid["Matrix"].apply(lambda x: list(x)[-1])

In [201]:
def split(word):
    return [char for char in word] 

In [202]:
# remove that final punct from the Matrix column
df_valid["Matrix"] = df_valid["Matrix"].apply(lambda x: x.replace(list(x)[-1], ' '))

# Add in the paraphrases
this should take as input the entire constrained dataframe from the above section, and then generate the paraphrases

For Who questions: Who is a person...? / Who is some person...? / "Who is every person..." / "Who is the person..."

In [203]:
who = df_valid[df_valid["Wh"] == "who"]
where = df_valid[df_valid["Wh"] == "where"]
how = df_valid[df_valid["Wh"] == "how"]
when = df_valid[df_valid["Wh"] == "when"]
why = df_valid[df_valid["Wh"] == "why"]
what = df_valid[df_valid["Wh"] == "what"]

In [204]:
who["AResponse"] = "...who is a person..."
# who["SomeResponse"] = "Who is some person..."
who["AllResponse"] = "...who is every person..."
who["TheResponse"] = "...who is the person..."


where["AResponse"] = "...what is a place..."
# where["SomeResponse"] = "What is some place..."
where["AllResponse"] = "...what is every place..."
where["TheResponse"] = "...what is the place..."


how["AResponse"] = "...what is a way..."
# how["SomeResponse"] = "What is some way..."
how["AllResponse"] = "...what is every way..."
how["TheResponse"] = "...what is the way..."

when["AResponse"] = "...what is a time..."
# when["SomeResponse"] = "What is some time..."
when["AllResponse"] = "...what is every time..."
when["TheResponse"] = "...what is the time..."


why["AResponse"] = "...what is a reason..."
# why["SomeResponse"] = "What is some reason..."
why["AllResponse"] = "...what is every reason..."
why["TheResponse"] = "...what is the reason..."


what["AResponse"] = "...what is a thing..."
# what["SomeResponse"] = "What is some thing..."
what["AllResponse"] = "...what is every thing..."
what["TheResponse"] = "...what is the thing..."

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  who["AResponse"] = "...who is a person..."
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  who["AllResponse"] = "...who is every person..."
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  who["TheResponse"] = "...who is the person..."
A value is trying to be set on a copy of a slice from a DataFrame.

In [205]:
df_final = pd.concat([who,where,how,why,when,what])

In [22]:
len(df_final)

1073

# Controls

In [96]:
controls = pd.read_csv("../../experiments/clean_corpus/controls.csv")

In [206]:
# Add columns to make merging datasets easier
controls["Wh"] = "none"
controls["ModalPresent"] = "no"
controls["Question"] = controls["EntireSentence"]
controls["punct"] = ""
controls["Matrix"] = ""
controls["Modal"] = ""

In [207]:
controls = controls[["TGrepID","EntireSentence","PreceedingContext","Matrix","Question","Wh","Modal","punct","AResponse","AllResponse","TheResponse"]]

# Balancing factors

In [52]:
len(df_final)

1073

In [53]:
1073 - 30*35

23

In [210]:
# 35 lists of 30, 1 list of 23
30*35 + 23

1073

In [215]:
289 - (35*8)

9

In [216]:
35*8 + 1*9

289

## Modal Balancing
- Lists 1-35: 8
- List 36: 9

In [208]:
df_final.groupby(["Modal"])["Modal"].count()

Modal
no     784
yes    289
Name: Modal, dtype: int64

Modals

In [209]:
df_final.pivot_table(index=['Modal'], values="Question", aggfunc=len).groupby(["Modal"]).Question.transform(lambda x: x/len(df_final)).reset_index()

Unnamed: 0,Modal,Question
0,no,0.730662
1,yes,0.269338


In [207]:
df_final.pivot_table(index=['Wh'], values="Question", aggfunc=len).groupby(["Wh"]).Question.transform(lambda x: x/len(df_final)).reset_index()

Unnamed: 0,Wh,Question
0,how,0.269338
1,what,0.416589
2,when,0.025163
3,where,0.158434
4,who,0.053122
5,why,0.077353


In [270]:
df_final = pd.concat([who,where,how,why,when,what])

In [254]:
len(df_final)

1073

# Generating random samples

## First Iteration

Lists 1-21 (30): 21
    1 when ... 21
    2 who .... 42
    2 why .... 42
    5 where ...105
    8 how ....168
    12 what ...252

Modals = 5

In [238]:
df_final.groupby(["Wh","Modal"])["Wh"].count()

Wh     Modal
how    no       156
       yes      133
what   no       365
       yes       82
when   no        21
       yes        6
where  no       122
       yes       48
who    no        53
       yes        4
why    no        67
       yes       16
Name: Wh, dtype: int64

In [224]:
df_final.groupby(["Modal"])["Modal"].count()

Modal
no     403
yes     70
Name: Modal, dtype: int64

In [352]:
df_final = pd.concat([who,where,how,why,when,what])

In [353]:
len(df_final)

1073

In [354]:
len(df_final)

1073

In [355]:
for n in range(1,22):
    mod_sample = df_final[df_final["Modal"] == "yes"].sample(8)

    i = len(mod_sample[mod_sample["Wh"] == "why"])
    j = len(mod_sample[mod_sample["Wh"] == "when"])
    k = len(mod_sample[mod_sample["Wh"] == "what"])
    l = len(mod_sample[mod_sample["Wh"] == "how"])
    m = len(mod_sample[mod_sample["Wh"] == "who"])
    o = len(mod_sample[mod_sample["Wh"] == "where"])
    df_final = df_final.drop(mod_sample.index)

    when_sample = pd.DataFrame(columns=df_final.columns)
    if j >= 1:
        print(f"when: {j}, {j}")
        when_sample
    else:
        ws = df_final[(df_final["Wh"]=="when") & (df_final["Modal"] == "no")].sample(1)
        when_sample = pd.concat([when_sample,ws])
        print(f"loop2 {len(when_sample)}, {j}")
        when_sample
    df_final = df_final.drop(when_sample.index)

    who_sample = df_final[
        (df_final["Wh"] == "who") &
        (df_final["Modal"] == "no")
    ].sample(2-m)
    print(f"who: {len(who_sample)}, {m}")
    df_final = df_final.drop(who_sample.index)  

#     who_sample = pd.DataFrame(columns=df_final.columns)
#     if m >= 2:
#         print(f"who: {j}, {j}")
#         who_sample
#     else:
#         ws = df_final[(df_final["Wh"]=="who") & (df_final["Modal"] == "no")].sample(2)
#         who_sample = pd.concat([who_sample,ws])
#         print(f"loop2 {len(who_sample)}, {j}")
#         who_sample
#     df_final = df_final.drop(who_sample.index)
    
#     why_sample = pd.DataFrame(columns=df_final.columns)
#     if j >= 1:
#         print(f"why: {j}, {j}")
#         why_sample
#     else:
#         ws = df_final[(df_final["Wh"]=="why") & (df_final["Modal"] == "no")].sample(1)
#         why_sample = pd.concat([why_sample,ws])
#         print(f"loop2 {len(why_sample)}, {j}")
#         why_sample
#     df_final = df_final.drop(why_sample.index)  
    
    why_sample = df_final[
        (df_final["Wh"] == "why") &
        (df_final["Modal"] == "no")
    ].sample(2-i)
    print(f"why: {len(why_sample)}, {i}")
    df_final = df_final.drop(why_sample.index)
    
    where_sample = df_final[
        (df_final["Wh"] == "where") &
        (df_final["Modal"] == "no")
    ].sample(5-o)
    print(f"where: {len(where_sample)}, {o}")
    df_final = df_final.drop(where_sample.index)

    how_sample = df_final[
        (df_final["Wh"] == "how") &
        (df_final["Modal"] == "no")
    ].sample(8-l)
    print(f"how: {len(how_sample)}, {l}")
    df_final = df_final.drop(how_sample.index)    
    
    what_sample = df_final[
        (df_final["Wh"] == "what") &
        (df_final["Modal"] == "no")
    ].sample(12-k)
    print(f"what: {len(what_sample)}, {k}")
    df_final = df_final.drop(what_sample.index)
    
    total = pd.concat([mod_sample,why_sample,when_sample,what_sample,how_sample,who_sample,where_sample,controls])
    print(f"total #{n} = {len(total)}")
    # save to file
    filename = f"../../experiments/clean_corpus/04_experiment/corpus_{n}.txt".format(n=n)
    total.to_csv(filename,header=True,sep="\t",index=False)

loop2 1, 0
who: 2, 0
why: 1, 1
where: 1, 4
how: 5, 3
what: 12, 0
total #1 = 36
loop2 1, 0
who: 2, 0
why: 2, 0
where: 4, 1
how: 3, 5
what: 10, 2
total #2 = 36
loop2 1, 0
who: 2, 0
why: 1, 1
where: 4, 1
how: 3, 5
what: 11, 1
total #3 = 36
loop2 1, 0
who: 2, 0
why: 2, 0
where: 4, 1
how: 1, 7
what: 12, 0
total #4 = 36
loop2 1, 0
who: 1, 1
why: 2, 0
where: 4, 1
how: 3, 5
what: 11, 1
total #5 = 36
loop2 1, 0
who: 2, 0
why: 1, 1
where: 3, 2
how: 5, 3
what: 10, 2
total #6 = 36
loop2 1, 0
who: 2, 0
why: 2, 0
where: 5, 0
how: 3, 5
what: 9, 3
total #7 = 36
when: 1, 1
who: 2, 0
why: 2, 0
where: 4, 1
how: 4, 4
what: 10, 2
total #8 = 36
loop2 1, 0
who: 2, 0
why: 1, 1
where: 3, 2
how: 4, 4
what: 11, 1
total #9 = 36
when: 1, 1
who: 2, 0
why: 1, 1
where: 5, 0
how: 6, 2
what: 8, 4
total #10 = 36
loop2 1, 0
who: 2, 0
why: 2, 0
where: 4, 1
how: 7, 1
what: 6, 6
total #11 = 36
loop2 1, 0
who: 2, 0
why: 2, 0
where: 5, 0
how: 6, 2
what: 6, 6
total #12 = 36
loop2 1, 0
who: 2, 0
why: 1, 1
where: 5, 0
how: 3, 5


In [356]:
1073 - 21*30

443

In [357]:
len(df_final)

443

In [348]:
df_final.groupby(["Wh","ModalPresent"])["Wh"].count()

Wh     ModalPresent
how    no              105
       yes              16
what   no              167
       yes              28
when   no                3
       yes               3
where  no               44
       yes              21
who    no               15
why    no               33
       yes               8
Name: Wh, dtype: int64

In [350]:
df_final.groupby(["ModalPresent"])["ModalPresent"].count()

ModalPresent
no     367
yes     76
Name: ModalPresent, dtype: int64

In [329]:
1073 - 30*21

443

In [328]:
len(df_final)

443

## Second iteration

Modal 22-36 =  4

List 22-27 (30): <<<< 6
    1 when ...6
    1 who ....6
    2 why  ...12
    5 where ..30
    8 how  ...48
    13 what ...78

In [358]:
for n in range(22,28):
    mod_sample = df_final[df_final["Modal"] == "yes"].sample(8)

    i = len(mod_sample[mod_sample["Wh"] == "why"])
    j = len(mod_sample[mod_sample["Wh"] == "when"])
    k = len(mod_sample[mod_sample["Wh"] == "what"])
    l = len(mod_sample[mod_sample["Wh"] == "how"])
    m = len(mod_sample[mod_sample["Wh"] == "who"])
    o = len(mod_sample[mod_sample["Wh"] == "where"])
    df_final = df_final.drop(mod_sample.index)

    when_sample = pd.DataFrame(columns=df_final.columns)
    if j >= 1:
        # make an empty dataframe
        when_sample
    else:
        ws = df_final[(df_final["Wh"]=="when")].sample(1)
        when_sample = pd.concat([when_sample,ws])
        when_sample
    df_final = df_final.drop(when_sample.index)
    
    who_sample = df_final[
        (df_final["Wh"] == "who") &
        (df_final["Modal"] == "no")
    ].sample(1-m)
    df_final = df_final.drop(who_sample.index)    
    
    why_sample = df_final[
        (df_final["Wh"] == "why") &
        (df_final["Modal"] == "no")
    ].sample(2-i)
    df_final = df_final.drop(why_sample.index)
    
    where_sample = df_final[
        (df_final["Wh"] == "where") &
        (df_final["Modal"] == "no")
    ].sample(5-o)
    df_final = df_final.drop(where_sample.index)

    how_sample = df_final[
        (df_final["Wh"] == "how") &
        (df_final["Modal"] == "no")
    ].sample(8-l)
    df_final = df_final.drop(how_sample.index)    
    
    what_sample = df_final[
        (df_final["Wh"] == "what") &
        (df_final["Modal"] == "no")
    ].sample(13-k)
    df_final = df_final.drop(what_sample.index)
    
    total = pd.concat([mod_sample,when_sample,why_sample,what_sample,how_sample,who_sample,where_sample,controls])

    # save to file
    filename = f"../../experiments/clean_corpus/04_experiment/corpus_{n}.txt".format(n=n)
    total.to_csv(filename,header=True,sep="\t",index=False)

In [360]:
443 - 6*30

263

In [359]:
len(df_final)

263

In [361]:
df_final.groupby(["ModalPresent"])["ModalPresent"].count()

ModalPresent
no     227
yes     36
Name: ModalPresent, dtype: int64

In [120]:
df_final.groupby(["Wh"])["Wh"].count()

Wh
how       73
what     117
where     35
who        9
why       29
Name: Wh, dtype: int64

## Third iteration

List 28-30 (30): 3
    0 when....
    1 who ....3
    3 why ....9
    5 where....15
    9 how......27
    12 what ...36

In [362]:
for n in range(28,31):
    mod_sample = df_final[df_final["Modal"] == "yes"].sample(8)

    i = len(mod_sample[mod_sample["Wh"] == "why"])
    k = len(mod_sample[mod_sample["Wh"] == "what"])
    l = len(mod_sample[mod_sample["Wh"] == "how"])
    m = len(mod_sample[mod_sample["Wh"] == "who"])
    o = len(mod_sample[mod_sample["Wh"] == "where"])
    df_final = df_final.drop(mod_sample.index)

    who_sample = df_final[
        (df_final["Wh"] == "who") &
        (df_final["Modal"] == "no")
    ].sample(1-m)
    df_final = df_final.drop(who_sample.index)    
    
    why_sample = df_final[
        (df_final["Wh"] == "why") &
        (df_final["Modal"] == "no")
    ].sample(3-i)
    df_final = df_final.drop(why_sample.index)
    
    where_sample = df_final[
        (df_final["Wh"] == "where") &
        (df_final["Modal"] == "no")
    ].sample(5-o)
    df_final = df_final.drop(where_sample.index)

    how_sample = df_final[
        (df_final["Wh"] == "how") &
        (df_final["Modal"] == "no")
    ].sample(9-l)
    df_final = df_final.drop(how_sample.index)    
    
    what_sample = df_final[
        (df_final["Wh"] == "what") &
        (df_final["Modal"] == "no")
    ].sample(12-k)
    df_final = df_final.drop(what_sample.index)
    
    total = pd.concat([mod_sample,why_sample,what_sample,how_sample,who_sample,where_sample,controls])

    # save to file
    filename = f"../../experiments/clean_corpus/04_experiment/corpus_{n}.txt".format(n=n)
    total.to_csv(filename,header=True,sep="\t",index=False)

In [363]:
len(df_final)

173

In [364]:
263 - 30*3

173

In [365]:
df_final.groupby(["Wh","ModalPresent"])["Wh"].count()

Wh     ModalPresent
how    no              39
       yes              7
what   no              75
       yes              6
where  no              13
       yes              7
who    no               6
why    no              17
       yes              3
Name: Wh, dtype: int64

## Fourth Iteration

List 31-32 (30): 2
    0 when...0
    1 who....2
    3 why....6
    4 where..8
    9 how....18
    13 what..26

In [366]:
for n in range(31,33):
    mod_sample = df_final[df_final["Modal"] == "yes"].sample(8)

    i = len(mod_sample[mod_sample["Wh"] == "why"])
    k = len(mod_sample[mod_sample["Wh"] == "what"])
    l = len(mod_sample[mod_sample["Wh"] == "how"])
    m = len(mod_sample[mod_sample["Wh"] == "who"])
    o = len(mod_sample[mod_sample["Wh"] == "where"])
    df_final = df_final.drop(mod_sample.index)

    who_sample = df_final[
        (df_final["Wh"] == "who") &
        (df_final["Modal"] == "no")
    ].sample(1-m)
    df_final = df_final.drop(who_sample.index)    
    
    why_sample = df_final[
        (df_final["Wh"] == "why") &
        (df_final["Modal"] == "no")
    ].sample(3-i)
    df_final = df_final.drop(why_sample.index)
    
    where_sample = df_final[
        (df_final["Wh"] == "where") &
        (df_final["Modal"] == "no")
    ].sample(4-o)
    df_final = df_final.drop(where_sample.index)

    how_sample = df_final[
        (df_final["Wh"] == "how") &
        (df_final["Modal"] == "no")
    ].sample(9-l)
    df_final = df_final.drop(how_sample.index)    
    
    what_sample = df_final[
        (df_final["Wh"] == "what") &
        (df_final["Modal"] == "no")
    ].sample(13-k)
    df_final = df_final.drop(what_sample.index)
    
    total = pd.concat([mod_sample,why_sample,what_sample,how_sample,who_sample,where_sample,controls])

    # save to file
    filename = f"../../experiments/clean_corpus/03_experiment/corpus_{n}.txt".format(n=n)
    total.to_csv(filename,header=True,sep="\t",index=False)

In [367]:
len(df_final)

113

In [368]:
173 - 30*2

113

In [369]:
df_final.groupby(["Wh"])["Wh"].count()

Wh
how      28
what     55
where    12
who       4
why      14
Name: Wh, dtype: int64

## Fifth Iteration

List 33 (30): 1
    0 when...0
    1 who....1
    3 why....3
    5 where..5
    7 how....7
    14 what..14

In [370]:
for n in range(33,34):
    mod_sample = df_final[df_final["Modal"] == "yes"].sample(8)

    i = len(mod_sample[mod_sample["Wh"] == "why"])
    k = len(mod_sample[mod_sample["Wh"] == "what"])
    l = len(mod_sample[mod_sample["Wh"] == "how"])
    m = len(mod_sample[mod_sample["Wh"] == "who"])
    o = len(mod_sample[mod_sample["Wh"] == "where"])
    df_final = df_final.drop(mod_sample.index)

    who_sample = df_final[
        (df_final["Wh"] == "who") &
        (df_final["Modal"] == "no")
    ].sample(1-m)
    df_final = df_final.drop(who_sample.index)    
    
    why_sample = df_final[
        (df_final["Wh"] == "why") &
        (df_final["Modal"] == "no")
    ].sample(3-i)
    df_final = df_final.drop(why_sample.index)
    
    where_sample = df_final[
        (df_final["Wh"] == "where") &
        (df_final["Modal"] == "no")
    ].sample(5-o)
    df_final = df_final.drop(where_sample.index)

    how_sample = df_final[
        (df_final["Wh"] == "how") &
        (df_final["Modal"] == "no")
    ].sample(7-l)
    df_final = df_final.drop(how_sample.index)    
    
    what_sample = df_final[
        (df_final["Wh"] == "what") &
        (df_final["Modal"] == "no")
    ].sample(14-k)
    df_final = df_final.drop(what_sample.index)
    
    total = pd.concat([mod_sample,why_sample,what_sample,how_sample,who_sample,where_sample,controls])

    # save to file
    filename = f"../../experiments/clean_corpus/04_experiment/corpus_{n}.txt".format(n=n)
    total.to_csv(filename,header=True,sep="\t",index=False)

In [371]:
len(df_final)

83

In [373]:
113 - 1*30

83

## Sixth Iteration

List 34-35 (30): 2
    0 when....0
    1 who.....2
    3 why.....6
    3 where...6
    7 how.....14
    16 what...32

In [374]:
for n in range(34,36):
    mod_sample = df_final[df_final["Modal"] == "yes"].sample(8)

    i = len(mod_sample[mod_sample["Wh"] == "why"])
    k = len(mod_sample[mod_sample["Wh"] == "what"])
    l = len(mod_sample[mod_sample["Wh"] == "how"])
    m = len(mod_sample[mod_sample["Wh"] == "who"])
    o = len(mod_sample[mod_sample["Wh"] == "where"])
    df_final = df_final.drop(mod_sample.index)

    who_sample = df_final[
        (df_final["Wh"] == "who") &
        (df_final["Modal"] == "no")
    ].sample(1-m)
    df_final = df_final.drop(who_sample.index)    
    
    why_sample = df_final[
        (df_final["Wh"] == "why") &
        (df_final["Modal"] == "no")
    ].sample(3-i)
    df_final = df_final.drop(why_sample.index)
    
    
#     where_sample = pd.DataFrame(columns=df_final.columns)
#     if j >= 1:
#         print(f"when: {j}, {j}")
#         # make an empty dataframe
#         WS2 = df_final[(df_final["Wh"] == "why") & (df_final["Modal"] == "no")].sample(3)
#         where_sample = pd.concat([where_sample,WS2])
#     else:
#         where_sample = pd.concat([where_sample,df_final[df_final["Modal"] == "no"].sample(3)])
# #         print(f"loop2 {len(when_sample)}, {j}")
#         where_sample
#     df_final = df_final.drop(when_sample.index)
    
    where_sample = df_final[
        (df_final["Wh"] == "where") &
        (df_final["Modal"] == "no")
    ].sample(3-o)
    df_final = df_final.drop(where_sample.index)

    how_sample = df_final[
        (df_final["Wh"] == "how") &
        (df_final["Modal"] == "no")
    ].sample(7-l)
    df_final = df_final.drop(how_sample.index)    
    
    what_sample = df_final[
        (df_final["Wh"] == "what") &
        (df_final["Modal"] == "no")
    ].sample(16-k)
    df_final = df_final.drop(what_sample.index)
    
    total = pd.concat([mod_sample,why_sample,what_sample,how_sample,who_sample,where_sample,controls])

    # save to file
    filename = f"../../experiments/clean_corpus/04_experiment/corpus_{n}.txt".format(n=n)
    total.to_csv(filename,header=True,sep="\t",index=False)

In [375]:
len(df_final)

23

In [376]:
83 - 2*30

23

## Final Set

List 36 (23): <--target = 23
    0 when
    1 who
    5 why
    1 where
    7 how
    9 what

In [377]:
last = pd.concat([df_final,controls])

In [378]:
last.to_csv("../../experiments/clean_corpus/04_experiment/corpus_36.txt",header=True,sep="\t",index=False)

# Pilot Samples

In [40]:
df_final.pivot_table(index=['Wh'], values="EntireSentence", aggfunc=len).groupby(["Wh"]).EntireSentence.transform(lambda x: x/len(df_final)).reset_index()

Unnamed: 0,Wh,EntireSentence
0,how,0.256433
1,what,0.411713
2,when,0.023957
3,where,0.150843
4,which,0.032831
5,who,0.050577
6,why,0.073647


In [164]:
eq_pilot = df_final.sample(10,random_state=666)

In [166]:
eqp = pd.concat([eq_pilot,controls])

In [167]:
eqp.to_csv("../../experiments/clean_corpus/04_experiment/pilot.txt",header=True,sep="\t",index=False)

99.5

In [125]:
mod_sample = df_final[df_final["ModalPresent"] == "yes"].sample(1)

i = len(mod_sample[mod_sample["Wh"] == "why"])
j = len(mod_sample[mod_sample["Wh"] == "when"])
k = len(mod_sample[mod_sample["Wh"] == "what"])
l = len(mod_sample[mod_sample["Wh"] == "how"])
m = len(mod_sample[mod_sample["Wh"] == "who"])
n = len(mod_sample[mod_sample["Wh"] == "where"])
n = len(mod_sample[mod_sample["Wh"] == "which"])
df_final = df_final.drop(mod_sample.index)

why_sample = df_final[
    (df_final["Wh"] == "why") &
    (df_final["ModalPresent"] == "no")
                     ].sample(1-i)
df_final = df_final.drop(why_sample.index)

when_sample = df_final[
    (df_final["Wh"] == "when") &
    (df_final["ModalPresent"] == "no")].sample(1-j)
df_final = df_final.drop(when_sample.index)

what_sample = df_final[
    (df_final["Wh"] == "what") &
    (df_final["ModalPresent"] == "no")].sample(5-k)
df_final = df_final.drop(what_sample.index)

how_sample = df_final[
    (df_final["Wh"] == "how") &
    (df_final["ModalPresent"] == "no")
                     ].sample(1-l)
df_final = df_final.drop(how_sample.index)

who_sample = df_final[
    (df_final["Wh"] == "who") &
    (df_final["ModalPresent"] == "no")].sample(1-m)
df_final = df_final.drop(who_sample.index)

where_sample = df_final[
    (df_final["Wh"] == "where") &
    (df_final["ModalPresent"] == "no")].sample(1-n)
df_final = df_final.drop(where_sample.index)


In [128]:
total = pd.concat([mod_sample,why_sample,when_sample,what_sample,how_sample,who_sample,where_sample,controls])

# save to file


In [129]:
total.to_csv("../../experiments/clean_corpus/04_experiment/pilot.txt",header=True,sep="\t",index=False)