# Corpus Processing
This notebook is for miscelaneous processing from the swbd.tab database file

In [1]:
import pandas as pd
import numpy as np

 ## Contents
 1. [Create separate file with contexts](#Create-separate-file-with-contexts)
 2. [Create separate file without contexts](#Create-separate-file-without-contexts)
 3. [Creating the files for the experiment](#Creating-files-for-the-experiment)
 4. [Automate Paraphrase Generator](#Automate-pharaphrase-generator)


In [6]:
# import the database file from the TGrep2 searching
df = pd.read_csv("results/swbd.tab", sep='\t', engine='python')

# import the template file just to take a look at it
# df_template = pd.read_csv("~/Downloads/korpus3_0.txt", names=['TGrepID','EntireSentence','context','BestResponse'], skiprows=1, sep='\t')

In [386]:
# New run
df.groupby("QuestionType")["QuestionType"].count()

QuestionType
adjunct         764
embadjunct     2309
embedded       3018
exclamation      31
fragment        126
relative       1378
root           1839
subject         734
Name: QuestionType, dtype: int64

In [387]:
# Old run for comparison
df.groupby("QuestionType")["QuestionType"].count()

QuestionType
adjunct         764
embadjunct     2309
embedded       3018
exclamation      31
fragment        126
relative       1378
root           1839
subject         734
Name: QuestionType, dtype: int64

In [96]:
# This makes the display show more info
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [97]:
df_template.iloc[0]

TGrepID                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

df.head()

In [277]:
df.loc[df["Item_ID"] == "331:188"].Sentence

13    but, uh, i, i, i think that, you know, we always, uh, i mean, i've, i've had a lot of good experiences with, uh, with many many people especially where they've had, uh, extended family *t*-1.
Name: Sentence, dtype: object

In [350]:
# Embedded questions with matrix negation?«
len(df.loc[
    (df["QuestionType"] == "embedded")
    &
    (df["MatrixNegPresent"] == "yes")
#     &
#     (df.Sentence.str.contains("does not|don't|doesn't"))
])

828

In [347]:
# all good, these are not spurious misses

missing = df.loc[
    (df["QuestionType"] == "embedded")
    &
    (df["EmbeddedNegPresent"] != "yes")
    &
    (df["MatrixNegPresent"] != "yes")
    &
    (df.Sentence.str.contains("does not|don't|doesn't"))
]

In [349]:
len(missing)

17

## Create separate file with contexts

In [175]:
contexts_df = df[["Item_ID","Sentence", "Prev-Context","FollowingContext"]].rename(columns={"Item_ID":"TGrepID","Sentence":"EntireSentence","Prev-Context":"PreceedingContext"})

In [99]:
contexts_df.head(2)

Unnamed: 0,TGrepID,EntireSentence,PreceedingContext,FollowingContext
0,3:43,"uh, first, um, i need *-1 to know, uh, how do you feel *t*-2 about, uh, about * sending, uh, an elderly, uh, family member to a nursing home?",###<none>###<none>###<none>###<none>###<none>###<none>###<none>###<none>###speakera1.###okay.,"###speakerb2.###well, of course, it's, you know, it's one of the last few things in the world 0 you'd ever want *-1 to do *t*-2, you know. unless it's just, you know, really, you know, and, uh, for their, uh, you know, for their own good.###speakera3.###yes.###yeah.###speakerb4.###i'd be very very careful and, uh, you know, *-1 checking them out.###uh, our,###* had *-1 t-, place my mother in a nursing home.###she had a rather massive stroke about, uh, about uh, eight months ago i guess."
1,17:77,"and, uh, we were, i was fortunate in that i was personally acquainted with the, uh, people who, uh, *t*-1 ran the nursing home in our little hometown.","###yes.###yeah.###speakerb4.###i'd be very very careful and, uh, you know, *-1 checking them out.###uh, our,###* had *-1 t-, place my mother in a nursing home.###she had a rather massive stroke about, uh, about uh, eight months ago i guess.###speakera5.###uh-huh.###speakerb6.","###speakera7.###yeah.###speakerb8.###so, i was very comfortable, you know, in *-1 doing it when it got to the point that we had *-2 to do it *t*-3 *t*-4.###but there's,###well, i had an occasion for my mother-in-law who *t*-1 had fell and needed * to be, you know, could not take care of herself anymore, was confined *-2 to a nursing home for a while###that was really not a very good experience.###uh, it had *-1 to be done *-2 in a hurry.###i mean, we didn't have, you know, like six months *-1 to check all of these places out.###and it was really not, not very good, uh, deal."


In [173]:
contexts_df.to_csv("swbd_contexts.csv", header=True, index=False)

In [178]:
contexts_df[contexts_df["TGrepID"]=="1112:41"]

Unnamed: 0,TGrepID,EntireSentence,PreceedingContext,FollowingContext
65,1112:41,"well, i mean, i, i wonder how people have sex *t*-1, and things like that,","### and which *t*-1 isn't of, always the case in all cultures,###and it wasn't until i was thinking about it just now that i realized 0 that's actually something that *t*-1's culturally relative.###speakerb90.### tha-, that is true.###i haven't thought about that### and, and that it *exp*-2's fascinating * to, to think a lot of someone who *t*-1 doesn't know how *-3 to say private *t*-4.###speakera91.###uh-huh,###uh-huh.###and that's really, um,","###speakerb92.###speakera93.###i mean, they a-,,###you go to india###and it's obvious, you know, the results of sex are quite obvious,, as the population goes up an extra hundred million every few years.###um, but i, i just don't quite, um,###there's hope,###i actually###for al-, of the time 0 i've spent *t*-1 there, i still don't quite understand how certain things that i assume 0 *t*-2 and require privacy and require not just that you be alone but actually that you have a sense of privacy *t*-3.###speakerb94."


How question that's really a why question...

## Create separate file without contexts

In [168]:
nocontx_df = df.drop(columns=["Prev-Context","FollowingContext"]).reset_index()

In [172]:
nocontx_df[nocontx_df["Item_ID"]=="1112:41"]

Unnamed: 0,index,Item_ID,Sentence,HaveNeedTo,Finite,ModalPresent,QuestionType,DegreeQ,SubjectAuxInv,WhAll,...,DeterminerSubject,DeterminerNonSubject,FullWhPhrase,DeterminerSubjPresent,DeterminerNonSubjPresent,WhNode,WhParse,Question,SentenceParse,WhPhaseType
65,65,1112:41,"well, i mean, i, i wonder how people have sex *t*-1, and things like that,",no,yes,no,embedded,no,,,...,,things like that,,no,no,,(WRB how),"how people have sex *t*-1, and things like that","(TOP (S (INTJ (UH well)) (PRN (, ,) (S (NP-SBJ (PRP I)) (VP (VBP mean))) (, ,)) (EDITED (RM (-DFL- \[)) (NP-SBJ (PRP I)) (, ,) (IP (-DFL- \+))) (NP-SBJ (PRP I)) (RS (-DFL- \])) (VP (VBP wonder) (SBAR (WHADVP-1 (WRB how)) (S (NP-SBJ (NNS people)) (VP (VP (VBP have) (NP (NN sex)) (ADVP (-NONE- *T*-1))) (, ,) (CC and) (NP-ETC (NP (NNS things)) (PP (IN like) (NP (DT that)))))))) (, ,) (-DFL- E_S)))",monomorphemic


In [169]:
nocontx_df.to_csv("swbd_nocntxt.csv", header=True, index=False)

## Looking around at the data

In [129]:
df_crit = nocontx_df[nocontx_df.QuestionType.isin(["root","embedded"])]

In [151]:
df_crit.Sentence.iloc[500]

"it's funny because i'm, i was in the process of * filling one out when i decided 0 i would make this phone call *t*-2,"

Hard to just insert a "some" or an "all" consistently in the same place for each question...
Unless it was done by hand. But even then, the two quantifiers wouldn't be in the same place.

- \* what all day was that on?
- \* what some day was that on?

In [152]:
df_crit.Sentence.iloc[600]

"i can't remember what they said 0 his name was *t*-1."

In [153]:
# df_crit.sample(n=1).Sentence

3908    strangely enough, uh, their mother and i both smoked when they were growing up *t*-1.
Name: Sentence, dtype: object

In [155]:
df_crit.loc[3908]

index                                                                                                                                                                                                                                                                                                                                             3908
Item_ID                                                                                                                                                                                                                                                                                                                                       68051:34
Sentence                                                                                                                                                                                                                                                         strangely enough, uh, their mother and i both smoked when

In [156]:
df_crit.sample(n=1).Sentence

6913    and which is the cat *t*-2?
Name: Sentence, dtype: object

### Here's a good one:

all [things going on] vs. some [things going on]

In [158]:
# df_crit.sample(n=1).Sentence

5758    and we can't see her often enough *-1 to really know what *t*-2's going on.
Name: Sentence, dtype: object

In [357]:
# df_crit.loc[5758]

In [160]:
# df_crit.sample(n=1).Sentence

4976    and that's, that sometimes means 0 you can't do gourmet, because depending on who your guests are *t*-1.
Name: Sentence, dtype: object

all of the guests vs. some of the guests

In [161]:
df_crit_emb = df_crit[df_crit.QuestionType == "embedded"]

In [351]:
# df_crit_emb.Sentence

In [352]:
# df_crit.sample(n=1).Sentence
# this one is non sensical
df.loc[df["Item_ID"] == "162357:108"].Sentence

9148    i don't know what kind of, i don't know what g m corporate, kind of hit the, i don't know what kind of hit they take *t*-1 on it,
Name: Sentence, dtype: object

In [83]:
# df_crit.sample(n=1).Sentence
df_crit.loc[7272].Sentence

"but, uh, you know, that's, i guess 0 that's one of the things 0 you got *-1 to put up with *t*-2 when you don't have a dress code *t*-3."

In [84]:
df_crit.sample(n=1).Sentence

2800    you couldn't be inconspicuous, when you walked into a store stuff like that *t*-1,
Name: Sentence, dtype: object

In [85]:
df_crit.sample(n=1).Sentence

3551    if we, if we keep *-1 putting that stuff into the air and, and, you know, if we keep *-2 creating the problem and not doing anything about it that it's really going *-3 to be a problem for, um, just the, the earth, you know, what the earth is receiving *t*-4 back, you know, because how can you tell where it's going *-5 to come down at *t*-6 *t*-7.
Name: Sentence, dtype: object

In [86]:
df_crit.loc[3551]

index                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In [141]:
df_crit_nowhen = df_crit[df_crit["Wh"] != "when"]

In [91]:
len(df_crit_nowhen)/len(df_crit)

0.8099683782562868

In [122]:
df_crit_nowhen.loc[148].Sentence

"uh, one of its great slogans is 0 if you're not serving the customer, you better be serving someone who *t*-1 is *?*."

In [123]:
df_crit_nowhen.loc[148]

index                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       148
Item_ID                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 2316:69
Sentence

In [110]:
# df_crit_nowhen.loc[148].Sentence

"uh, one of its great slogans is 0 if you're not serving the customer, you better be serving someone who *t*-1 is *?*."

In [93]:
df_crit_nowhen.loc[148].SentenceParse
# this should be relative clause

"(TOP (S (INTJ (UH Uh)) (, ,) (NP-SBJ (NP (CD one)) (PP (IN of) (NP (PRP$ its) (JJ great) (NNS slogans)))) (VP (VBZ is) (SBAR-PRD (-NONE- 0) (S (SBAR-ADV (IN if) (S (NP-SBJ (PRP you)) (VP (VBP 're) (RB not) (VP (VBG serving) (NP (DT the) (NN customer)))))) (, ,) (NP-SBJ (PRP you)) (ADVP (RBR better)) (VP (VB be) (VP (VBG serving) (NP (NP (NN someone)) (SBAR (WHNP-1 (WP who)) (S (NP-SBJ (-NONE- *T*-1)) (VP (VBZ is) (NP-PRD (-NONE- *?*))))))))))) (. .) (-DFL- E_S)))"

In [90]:
df_crit_nowhen.sample(n=1).Sentence

148    uh, one of its great slogans is 0 if you're not serving the customer, you better be serving someone who *t*-1 is *?*.
Name: Sentence, dtype: object

In [143]:
df.loc[df["Item_ID"]=="2316:69"]

Unnamed: 0,Item_ID,Sentence,HaveNeedTo,Finite,ModalPresent,QuestionType,DegreeQ,SubjectAuxInv,WhAll,QuantifiedSubject,...,FullWhPhrase,DeterminerSubjPresent,DeterminerNonSubjPresent,WhNode,WhParse,Question,SentenceParse,WhPhaseType,Prev-Context,FollowingContext
148,2316:69,"uh, one of its great slogans is 0 if you're not serving the customer, you better be serving someone who *t*-1 is *?*.",no,yes,no,subject,no,,,,...,who,no,no,,(WP who),who *t*-1 is *?*,"(TOP (S (INTJ (UH Uh)) (, ,) (NP-SBJ (NP (CD one)) (PP (IN of) (NP (PRP$ its) (JJ great) (NNS slogans)))) (VP (VBZ is) (SBAR-PRD (-NONE- 0) (S (SBAR-ADV (IN if) (S (NP-SBJ (PRP you)) (VP (VBP 're) (RB not) (VP (VBG serving) (NP (DT the) (NN customer)))))) (, ,) (NP-SBJ (PRP you)) (ADVP (RBR better)) (VP (VB be) (VP (VBG serving) (NP (NP (NN someone)) (SBAR (WHNP-1 (WP who)) (S (NP-SBJ (-NONE- *T*-1)) (VP (VBZ is) (NP-PRD (-NONE- *?*))))))))))) (. .) (-DFL- E_S)))",monomorphemic,"###i do.###speakera21.###speakerb22.### we, uh, we have these classes 0 we attend *t*-1, uh, management classes### and, and they give you books### and, and the last book, uh, matter of fact 0 i read *t*-1 was, at america's service by carl albrecht.###it talks about, uh, who the customer is *t*-1 and * being customer oriented, uh, which *t*-2 falls in line with the t i culture here at texas instruments.###speakera23.###yeah.###speakerb24.","###speakera25.###uh-huh.###speakerb26.###uh, so that's all in self improvement * to stay focused on who the customer is *t*-1###and as you probably well know, all of us are our own customer.###you're my customer, i'm your customer, sort of thing.###speakera27.###right.###speakerb28.###um, every now and then i'm loaned *-3 a tape 0 i can stick *t*-1 in the, uh, in the car cassette set on the way home * to make the drive more enjoyable, talking about, uh, better outlooks on things and the philosophy of, of pat hagerty and these kind of, uh, mind stimulating philosophy type. which *t*-2 all, you know, betters yourself."


# Creating the files for the experiment

## Constrain dataset
for experimental mock-up

First we have to remove the questions that we don;t want to include:
1. non-embedded or root questions
2. no degree questions
3. no identity questions
4. generally only monomorphemic wh-phrases
5. only who-, where-, and how-questions

In [231]:
df.WhPhaseType.values

array(['monomorphemic', 'monomorphemic', 'monomorphemic', ...,
       'monomorphemic', 'monomorphemic', 'monomorphemic'], dtype=object)

In [7]:
df_root = df[df.QuestionType=="root"]
df_emb = df[df.QuestionType=="embedded"]

In [8]:
critical = df[(df['QuestionType'] == 'root') # only root questions
              & 
              (df['DegreeQ'] == 'no' ) # no degree questions
              &
              (df['IdentityQ'] == "no") # no identity questions
              &
              (df['Wh'].isin(['how','How','where','Where','who','Who'])) # just these wh-words
              &
              (df['WhPhaseType'] == 'monomorphemic') # monomorphic wh only (might get anything not go by degQ)
             ]

In [12]:
# Percentage of total dataset
len(critical)/len(df_root)*100

19.488074461896453

## Automate Paraphrase Generator
this should take as input the entire constrained dataframe from the above section, and then generate the paraphrases

For Who questions: Who is a person...? / Who is some person...? / "Who is every person..." / "Who is the person..."

In [13]:
# read in df with contexts
cntxts = pd.read_csv("swbd_contexts.csv")

In [14]:
cntxts = cntxts.drop(columns="FollowingContext")

In [15]:
cntxts.head(1)

Unnamed: 0,TGrepID,EntireSentence,PreceedingContext
0,3:43,"uh, first, um, i need *-1 to know, uh, how do ...",###<none>###<none>###<none>###<none>###<none>#...


In [16]:
# get the indixes from critical
crit_index = critical.Item_ID

335

In [60]:
# subset to the items that are just the ones filtered in the previos section

# otherwise, if using the database file with contexts directly in there, then this step
# is not necessary
df_valid = cntxts[cntxts["TGrepID"].isin(set(crit_index))]

In [63]:
who = df_valid[df_valid["EntireSentence"].str.contains("Who|who")]
where = df_valid[df_valid["EntireSentence"].str.contains("Where|where")]
how = df_valid[df_valid["EntireSentence"].str.contains("How|how")]

In [64]:
df_valid.head()

Unnamed: 0,TGrepID,EntireSentence,PreceedingContext
95,1721:4,how do you feel *t*-1 about rap music?,"###there's such a wide selection,### i think 0..."
189,2936:24,"oh, rea-, where *ich*-2 did you go to school i...","###speakerb58.###so, that, uh, you, you became..."
246,4134:4,"how did you hear,","###do you work for t i?###speakera90.###no,###..."
247,4140:4,how did i hear about it *t*-1?,###really?###speakera92.###yes.###speakerb93.#...
256,4270:11,"so, and how do you feel about it *t*-1?.","###yeah,###so. i'm, i'm an avid aerobics, uh, ..."


In [65]:
who["AResponse"] = "Who is a person?"
who["SomeResponse"] = "Who is some person?"
who["AllResponse"] = "Who is every person?"
who["TheResponse"] = "Who is the person?"


where["AResponse"] = "Where is a place?"
where["SomeResponse"] = "Where is some place?"
where["AllResponse"] = "Where is every place?"
where["TheResponse"] = "Where is the place?"


how["AResponse"] = "What is a way?"
how["SomeResponse"] = "What is some way?"
how["AllResponse"] = "What is every way?"
how["TheResponse"] = "What is the way?"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  who["AResponse"] = "Who is a person?"
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  who["SomeResponse"] = "Who is some person?"
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  who["AllResponse"] = "Who is every person?"
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .l

In [67]:
df_final = pd.concat([who,where,how])

There should be a way to automate that whole thing in one definitieion, but im failing to do that correctly

In [51]:
# def paramap(data):
#     if data["EntireSentence"].str.contains("who|Who"):
#         data["AResponse"] == "Who is a person?"
#         data["SomeResponse"] == "Who is some person?"
#         data["AllResponse"] == "Who is every person?"
#         data["TheResponse"] == "Who is the person?"
    #         return
    #     if df["EntireSentence"].str.contains(["where|Where"]):
    #         return "places"
    #     if df["EntireSentence"].str.contains(["how|How"]):
    #         return "ways"

In [49]:
# df_para = df_valid.apply(paramap)

KeyError: 'EntireSentence'

## Create randomly sampled files

In [246]:
d = df_final.sample(30, random_state=321)

save to csv

In [260]:
d.to_csv("testfile.txt",header=True,sep="\t",index=False)

In [261]:
d

Unnamed: 0,TGrepID,EntireSentence,PreceedingContext,BestResponse,AResponse,SomeResponse,AllResponse,TheResponse
7769,137049:4,where's that *t*-1?,"###so you have probably a lot of green, and lot of real pretty things.### our, our climate's real dry###speakerb60.###where is that *t*-1?###speakera61.###and you have *-1 to kind of get up into the mountains *-2 to get much of the greenery.### most of, most, most of the land's pretty brown.###you don't have as much of the greenery like you have *?*.###speakerb62.###oh,",,What is a way?,What are some of the ways?,What are all of the ways?,What is the way?
9997,174870:4,where do you work *t*-1?,"###and i went through a, a pretty, i don't know, i went through a, standard drug testing thing before i, i was brought *-1 on.###speakerb4.###uh-huh.###speakera5.###i think 0 that's pretty standard.###at least at honeywell it is *?*.###uh, i think 0 it's important * to insure the quality and, and, uh, i don't know, almost the goodness of character. you know that kind of thing.###speakerb6.###yeah.###speakera7.",,What is a way?,What are some of the ways?,What are all of the ways?,What is the way?
3118,54662:4,who's your favorite actress or actor *t*-1?,"###you see it for, uh, you know a couple of hours or an hour###and it really,###there's a lot in it when you look at the scenery and the cars, and all the different stuff like that *t*-1, you know.###speakera80.###oh, yeah,###oh.###speakerb81.###speakera82.###yeah.###speakerb83.",,What is a way?,What are some of the ways?,What are all of the ways?,What is the way?
1311,22805:4,"how do you feel about them *t*-1, i mean, since you've kind of been close to that.","###and i,###speakera1.###all right.###well, on this subject, i really hadn't had *-1 to deal with *-2 putting someone in there yet,###but my mother's always been administrator of a nursing home###speakerb2.###uh-huh.###speakera3.###so i've always been involved, you know, in one.###speakerb4.",,What is a way?,What are some of the ways?,What are all of the ways?,What is the way?
10007,175042:4,how do you feel about public schools *t*-1?,"### they don't, they don't really do that.###so.###speakerb66.###wow.###is it *exp*-1 formal policy that they said 0 they might test?###speakera67.###yeah.###speakera1.###okay,###i'm here.",,What is a way?,What are some of the ways?,What are all of the ways?,What is the way?
8348,149860:4,where do you go *t*-1?,"### i, i agree totally.###speakera53.###speakerb54.###um, i mean, this, this, it just seems so, you know, so ridiculous that it was allowed *-1 to happen.###um, i, i'm in college right now,###and.###speakera55.###so am *t*-1 i.###speakerb56.###oh really,",,What is a way?,What are some of the ways?,What are all of the ways?,What is the way?
6892,119975:9,"well, how's it going *t*-1?","###and then you iron that on the t shirt,###and you paint around it.###speakera9.###oh, i see.###speakerb10.###so, it's real fun.###i started *-1 doing it as a, um, just something fun 0 * to do *t*-2,###and now i'm selling them###and pretty,###speakera11.",,What is a way?,What are some of the ways?,What are all of the ways?,What is the way?
9953,174186:4,how you doing *t*-1?,"###but, uh, boy, i, i feel for you and your husband.###speakerb117.###oh, why is that *t*-1?###speakera118.###well, my, my, uh, two of my, uh,###speakera1.###i'm nevin from sunnyvale, california.###speakerb2.###hi,###this is jim bliss from minneapolis, minnesota.",,What is a way?,What are some of the ways?,What are all of the ways?,What is the way?
4421,77454:4,who would they ask *t*-1?,"###speakerb77.###right.###speakera78.###and it was, uh,###you just felt, gee, if i weren't here, how,###speakerb79.###uh-huh###speakera80.### or, or if my husband,###speakerb81.",,What is a way?,What are some of the ways?,What are all of the ways?,What is the way?
6230,107918:7,well how do you dress for work *t*-1?,"###why don't you start *t*-1###so you probably have a job 0 you need *-1 to get to *t*-2 pretty soon.###speakerb2.###i'm already on my job###speakera3.###oh, you are *?*,###speakerb4.###so, you, you reached me at my job.###speakera5.###oh, wonderful,",,What is a way?,What are some of the ways?,What are all of the ways?,What is the way?


# scratch

In [264]:
df[df["Item_ID"] == "331:188"].SentenceParse

13    (TOP (S (CC But) (, ,) (INTJ (UH uh)) (, ,) (EDITED (RM (-DFL- \[)) (EDITED (RM (-DFL- \[)) (NP-SBJ (PRP I)) (, ,) (IP (-DFL- \+))) (NP-SBJ (PRP I)) (, ,) (RS (-DFL- \])) (IP (-DFL- \+))) (NP-SBJ (PRP I)) (RS (-DFL- \])) (VP (VBP think) (SBAR (IN that) (PRN (, ,) (S (NP-SBJ (PRP you)) (VP (VBP know))) (, ,)) (S (EDITED (RM (-DFL- \[)) (S-UNF (NP-SBJ (PRP we)) (ADVP-TMP (RB always))) (, ,) (IP (-DFL- \+)) (RS (-DFL- \]))) (INTJ (UH uh)) (PRN (, ,) (S (NP-SBJ (PRP I)) (VP (VBP mean))) (, ,)) (EDITED (RM (-DFL- \[)) (S (NP-SBJ (PRP I)) (VP-UNF (VBP 've))) (, ,) (IP (-DFL- \+))) (NP-SBJ (PRP I)) (VP (VBP 've) (RS (-DFL- \])) (VP (VBN had) (NP (NP (DT a) (NN lot)) (PP (IN of) (NP (NP (JJ good) (NNS experiences)) (EDITED (RM (-DFL- \[)) (PP-UNF (IN with)) (, ,) (IP (-DFL- \+))) (INTJ (UH uh)) (, ,) (PP (IN with) (RS (-DFL- \])) (EDITED (RM (-DFL- \[)) (NP-UNF (JJ many)) (IP (-DFL- \+))) (NP (NP (JJ many) (RS (-DFL- \])) (NNS people)) (SBAR (ADVP (RB especially)) (WHADVP-1 (WRB where)) 

In [391]:
df_re = df[df.QuestionType.isin(["root","embedded"])]

In [405]:
df_re_iq = df_re[df_re["IdentityQ"] == "yes"]

In [400]:
df[df["Item_ID"]=="45968:11"]

Unnamed: 0,Item_ID,Sentence,HaveNeedTo,Finite,ModalPresent,QuestionType,DegreeQ,SubjectAuxInv,WhAll,MatrixNegPresent,...,DeterminerNonSubject,FullWhPhrase,DeterminerSubjPresent,DeterminerNonSubjPresent,WhNode,WhParse,Question,SentenceParse,WhPhaseType,IdentityQ
2675,45968:11,you remember when archie manning was a quarterback *t*-1?,no,yes,no,root,no,,,,...,a quarterback,,no,no,,(WRB when),when archie manning was a quarterback *t*-1,(TOP (SQ (NP-SBJ (PRP you)) (VP (VBP remember) (SBAR-TMP (WHADVP-1 (WRB when)) (S (NP-SBJ (NNP Archie) (NNP Manning)) (VP (VBD was) (NP-PRD (DT a) (NN quarterback)) (ADVP-TMP (-NONE- *T*-1)))))) (. ?) (-DFL- E_S))),monomorphemic,yes


In [410]:
# df_re_iq.loc[3096]
df[df["Item_ID"] =="54168:7"]

Unnamed: 0,Item_ID,Sentence,HaveNeedTo,Finite,ModalPresent,QuestionType,DegreeQ,SubjectAuxInv,WhAll,MatrixNegPresent,...,DeterminerNonSubject,FullWhPhrase,DeterminerSubjPresent,DeterminerNonSubjPresent,WhNode,WhParse,Question,SentenceParse,WhPhaseType,IdentityQ
3096,54168:7,"well how did they feel about the, uh, the united states interven-, intervening with patriot missiles *t*-1,",no,yes,no,root,no,yes,,,...,the,,no,no,,(WRB how),"well how did they feel about the, uh, the united states interven-, intervening with patriot missiles *t*-1,","(TOP (SBARQ (INTJ (UH Well)) (WHADVP-1 (WRB how)) (SQ (VBD did) (NP-SBJ (PRP they)) (VP (VB feel) (PP (IN about) (EDITED (RM (-DFL- \[)) (NP-UNF (DT the)) (, ,) (IP (-DFL- \+))) (INTJ (UH uh)) (, ,) (S-NOM (NP-SBJ (DT the) (RS (-DFL- \])) (NNP United) (NNP States)) (EDITED (RM (-DFL- \[)) (VP-UNF (VBG interven-)) (, ,) (IP (-DFL- \+))) (VP (VBG intervening) (RS (-DFL- \])) (PP (IN with) (NP (NNP Patriot) (NNS missiles)))))) (ADVP (-NONE- *T*-1)))) (, ,) (-DFL- E_S)))",monomorphemic,yes


In [393]:
len(df_re_iq)

816

In [394]:
df_re_iq[df_re_iq["QuestionType"] == "root"].Sentence

33                                                                                                                                                                                                                why do we end this thing *t*-1?
107                                                                                                                                                                                        but you know, whatever *t*-1 became of peter frampton.
223                                                                                                                                                                                                                              what is it *t*-1
234                                                                                                                                                                                                                           where's that *t*-1.
254                             

In [402]:
df.loc[df["Item_ID"] == "1872:16"]

Unnamed: 0,Item_ID,Sentence,HaveNeedTo,Finite,ModalPresent,QuestionType,DegreeQ,SubjectAuxInv,WhAll,MatrixNegPresent,...,DeterminerNonSubject,FullWhPhrase,DeterminerSubjPresent,DeterminerNonSubjPresent,WhNode,WhParse,Question,SentenceParse,WhPhaseType,IdentityQ
107,1872:16,"but you know, whatever *t*-1 became of peter frampton.",no,yes,no,root,no,yes,,,...,,,no,no,,(WDT whatever),"but you know, whatever *t*-1 became of peter frampton.","(TOP (SBARQ (CC but) (PRN (S (NP-SBJ (PRP you)) (VP (VBP know))) (, ,)) (WHNP-1 (WDT whatever)) (SQ (NP-SBJ (-NONE- *T*-1)) (VP (VBD became) (PP (IN of) (NP (NNP Peter) (NNP Frampton))))) (. .) (-DFL- E_S)))",monomorphemic,yes


In [407]:
df_re_iq.loc[33]

Item_ID                                                                                                                                                                  844:4
Sentence                                                                                                                                       why do we end this thing *t*-1?
HaveNeedTo                                                                                                                                                                  no
Finite                                                                                                                                                                     yes
ModalPresent                                                                                                                                                                no
QuestionType                                                                                                                 

In [406]:
df_re_iq.Sentence

25                                                                                                                                                                                                                                                                                                                                             do you know who he is *t*-1?
30                                                                                                                                                                                                                                                                                                                                      i don't know who that guy is *t*-1.
33                                                                                                                                                                                                                                                                              