# Data Prep
This notebook extracts, preprocesses, and splits examples into training and test sets.

## Collect Data
We extract sample sentences from the first half of the US English portion of the GloWbE corpus. After preprocessing, we search for instances where "there", "their", or "theyre" appears once.

### Data Processing

In [87]:
import re

text = []
with open('Euphemism Proj/glowbe_corpus/text_US_0-4.txt', encoding='utf8') as f:
    for line in f:
        for p in re.split(r'<p>', line):
            p = p.strip()
            text.append(p)

In [88]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

# create a dataframe for the texts
df = pd.DataFrame(text)
df.columns = ['text']
print("There are {} rows in df".format(len(df)))
df[['text']]

There are 3711273 rows in df


Unnamed: 0,text
0,##100040
1,"Arts and cultural industries rake in an annual $1.2 billion and generate 19,500 jobs in Bernalillo County . UNM alumni in the sector are helping the university pave a smoother path for future arts managers and entrepreneurs"
2,##100060
3,disabled or is not : About CDC.gov . <h> Cancer and Men
4,"Every year , cancer claims the lives of nearly 300,000 men in America . Men can reduce their risk for several of the most common kinds of cancer . <h> Lifestyle Changes <h> Tobacco Use and Secondhand Smoke"
...,...
3711268,And I know from my experience with the boys that someday is coming fast ... inconceivably I will miss these nights of bad dreams and IN-adequate quality sleep .
3711269,Its hard to imagine if you have n't parented through these years but trust me on this ... you will miss it all ... and it will gone before you know it .
3711270,"So I 'll have an extra cup of coffee and my eyes will be a little MORE puffy and I 'll struggle to stay awake in my meetings today but I will also thank the heavens above AGAIN that I have this amazing , wonderful privilege to be the ( exhausted ) mom of little ones again"
3711271,##99914


In [89]:
# lowercase all characters
df['clean_text'] = df['text'].fillna('').apply(lambda x: x.lower())

# keep only first sentence from each example to keep it simple
def only_keep_first_sentence(x):
    try:
        text = re.search(r'^.*?[.?!]', x).group(0)
    except:
        text = '' # if no such sentence boundary can be find, discard this example by replacing it with ''
    return text
df['clean_text'] = df['clean_text'].apply(lambda x: only_keep_first_sentence(x))

# Remove HTML tags
df['clean_text'] = df['clean_text'].str.replace('(<.*?>|&lt;.*?&gt;)', '')

# Remove digits
df['clean_text'] = df['clean_text'].str.replace('\d+', '')

# Remove other special characters 
# df['clean_text'] = df['clean_text'].str.replace("( [^a-zA-Z ]+|#)", "") # (most are punctuation marks with spaces)
df['clean_text'] = df['clean_text'].str.replace("[^a-zA-Z\s]", "")

# get rid of multiple white spaces + strip
df['clean_text'] = df['clean_text'].apply(lambda x: ' '.join(x.split())) 

# Remove empty rows
df['clean_text'].astype(bool) 
df= df[df['clean_text'].str.strip().astype(bool)]

  df['clean_text'] = df['clean_text'].str.replace('(<.*?>|&lt;.*?&gt;)', '')
  df['clean_text'] = df['clean_text'].str.replace('\d+', '')
  df['clean_text'] = df['clean_text'].str.replace("[^a-zA-Z\s]", "")


In [90]:
df.rename(columns={'text': 'orig_text'}, inplace=True)
# df

Unnamed: 0,orig_text,clean_text
1,"Arts and cultural industries rake in an annual $1.2 billion and generate 19,500 jobs in Bernalillo County . UNM alumni in the sector are helping the university pave a smoother path for future arts managers and entrepreneurs",arts and cultural industries rake in an annual
3,disabled or is not : About CDC.gov . <h> Cancer and Men,disabled or is not about cdc
4,"Every year , cancer claims the lives of nearly 300,000 men in America . Men can reduce their risk for several of the most common kinds of cancer . <h> Lifestyle Changes <h> Tobacco Use and Secondhand Smoke",every year cancer claims the lives of nearly men in america
5,"More men in the United States die from lung cancer than any other kind of cancer , and cigarette smoking causes most cases . Smoking also causes cancers of the esophagus , larynx ( voice box ) , mouth , throat , kidney , bladder , pancreas , stomach , and acute myeloid leukemia . Nonsmokers who are exposed to secondhand smoke at home or work increase their lung cancer risk by 20% -- 30% . Concentrations of many cancer-causing and toxic chemicals are higher in secondhand smoke than in the smoke inhaled by smokers .",more men in the united states die from lung cancer than any other kind of cancer and cigarette smoking causes most cases
6,"One of the things you can do to lower your risk of cancer is @ @ @ @ @ @ @ @ @ @ smoke . <h> Obesity , Overweight , and Lack of Physical Activity",one of the things you can do to lower your risk of cancer is smoke
...,...,...
3711266,"This morning I 'm feeling slightly nauseous from the lack of adequate quality sleep ( OK , I just snorted as I wrote that , "" adequate quality sleep "" , hysterical ) .",this morning i m feeling slightly nauseous from the lack of adequate quality sleep ok i just snorted as i wrote that adequate quality sleep hysterical
3711267,"But somehow even through my sleep deprived nausea ( and this is the weird thing that is being a mom ) I ca n't help but look over at my girls , snuggled together on my bedroom floor and feel a crushing disappointment that @ @ @ @ @ @ @ @ @ @ they will be too old for this .",but somehow even through my sleep deprived nausea and this is the weird thing that is being a mom i ca nt help but look over at my girls snuggled together on my bedroom floor and feel a crushing disappointment that they will be too old for this
3711268,And I know from my experience with the boys that someday is coming fast ... inconceivably I will miss these nights of bad dreams and IN-adequate quality sleep .,and i know from my experience with the boys that someday is coming fast
3711269,Its hard to imagine if you have n't parented through these years but trust me on this ... you will miss it all ... and it will gone before you know it .,its hard to imagine if you have nt parented through these years but trust me on this


In [85]:
df['char_count'] = df['clean_text'].apply(len)
df['word_count'] = df['clean_text'].apply(lambda x: len(x.split()))
# df

Unnamed: 0,orig_text,clean_text,char_count,word_count
1,"Arts and cultural industries rake in an annual $1.2 billion and generate 19,500 jobs in Bernalillo County . UNM alumni in the sector are helping the university pave a smoother path for future arts managers and entrepreneurs",arts and cultural industries rake in an annual,46,8
3,disabled or is not : About CDC.gov . <h> Cancer and Men,disabled or is not about cdc,28,6
4,"Every year , cancer claims the lives of nearly 300,000 men in America . Men can reduce their risk for several of the most common kinds of cancer . <h> Lifestyle Changes <h> Tobacco Use and Secondhand Smoke",every year cancer claims the lives of nearly men in america,59,11
5,"More men in the United States die from lung cancer than any other kind of cancer , and cigarette smoking causes most cases . Smoking also causes cancers of the esophagus , larynx ( voice box ) , mouth , throat , kidney , bladder , pancreas , stomach , and acute myeloid leukemia . Nonsmokers who are exposed to secondhand smoke at home or work increase their lung cancer risk by 20% -- 30% . Concentrations of many cancer-causing and toxic chemicals are higher in secondhand smoke than in the smoke inhaled by smokers .",more men in the united states die from lung cancer than any other kind of cancer and cigarette smoking causes most cases,120,22
6,"One of the things you can do to lower your risk of cancer is @ @ @ @ @ @ @ @ @ @ smoke . <h> Obesity , Overweight , and Lack of Physical Activity",one of the things you can do to lower your risk of cancer is smoke,66,15
...,...,...,...,...
3711266,"This morning I 'm feeling slightly nauseous from the lack of adequate quality sleep ( OK , I just snorted as I wrote that , "" adequate quality sleep "" , hysterical ) .",this morning im feeling slightly nauseous from the lack of adequate quality sleep ok i just snorted as i wrote that adequate quality sleep hysterical,149,25
3711267,"But somehow even through my sleep deprived nausea ( and this is the weird thing that is being a mom ) I ca n't help but look over at my girls , snuggled together on my bedroom floor and feel a crushing disappointment that @ @ @ @ @ @ @ @ @ @ they will be too old for this .",but somehow even through my sleep deprived nausea and this is the weird thing that is being a mom i ca nt help but look over at my girls snuggled together on my bedroom floor and feel a crushing disappointment that they will be too old for this,244,48
3711268,And I know from my experience with the boys that someday is coming fast ... inconceivably I will miss these nights of bad dreams and IN-adequate quality sleep .,and i know from my experience with the boys that someday is coming fast,71,14
3711269,Its hard to imagine if you have n't parented through these years but trust me on this ... you will miss it all ... and it will gone before you know it .,its hard to imagine if you have nt parented through these years but trust me on this,84,17


### Searching for the keywords "their" "there" and "theyre"

In [93]:
from tqdm import tqdm # module for keeping track of progress

no_matches = [] # stores indices of rows to be deleted
df['keyword'] = "" # column for indicating what PET is in this example
their_examples, there_examples, theyre_examples = [], [], []

# for each row, try to find and note down the PET, otherwise mark it for deletion
for i, row in tqdm(df.iterrows(), position=0, leave=True):
    text = df.loc[i, 'clean_text'] # locate the text
    found_keyword = False # if this is still False after the PET search, mark this row for deletion
    their_result = bool(re.search(r'their', text))
    there_result = bool(re.search(r'there', text))
    theyre_result = bool(re.search(r'they ?re', text))
    if (their_result ^ there_result ^ theyre_result): # if only one of these cases is true (we don't want both 'their' and 'there' in an example, for instance)
        if (their_result): 
            df.loc[i, 'keyword'] = 'their'
            their_examples.append(i)
        elif (there_result): 
            df.loc[i, 'keyword'] = 'there'
            there_examples.append(i)
        elif (theyre_result):
            df.loc[i, 'keyword'] = 'theyre'
            theyre_examples.append(i)
            
print(len(their_examples))
print(len(there_examples))
print(len(theyre_examples))
#     for term in ['their', 'there', 'they\'re']: 
#         if term in text:
#             found_PET = True
#             df.loc[i, 'PET'] = PET
        
#     if (found_PET == False):
#         no_matches.append(i)
        
# result_df = df.drop(no_matches) # delete rows
# display(result_df)

2985654it [01:56, 25684.71it/s]

115701
139502
11039





In [None]:
# some more preprocessing: combining "they re" into "theyre", to be consistent with the other keywords
for i, row in theyre_df.iterrows():
    text = theyre_df.loc[i, 'clean_text'] 
    theyre_df.loc[i, 'clean_text'] = text.replace('they re', 'theyre')
    
theyre_df

In [98]:
# save the matched examples
matches_df = pd.concat([df.loc[their_examples], df.loc[there_examples], df.loc[theyre_examples]])
matches_df.to_csv('their_there_theyre.csv')

## Form Training and Test Sets

In [239]:
import pandas as pd

matches_df = pd.read_csv("their_there_theyre_data.csv", index_col=0)

# GloWbE sometimes provides duplicate examples, so let's take care of that now
dup = matches_df.duplicated(subset=['clean_text'], keep=False)
print(len(dup[dup].index))
matches_df = matches_df.drop(dup[dup].index)
matches_df['keyword'].value_counts()

14448


keyword
there     131202
their     110098
theyre     10494
Name: count, dtype: int64

In [240]:
# sample 10k examples for each keyword
their_df = matches_df.loc[matches_df['keyword'] == 'their'].sample(n=10000)
there_df = matches_df.loc[matches_df['keyword'] == 'there'].sample(n=10000)
theyre_df = matches_df.loc[matches_df['keyword'] == 'theyre'].sample(n=10000)

# initialize column that will contain correct keyword (0 = their, 1 = there, 2 = theyre)
their_df['label'] = 0
there_df['label'] = 1
theyre_df['label'] = 2

In [241]:
# randomly sample half of each dataset to be corrupted (i.e., the keyword will be replaced with 1 of the other 2 keywords to generate "ungrammatical" examples)
their_corrupt_sample = their_df.sample(frac=0.5)
their_regular_sample = their_df.drop(their_corrupt_sample.index)
there_corrupt_sample = there_df.sample(frac=0.5)
there_regular_sample = there_df.drop(there_corrupt_sample.index)
theyre_corrupt_sample = theyre_df.sample(frac=0.5)
theyre_regular_sample = theyre_df.drop(theyre_corrupt_sample.index)

# initialize column that will indicate whether this example is ungrammatical (ie whether it has been corrupted)
their_corrupt_sample['ungrammatical'] = 1
their_regular_sample['ungrammatical'] = 0
there_corrupt_sample['ungrammatical'] = 1
there_regular_sample['ungrammatical'] = 0
theyre_corrupt_sample['ungrammatical'] = 1
theyre_regular_sample['ungrammatical'] = 0

# for each of the corrupted datasets, randomly select half to be corrupted with one keyword; the other half, the remaining keyword
their_there_corrupt_sample = their_corrupt_sample.sample(frac=0.5) # this means the examples with "their" that'll be corrupted with "there"
their_theyre_corrupt_sample = their_corrupt_sample.drop(their_there_corrupt_sample.index) 
their_there_corrupt_sample['clean_text'] = their_there_corrupt_sample['clean_text'].apply(lambda x: x.replace('their', 'there'))
their_theyre_corrupt_sample['clean_text'] = their_theyre_corrupt_sample['clean_text'].apply(lambda x: x.replace('their', 'theyre'))

there_their_corrupt_sample = there_corrupt_sample.sample(frac=0.5) # this means the examples with "their" that'll be corrupted with "there"
there_theyre_corrupt_sample = there_corrupt_sample.drop(there_their_corrupt_sample.index) 
there_their_corrupt_sample['clean_text'] = there_their_corrupt_sample['clean_text'].apply(lambda x: x.replace('there', 'their'))
there_theyre_corrupt_sample['clean_text'] = there_theyre_corrupt_sample['clean_text'].apply(lambda x: x.replace('there', 'theyre'))

theyre_their_corrupt_sample = theyre_corrupt_sample.sample(frac=0.5) # this means the examples with "their" that'll be corrupted with "there"
theyre_there_corrupt_sample = theyre_corrupt_sample.drop(theyre_their_corrupt_sample.index) 
theyre_their_corrupt_sample['clean_text'] = theyre_their_corrupt_sample['clean_text'].apply(lambda x: x.replace('theyre', 'their'))
theyre_there_corrupt_sample['clean_text'] = theyre_there_corrupt_sample['clean_text'].apply(lambda x: x.replace('theyre', 'there'))

In [242]:
final_df = pd.concat([their_regular_sample, there_regular_sample, theyre_regular_sample, 
           their_there_corrupt_sample, their_theyre_corrupt_sample, 
           there_their_corrupt_sample, there_theyre_corrupt_sample,
           theyre_their_corrupt_sample, theyre_there_corrupt_sample])

final_df[['clean_text', 'keyword', 'label', 'ungrammatical']]

Unnamed: 0,clean_text,keyword,label,ungrammatical
818651,thanks joe for talking about what might off to...,their,0,0
410581,unfortunately societysometimes fails to recogn...,their,0,0
1467069,we push browsers to their limits these days an...,their,0,0
2318513,like most families the pga tour and the pga of...,their,0,0
200065,my friend was suggesting that we need more stu...,their,0,0
...,...,...,...,...
2779480,i ca nt believe they re going to strike this down,theyre,2,1
2055532,by the way they re actually not closing comple...,theyre,2,1
340218,the units are in some ways a tacit acknowledge...,theyre,2,1
1473756,when they reached its shores the tide was out,theyre,2,1


In [243]:
# checks
their_regular = final_df[(final_df['label'] == 0) & (final_df['ungrammatical'] == 0)]
their_corrupt = final_df[(final_df['label'] == 0) & (final_df['ungrammatical'] == 1)]
there_regular = final_df[(final_df['label'] == 1) & (final_df['ungrammatical'] == 0)]
there_corrupt = final_df[(final_df['label'] == 1) & (final_df['ungrammatical'] == 1)]
theyre_regular = final_df[(final_df['label'] == 2) & (final_df['ungrammatical'] == 0)]
theyre_corrupt = final_df[(final_df['label'] == 2) & (final_df['ungrammatical'] == 1)]

print(len(their_regular), len(their_corrupt), len(there_regular), len(there_corrupt), len(theyre_regular), len(theyre_corrupt))

5000 5000 5000 5000 5000 5000


In [245]:
# form train test splits with balanced examples per category
def split(df, split_index):
    return df[0:split_index], df[split_index:len(df)]

SPLIT_INDEX = int(0.9*5000)
train_their_regular, test_their_regular = split(their_regular, SPLIT_INDEX)
train_their_corrupt, test_their_corrupt = split(their_corrupt, SPLIT_INDEX)
train_there_regular, test_there_regular = split(there_regular, SPLIT_INDEX)
train_there_corrupt, test_there_corrupt = split(there_corrupt, SPLIT_INDEX)
train_theyre_regular, test_theyre_regular = split(theyre_regular, SPLIT_INDEX)
train_theyre_corrupt, test_theyre_corrupt = split(theyre_corrupt, SPLIT_INDEX)

train_df = pd.concat([train_their_regular, train_their_corrupt, train_there_regular, train_there_corrupt, train_theyre_regular, train_theyre_corrupt])
test_df = pd.concat([test_their_regular, test_their_corrupt, test_there_regular, test_there_corrupt, test_theyre_regular, test_theyre_corrupt])
print(len(train_df), len(test_df))

27000 3000


In [246]:
folder = 'Vantage_Labs_Project_Files'
train_df.to_csv(f'{folder}/their_there_theyre_train.csv')
test_df.to_csv(f'{folder}/their_there_theyre_test.csv')