# Stimulus selection for two-word reversible phrases

Here we will use the [Google Books N-gram dataset](https://books.google.com/ngrams/info) to select two-word phrases which are appropriate for our stimuli in the experiment. The lists of 2-grams and 1-grams used are generated by [code made publicly available by GitHub user `orgtre`](https://github.com/orgtre/google-books-ngram-frequency). The phrases need to be:

 - Comprised of two $N$ to $K$ letter words
 - The first word must be a determiner or adjective
 - The second word must be a noun
 - Commonly used

A list of the 45,000 most common words in the dataset and their accompanying part-of-sentence identifier will be used to subselect common words, and a list of the 100,000 bigrams will be used as candidates for the final phrases. We will sub-select from these bigrams according to the above criteria.

We begin by defining our min/max characters. We then load in the CSV files and separate the words and part-of-sentence tags for the 1-grams. We also separate the two phrase words of the bigrams.

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from IPython.display import display

CHARACTER_BOUNDS = (3, 5)

# Load the 45k most common words and 100k most common 2-grams
datapath = Path("../word_ngrams/").resolve()
n2grams = pd.read_csv(datapath / "2grams_english_1a_no_pos.csv").convert_dtypes()
wordcats = pd.read_csv(datapath / "1grams_english_1b_with_pos.csv").convert_dtypes()

# Isolate the words which are nouns in the top 45k
wordcats["word"] = wordcats["ngram"].str.split("_").str[0]
wordcats["POS"] = wordcats["ngram"].str.split("_").str[1]
wordcats.drop(columns=["ngram"], inplace=True)
wordcats = wordcats.reindex(columns=["word", "POS", "freq"])
nouns = wordcats[wordcats["POS"] == "NOUN"].reset_index(drop=True)
determiners = wordcats[wordcats["POS"] == "DET"].reset_index(drop=True)
adjectives = wordcats[wordcats["POS"] == "ADJ"].reset_index(drop=True)

# Split the 2-grams into the component words
n2grams["word1"] = n2grams["ngram"].str.split(" ").str[0]
n2grams["word2"] = n2grams["ngram"].str.split(" ").str[1]
n2grams.drop(columns=["ngram"], inplace=True)
n2grams = n2grams.reindex(columns=["word1", "word2", "freq"]).convert_dtypes()

In [2]:
display(nouns)
display(n2grams)

Unnamed: 0,word,POS,freq
0,time,NOUN,344137821
1,way,NOUN,200669012
2,people,NOUN,199633836
3,life,NOUN,160230516
4,man,NOUN,149700258
...,...,...,...
27443,Gallimard,NOUN,207112
27444,lichen,NOUN,207107
27445,hack,NOUN,207101
27446,CULTURE,NOUN,207071


Unnamed: 0,word1,word2,freq
0,of,the,1741529442
1,in,the,1043688187
2,to,the,686570630
3,on,the,446536709
4,and,the,446190715
...,...,...,...
99995,cell,adhesion,187605
99996,and,defence,187605
99997,the,testator,187603
99998,suppressed,by,187590


Next we will isolate the bigrams which are composed of any first word followed by a noun in the top 45,000 between $N$ and $K$ letters long:

In [5]:
# Isolate the 2-grams which are composed of any word and then a top-45k noun, and N-K characters
noun_2grams = n2grams[
    n2grams["word2"].isin(nouns["word"]) & n2grams["word2"].str.len().between(*CHARACTER_BOUNDS)
].copy()
display(noun_2grams)

Unnamed: 0,word1,word2,freq,w2_noun_rank
0,of,the,1741529442,20329
1,in,the,1043688187,20329
2,to,the,686570630,20329
3,on,the,446536709,20329
4,and,the,446190715,20329
...,...,...,...,...
99973,Introduction,This,187641,2321
99975,her,round,187635,2907
99977,the,RNA,187634,3854
99979,computer,games,187629,1281


Create word-indexed lists of rank within a category for nouns, adjectives, and determiners. I.e. key-value mapping in which the is the word, and can be used to retrieve the rank as the value

In [7]:
idxnouns = nouns.set_index("word")
idxadj = adjectives.set_index("word")
idxdet = determiners.set_index("word")
idxnouns["rank"] = np.arange(len(idxnouns))
idxadj["rank"] = np.arange(len(idxadj))
idxdet["rank"] = np.arange(len(idxdet))
noun_2grams["w2_noun_rank"] = noun_2grams["word2"].map(idxnouns["rank"])

Next we select the phrases which start with an adjective or determiner which is between $N$ and $K$ characters long, and sort the final phrases by frequency, and save to the `phrases_candidates.csv` file:

In [7]:
# Isolate the word2 noun 2-grams that start with a determiner or adjective 4-6 char long
singlecat = wordcats.drop_duplicates(subset=["word"]).copy()
noun_2grams.apply
noun_2grams["word1_POS"] = noun_2grams["word1"].map(singlecat.set_index("word")["POS"])
detadj_2grams = (
    noun_2grams[
        noun_2grams["word1_POS"].isin(["DET", "ADJ"])
        & noun_2grams["word1"].isin(singlecat["word"])
        & noun_2grams["word1"].str.len().between(*CHARACTER_BOUNDS)
    ]
    .copy()
    .sort_values("freq", ascending=False)
)


# Add in the ranks for the first word within its category
def safe_adjdet_rank(row):
    try:
        return idxadj.loc[row["word1"]]["rank"]
    except KeyError:
        return idxdet.loc[row["word1"]]["rank"]


noun_2grams["w1_adjdet_rank"] = noun_2grams.apply(safe_adjdet_rank, axis=1)

display(detadj_2grams)
detadj_2grams.to_csv(datapath / "phrases_candidates.csv", index=False)

Unnamed: 0,word1,word2,freq,word1_POS
58,all,the,81315406,DET
95,the,world,61563770,DET
109,the,time,56081273,DET
129,the,way,49869616,DET
168,the,two,43138107,DET
...,...,...,...,...
99915,open,water,187739,ADJ
99946,right,path,187686,ADJ
99954,both,male,187670,DET
99977,the,RNA,187634,DET


Finally we want to know that the phrases are also _not reversible_ so that they can successfully be used for the non-phrase condition in `word2, word1` order.

This isn't trivial, so as a rough proxy (we will hand-select the final stimuli anyways) we look to ensure the candidate phrases are not present in reverse order in the top 100k bigrams.

In [8]:
# Create a copy of the data in all-lowercase form, and another reversed copy of that lowercase form
lower = detadj_2grams.copy()
lower["word1"], lower["word2"] = lower["word1"].str.lower(), lower["word2"].str.lower()
rev_lower = lower.copy().rename(columns={"word1": "word2", "word2": "word1"})

# Set the index of both dataframes to the 2-gram words for comparison
rev_lower.set_index(["word1", "word2"], inplace=True)
lower.set_index(["word1", "word2"], inplace=True)
nonreversible = lower.index.difference(rev_lower.index)
final_candidates = lower.loc[nonreversible].sort_values("freq", ascending=False).reset_index()
final_candidates["rev_word1"], final_candidates["rev_word2"] = (
    final_candidates["word2"],
    final_candidates["word1"],
)

pd.set_option("display.max_rows", 100)
display(f"The top 100 most used candidates out of {len(lower)}:")
display(final_candidates.head(100))
final_candidates.to_csv(datapath / "phrases_final_candidates.csv", index=False)


'The top 100 most used candidates out of 4117:'

Unnamed: 0,word1,word2,freq,word1_POS,rev_word1,rev_word2
0,the,world,61563770,DET,world,the
1,the,time,56081273,DET,time,the
2,the,way,49869616,DET,way,the
3,the,two,43138107,DET,two,the
4,the,end,42578601,DET,end,the
5,which,the,39832216,DET,the,which
6,the,door,37231705,DET,door,the
7,the,case,33094367,DET,case,the
8,the,one,28806519,DET,one,the
9,the,new,28214285,DET,new,the
