## Smithsonian Trinomial Search in Trigrams from Documents Stored in Constellate's Datasets
This Jupyter Notebook contains programming to search for Smithsonian Trinomial instances in datasets built using Constellate's dataset builder and export the Smithsonian Trinomial and the JSTOR stable url for the article containing it as a .csv file. 

**How this notebook functions:** With a dataset built with Constellate's dataset builder (https://constellate.org/), this notebook compiles a DataFrame of trigrams from a dataset and then filters this DataFrame through six functions that refine the data through the use of regular expressions. The first identifies all possible Smithsonian Trinomials including some Munsell numbers and other various errors that occur. The second acts as a filter to remove any remaining Munsells and the remaining four remove errors due to common writing conventions.

**What you need to start using this notebook:** Dataset file built using Constellate (https://constellate.org/), Jupyter Lab environment, and Python. The dataset file should be in the same file as this notebook to access it during the execution of this notebook. 


In [None]:
import numpy as np
import pandas as pd
import csv
import time
import re
from io import StringIO

#Regular Expression for detection of Smithsonian Format, some Munsell 
#numbers also meet this format and will be picked up by the regular 
#expression
tri_or_munsell_re = r'[(]?((([0]{1}[1-9]{1})|[1-4]{1}[0-9]{1})|[1-9]{1}|[50])[ ][A-Z]{1}[a-zA-z]{1}[ ][1-9]{1}\d{0,6}'
munsell_re = r'(0|10|5)[ ](yr|Yr|YR|gy|Gy|GY|bg|Bg|BG|pb|Pb|PB|rp|Rp|RP)'

to_regex = r'(to|To|TO)'
num_to_regex = r'^[(]?(9|14|15|21|29|32|42)'
by_regex = r'(by|By|BY)'
num_by_regex = r'^[(]?(8|9|10|20|23|40|41|46)'

km_regex = r'(km|Km|KM)'
num_km_regex = r'^[(]?(25|41)'
cm_regex = r'(cm|CM|Cm|-cm)'
num_cm_regex = r'^[(]?(9|10|14|16|25|34|41)'
mm_regex = r'(mm|mm2|MM|Mm|mm.)'
num_mm_regex = r'^[(]?(14|15|41)'

in_regex = r'(in|In|IN)'
num_in_regex = r'^[(]?(3|8|20|36|42)'
of_regex = r'(of|Of|OF)'
num_of_regex = r'^[(]?(34|45)'
at_regex = r'(at|At|AT)'
num_at_regex = r'^[(]?(9|14|22|23|25|28|33|34|41)'
or_regex = r'(or|Or|OR)'
num_or_regex = r'^[(]?(5|8|12|16|23|26|31|38|41|44)'

import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')
pd.set_option('mode.chained_assignment', None)

## Test Case for Regular Expressions and Trinomial Search Algorithm
The next cells contain tests for the Regular Expressions used to search for Trinomial in the trigram data taken from Constellate's database. The test dataframe (test_df) contains sample data based on what may be found in the trigram data and the four cells below the test data show the refinement of test_df using regular expressions. 

**tri_or_munsell_re:** This regular expression is used as a filter to go through the compiled trigrams from the Constellate dataset and pull hose which meet the formatting standards for Smithsonian Trinomials.

**munsell_re:** This regular expression is used as a filter to remove any remaining Munsell numbers within the dataframe built from Constellate. 

**to_regex and num_to_regex:** Used together these two regular expressions work to filter any errors in the dataset due to written expressions like "10-to-20 inches" that match possible formatting for Smithsonian Trinomials. 

**by_regex and num_by_regex:** Used together these two regular expressions work to filter any errors in the dataset due to written expressions like "5-by-20 inches" that match possible formatting for Smithsonian Trinomials. 

**km_regex and num_km_regex, cm_regex and num_cm_regex, mm_regex and num_mm_regex:** These six regular expressions filter errors in the dataset due to common writing conventions for measurements.

**in_regex and num_in_regex, of_regex and num_of_regex, at_regex and num_at_regex, or_regex and num_or_regex:** These eight regular expressions filter errors in the dataset due to other common writing expressions common among the trigrams of the articles. 

In [None]:
test_data = """text,expected_contains
(13 AA 1242),true
(13 Aa 1242),true
13 AA 1242,true
BB AA 1242,false
99 AA 1242,false
125 AA 1234,false
12 FA 12,true
02 Fa 012,false
02 FA 12,true
(02 fa 12),false

2.5 YR 6/3,false
5 YR 7/4,true
7.5 YR 5/8,false
10 YR 4/6,true
(2.5 YR 6/3),false
(5 YR 7/4),true
(7.5 YR 5/8),false
(10 YR 4/6),true
2.5 Y 7/2,false
10 Y 7/2,false

15 to 20,false
10 to 30,false
9 to 10,false
18 To 25,true
25 to 30,false
42 To 10,true
16 TO 28,true
7 To 30,true
10 TO 12,true
30 to 12,false

15 by 20,false
10 by 30,false
9 by 10,false
18 by 25,false
25 By 30,true
42 BY 10,true
16 By 28,true
7 BY 30,true
10 By 12,true
30 By 12,true

41 km 2,false
13 km 2,false
44 km 2,false
14 km 2,false
4 Km 2,true
34 Km 2,true
16 KM 2,true
27 Km 2,true
5 KM 2,true
36 Km 2,true

41 cm 123,false
20 cm 3,false
4 cm 2,false
14 cm 43,false
4 cm 12,false
16 CM 123,true
13 Cm 234,true
44 CM 23,true
9 Cm 43,true
4 CM 12,true

41 mm 123,false
13 mm 234,false
2 mm 2,false
14 mm 413,false
4 Mm 12,true
41 MM 123,true
13 Mm 234,true
44 MM 23,true
15 Mm 43,true
4 MM 12,true

41 in 123,false
13 in 234,false
2 in 2,false
14 in 413,false
3 In 12,true
8 IN 123,true
20 In 234,true
36 IN 23,true
42 In 43,true
4 IN 12,true

41 of 123,false
13 of 234,false
2 of 2,false
14 of 413,false
34 Of 12,true
45 OF 123,true
34 Of 234,true
45 OF 23,true
45 Of 43,true
4 OF 12,true

41 at 123,false
13 at 234,false
2 at 2,false
14 at 423,false
28 At 12,true
9 AT 123,true
41 At 234,true
34 AT 23,true
14 At 43,true
4 AT 12,true

41 or 123,false
13 or 234,false
2 or 2,false
14 or 433,false
3 Or 12,true
8 OR 123,true
20 Or 234,true
36 OR 23,true
42 Or 43,true
4 OR 12,true
"""

s = StringIO(test_data)
test_df = pd.read_csv(s)
print(test_df)


In [None]:
#The test below shows that the regular expression defined earlier can be used 
#refine the dataframe of all different data down to only those that match 
#smithsonian trinomial formatting including some munsell numbers that match
#this format

def tri_or_munsell_re_test(test_df):
    test_df['tested_contains'] = test_df['text'].str.match(tri_or_munsell_re)
    assert test_df[(test_df['tested_contains']!= test_df['expected_contains'])].empty
    
tri_or_munsell_re_test(test_df)

test_df = test_df.loc[test_df['tested_contains']== True]
print(test_df)

In [None]:
#the function munsell_remover removes the leftover munsells from the
#dataframe after the first pass of the regular expression 
#tri_or_munsell_re, this function used an additional regular
#expression, munsell_re

def munsell_remover(dataframe_in):
    munsell_filter = dataframe_in['text'].str.contains(munsell_re)
    return dataframe_in[~munsell_filter]

test_data_trinomials = munsell_remover(test_df)
print(test_data_trinomials)


In [None]:
#the function to_error_remover removes the possible errors because of 
#written measurements (ex: "5-to-10 feet") within text through a 2
#step process, first to determine whether "-to-" is within the text 
#and the second to verify that the possible trinomial is not a state
#with the abbreviation TO or to

def to_error_remover(dataframe_in):
    dataframe_in['to'] = dataframe_in['text'].str.contains(to_regex)
    dataframe_in.loc[dataframe_in['text'].str.contains(num_to_regex), 'to'] = False
    dataframe_in = dataframe_in.loc[dataframe_in['to'] == False]
    return dataframe_in
    
    
test_data_trinomials_to = to_error_remover(test_data_trinomials)
print(test_data_trinomials_to)


In [None]:
#the function by_error_remover removes the possible errors because of 
#written measurements (ex: "5-by-2 inches") within text through a 2
#step process, first to determine whether "-by-" is within the text 
#and the second to verify that the possible trinomial is not a state
#with the abbreviation BY or by

def by_error_remover(dataframe_in):
    dataframe_in['by'] = dataframe_in['text'].str.contains(by_regex)
    dataframe_in.loc[dataframe_in['text'].str.contains(num_by_regex), 'by'] = False
    dataframe_in = dataframe_in.loc[dataframe_in['by'] == False]
    return dataframe_in
    
    
test_data_trinomials_by = by_error_remover(test_data_trinomials_to)
print(test_data_trinomials_by)


In [None]:
#the function measurement_error_remover removes the possible errors due 
#to written measurements (ex: "5km2") within text through a 2
#step process, first to determine whether "km", "cm", or "mm" is within 
#the text and the second to verify that the possible trinomial is not
#a state with the abbreviation for a county

def measurement_error_remover(dataframe_in):
    dataframe_in['km'] = dataframe_in['text'].str.contains(km_regex)
    dataframe_in.loc[dataframe_in['text'].str.contains(num_km_regex), 'km'] = False
    dataframe_in = dataframe_in.loc[dataframe_in['km'] == False]
    
    dataframe_in['cm'] = dataframe_in['text'].str.contains(cm_regex)
    dataframe_in.loc[dataframe_in['text'].str.contains(num_cm_regex), 'cm'] = False
    dataframe_in = dataframe_in.loc[dataframe_in['cm'] == False]
    
    dataframe_in['mm'] = dataframe_in['text'].str.contains(mm_regex)
    dataframe_in.loc[dataframe_in['text'].str.contains(num_mm_regex), 'mm'] = False
    dataframe_in = dataframe_in.loc[dataframe_in['mm'] == False]

    
    return dataframe_in
    
    
test_data_trinomials_measurement = measurement_error_remover(test_data_trinomials_by)
print(test_data_trinomials_measurement)

In [None]:
#the function writing_error_remover removes the possible errors due 
#to writing conventions (ex: "5 or 6") within text through a 2
#step process, first to determine whether "in", "at", "or" or "of" is within 
#the text and the second to verify that the possible trinomial is not
#a state with the abbreviation for a county

def writing_error_remover(dataframe_in):
    dataframe_in['in'] = dataframe_in['text'].str.contains(in_regex)
    dataframe_in.loc[dataframe_in['text'].str.contains(num_in_regex), 'in'] = False
    dataframe_in = dataframe_in.loc[dataframe_in['in'] == False]
    
    dataframe_in['of'] = dataframe_in['text'].str.contains(of_regex)
    dataframe_in.loc[dataframe_in['text'].str.contains(num_of_regex), 'of'] = False
    dataframe_in = dataframe_in.loc[dataframe_in['of'] == False]
    
    dataframe_in['at'] = dataframe_in['text'].str.contains(at_regex)
    dataframe_in.loc[dataframe_in['text'].str.contains(num_at_regex), 'at'] = False
    dataframe_in = dataframe_in.loc[dataframe_in['at'] == False]

    dataframe_in['or'] = dataframe_in['text'].str.contains(or_regex)
    dataframe_in.loc[dataframe_in['text'].str.contains(num_or_regex), 'or'] = False
    dataframe_in = dataframe_in.loc[dataframe_in['or'] == False]
    
    return dataframe_in
    
    
test_data_trinomials_writing = writing_error_remover(test_data_trinomials_measurement)
print(test_data_trinomials_writing)

## Extracting and Refining Possible Smithsonian Trinomials from the Constellate Dataset
**Using this section of the notebook for your own dataset:** To use this notebook for your own dataset, first proceed to Constellate's website to build your own dataset (https://constellate.org/builder/?start=1900&end=2022). You'll download the full metadata and n-grams and replace the placeholder "FileName.jsonl" with your file name after extraction. An **important** note about this file: it **must** be in the same place as the file for this Jupyter Notebook to function as intended. 

After replacing the file name, you will run each cell block. The second cell pulls all the trigrams from the Constellate data set. Depending on the size of your dataset this may take a significant amount of time. In order to limit the time this takes, it is best to keep the datasets under 3,000 items. The last five cells run the methods shown above in the test section to identify trigrams that match known formats for Smithsonian Trinomials. 

Each of the five cells exports a csv after the cleaning method has been applied. The names of these files can be modified for your specific data set outputs. 

In [None]:
##the below code opens the file for the Constellate dataset into the 
##program and copies the file to a pandas dataframe to be cleaned and
##refined using the methods above in the test section. The first
##parameter is the file name and should be replaced with the file name
##of your own dataset from Constellate
with open('FileName.jsonl', 'r') as path:
    #create dataframe (df) from data within jsonl file from constellate
    df = pd.read_json(path, lines=True) 

df = df[df['unigramCount'].notna()]
print (df)
    

In [None]:
newdf = pd.DataFrame()

start_time = time.time() 
#Separates the unigrams into individual rows keeping the url information for
#the article with each one
for x in range(0,(len(df.trigramCount)-1)): 
    tempdf = pd.DataFrame.from_dict(df.at[x,'trigramCount'], orient='index')
    tempdf.index.name='text' 
    tempdf.reset_index(inplace=True)

    tempdf = tempdf.rename(columns={0: 'count'})
    tempdf = tempdf.assign(id = df.at[x,'id'])
    
    #appends new dataframe (newdf) with the new rows formed from the unigram
    #count row in original dataframe
    newdf = newdf.append(tempdf) 
    
end_time = time.time() #time check for entire process
total_time = end_time - start_time
print (total_time)
print (newdf)
    

In [None]:
#These three lines below use the tri_or_munsell regular expression
#to filter the table of unigrams compiled from the journal data
#from constellate
tri_or_munsell_index = newdf['text'].str.match(tri_or_munsell_re)
tri_or_munsell_df = newdf[tri_or_munsell_index].copy()
tri_or_munsell_df.reset_index(inplace=True)

#After this step, any row that does not contain Trinomial or Munsell formatting,
#below are shown the first ten rows of the tri_or_munsell_df
tri_or_munsell_df.head(10)

#exports the possible trinomials or munsells to a csv
tri_or_munsell_df[['text', 'count', 'id']].to_csv(
    'Step1PossibleTrinomialsFrom.csv', index = False)

In [None]:
#the below lines of code use the function munsell_remover defined in
#the tests section to remove any remaining munsells from the data set
trinomial_df = munsell_remover(tri_or_munsell_df)
trinomial_df.reset_index(inplace=True)

trinomial_df[['text', 'count', 'id']].to_csv(
                'Step2TrinomialsAfterMunsellRemovedFrom.csv', index = False)

In [None]:
##the below lines of code use the function to_error_remover defined
##in the tests section to remove any errors resulting from the use 
##of "-to-" within typical writing conventions
trinomial_df_refined_to = to_error_remover(trinomial_df)


trinomial_df_refined_to[['text','count','id']].to_csv(
                'Step3TrinomialsWithoutPossibleToErrorsFrom.csv')

In [None]:
##the below lines of code use the function by_error_remover defined
##in the tests section to remove any errors resulting from the use 
##of "-by-" within typical writing conventions
trinomial_df_refined_by = by_error_remover(trinomial_df_refined_to)

trinomial_df_refined_by[['text','count','id']].to_csv(
                'Step4TrinomialsWithoutPossibleByAndToErrorsFrom.csv')

In [None]:
##the below lines of code use the function measurement_error_remover defined
##in the tests section to remove any errors resulting from the use 
##of measurements within typical writing conventions
trinomial_df_refined_measurement = measurement_error_remover(trinomial_df_refined_by)

trinomial_df_refined_measurement[['text','count','id']].to_csv(
                'Step5TrinomialsWithoutMeasurementErrorsFrom.csv')

In [None]:
##the below lines of code use the function measurement_error_remover defined
##in the tests section to remove any errors resulting from the use 
##of measurements within typical writing conventions
trinomial_df_refined_writing = writing_error_remover(trinomial_df_refined_measurement)

trinomial_df_refined_writing[['text','count','id']].to_csv(
                'Step6TrinomialsWithoutWritingErrorsFrom.csv')

## End of File

At this point, there should be five files within the file location of this notebook and your dataset. The code blocks between "Extracting and Refining Possible Smithsonian Trinomials from the Constellate Dataset" and this "End of File" can be repeated for any other .jsonl files from Constellate that you may have. **You will want to change the file names for each step of output to avoid overwriting any previous files.** 