## Trinomial Search For Documents Stored in Constellate's Datasets
This Jupyter Notebook contains programming to search for trinomial instances in datasets built using Constellate's dataset builder. 

**How this notebook functions:** Using a dataset built with Constellate's dataset builder (https://constellate.org/) and json file with state and county abbreviations, this notebook allows the user to search the unigrams from every article within a dataset for Smithsonian Trinomials and export a .csv file with the Smithsonian Trinomial and the JSTOR stable url data. 

**What you need to start using this notebook:** Dataset file built using Constellate (https://constellate.org/), JSON file with State and County Abbreviation information, Jupyter Lab environment and Python


In [None]:
import numpy as np
import pandas as pd
import csv
import time


In [None]:
with open('3ce88441-1eb6-c10e-970d-7d071ad33299.jsonl', 'r') as path:
    df = pd.read_json(path, lines=True) #create dataframe (df) from data within jsonl file from constellate
print (df)
    

In [None]:
df.columns

In [None]:
df.id

In [None]:
newdf = pd.DataFrame()

start_time = time.time() 
for x in range(0,(len(df.unigramCount)-1)): #Separates the unigrams into individual rows keeping the url information for the article with each one
    tempdf = pd.DataFrame.from_dict(df.at[x,'unigramCount'], orient='index')
    tempdf.index.name='text' 
    tempdf.reset_index(inplace=True)

    tempdf = tempdf.rename(columns={0: 'count'})
    tempdf = tempdf.assign(id = df.at[x,'id'])
    
    newdf = newdf.append(tempdf) #appends new dataframe (newdf) with the new rows formed from the unigram count row in original dataframe

end_time = time.time() #time check for entire process
total_time = end_time - start_time
print (total_time)
print (newdf)
    

In [None]:
print (len(df.unigramCount))

In [None]:
#newdf.to_csv('DataDump.csv') 

In [None]:
contains_12VG = newdf.loc[newdf['text'].str.contains("12VG",case = False)] #test case for normal smithsonian trinomial formatting
contains_12VG

In [None]:
contains_12VG_alt1 = newdf.loc[newdf['text'].str.contains("12-VG", case = False)] #test case for smithsonian trinomial formatting option 2
contains_12VG_alt1

In [None]:
contains_12VG_alt2 = newdf.loc[newdf['text'].str.contains("12/VG", case = False)] #test case for smithsonian trinomial formatting option 3
contains_12VG_alt2

In [None]:
def contains_num (s):   #functtion to determine if a string contains a digit in any location
    return any(i.isdigit() for i in s)

In [None]:
newdf['contains_number'] = np.where(newdf.text , True, False) #currently non functioning method for identifying whether the string in the 'text' column contains a number
print (newdf)