## Trinomial Search For Documents Stored in Constellate's Datasets
This Jupyter Notebook contains programming to search for trinomial instances in datasets built using Constellate's dataset builder. 

**How this notebook functions:** Using a dataset built with Constellate's dataset builder (https://constellate.org/) and json file with state and county abbreviations, this notebook allows the user to search the unigrams from every article within a dataset for Smithsonian Trinomials and export a .csv file with the Smithsonian Trinomial and the JSTOR stable url data. 

**What you need to start using this notebook:** Dataset file built using Constellate (https://constellate.org/), JSON file with State and County Abbreviation information, Jupyter Lab environment and Python


In [1]:
import numpy as np
import pandas as pd
import csv
import time


In [2]:
with open('3ce88441-1eb6-c10e-970d-7d071ad33299.jsonl', 'r') as path:
    df = pd.read_json(path, lines=True) #create dataframe (df) from data within jsonl file from constellate
print (df)
    

    datePublished  docType                                    id  \
0      1991-01-01  article  http://www.jstor.org/stable/20708313   
1      1988-01-01  article  http://www.jstor.org/stable/20708269   
2      1991-01-01  article  http://www.jstor.org/stable/20708320   
3      1992-10-01  article  http://www.jstor.org/stable/20708331   
4      1991-01-01  article  http://www.jstor.org/stable/20708318   
..            ...      ...                                   ...   
577    2012-10-01  article  http://www.jstor.org/stable/24571142   
578    1976-01-01  article  http://www.jstor.org/stable/20707782   
579    2012-04-01  article  http://www.jstor.org/stable/24571266   
580    2011-10-01  article  http://www.jstor.org/stable/24571042   
581    2008-10-01  article  http://www.jstor.org/stable/41220753   

                                            identifier  \
0    [{'name': 'issn', 'value': '01461109'}, {'name...   
1    [{'name': 'issn', 'value': '01461109'}, {'name...   
2    [{'n

In [3]:
df.columns

Index(['datePublished', 'docType', 'id', 'identifier', 'isPartOf',
       'issueNumber', 'language', 'outputFormat', 'pageCount', 'provider',
       'publicationYear', 'publisher', 'sourceCategory', 'title', 'url',
       'volumeNumber', 'wordCount', 'unigramCount', 'bigramCount',
       'trigramCount', 'abstract', 'creator', 'pageEnd', 'pageStart',
       'pagination', 'tdmCategory'],
      dtype='object')

In [4]:
df.id

0      http://www.jstor.org/stable/20708313
1      http://www.jstor.org/stable/20708269
2      http://www.jstor.org/stable/20708320
3      http://www.jstor.org/stable/20708331
4      http://www.jstor.org/stable/20708318
                       ...                 
577    http://www.jstor.org/stable/24571142
578    http://www.jstor.org/stable/20707782
579    http://www.jstor.org/stable/24571266
580    http://www.jstor.org/stable/24571042
581    http://www.jstor.org/stable/41220753
Name: id, Length: 582, dtype: object

In [5]:
newdf = pd.DataFrame()

start_time = time.time() 
for x in range(0,(len(df.unigramCount)-1)): #Separates the unigrams into individual rows keeping the url information for the article with each one
    tempdf = pd.DataFrame.from_dict(df.at[x,'unigramCount'], orient='index')
    tempdf.index.name='text' 
    tempdf.reset_index(inplace=True)

    tempdf = tempdf.rename(columns={0: 'count'})
    tempdf = tempdf.assign(id = df.at[x,'id'])
    
    newdf = newdf.append(tempdf) #appends new dataframe (newdf) with the new rows formed from the unigram count row in original dataframe

end_time = time.time() #time check for entire process
total_time = end_time - start_time
print (total_time)
print (newdf)
    

30.610790967941284
               text  count                                    id
0           HURLEY,      1  http://www.jstor.org/stable/20708313
1               Map      1  http://www.jstor.org/stable/20708313
2              text      1  http://www.jstor.org/stable/20708313
3     illustrations      1  http://www.jstor.org/stable/20708313
4          Madison,      1  http://www.jstor.org/stable/20708313
...             ...    ...                                   ...
3166   Recognition,      2  http://www.jstor.org/stable/24571042
3167           Veit      2  http://www.jstor.org/stable/24571042
3168        peoples      6  http://www.jstor.org/stable/24571042
3169    petitioning      2  http://www.jstor.org/stable/24571042
3170         assump      2  http://www.jstor.org/stable/24571042

[1124713 rows x 3 columns]


In [6]:
print (len(df.unigramCount))

582


In [7]:
#newdf.to_csv('DataDump.csv') 

In [25]:
contains_12VG = newdf.loc[newdf['text'].str.contains("12VG",case = False)] #test case for normal smithsonian trinomial formatting
contains_12VG

Unnamed: 0,text,count,id,contains_number
252,12Vgl,1,http://www.jstor.org/stable/20708306,True
259,(12Vgl),1,http://www.jstor.org/stable/20707949,True


In [10]:
contains_12VG_alt1 = newdf.loc[newdf['text'].str.contains("12-VG", case = False)] #test case for smithsonian trinomial formatting option 2
contains_12VG_alt1

Unnamed: 0,text,count,id


In [11]:
contains_12VG_alt2 = newdf.loc[newdf['text'].str.contains("12/VG", case = False)] #test case for smithsonian trinomial formatting option 3
contains_12VG_alt2

Unnamed: 0,text,count,id


In [28]:
def contains_num (s):   #functtion to determine if a string contains a digit in any location
    return any(i.isdigit() for i in s)

In [29]:
newdf['contains_number'] = np.where(newdf.text , True, False) #currently non functioning method for identifying whether the string in the 'text' column contains a number
print (newdf)

               text  count                                    id  \
0           HURLEY,      1  http://www.jstor.org/stable/20708313   
1               Map      1  http://www.jstor.org/stable/20708313   
2              text      1  http://www.jstor.org/stable/20708313   
3     illustrations      1  http://www.jstor.org/stable/20708313   
4          Madison,      1  http://www.jstor.org/stable/20708313   
...             ...    ...                                   ...   
3166   Recognition,      2  http://www.jstor.org/stable/24571042   
3167           Veit      2  http://www.jstor.org/stable/24571042   
3168        peoples      6  http://www.jstor.org/stable/24571042   
3169    petitioning      2  http://www.jstor.org/stable/24571042   
3170         assump      2  http://www.jstor.org/stable/24571042   

      contains_number  
0                True  
1                True  
2                True  
3                True  
4                True  
...               ...  
3166           