Skip to content

organisciak/BookwormPython

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BookwormPython

A library for connecting to a remote Bookworm instance through Python.

There are two main classes to know about:

  • BWQuery takes the Bookworm server URL and wraps Bookworm's JSON query format (described in the API docs). You can run a query with BWQuery.run().
  • BWResults is an object holding the Bookworm results, with functions that allow display of the results as csv, json, or Pandas DataFrame.

There is also a set_options class, which allows global database and endpoint setting`

BW Query object

To start:

import bwypy

Intialize from JSON

jsonq = '''{
   "database": "hathipd",
   "method": "return_json", 
   "search_limits": {
       "date_year": {"$gt": 1790, "$lt": 1923 }
   },
   "counttype": ["TextCount"],
   "groups": ["date_year"]
   }'''
bw = bwypy.BWQuery(json=jsonq, endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py')
bw.json
{'counttype': ['TextCount'],
 'database': 'hathipd',
 'groups': ['date_year'],
 'method': 'return_json',
 'search_limits': {'date_year': {'$gt': 1790, '$lt': 1923}}}
bw.groups
['date_year']
bw.search_limits
{'date_year': {'$gt': 1790, '$lt': 1923}}
bw.database
'hathipd'

Run a query

Query results are returns as a BWResults object

bw.groups = ['page_count_bin', 'is_gov_doc']
bw_results = bw.run()
bw_results.json()
{'L - Between 350 and 550': {'': [563222], 'No': [30973]},
 'M - Between 150 and 350': {'': [549374], 'No': [30020]},
 'S - Less than 150': {'': [466445], 'No': [25737]},
 'XL - Greater than 550': {'': [529501], 'No': [28435]},
 'unknown': {'': [1325704], 'No': [73659]}}
bw_results.dataframe()
TextCount
page_count_bin is_gov_doc
XL - Greater than 550 529501
No 28435
unknown 1325704
No 73659
L - Between 350 and 550 563222
No 30973
M - Between 150 and 350 549374
No 30020
S - Less than 150 466445
No 25737
print(bw_results.csv())
page_count_bin,is_gov_doc,TextCount
XL - Greater than 550,,529501
XL - Greater than 550,No,28435
unknown,,1325704
unknown,No,73659
L - Between 350 and 550,,563222
L - Between 350 and 550,No,30973
M - Between 150 and 350,,549374
M - Between 150 and 350,No,30020
S - Less than 150,,466445
S - Less than 150,No,25737
bw_results.tolist()
[{'TextCount': 529501,
  'is_gov_doc': '',
  'page_count_bin': 'XL - Greater than 550'},
 {'TextCount': 28435,
  'is_gov_doc': 'No',
  'page_count_bin': 'XL - Greater than 550'},
 {'TextCount': 1325704, 'is_gov_doc': '', 'page_count_bin': 'unknown'},
 {'TextCount': 73659, 'is_gov_doc': 'No', 'page_count_bin': 'unknown'},
 {'TextCount': 563222,
  'is_gov_doc': '',
  'page_count_bin': 'L - Between 350 and 550'},
 {'TextCount': 30973,
  'is_gov_doc': 'No',
  'page_count_bin': 'L - Between 350 and 550'},
 {'TextCount': 549374,
  'is_gov_doc': '',
  'page_count_bin': 'M - Between 150 and 350'},
 {'TextCount': 30020,
  'is_gov_doc': 'No',
  'page_count_bin': 'M - Between 150 and 350'},
 {'TextCount': 466445,
  'is_gov_doc': '',
  'page_count_bin': 'S - Less than 150'},
 {'TextCount': 25737,
  'is_gov_doc': 'No',
  'page_count_bin': 'S - Less than 150'}]
bw_results.tuples()
[('XL - Greater than 550', '', 529501),
 ('XL - Greater than 550', 'No', 28435),
 ('unknown', '', 1325704),
 ('unknown', 'No', 73659),
 ('L - Between 350 and 550', '', 563222),
 ('L - Between 350 and 550', 'No', 30973),
 ('M - Between 150 and 350', '', 549374),
 ('M - Between 150 and 350', 'No', 30020),
 ('S - Less than 150', '', 466445),
 ('S - Less than 150', 'No', 25737)]

Initialize blank BW

Rather than entering an already constructed json query, BWQuery can be used to construct from scratch.

An endpoint and database are required, at minimum.

newq = bwypy.BWQuery()
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

NameError: No endpoint. Provide to BWQuery on initialization or set globally.
newq = bwypy.BWQuery(database='hathipd', endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py')
newq.json
{'compare_limits': [],
 'counttype': ['TextCount', 'WordCount'],
 'database': 'hathipd',
 'groups': [],
 'method': 'return_json',
 'search_limits': {},
 'words_collation': 'Case_Sensitive'}
newq.run().dataframe()
TextCount WordCount
0 4552862 7.328341e+11
newq.groups
[]
newq.groups = ['foo']
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)    

KeyError: 'The following groups are not supported in this BW: foo'
newq.groups = ['publication_country']
newq.run().dataframe()
TextCount WordCount
publication_country
No place, unknown, or undetermined 8144 1.137803e+09
United Kingdom Misc. Islands 2 4.984400e+04
Australia 799 2.968196e+08
United States 1962339 3.212052e+11
Wales 41 1.241756e+07
England 10656 1.831402e+09
unknown 1937740 2.954869e+11
Latvia 64 1.987412e+07
Northern Ireland 10 2.987728e+06
Scotland 863 1.483882e+08
Soviet Socialist Republic 152 1.947023e+07
United Kingdom 536001 9.987234e+10
Canada 78663 9.747256e+09
Russian S.F.S.R. 3427 8.390290e+08
South Australia 29 4.146067e+06
Victoria 50 1.064348e+07
Estonia 25 4.242192e+06
New South Wales 5 5.819920e+05
Georgian S.S.R. 0 0.000000e+00
Ukraine 59 7.314677e+06
Soviet Union 13784 2.186202e+09
Tasmania 1 8.426800e+04
Lithuania 8 9.289520e+05

Global settings

Since it's unlikely be be consistently switching databases or endpoints, these settings can be set globally with set_options:

bwypy.set_options(endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py',
                        database='global')

bwypy.BWQuery(verify_fields=False).database
'global'

Or in a with block:

with bwypy.set_options(endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py', database='with_block'):
    bw = bwypy.BWQuery(verify_fields=False)
bw.database
'with_block'

The priority for variables is:

  • set with an init argument
  • set within the query json (for database)
  • set within a with block with set_options
  • set globally with set_options

More BWQuery functions

Parser for getAvailableFields, used internally on initialization if integrity_check=True:

bw = bwypy.BWQuery(json=jsonq, endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py')
bw.fields()
anchor dbname description name tablename type
0 bookid lc_classes lc_classes lc_classesLookup character
1 bookid lc_subclasses lc_subclasses lc_subclassesLookup character
2 bookid fiction_nonfiction fiction_nonfiction fiction_nonfictionLookup character
3 bookid genres genres genresLookup character
4 bookid languages languages languagesLookup character
5 bookid format format formatLookup character
6 bookid is_gov_doc is_gov_doc is_gov_docLookup character
7 bookid page_count_bin page_count_bin page_count_binLookup character
8 bookid word_count_bin word_count_bin word_count_binLookup character
9 bookid publication_country publication_country publication_countryLookup character
10 bookid publication_state publication_state publication_stateLookup character
11 bookid publication_place publication_place publication_placeLookup character
12 bookid date_year date_year fastcat integer

Return all possible values for the field.

bw.field_values(field='lc_classes')
['unknown',
 'Language and Literature',
 'General and Old World History',
 'Social Sciences',
 'Science',
 'Philosophy, Psychology, and Religion',
 'Law',
 'Technology',
 'General Works',
 'History of the United States and British, Dutch, French, and Latin America',
 'Political Science',
 'Agriculture',
 'History of America',
 'Education',
 'Bibliography, Library Science, and General Information Resources',
 'Medicine',
 'Fine Arts',
 'Geography, Anthropology, and Recreation',
 'Music',
 'Auxiliary Sciences of History',
 'Military Science',
 'Naval Science']
bw.field_values(field='is_gov_doc')
['', 'No']

Testing validation

If BWQuery was initialized without turning off verify_fields, or if the fields method was run at any point, it will check queries against the known fields for that database.

Much of the time, field validation throws an automatic error:

bw = bwypy.BWQuery(json=jsonq, endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py')
bw.search_limits = { 'fake_field': 'whatever_value'}
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

KeyError: 'The following search_limit fields are not supported in this BW: fake_field'

There are some fancy ways that you can set values where the validation isn't run. In those cases, next time validation runs, if it crashes the query is reverted to an older versions.

bw.search_limits['date_year_wrong'] = 1
print("Uh oh, we got a bad field set! -- ", bw.search_limits)
try:
    bw._validate()
except:
    print("But it reverted after a failure! -- " , bw.search_limits)
Uh oh, we got a bad field set! --  {'date_year': {'$lt': 1923, '$gt': 1790}, 'date_year_wrong': 1}
But it reverted after a failure! --  {'date_year': {'$lt': 1923, '$gt': 1790}}

Turning off validation

Checking allowable fields means an extra call to the database. If you know the schema already, just turn off verify_fields.

%%time
bwypy.BWQuery(json=jsonq, endpoint='https://bookworm.htrc.illinois.edu/cgi-bin/dbbindings.py', verify_fields=False)
Wall time: 0 ns

<bwypy.core.BWQuery at 0x1fb45630358>

About

Python library for working with a Bookworm server

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published