# Lab 2 - Fun with Data Structures

We are going to pull in a dataset that we are unfamiliar with in order to explore it and organize subsets of the data into python data structures.

We will use the crossref.org API to query academic research publications.
https://www.crossref.org/documentation/retrieve-metadata/rest-api/a-non-technical-introduction-to-our-api/

Use your last name in the query to generate different results from the rest of the class.


In [12]:
import requests

name='jones'  # use your last name for the query
rows='100'    # only request the first 100 results

url='https://api.crossref.org/works?filter=has-license:true,has-full-text:true&query.author='+name+'&rows='+rows

response=requests.get(url)  # use the requests module to get the results

Verify that the request was successful.  Html status code of 200 means successful.

In [13]:
response

<Response [200]>

Investigate the type of data returned by the response.  This is a requests response object.  

In [14]:
type(response)

requests.models.Response

Because the data is json, we can use the requests method to convert the json data to a python dictionary.

In [15]:
resp = response.json()
type(resp)

dict

Look at the contents of the data.  

In [16]:
resp

{'status': 'ok',
 'message-type': 'work-list',
 'message-version': '1.0.0',
 'message': {'facets': {},
  'total-results': 212400,
  'items': [{'indexed': {'date-parts': [[2022, 4, 5]],
     'date-time': '2022-04-05T06:33:52Z',
     'timestamp': 1649140432808},
    'reference-count': 12,
    'publisher': 'Elsevier BV',
    'issue': '31',
    'license': [{'start': {'date-parts': [[1964, 1, 1]],
       'date-time': '1964-01-01T00:00:00Z',
       'timestamp': -189388800000},
      'content-version': 'tdm',
      'delay-in-days': 0,
      'URL': 'https://www.elsevier.com/tdm/userlicense/1.0/'}],
    'content-domain': {'domain': [], 'crossmark-restriction': False},
    'short-container-title': ['Tetrahedron Letters'],
    'published-print': {'date-parts': [[1964, 1]]},
    'DOI': '10.1016/s0040-4039(01)89465-x',
    'type': 'journal-article',
    'created': {'date-parts': [[2002, 7, 25]],
     'date-time': '2002-07-25T11:48:29Z',
     'timestamp': 1027597709000},
    'page': '2117-2123',
   

To better understand the contents, print each of the keys.

In [17]:
for key in resp:
    print(key)

status
message-type
message-version
message


The message key looks like it contains the most data.  Let's explore that more.

In [18]:
for key in resp['message']:
    print(key)

facets
total-results
items
items-per-page
query


The items key contains a lot of information, but what type is it?
It is a list, and each element in the list is a dictionary.

In [19]:
print(type(resp['message']['items']))
print(type(resp['message']['items'][0]))

<class 'list'>
<class 'dict'>


We'll print the keys of the item dictionary to see what we want to explore next.

In [20]:
keys=[]
for key in resp['message']['items'][0]:
    keys.append(key)
    print(key)

indexed
reference-count
publisher
issue
license
content-domain
short-container-title
published-print
DOI
type
created
page
source
is-referenced-by-count
title
prefix
volume
author
member
reference
container-title
language
link
deposited
score
resource
issued
references-count
journal-issue
alternative-id
URL
ISSN
issn-type
subject
published


Let's do a more exhaustive search of the items to print the data type and value of each item.

In [21]:
print('Level: 0 is a list with '+str(len(resp['message']['items']))+' items.')
print('Level: 1 is a dict with '+str(len(keys))+' keys.')

for index,key in enumerate(resp['message']['items'][0]):
    print("key "+str(index)+": "+str(key))
    print("type: "+str(type(resp['message']['items'][0][key])))
    if type(resp['message']['items'][0][key])==list:
        print("values:")
        for index,value in enumerate(resp['message']['items'][0][key]):
            print("    value "+str(index)+str(value))
    elif type(resp['message']['items'][0][key])==dict:
        print("values:")
        for index,subkey in enumerate(resp['message']['items'][0][key]):
            print("    subkey "+str(index)+": "+str(subkey))
            print("    value "+str(index)+": "+str(resp['message']['items'][0][key][subkey]))
    else:
        print("value: "+str(resp['message']['items'][0][key]))
    print()


Level: 0 is a list with 100 items.
Level: 1 is a dict with 35 keys.
key 0: indexed
type: <class 'dict'>
values:
    subkey 0: date-parts
    value 0: [[2022, 4, 5]]
    subkey 1: date-time
    value 1: 2022-04-05T06:33:52Z
    subkey 2: timestamp
    value 2: 1649140432808

key 1: reference-count
type: <class 'int'>
value: 12

key 2: publisher
type: <class 'str'>
value: Elsevier BV

key 3: issue
type: <class 'str'>
value: 31

key 4: license
type: <class 'list'>
values:
    value 0{'start': {'date-parts': [[1964, 1, 1]], 'date-time': '1964-01-01T00:00:00Z', 'timestamp': -189388800000}, 'content-version': 'tdm', 'delay-in-days': 0, 'URL': 'https://www.elsevier.com/tdm/userlicense/1.0/'}

key 5: content-domain
type: <class 'dict'>
values:
    subkey 0: domain
    value 0: []
    subkey 1: crossmark-restriction
    value 1: False

key 6: short-container-title
type: <class 'list'>
values:
    value 0Tetrahedron Letters

key 7: published-print
type: <class 'dict'>
values:
    subkey 0: date-

Create a list of all of the authors.

In [22]:
authors=[]
for index,key in enumerate(resp['message']['items']):
    authors.append(resp['message']['items'][index]['author'][0])

authors

[{'given': 'Joan', 'family': 'Jones', 'sequence': 'first', 'affiliation': []},
 {'given': 'W.', 'family': 'Jones', 'sequence': 'first', 'affiliation': []},
 {'given': 'J.M.', 'family': 'Jones', 'sequence': 'first', 'affiliation': []},
 {'given': 'Stephen E.',
  'family': 'Jones',
  'sequence': 'first',
  'affiliation': []},
 {'given': 'BUSH', 'family': 'JONES', 'sequence': 'first', 'affiliation': []},
 {'given': 'R.M.', 'family': 'Jones', 'sequence': 'first', 'affiliation': []},
 {'given': 'J.', 'family': 'Jones', 'sequence': 'first', 'affiliation': []},
 {'given': 'M.S.', 'family': 'Jones', 'sequence': 'first', 'affiliation': []},
 {'given': 'M.', 'family': 'JONES', 'sequence': 'first', 'affiliation': []},
 {'given': 'J.P.', 'family': 'Jones', 'sequence': 'first', 'affiliation': []},
 {'given': 'N.A.', 'family': 'Jones', 'sequence': 'first', 'affiliation': []},
 {'given': 'M.B.', 'family': 'Jones', 'sequence': 'first', 'affiliation': []},
 {'given': 'M.', 'family': 'JONES', 'sequence'

The variable authors is now a list of dictionaries.  We can verify that by checking the types.

In [168]:
print(type(authors))
print(type(authors[0]))

<class 'list'>
<class 'dict'>


In session 4 we will be learning more about pandas which helps create and work with dataframe objects.  Here is an example of how we could store all of the authors in a dataframe.

In [170]:
import pandas as pd
df_author=pd.DataFrame(list(authors))
df_author

Unnamed: 0,given,family,sequence,affiliation,ORCID,authenticated-orcid
0,Andrew,McGee,first,[],,
1,Andrew,McGee,first,[{'name': 'Department of EconomicsUniversity o...,,
2,Glenn,McGee,first,[],,
3,R. G.,McGee,first,[],,
4,Andrew,McGee,first,[],,
...,...,...,...,...,...,...
95,Vann,McGee,first,[],,
96,Victor,McGee,first,[{'name': 'Dartmouth College'}],,
97,J.Michael,McGee,first,[],,
98,Sears,McGee,first,[],,
