# Parsing Citations 

With [AnyStyle.io](http://anystyle.io) and Crossref's [search api](http://labs.crossref.org/resolving-citations-we-dont-need-no-stinkin-parser/)

In [81]:
import pandas as pd

In [136]:
citations = pd.read_csv("cites.csv")
citations

Unnamed: 0,syllabus,note,cite
0,berkeley-info202,,"Glushko, Robert J. (Editor). The Discipline of..."
1,berkeley-info202,,"Kent, William. Data and Reality (3rd Edition) ..."
2,berkeley-info202,,"Bush, Vannevar (1945). As We May Think. The At..."
3,berkeley-info202,,"Borges, Jorge Luis. “The library of Babel (Lin..."
4,berkeley-info202,,"Hearst, Marti. Search User Interfaces, 2009."
5,berkeley-info202,,"Morville, Peter and Rosenfeld, Louis. Informat..."
6,berkeley-info202,,"NPR. For a More Ordered Life, Organize Like a ..."
7,berkeley-info202,,"Gardiner, Bryan. “How an Army of Sensors Helps..."
8,berkeley-info202,,"Smith, Abby. “Authenticity in perspective.” Au..."
9,berkeley-info202,,"Doctorow, Cory. Metacrap http://www.well.com/~..."


In [86]:
citations.iloc[0]

syllabus                                     berkeley-info202
note                                                      NaN
cite        Glushko, Robert J. (Editor). The Discipline of...
Name: 0, dtype: object

In [87]:
len(citations)

1822

ok, we got a pile of citations. But they aren't in shape. When we look at the cites, they are a collection of 1,505 strings, this isn't very useful for doing data analysis. We need to *get them into shape*, that is, we need to break them up into their component parts.

In [91]:
citations.iloc[0:5]

Unnamed: 0,syllabus,note,cite
0,berkeley-info202,,"Glushko, Robert J. (Editor). The Discipline of..."
1,berkeley-info202,,"Kent, William. Data and Reality (3rd Edition) ..."
2,berkeley-info202,,"Bush, Vannevar (1945). As We May Think. The At..."
3,berkeley-info202,,"Borges, Jorge Luis. “The library of Babel (Lin..."
4,berkeley-info202,,"Hearst, Marti. Search User Interfaces, 2009."


Note, these citations are not always formatted in the same way, for example, lets looks at some from a different part of the piles.

In [90]:
citations.iloc[390:395]

Unnamed: 0,syllabus,note,cite
390,indiana-z501,article,"Dempsey, Lorcan, Malpas, Constance, and Lavoie..."
391,indiana-z501,article,"Levine‐Clark, Michael. “Access to Everything: ..."
392,indiana-z501,article,"Downey, Kay, Zhang, Yin, Urbano, Cristobal, an..."
393,indiana-z501,article,"Cassell, K. A., & Hiremath, U. (2013). Introdu..."
394,indiana-z501,article,"Janes, J. (2003). Reference, digital and other..."


Parsing citaitons is a whole area of research to be discussed at another time. I am going to use [AnyStyle.io](http://anystyle.io) to try and parse these citations because it has a nicely designed API.

In [131]:
import requests
import os
import json
import numpy as np
import time

In [56]:
# get the API key for AnyStyle.io from a text file in this directory
with open('anystyle_key.txt','r') as f:
    api_key = f.read()

In [135]:
# I want to figure out what cite is causing the error

parsed_cites = []

for cite in list(segment['cite']):
   

    
    payload = {"format": "json",
               "access_token": api_key,
               "references": cite}
    headers = {"Content-Type": "application/json;charset=UTF-8"}
    #print("Payload Build, requesting")
    
    r = requests.post("http://anystyle.io/parse/references",
                  headers=headers,
                  data=json.dumps(payload))
    #print("Got response", r)
    if r.status_code == 400:
        print(cite)
    
    #parsed_cites.append(r.json())


nan
nan


In [137]:
parsed_cites = []

for segment in np.array_split(citations,5):
    print("Segment Length: ",len(segment))
    cite_pile = list(segment['cite'])

    
    payload = {"format": "json",
               "access_token": api_key,
               "references": cite_pile}
    headers = {"Content-Type": "application/json;charset=UTF-8"}
    print("Payload Build, requesting")
    
    r = requests.post("http://anystyle.io/parse/references",
                  headers=headers,
                  data=json.dumps(payload))
    print("Got response", r)
    parsed_cites.append(r.json())


Segment Length:  364
Payload Build, requesting
Got response <Response [200]>
Segment Length:  364
Payload Build, requesting
Got response <Response [200]>
Segment Length:  364
Payload Build, requesting
Got response <Response [200]>
Segment Length:  364
Payload Build, requesting
Got response <Response [200]>
Segment Length:  363
Payload Build, requesting
Got response <Response [200]>


In [138]:
print(len(parsed_cites))

5


In [142]:
parsed_cites_master = [cite for cites in parsed_cites for cite in cites]

In [143]:
parsed_cites_master[0:10]

[{'date': '2014',
  'editor': 'Glushko, Robert J.',
  'language': 'en',
  'publisher': "O'Reilly Media",
  'title': 'The Discipline of Organizing (http://shop.oreilly.com/product/0636920034629.do)',
  'type': 'book'},
 {'author': 'Kent, William',
  'date': '2012',
  'edition': '3rd Edition) (http://books.google.com/books?id=7z57tgAACAAJ',
  'language': 'et',
  'publisher': 'Technics Publications',
  'title': 'Data and Reality',
  'type': 'book'},
 {'author': 'Bush, Vannevar',
  'date': '1945-07',
  'journal': 'The Atlantic Magazine',
  'language': 'en',
  'title': 'As We May Think',
  'type': 'article'},
 {'author': 'Borges, Jorge Luis',
  'date': '1998',
  'language': 'en',
  'publisher': 'Collected Fictions',
  'title': 'The library of Babel (Links to an external site.).',
  'type': 'book'},
 {'author': 'Hearst, Marti',
  'date': '2009',
  'language': 'en',
  'title': 'Search User Interfaces',
  'type': 'misc'},
 {'author': 'Morville, Peter and Rosenfeld, Louis',
  'date': '2006',
  

Sweet!

In [144]:
with open("parsed_cites.json",'w') as f:
    print(json.dumps(parsed_cites_master, indent=4), file=f )

In [145]:
df_citations = pd.DataFrame(parsed_cites_master)
df_citations

Unnamed: 0,accessed,author,authority,booktitle,citation_number,date,edition,editor,genre,isbn,...,unmatched-editor,unmatched-genre,unmatched-journal,unmatched-pages,unmatched-publisher,unmatched-unknown,unmatched-url,unmatched-volume,url,volume
0,,,,,,2014,,"Glushko, Robert J.",,,...,,,,,,,,,,
1,,"Kent, William",,,,2012,3rd Edition) (http://books.google.com/books?id...,,,,...,,,,,,,,,,
2,,"Bush, Vannevar",,,,1945-07,,,,,...,,,,,,,,,,
3,,"Borges, Jorge Luis",,,,1998,,,,,...,,,,,,,,,,
4,,"Hearst, Marti",,,,2009,,,,,...,,,,,,,,,,
5,,"Morville, Peter and Rosenfeld, Louis",,,,2006,,,,,...,,,,,,,,,,
6,,N.P.R.,,,,2014,,,,,...,,,,,,,,,,
7,,"Gardiner, Bryan",,,,2013,,,,,...,,,,,,,,,,
8,,"Smith, Abby",,,,2000,,,,,...,,,,,,,,,,
9,,"Doctorow, Cory",,,,,,,,,...,,,,,,,,,http://www.well.com/,


In [146]:
df_citations.to_csv("parsed_cites.csv")