# Date Extraction
In this notebook I explore methods of finding the approximate publication dates of items in our [catalog files](https://drive.google.com/drive/folders/1hiZ_IeMibF7RWLOJOOuAoziaF7ynXw2a).

In [None]:
%%capture
!pip3 install scholarly;

import pandas as pd
import os
import re
import doctest
import random
from scholarly import scholarly
from datetime import datetime
from google.colab import drive
drive.mount('/content/drive')

from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.activity.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fexperimentsandconfigs%20https%3a%2f%2fwww.googleapis.com%2fauth%2fphotos.native&response_type=code

Enter your authorization code:
4/1AY0e-g4G8TuZAnTg3elwpW4_RvrofyO5Boy-lVGM8lkWiat00oCVxnapQkU
Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=openid+https%3A%2F%2Fwww.googleapi

## Catalog File Imports

The getDataFrame method is provided below for convenience.

In [None]:
catalog_titles = [
    'catalog.20200407.' + suffix
    for suffix in ('aa', 'ab', 'ac', 'ad', 'ae')
]
catalog_titles

['catalog.20200407.aa',
 'catalog.20200407.ab',
 'catalog.20200407.ac',
 'catalog.20200407.ad',
 'catalog.20200407.ae']

In [None]:
def getDataFrame(title, worksheet=0, has_headers=True):
  """Returns a pandas.DataFrame representation of the
  (WORKSHEET)th worksheet of the Google Sheets (GSHEET)
  file that has title TITLE.

  TITLE - the title of the desired spreadsheet
  WORKSHEET - the index of the desired worksheet within
      the spreadsheet
  HAS_HEADERS - set to False if the spreadsheet does not
      have a header row at the top.
  
  It is not necessary to specify the path or the GSHEET
  file extension. Note that this creates undefined
  behavior when your google drive has multiple spreadsheets
  with the same name (i.e., you do not know which one
  will be opened).
  """
  # For details on how to handle GSHEET files, see
  # https://gspread.readthedocs.io/en/latest/api.html
  contents = gc.open(title).get_worksheet(worksheet).get_all_values()
  if has_headers:
    return pd.DataFrame.from_records(
        data=contents[1:],
        columns=contents[0]
    )
  return pd.DataFrame.from_records(contents)
df = pd.concat(
    (getDataFrame(title) for title in catalog_titles),
    ignore_index=True
)
df.head()

Unnamed: 0,ID,md5,Size,mime-type,Created Date,Modified Date,Folder,Name
0,0B9Ibqa26YXiReTZaVm9OWmNxSlk,8.0,1.0,application/vnd.google-apps.folder,,,ane.pdf.share,
1,1SM5HksnY6AMPpTdSU5ww9-HPiQ99XfgSE7WrLhGlBDA,,,application/vnd.google-apps.spreadsheet,2016-12-16T21:40:27.784Z,2018-03-06T17:00:11.204Z,ane.pdf.share,Copy of PDFtp Catalog
2,0B9Ibqa26YXiRV0N4VjF4X1dFWGs,26.0,2.0,application/vnd.google-apps.folder,2015-10-28T02:49:34.608Z,2018-04-25T18:57:32.604Z,ane.pdf.share/By Author (or editor),
3,11QJyX5cD2fOKFOv3IZreja87dZGKSh-hIdqCIkuWd6k,,,application/vnd.google-apps.spreadsheet,2011-06-14T22:12:39.027Z,2018-06-11T08:30:28.072Z,ane.pdf.share/By Author (or editor),PDFtp Files: Dirty
4,1FmwBRqFmXyG--6akcrQGBE6yVmKOl8L3PXTtzT8_jew,,,application/vnd.google-apps.spreadsheet,2011-06-15T00:58:25.597Z,2019-07-05T14:36:16.263Z,ane.pdf.share/By Author (or editor),PDFtp WishList & Catalog


## Characterization of the Data Available

In [None]:
int('0123')

123

In [None]:
def get_naturals(s):
  """Return a list of all natural numbers in the string S.

  Substrings are identified as natural numbers iff they are
  contiguous sequences of decimal digits that are not adjacent
  to other digits and that do not have leading zeros.

  >>> get_naturals('12345 abcd 12345')
  [12345, 12345]
  >>> get_naturals('3.14159')
  [3, 14159]
  >>> get_naturals('-26.4 + 0 = -26.4')
  [26, 4, 0, 26, 4]
  >>> get_naturals('012 64, 1923')
  [64, 1923]
  >>> get_naturals('00, 0--12') # If the number is just one zero, it's not a leading zero.
  [0, 12]
  """
  parts = re.split(r'[^0-9]', s)
  ret = list()
  for part in parts:
    if part and not (part[0] == '0' and part[1:]):
      ret.append(int(part))
  return ret
def get_year(s):
  """Return the numbers in S that seems most likely to be a year.

  Numbers are likely to be years if they are recent, but not later
  than the current year. (For reference, these doctests were
  written in 2021.)
  If no number is likely to be a year, then None is returned.

  >>> get_year('2019-2048.2021.apple.runningMan')
  2021
  >>> get_year('2019 20211') # 20211 is much later than the current year
  2019
  >>> get_year('01273, abc') # 102
  >>> get_year('Wolf, B, & Arnold, J. (1944). Calcium Content in ...')
  1944
  """
  ret = -1
  for n in get_naturals(s):
    if ret < n <= datetime.now().year:
      ret = n
  return ret if ret >= 0 else None

The functions defined above do pass their doctests. (Set verbose to True for details.)

In [None]:
verbose = False
doctest.run_docstring_examples(get_naturals, globals(), verbose=verbose)
doctest.run_docstring_examples(get_year, globals(), verbose=verbose)

In [None]:
print('Regrettably, only {:.3f} of the bibliographic resources in our\n'
      'catalog seem to include a year in their file names.'.format(
          sum(
          1 if get_year(name) and get_year(name) > 1900 else 0
          for name in df.Name
      ) / len(df.Name)
))

Regrettably, only 0.288 of the bibliographic resources in our
catalog seem to include a year in their file names.


In [None]:
print('Here are some examples of file names in the catalog:')
list(random.sample(list(df.Name), 10))

Here are some examples of file names in the catalog:


['._Oidd.png',
 '',
 'Proust-ap-c-NTM.pdf',
 'Icon\n',
 'asset-v1_spbu+PSYLING+spring_2017+type@asset+block@1_1_Когнитивная_наука_как_конвергетное_знание.pdf',
 'arslantashivorysphinx3.jpeg',
 '',
 'P8150266.JPG',
 'Icon\n',
 '1963_book04.pdf']

## Examples: Use of Scholarly Library

Google Scholar quite deliberately does not provide an API, and I am importing and using a package called Scholarly that may not be fully compliant with Google's terms of service. There is some concise and informative discussion regarding such software [here](https://academia.stackexchange.com/questions/34970/how-to-get-permission-from-google-to-use-google-scholar-data-if-needed). If we have ethical concerns about it, we could try to query some other database.

Using this package (which appears to be a reasonably simple module built on top of BeautifulSoup), it is possible to retrieve quite a bit of metadata about an article.

In [None]:
search_query = scholarly.search_pubs('Abdul-Raof, Hussein - On the stylistic variation in the Quranic genre')
scholarly.pprint(next(search_query))

{'author_id': [''],
 'bib': {'abstract': 'Stylistic variation is one of the intriguing linguistic '
                     'problems of Quranic discourse. It deals with sentences '
                     'that are structurally similar, yet they are '
                     'stylistically and semantically dissimilar. Stylistic '
                     'variation echoes language behaviour and mirrors the '
                     'stylistic patterns produced by linguistic strategies and '
                     'ad hoc linguistic tools. Stylistic variation is '
                     "Qur'ān-specific and is context and co-text sensitive. In "
                     'other words, context and co-text are the linguistic '
                     'habitat of stylistic variation. Stylistic variation is '
                     'directly influenced by the',
         'author': ['H Abdul-Raof'],
         'pub_year': '2007',
         'title': 'On the stylistic variation in the quranic genre',
         'venue': 'Journa

Year information is available. Here, I demonstrate that it is possible to find the publication years of the top 10 articles that match the keyword "wool."

In [None]:
search_query = scholarly.search_pubs('wool')
for i, result in enumerate(search_query):
  print(result['bib']['pub_year'] + ' -> ' + result['bib']['title'])
  if i > 10:
    break;

2011 -> Bio-based polymers and composites
2008 -> Self-healing materials: a review
1995 -> Identification and expression cloning of a leptin receptor, OB-R
1996 -> Extraribosomal functions of ribosomal proteins
1981 -> A theory crack healing in polymers
2004 -> A quantitative study of firewall configuration errors
1983 -> A theory of healing at a polymer-polymer interface
2004 -> Natural fiber composites with plant oil-based resin
2002 -> Wool: Science and technology
2000 -> Composites from natural fibers and soy oil resins
2001 -> Development and application of triglyceride‐based polymers and composites
1979 -> The structure and function of eukaryotic ribosomes


One challenge with this dataset is that words are sometimes run together without any spaces, hyphens, or underscores to distinguish them. Reconstructing the original words is a nontrivial task, but apparently Google can do it.

In [None]:
search_query = scholarly.search_single_pub('hallo 1978 assyrianhistoriographyrevisited')
scholarly.pprint(search_query)

{'author_id': [''],
 'bib': {'author': ['WW Hallo'],
         'pub_year': '1978',
         'title': 'Assyrian Historiography Revisited',
         'venue': 'NA'},
 'citedby_url': '/scholar?cites=11771693565646888008&as_sdt=2005&sciodt=0,5&hl=en',
 'filled': False,
 'gsrank': 1,
 'num_citations': 12,
 'source': 'PUBLICATION_SEARCH_SNIPPET',
 'url_add_sclib': '/citations?hl=en&xsrf=&continue=/scholar%3Fq%3Dhallo%2B1978%2Bassyrianhistoriographyrevisited%26hl%3Den%26as_sdt%3D0,5&citilm=1&json=&update_op=library_add&info=SFwiye10XaMJ&ei=RWxfYPHTJZ-R6rQPps2BmAE',
 'url_related_articles': '/scholar?q=related:SFwiye10XaMJ:scholar.google.com/&scioq=hallo+1978+assyrianhistoriographyrevisited&hl=en&as_sdt=0,5',
 'url_scholarbib': '/scholar?q=info:SFwiye10XaMJ:scholar.google.com/&output=cite&scirp=0&hl=en'}


## Data Cleaning

Here, I write some basic methods that we can use to clean up our file names before searching for them on Google Scholar.

File extensions seem like a logical thing to get rid of. However, `os.path.splitext` doesn't seem to do quite as well as we would like. Below, a very long list of strings that are identified as file extensions is printed.

In [None]:
def print_compact(strs, max_width=100):
  nchars = 0
  len_pad = 4
  for i, s in enumerate(strs):
    width = len(s) + len_pad
    if len(s) > max_width:
      print("\n    [A string was too long to print.]\n")
      continue
    if nchars + width > max_width:
      print()
      nchars = 0
    print('"' + s + '"', end=', ')
    nchars += len(s) + len_pad
extensions = set(os.path.splitext(f)[1] for f in df.Name)
print_compact(random.sample(extensions, 100))

". Harrak, Assyria and Hanigalbat", ".INF", ".IV-", ".exp", ". CDOG 8, 2008 125-136", 
".textClipping", ". Zippalanda", ".C", ". -T", ". AnalogLife_DigitalImage_Abstracts", 
". Список литературы", ".133", ". RGTC 5", ".155", ".proclus", ".homer", ".sheinman", ".ini", 
".rpm", ".vergilius", ".mds", ".01", ". 1-2", ".EX_", ".xlsx", ".palama", ".org", ".UNS", 
". Ah Rit fur das Konigspaar", ".87", ".au", ".jbf", ".sit", ".svg", ".19a", 
". Kumarbi, Teogonia, Papiro de Derveni_1989", ".09d", ".56", ".csv", ".bibikhin-com", 
". - Relevant of the Diyala Sequence to South Mesopotamian Sites (Iraq 29)", 
". Cammarosano_Il decreto antico-ittita di Pimpira_2006", 
    [A string was too long to print.]

".muravieva-comm", ".voilquin-com", 
".JPE", ".kant", ".DLL", ".Formation des noms personnels f��minins", ".met", ".losski-comm", 
". Veenhof on the Occasion of his Sixty-Fifth Birthday", 
". Die Syntax des ah subst Genitivs (PARTLY)", ". Religion_1997", ".LUGAL", ".04", ".DIZ", 
". Die Apologie H

In [None]:
def get_extensions(filenames, min_frequency=7, max_len=5):
  """Return a list of common extensions used in the sequence
  FILENAMES.

  Extensions are only recognized if they appear at least
  MIN_FREQUENCY times and have at most MAX_LEN characters.
  """
  extensions = dict()
  for f in filenames:
    assert isinstance(f, str)
    ext = os.path.splitext(f)[1][1:]
    if ext and len(ext) <= max_len:
      extensions[ext] = extensions.get(ext, 0) + 1
  return set(
      key for key in extensions
      if extensions[key] >= min_frequency
  )
len(get_extensions(df.Name))

108

In [None]:
print_compact(common_extensions(df.Name))

"", ".pdf", ".LSF", ".au", ".kant", ". 2", ".psd", ".opus", ".2", ".cda", ".jbf", ".wav", ".css", 
".dr", ".bmp", ".ai", ".", ".rtf", ".frf", ".m4a", ".tab", ".gexf", ".e 2", ".docx", ".MYI", 
".rar", ".xls", ".vulg", ".tmp", ".wbk", ".TXT", ".WPG", ".idx", ".C", ".1", ".wma", ".mht", 
".frm", ".pptx", ".info", ".gif", ".csv", ".aux", ".MYD", ".png", ".jpeg", ".wpd", ".ACE", ".jpe", 
".JPG", ".zip", ".eml", ".doc", ".DOC", ".xlsx", ".EXE", ".ppt", ".HW3", ".JPE", ".e 1", ".js", 
".gdoc", ".txt", ".djvu", ".GIF", ".HW", ".ini", ".TTF", ".mp3", ".3", ".TIF", ".htm", ".tif", 
".djv", ".php", ".RAR", ".HWT", ".db", ".ttf", ".LIX", ".DAT", ".jpg", ".html", ".john", ".djbz", 
".lnk", ".yml", ".log", ".cdr", ".e", ".epub", ".hdr", ".xml", ".PDF", ".exe", ".dat", ".tiff", 
".swf", ".BMP", ". 1", ".pdx", ".fp7", ".eps", ".DB", ".chm", 

In [None]:
def get_words(s):
  """Returns all words and contiguous numbers in S.
  
  >>> get_words('Юрчак Алексей. Это было навсегда, пока не кончилось.')
  ['Юрчак', 'Алексей', 'Это', 'было', 'навсегда', 'пока', 'не', 'кончилось']
  >>> get_words('veenhof2010_ch.pdf')
  ['veenhof', '2010', 'ch', 'pdf']
  >>> get_words('Michel_1997e_Or66_lamastu.pdf')
  ['Michel', '1997', 'e', 'Or', '66', 'lamastu', 'pdf']
  >>> get_words('._transcript_KBo_IV_6_obv.htm')
  ['transcript', 'KBo', 'IV', '6', 'obv', 'htm']
  >>> get_words('macdonald_the-homeric-epics-and-the-gospel-of-mark-0300080123.pdf')
  ['macdonald', 'the', 'homeric', 'epics', 'and', 'the', 'gospel', 'of', 'mark', '0300080123', 'pdf']
  >>> get_words('Albenda, Lions Assyrian Reliefs, JANES 6, 1974b.pdf')
  ['Albenda', 'Lions', 'Assyrian', 'Reliefs', 'JANES', '6', '1974', 'b', 'pdf']
  >>> get_words('marti2009 un m%e9decin malade jmc 13.pdf')
  ['marti', '2009', 'un', 'm', 'e', '9', 'decin', 'malade', 'jmc', '13', 'pdf']
  """
  return [match.group(0) for match in re.finditer(r'([^\d\W_]+)|(\d+)', s)]

Again, I verify that the method passes doctests:

In [None]:
doctest.run_docstring_examples(get_words, globals(), verbose=False)

In [None]:
def get_search_query(s, file_extensions):
  """Constructs a search query from the file name S."""
  words = get_words(s)
  if words and words[-1] in file_extensions:
    words = words[:-1]
  return ' '.join(words)
extensions = common_extensions(df.Name)
get_search_query('Veenhof_2006_Two new sources from Kanesh.PDF', extensions)

'Veenhof 2006 Two new sources from Kanesh'

## Current Best Working Technique

In [None]:
class YearFinder:
  """Encapsulates logic to get the years from a set of bibliographic
  records.
  """
  def __init__(
      self,
      file_names=None,
      min_year=-float('inf'),
      google_search=False
      ):
    """Provide file names so that the YearFinder can have general
    information about what the dataset is like.

    GOOGLE_SEARCH - whether or not you would like to query Google
        Scholar to help determine publication years.
    MIN_YEAR - the earliest year that you would like YearFinder
        to recognize
    """
    self.file_names = file_names
    self.extensions = (
        get_extensions(file_names) if file_names is not None
        else []
    )
    self.google_search = google_search
    self.min_year = min_year
  def year(self, file_name):
    """Returns a possible publication year for the bibliographic
    resource associated with the file name. Returns None if no
    year could be found.
    """
    file_year = get_year(file_name)
    if file_year and file_year >= self.min_year:
      return file_year
    if self.google_search:
      response = dict()
      try:
        response = scholarly.search_single_pub(
            get_search_query(file_name, self.extensions)
        )
      except IndexError:
        pass
      except:
        print('Could not access Google Scholar.')
      try:
        response_year = get_year(response['bib']['pub_year'])
        if response_year:
          return response_year
      except KeyError:
        pass

Here is a quick ad-hoc test.

In [None]:
yf = YearFinder(file_names=df.Name, min_year=1900, google_search=True)

In [None]:
for item in random.sample(list(df.Name), 20):
  pad = max(0, 40-len(item))
  print(item + ' ' * pad + ' -> ' + str(yf.year(item)))

Ogvl.png                                 -> 2020
Bourdieu (1998) - Practical Reason-On the Theory of Action.pdf -> 1998
Avraham Ronen_2006_Comments on Recent Publications_Neolithic Revolution by Peltenburg_Wasse.pdf -> 2006
OrNS 62, 1993.pdf                        -> 1993
hurritisch ein altorientalische Sprache in einem neuem Licht.pdf -> 1988
._Iraq 31, Gurney, List Copper Objects.pdf -> 1969
._133.DOC                                -> 2016
IMAGE3923.JPG                            -> 2020
40024317.pdf                             -> 2008
S0291.jpg                                -> 2019
PJ1553.A1_1908_cop3_126.JPG              -> 1908
black2000_sumerianadjectives.pdf         -> 2000
._Oapx.png                               -> 1999
0521807409.Cambridge.University.Press.The.Dynamics.of.Coastal.Models.Feb.2008.pdf -> 2008
BAM165iii-iv.jpg                         -> 1995
Okil.png                                 -> 2018
Орлов Традиция Еноха Метатрона.docx      -> 2018
kt s_t 92.txt          

## Evaluation

Here, I define a method for comparing years found by a human to years found by a function.

In [None]:
def get_labels(texts, quit_ask=None):
  """Returns a two-column DataFrame of texts and labels
  assigned to those texts by a human.
  """
  labels = []
  print('Please associate labels with the following file names.\n'
        'If you do not know the correct label, simply press Enter.')
  for i, s in enumerate(texts):
    labels.append(input('Current file name: ' + s + '\nLabel? '))
    if (
            quit_ask and i % quit_ask == quit_ask - 1
            and 'y' in input('Would you like to quit? (y/n) ')
        ):
      break
  return pd.DataFrame(data={
      'texts': texts[:len(labels)],
      'labels': labels
  })
labelled_file_names = get_labels(
    random.sample(list(df.Name), 100),
    quit_ask=10
)

Please associate labels with the following file names.
If you do not know the correct label, simply press Enter.
Current file name: richards_paul-and-first-century-letter-writing-secretaries-composition-and-collection-0830827889.pdf
Label? 2004
Current file name: Geller_1999_Freud_and_Mesopotamian_Magic_AMD1.pdf
Label? 1999
Current file name: Benz 2000 Die Neolithisierung im Vorderen Orient.pdf
Label? 2000
Current file name: Ritner 2000 - Innovations and Adaptations in Ancient Egyptian Medicine (JNES 59).pdf
Label? 2000
Current file name: T08K02Y10.htm
Label? 
Current file name: 500.GIF
Label? 
Current file name: Ojfa.png
Label? 
Current file name: 
Label? 
Current file name: Tel-Dan_Athas_JSS-51-2006.pdf
Label? 2006
Current file name: ._Ohla.png
Label? 
Would you like to quit? (y/n) n
Current file name: 441.jpg
Label? 
Current file name: 13.pdf
Label? 
Current file name: 20150806-DSC09833.JPG
Label? 2015
Current file name: CCT 6, 18a.txt
Label? 
Current file name: pl.103.JPG
Label? 
C

In [None]:
labelled_file_names['pred'] = [
    yf.year(text) for text in labelled_file_names.texts
]
labelled_file_names.head()

Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.
Could not access Google Scholar.


Unnamed: 0,texts,labels,pred
0,richards_paul-and-first-century-letter-writing...,2004.0,
1,Geller_1999_Freud_and_Mesopotamian_Magic_AMD1.pdf,1999.0,1999.0
2,Benz 2000 Die Neolithisierung im Vorderen Orie...,2000.0,2000.0
3,Ritner 2000 - Innovations and Adaptations in A...,2000.0,2000.0
4,T08K02Y10.htm,,


In [None]:
count_labelled = sum(1 if label else 0 for label in labelled_file_names.labels)
print('Regrettably, I was only able to associate years with\n{}/{} ({:.3f})'
      ' of the randomly sampled file names.'.format(
          count_labelled,
          len(labelled_file_names.index),
          count_labelled / len(labelled_file_names.index)
))

Regrettably, I was only able to associate years with
13/30 (0.433) of the randomly sampled file names.


In [None]:
print('Using the methods defined above, it was possible to\n'
      'retrieve {} possible years, of which {} agreed with\n'
      'my labels.'.format(
          sum(1 if label else 0 for label in labelled_file_names.pred),
          sum(
              1 if (labelled_file_names.labels[i]
                  and (int(labelled_file_names.labels[i])
                     == labelled_file_names.pred[i])) else 0
              for i in labelled_file_names.index
          )
      ))

Using the methods defined above, it was possible to
retrieve 30 possible years, of which 7 agreed with
my labels.
