1. Introduction
===============

In [2]:
import matplotlib.pyplot as plt
import os
import pandas as pd
import re
import seaborn as sns
try:
    from urllib2 import Request, urlopen
except ImportError:
    from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
%matplotlib inline

2. Scraping
===========

In [3]:
def get_file(url, filename=None):
    if filename is None:
        slash = url.rindex("/")
        filename = url[slash+1:]
    if os.path.isfile(filename):
        with open(filename, 'rb') as file:
            content = file.read()
    else:
        req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        content = urlopen(req).read()
        with open(filename, 'wb') as file:
            file.write(content)
            file.close()
    return content

3. Cleaning structured data
===========================

In [None]:
get_file("https://github.com/JabRef/abbrv.jabref.org/raw/master/journals/journal_abbreviations_webofscience.txt")
abbs = pd.read_csv("journal_abbreviations_webofscience.txt", sep='=',
                       comment='#', names=["Full Journal Title", "Abbreviation"])
abbs = abbs.applymap(lambda x: x.upper().strip())

In [None]:
get_file("https://www.researchgate.net/file.PostFileLoader.html?id=558730995e9d9735688b4631&assetKey=AS%3A273803718922244%401442291301717",
         "2014_SCI_IF.xlsx")
ifs = pd.read_excel("2014_SCI_IF.xlsx", skiprows=2, index_col=0)

In [None]:
ifs.tail()

In [None]:
skip = len(ifs) - ifs["Journal Impact Factor"].last_valid_index() + 1
ifs = pd.read_excel("2014_SCI_IF.xlsx", skiprows=2, index_col=0,
                        skip_footer=skip, parse_cols="A,B,E")

In [None]:
ifs["Full Journal Title"] = ifs["Full Journal Title"].apply(lambda x: x.upper())

4. Cleaning and filtering of semi-structured data
=================================================

In [None]:
def get_search_result(author):
    filename = author + ".html"
    url = "https://arxiv.org/find/all/1/au:+" + author
    url += "/0/1/0/all/0/1?per_page=400"
    page = get_file(url, filename)
    return BeautifulSoup(page, "html.parser")

Let us study the results:

In [None]:
lewenstein = get_search_result("Lewenstein_M")

Skipping the boring header part, the source code of the search result looks like this:

```html
<h3>Showing results 1 through 389 (of 389 total) for 
<a href="/find/all/1/au:+Lewenstein_M/0/1/0/all/0/1?skip=0&amp;query_id=ff0631708b5d0dd5">au:Lewenstein_M</a></h3>
<dl>
<dt>1.  <span class="list-identifier"><a href="/abs/1703.09814" title="Abstract">arXiv:1703.09814</a> [<a href="/pdf/1703.09814" title="Download PDF">pdf</a>, <a href="/ps/1703.09814" title="Download PostScript">ps</a>, <a href="/format/1703.09814" title="Other formats">other</a>]</span></dt>
<dd>
<div class="meta">
<div class="list-title mathjax">
<span class="descriptor">Title:</span> Efficient Determination of Ground States of Infinite Quantum Lattice  Models in Three Dimensions
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span> 
<a href="/find/cond-mat/1/au:+Ran_S/0/1/0/all/0/1">Shi-Ju Ran</a>, 
<a href="/find/cond-mat/1/au:+Piga_A/0/1/0/all/0/1">Angelo Piga</a>, 
<a href="/find/cond-mat/1/au:+Peng_C/0/1/0/all/0/1">Cheng Peng</a>, 
<a href="/find/cond-mat/1/au:+Su_G/0/1/0/all/0/1">Gang Su</a>, 
<a href="/find/cond-mat/1/au:+Lewenstein_M/0/1/0/all/0/1">Maciej Lewenstein</a>
</div>
<div class="list-comments">
<span class="descriptor">Comments:</span> 11 pages, 9 figures
</div>
<div class="list-subjects">
<span class="descriptor">Subjects:</span> <span class="primary-subject">Strongly Correlated Electrons (cond-mat.str-el)</span>; Computational Physics (physics.comp-ph)

</div>
</div>
</dd>
```
This is the entire first result. This might look intimidating, but as long as you know that in HTML, a mark-up starts with `<whatever>` and ends with `</whatever>`, you will find regular and hierarchical patterns. If you stare hard enough, you will see that the `<dd>` tag contains all the information we want. It does not actually matter what `<dd>` is: we are not writing a browser, we scraping data.

As a quick sanity check, we can easily extract the titles, and verify that they match the number of search results:

In [None]:
titles = []
for dd in lewenstein.find_all("dd"):
    div = dd.find("div", class_="list-title mathjax")
    titles.append(div.get_text().strip()[7:])
len(titles)

So far so good. The next problem we face is that not all of these papers belong to Maciej Lewenstein: some impostors have the same abbreviated name M. Lewenstein. They are easy to detect if they uses the non-abbreviated name. Let us run through the page again, noting which subject the the impostors publish in. For this, let us introduce another auxiliary function that extract the short name of the the subject. We also note the primary subject when the abbreviated form of the name appears.

In [None]:
def drop_punctuation(string):
    result = string.replace(".", " ")
    return " ".join(result.split())

In [None]:
def extract_subject(long_subject):
    start = long_subject.index("(")
    return long_subject[start+1:-1]

true_lewenstein = ["Maciej Lewenstein", "M Lewenstein"]
impostors = set()
primary_subjects = set()
for dd in lewenstein.find_all("dd"):
    alert = False
    div = dd.find("div", class_="list-authors")
    subject = extract_subject(dd.find("span", class_ = "primary-subject").text)
    names = [drop_punctuation(a.text) for a in div.find_all("a")]
    for name in names:
        if re.search("M.* Lewenstein", name):
            if name not in true_lewenstein:
                impostors.add(name + " " + subject)
            elif "Maciej" not in name:
                primary_subjects.add(subject)
print(impostors)
print(primary_subjects)

So it is only one person, and we can be reasonably confident that Maciej Lewenstein is unlikely to publish in these subjects. The other good news is that all the short forms of the name belong to physics papers, and not computer science. Armed with this knowledge, we can filter out the correct manuscripts. We need to filter one more thing: we are only interested in papers for which the journal reference is given. Further digging in the HTML code lets us find the correct tag. While we are putting together the correct records, we also normalize his name. We define yet another set of auxiliary functions. We zip the main loop's iterator with the `<dt>` tag, because only this one contains the arXiv ID, from which the year can be extracted.

In [None]:
def extract_journal(journal):
    start = journal.index(" ")
    raw = journal[start+1:-1]
    m = re.search("\d", raw)
    return drop_punctuation(raw[:m.start()]).strip().upper()


def extract_title(title):
    start = title.index(" ")
    return title[start+1:-1]


def extract_id_and_year(arXiv):
    start = arXiv.index(":")
    if "/" in arXiv:
        year_index = arXiv.index("/")
    else:
        year_index = start
    year = arXiv[year_index+1:year_index+3]
    if year[0] == "9":
        year = int("19" + year)
    else:
        year = int("20" + year)
    return arXiv[start+1:], year

papers = []
for dd, dt in zip(lewenstein.find_all("dd"), lewenstein.find_all("dt")):
    alert = False
    id_, year = extract_id_and_year(dt.find("a", attrs={"title": "Abstract"}).text)
    div = dd.find("div", class_="list-authors")
    subject = extract_subject(dd.find("span", class_ = "primary-subject").text)
    journal = dd.find("div", class_ = "list-journal-ref")
    if journal:
        names = [drop_punctuation(a.text) for a in div.find_all("a")]
        for i, name in enumerate(names):
            if re.search("M.* Lewenstein", name):
                if name not in true_lewenstein:
                    break
                else:
                    names[i] = "Maciej Lewenstein"
        else:
            papers.append([id_, extract_title(dd.find("div", class_ = "list-title mathjax").text),
                           names, subject, year, extract_journal(journal.text)])
            

We would be almost done if journal names were all entered same way. Of course day were not. Let us try to standardize them:

In [None]:
for i, paper in enumerate(papers):
    journal = paper[-1]
    long_name = abbs[abbs["Abbreviation"] == journal]
    if len(long_name) > 0:
        papers[i][-1] = long_name["Full Journal Title"].values[0]

There will be still some rotten apples:

In [None]:
def find_rotten_apples(paper_list):
    rotten_apples = []
    for paper in paper_list:
        match = ifs[ifs["Full Journal Title"] == paper[-1]]
        if len(match) == 0:
            rotten_apples.append(paper[-1])
    return sorted(rotten_apples)
rotten_apples = find_rotten_apples(papers)
rotten_apples

Now you start to feel the pain of being a data scientist. The sloppiness of manual data entering is unbounded. Your duty is to clean up this mess. A quick fix is to tinker with the `drop_punctuation` function to replace the retarded encoding of JRC. Then go through the creation of the papers array and the standardization again.

In [None]:
def drop_punctuation(string):
    result = string.replace(".", " ")
    result = result.replace(",", " ")
    result = result.replace("(", " ")
    result = result.replace(": ", "-")
    return " ".join(result.split())

**Exercise**. Cut the rotten apples list in half by defining a replacement dictionary and doing another round of standardization.

In [None]:
len(rotten_apples)

It is the same drill with our other contender, except that the short version of his name is uniquely his. On the other hand, that single nasty accent introduces N+1 spelling variants, which we should standardize.

In [None]:
acin = get_search_result("Acin_A")
for dd, dt in zip(acin.find_all("dd"), acin.find_all("dt")):
    id_, year = extract_id_and_year(dt.find("a", attrs={"title": "Abstract"}).text)
    div = dd.find("div", class_="list-authors")
    subject = extract_subject(dd.find("span", class_ = "primary-subject").text)
    journal = dd.find("div", class_ = "list-journal-ref")
    if journal:
        names = [drop_punctuation(a.text) for a in div.find_all("a")]
        journal = extract_journal(journal.text)
        long_name = abbs[abbs["Abbreviation"] == journal]
        if len(long_name) > 0:
            journal = long_name["Full Journal Title"].values[0]
        papers.append([id_, extract_title(dd.find("div", class_ = "list-title mathjax").text),
                       names, subject, year, journal])
for paper in papers:
    names = paper[2]
    for i, name in enumerate(names):
        if re.search("A.* Ac.?n", name):
            names[i] = "Antonio Acín"

In [None]:
rotten_apples = find_rotten_apples(papers)
rotten_apples

In [None]:
db = pd.merge(pd.DataFrame(papers, columns=["arXiv", "Title", "Authors", "Primary Subject", "Year", "Full Journal Title"]),
                           ifs, how="inner", on=["Full Journal Title"])

Finally, we drop duplicate entries.

In [None]:
db = db.drop_duplicates(subset="arXiv")

5. Visual analysis
==================

Since we only focus on two authors, we can add an additional column to help identifying who it is. We also care about the co-authored papers.

In [None]:
def identify_key_authors(authors):
    if "Maciej Lewenstein" in authors and "Antonio Acín" in authors:
        return "AAML"
    elif "Maciej Lewenstein" in authors:
        return "ML"
    else:
        return "AA"

db["Group"] = db["Authors"].apply(lambda x: identify_key_authors(x))

Let's start plotting distributions:

In [None]:
groups = ["AA", "ML", "AAML"]
fig, ax = plt.subplots(ncols=1)
for group in groups:
    data = db[db["Group"] == group]["Journal Impact Factor"]
    sns.distplot(data, kde=False, label=group)
ax.legend()
ax.set_yscale("log")
plt.show()

The logarithmic scale makes the raw number of papers appear more balanced, which is fair given the difference in age between the two authors. A single Nature paper makes a great outlier:

In [None]:
db[db["Journal Impact Factor"] == db["Journal Impact Factor"].max()]

Actually, Toni has another Nature, but that is not on arXiv yet. Not all of Maciej's paper are on arXiv either, especially not the old ones.

We can do the same plots with subjects:

In [None]:
subjects = db["Primary Subject"].drop_duplicates()
fig, ax = plt.subplots(ncols=1)
for subject in subjects:
    data = db[db["Primary Subject"] == subject]["Journal Impact Factor"]
    sns.distplot(data, kde=False, label=subject)
ax.legend()
ax.set_yscale("log")
plt.show()

You are safe with quant-ph and quantum gases, but stay clear of atom physics. It is amusing to restrict the histogram to Professor Acín's subset:

In [None]:
fig, ax = plt.subplots(ncols=1)
for subject in subjects:
    data = db[(db["Primary Subject"] == subject) & (db["Group"] != "ML")]
    if len(data) > 1:
        sns.distplot(data["Journal Impact Factor"], kde=False, label=subject)
ax.legend()
ax.set_yscale("log")
plt.show()

His topics are somewhat predictable.

Let's add one more column to indicate the number of authors:

In [None]:
db["#Authors"] = db["Authors"].apply(lambda x: len(x))
sns.stripplot(x="#Authors", y="Journal Impact Factor", data=db)

How about the length of title? Number of words in title?

In [None]:
db["Length of Title"] = db["Title"].apply(lambda x: len(x))
db["Number of Words in Title"] = db["Title"].apply(lambda x: len(x.split()))
fig, axes = plt.subplots(ncols=2, figsize=(12, 5))
sns.stripplot(x="Length of Title", y="Journal Impact Factor", data=db, ax=axes[0])
sns.stripplot(x="Number of Words in Title", y="Journal Impact Factor", data=db, ax=axes[1])
plt.show()

Do our authors maintain IF over time?

In [None]:
fig, axes = plt.subplots(nrows=2, figsize=(10, 5))
data = db[db["Group"] != "ML"]
sns.stripplot(x="Year", y="Journal Impact Factor", data=data, ax=axes[0])
axes[0].set_title("AA")
data = db[db["Group"] != "AA"].sort_values(by="Year")
sns.stripplot(x="Year", y="Journal Impact Factor", data=data, ax=axes[1])
axes[1].set_title("ML")
plt.tight_layout()
plt.show()

If anything, the IF improved over time. Perhaps joining ICFO had something to do with it.

**Completely absurd exercise**. Find out if there is any correlation between the non-date part of the arXiv ID and the Impact Factor. For IDs that contain a "/", it is the last three digits. For newer IDs, it is everything after "."

**Homework**. Extend analysis to the citation numbers of individual papers. Scrape data from Google Scholar.