### ***Insights into academia***

* How has the open-access publishing grown over the years? 
  * Does it improve the quality of research? How well does open-access translate to the quantity of research? 
  * Can we devise a normalized research quality metric based on the information?
  * To what extend does the open-access benefit the community, in particular, the developing nations? 

In an attempt to make federally funded research open, initiated by scientists and researchers, Library Genesis is a database for articles and books on various topics, which allows free access to content that is otherwise paywalled or not digitized elsewhere. 

> A sister site, Sci-Hub's, founder *Alexandra Elbakyan* was listed by *Nature*, in 2016, among the **top ten people that mattered in science**. US ranks *second* in the terms of web-traffic on Sci-Hub and related websites. 
>
>As per wikipedia: Elbakyan's Sci-Hub is widely used in both developed and developing countries, serving over 200,000 requests per day as of February 2016. 
> * How has the inception of Sci-Hub affected the quality/quantity of research in developing nations? Do more authors go on to write high quality papers in their respective fields? 
> * Has the number of papers cited per paper increased (with access to large number of papers and therefore more information)? 
> * Does their sphere of influence increase? Do they get more post-doctoral offers abroad? (This can be partially answered by looking at the collaboration graph of a researcher)

In this work, we shall restrict our attention to academic papers. Our study shall be based on ***database record*** of academic papers, ***not*** the actual academic papers themselves. The database, publicly available at http://booksdescr.org/dbdumps/scimag, contains over 70 million records of academic papers. It includes information of paper titles, author list, doi, etc. which in turn can be used to query citation information, abstract, etc. from Google Scholar, Pubmed, Researchgate, etc. The important point is: ***the database is relatively unanalyzed, and LibGen is perhaps the biggest such record of academic papers!*** The databse is incredibly information rich, and can be utilized to reveal insights into publishing. 

* How does the paper quality of a researcher changes over time in various fields? For example: It is widely believed in Mathematics that ground-breaking contribution are from researchers below 30-35 years of age. How does that vary in other fields, by country of origin, by institution, etc? The intention of this study shall be to provide answers backed by analysis on actual data.

> Important Note:
>* Our study shall be based on ***database record*** of academic papers, ***not*** the actual academic papers themselves. LibGen has been the subject of various formal academic studies. Few of them are listed below:
>  * Karaganis, Joe. *Shadow libraries: access to knowledge in global higher education*. MIT Press, 2018.
>  * Cabanac, Guillaume. *Bibliogifts in LibGen? A study of a text‐sharing platform driven by biblioleaks and crowdsourcing.* Journal of the Association for Information Science and Technology 67, no. 4 (2016): 874-884.   

Data presentation: 
* The intention will be to generate an interactive insightful tool on academic publishing (see for example: Hans Rosling's https://www.gapminder.org/tools on socio-economic growth across the globe over years).

Tools:
* Python: Apache-Spark's ML on large datasets, MySQL, NetworkX, etc; Javascript: d3js

In [1]:
import numpy as np
import itertools as it
from collections import defaultdict
import pandas as pd
pd.set_option('display.max_colwidth', -1)

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

import pyspark
import MySQLdb
import requests

In [2]:
# Connect to LibGen's publicly available database
db=MySQLdb.connect(user="jaiswal0", passwd="dInc#2019", db="8k")

In [3]:
# We shall analyze a subportion (~10.8 million) of the entire database (>70 million).
num_entries = pd.read_sql('SELECT COUNT(*) FROM scimag', con=db)
print(num_entries)

   COUNT(*)
0  10795980


In [4]:
# What are the columns
cols = pd.read_sql('''SELECT column_name FROM information_schema.columns where table_name='scimag' AND table_schema='2k' ''', con=db)
print(cols['COLUMN_NAME'].to_string(index=False).replace("\n", ""))

 AbstractURL Attribute1  Attribute2  Attribute3  Attribute4  Attribute5  Attribute6  Author      Day         DOI         DOI2        Filesize    First_page  ID          ISBN        ISSNE       ISSNP       Issue       Journal     JOURNALID   Last_page   MD5         Month       PII         PMC         PubmedID    TimeAdded   Title       visible     Volume      Year       


In [5]:
# Find entries with a PubmedID
art_pubmed = pd.read_sql('''
    SELECT Author, PubmedID, Year, Attribute1 FROM scimag 
    where PubmedID>0 
''', con=db)
print(art_pubmed[:10])

                                                                                                                                                                                                                      Author  \
0  Katherine M. Hegmann; Aimee S. Spikes; Avi Orr-Urtreger; Lisa G. Shaffer                                                                                                                                                    
1  Vijay Tonk; Nancy R. Schneider; Mauricio R. Delgado; Jen-i Mao; Roger A. Schultz                                                                                                                                            
2  Selma Siegel Witchel; Peter A. Lee; Massimo Trucco                                                                                                                                                                          
3  Orit Reish; Susan A. Berry; Gordon Dewald; Richard A. King                                           

In [6]:
# Number of entries with a pubmedID
len(art_pubmed)

1616069

### Nearly 1.6 million entries have a *pubmedID*. Having identified this information, we can query abstract information from NIH database
*The abstract information can be utilized to build bi-grams, and potentially a ML model for writing paper abstracts!*

In [7]:
url = '''https://api.ncbi.nlm.nih.gov/lit/ctxp/v1/pubmed/?format=medline&id={id}'''
headers = {'User-Agent': 'Hydra/1.3.15'}

def load_medline(info):
    data = info.split("\n");
    infodict = defaultdict(lambda: None)
    temp = ['', '']
    for line in data:
        if not line: continue
        if line[4]=="-":
            if temp[0]!='':
              if temp[0] in infodict:
                infodict[temp[0]][-1] = temp[1];            
              else:
                infodict[temp[0]] = [temp[1]]; 
            temp = [line[:4].strip(), line[5:]]
            if temp[0] in infodict:
              infodict[temp[0]].append(temp[1]);
            else:
              infodict[temp[0]] = [temp[1]];
        else:
            temp[1] += ' ' + line.strip() + ' '    
            continue
    return dict({"Abstract": infodict["AB"]})

def fetch_pubmed(pubmedID):
    api_url = url.format(id=pubmedID)
    print(api_url)
    response = requests.get(api_url, headers=headers)
    if response.status_code == 200:
        return load_medline(response.content.decode('utf-8'))
    else:
        return None

# for all the entries fetch keywords
for entry in range(3): 
    pubmedID = art_pubmed['PubmedID'][entry]
    info = fetch_pubmed(pubmedID)
    #print(info)
    if info:
        print(info["Abstract"], art_pubmed['Year'][entry])
        art_pubmed['Attribute1'][entry] = info["Abstract"]


https://api.ncbi.nlm.nih.gov/lit/ctxp/v1/pubmed/?format=medline&id=8741910
[' A genetics evaluation was requested for a 6-week-old infant with multiple congenital  malformations including mild craniofacial anomalies, truncal hypotonia, hypospadias,  and a ventriculoseptal defect. Blood obtained for chromosome analysis revealed an  abnormal chromosome 4. Paternal chromosome analysis showed a 46,XY, inv ins  (3;4)(p21.32;q25q21.2), inv(4)(p15.3q21.2) karyotype. Therefore, the proband\'s  chromosome 4 was the unbalanced product of this insertional translocation from the  father resulting in partial monosomy 4q. Additionally, the derivative 4 had a  pericentric inversion which was also seen in the father\'s chromosome 4. During  genetic counseling, the proband\'s 2-year-old brother was evaluated. He was not felt  to be abnormal in appearance, but was described as having impulsive behavior.  Chromosome analysis on this child revealed 46,XY,der(3)inv  ins(3;4)(p21.32;q25q21.2)pat. This karyo

In [8]:
# Find names of all the authors "individually" (we split on delimiter ';')
authors = pd.read_sql('''
    SELECT scimag.ID,
      SUBSTRING_INDEX(SUBSTRING_INDEX(scimag.Author, ';', numbers.n), ';', -1) Author
FROM
  (SELECT 1 n UNION ALL SELECT 2
   UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5
  UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8
  UNION ALL SELECT 9 UNION ALL SELECT 10) numbers INNER JOIN scimag
  ON CHAR_LENGTH(scimag.Author)
     -CHAR_LENGTH(REPLACE(scimag.Author, ';', ''))>=numbers.n-1
    ORDER BY
  ID, n
''', con=db)

In [9]:
len(authors)

30867195

In [10]:
authors['Author'][0]

'Yasuhiro Itagaki'

### ***What are the most common author names in academia?***

In [None]:
# We further split every author's names at white spaces
#pd.Series(authors['Author'].str.split(" ").sum()).value_counts()[:10].plot('bar')
pds = pd.Series(authors['Author'].str.split(" ").sum())
pds[pds.apply(lambda v: len(v)>3)].value_counts()[:10].plot('bar')