# Journal Analysis of the LOC-DB Project


Here we make an analysis of the journals covered by the LOC-DB project, where we concentrate on the social sciences journals licensed by the Mannheim University Library published in 2011.

## Requirements and Etiquette

You need to have jupyter installed together with python, and in python you need the crossrefapi, which can be installed for example like this:

In [14]:
!pip install crossrefapi -q

In [1]:
from crossref.restful import Works, Journals, Etiquette
my_etiquette = Etiquette('LOC-DB', '', 'https://github.com/locdb/locdb-journal-analysis/blob/master/locdb-journals.ipynb', 'Philipp Zumstein (Mannheim University Library)')
str(my_etiquette)

'LOC-DB/ (https://github.com/locdb/locdb-journal-analysis/blob/master/locdb-journals.ipynb; mailto:Philipp Zumstein (Mannheim University Library)) BasedOn: CrossrefAPI/1.2.0'

## Journal List

We have 101 journals from social sciences licensed in 2011 with their ISSNs. We save them here into a list:

In [2]:
issnList =  ['0342-300X', '0033-5177', '0340-0425', '0023-2653', '0049-089x', '1550-3585', '0335-5322', '0065-2601', '0342-2275', '0092-6566', '0340-613x', '1869-8980', '0097-9740', '1864-9335', '0514-2776', '0038-6073', '0003-1224', '0018-7267', '0038-0164', '0027-3171', '0037-783X', '0035-2969', '0033-362X', '0038-0261', '0037-7732', '0022-2445', '0022-1031', '0038-0296', '0197-6664', '0894-3257', '0067-5830', '0146-1672', '0165-4896', '0278-016X', '0002-9602', '0007-1315', '0020-7152', '0011-3204', '0018-7259', '0037-8046', '0019-8676', '0022-3506', '0163-786X', '0378-8733', '0539-0184', '0038-0385', '0038-0407', '0066-6505', '0171-5860', '0933-9361', '0735-2751', '0266-7215', '0276-5624', '0749-5978', '0174-0202', '0048-8046', '0343-4109', '0001-6993', '0197-3533', '0360-0572', '0162-895x', '0002-7642', '1231-1413', '0046-2772', '0022-250x', '0340-1804', '0021-8308', '0094-3061', '0049-1241', '0048-3931', '1536-867X', '0730-8884', '0948-423X', '1749-5679', '0340-918X', '1043-4631', '1469-5405', '0891-2432', '0950-0170', '1477-996X', '0141-9889', '0098-7921', '0304-2421', '0011-3921', '0032-3292', '0044-118x', '0863-1808', '0263-2764', '0002-7162', '0038-609x', '0037-7791', '0012-155X', '0958-9287', '0951-6328', '1438-5627', '1864-3361', '1360-7804', '1435-9871', '0959-6801', '1468-0181', '0019-8692']

In [3]:
# Activate only for testing:
#issnList =  ['0342-300X', '0033-5177']

In [3]:
numberOfJournals = len(issnList)
print(numberOfJournals)

101


## Number of Articles Published in 2011

In [4]:
inCrossref = {}
noDoisList = []

In [5]:
k = 0
sum = 0
for issn in issnList:
    journal = Journals(etiquette=my_etiquette).journal(issn)
    if journal:
        inCrossref[issn] = True
        if (journal['breakdowns'] and journal['breakdowns']['dois-by-issued-year']):
            dois = journal['breakdowns']['dois-by-issued-year']
            for doi in dois:
                if doi[0]==2011:
                    k += 1
                    sum += doi[1]
                    noDoisList.append(doi[1])
    else:
        inCrossref[issn] = False
doisPerJournal = sum/(k * 1.0)
print(k, sum, doisPerJournal)
print("Max", max(noDoisList))
print("Min", min(noDoisList))

# result: (86, 5737, 66.70930232558139)
# Max 511
# Min 13

86 5737 66.70930232558139
Max 511
Min 13


Thus, we have found 86 journals in Crossref. The number of DOIs these journals registered in 2011 in Crossref is 5,737, which gives an average of **67 DOIs per journal**. Because every journal article has a DOI, we can take this as an upper bound for the number of articles published in a year. Although some journals have more DOIs registered than they have published articles.

## References in one Journal Article

Next, we count the number of references per journal article. This takes quite some time for the whole list of ISSNs. Therefore, we skip all journals not in Crossref beforehands by the variable `inCrossref` created above. Moreover, in the same loop we measure some indicator for data quality.

In [7]:
k = 0
z = 0
sum = 0
onlyUnstructured = 0
hardToResolve = 0
closedReferences = 0
occurrences = {}
examplesUnstructured = []
examplesHard = []
examplesClosedReferences = []
publisherClosedReferences = {}
noReferencesList= []
for issn in issnList:
    if (inCrossref[issn]):
        works = Journals(etiquette=my_etiquette).works(issn).filter(has_references="true").filter(from_pub_date=2011).filter(until_pub_date=2011)
        for article in works:
            nref = article['reference-count']
            k += 1
            sum += nref
            noReferencesList.append(nref)
            if ('reference' in article):
                for reference in article['reference']:
                    for label in reference.keys():
                        if (label in occurrences):
                            occurrences[label] += 1
                        else:
                            occurrences[label] = 1
                    if ('key' in reference and 'unstructured' in reference and len(reference)==2):
                        if (onlyUnstructured<10):
                            examplesUnstructured.append("First examples of a reference with only unstructured text: " + str(reference) + " from the article " + article['DOI'])
                        onlyUnstructured += 1
                    else:
                        if ('DOI' not in reference and 'article-title' not in reference and  'volume-title' not in reference and 'first-page' not in reference):
                            if (hardToResolve<10):
                                examplesHard.append("First examples of a hard case but not with only unstructured text: " + str(reference) + " from the article " + article['DOI'])
                            hardToResolve += 1
            else:
                if (closedReferences<10):
                    examplesClosedReferences.append(article['DOI'])
                closedReferences += 1
                if (article['publisher'] in publisherClosedReferences):
                    publisherClosedReferences[article['publisher']] += 1
                else:
                    publisherClosedReferences[article['publisher']] = 1
        if (works.count()>0): z += 1

referencePerArticle = sum/(k * 1.0)
print(k, z, sum, referencePerArticle)
print("Max" , max(noReferencesList))
print("Min", min(noReferencesList))

# result: (3835, 80, 169361, 44.1619295958279)
# Max 272
# Min 1

3835 80 169361 44.1619295958279
Max 272
Min 1


Crossref has references of k=3,835 journal articles (coming from z=80 different journals) suming up to sum=169,361 single references, i.e. **44 references in average per article**. Journals not in Crossref or without references in Crossref are not considered here for calculating the average number, because they will also have list of references.

## Estimation of the Number of Overall References

We can take these numbers now together to make an estimation for the number of all references from all articles appeared 2011 in any of the 101 journals:

In [8]:
numberOfJournals*doisPerJournal*referencePerArticle

297547.1627816015

Thus, we can estimate that there are **298,000 references** in 2011 for all our journals in social sciences.

## Closed References in Crossref

It is possible that in Crossref an entry is classified to `has_references`, but there are no openly shared references in the field `reference`. We counted these cases above and saved some examples:

In [10]:
print(closedReferences)
print(publisherClosedReferences)
examplesClosedReferences

1317
{u'Max Planck Institute for Demographic Research': 58, u'Elsevier BV': 603, u'Hogrefe Publishing Group': 34, u'Nomos Verlag': 27, u'University of Chicago Press': 187, u'Guilford Publications': 43, u'Society for Applied Anthropology': 39, u'Oxford University Press (OUP)': 226, u'Annual Reviews': 47, u'Duncker & Humblot GmbH': 53}


[u'10.1007/s11578-011-0123-0',
 u'10.1007/s11578-011-0124-z',
 u'10.1007/s11578-011-0128-8',
 u'10.1007/s11578-011-0126-x',
 u'10.1007/s11578-011-0129-7',
 u'10.1007/s11578-011-0130-1',
 u'10.1007/s11578-011-0132-z',
 u'10.1007/s11578-011-0133-y',
 u'10.1007/s11578-011-0134-x',
 u'10.1007/s11578-011-0135-9']

## Data Quality

In the loop above we also have measured several indicators for the data quality.

In [11]:
print(onlyUnstructured)
print(hardToResolve)
occurrences

7302
1370


{u'DOI': 59726,
 u'ISSN': 1,
 u'article-title': 21199,
 u'author': 73124,
 u'doi-asserted-by': 59726,
 u'edition': 763,
 u'first-page': 40145,
 u'issn-type': 1,
 u'issue': 11751,
 u'journal-title': 35229,
 u'key': 111490,
 u'series-title': 27,
 u'unstructured': 20310,
 u'volume': 34069,
 u'volume-title': 38623,
 u'year': 73205}

Out of these 169.361 references 7.302 (4%) have only unstructured information, but another 1,370 have only sparse data (no DOI, no title, no first-page) which makes the identification of these publications really hard.

The most frequent structured information which are interesting is:
* 73,205 (43%) have a year
* 73,124 (43%) have a author (possibly multiple authors)
* 59,822 (35%) have a title (article-title or volume-title)
* 59,726 (35%) have a DOI
* 40,145 (23%) have a first-page
* 35,229 (21%) have a journal-title
* 34,069 (20%) have a volume (number)

Here are some examples of references with only unstructured data and other references which are categorized as hard cases:

In [12]:
for e in examplesUnstructured:
    print(e)
    print("\n")

First examples of a reference with only unstructured text: {u'unstructured': u'Robert, P.: Measuring income in public opinion survey data. Paper presented at the ISA RC33 conference on social science methodology in the new millennium, Cologne, 2000', u'key': u'9636_CR12'} from the article 10.1007/s11135-011-9636-5


First examples of a reference with only unstructured text: {u'unstructured': u'Treiman, D.J.: Industrialization and social stratification. In: Laumann, E.O. (ed.) Social Stratification: Research and Theory for the 1970s, pp. 207\u2013234. Bobbs-Merrill, Indianapolis (1970)', u'key': u'9636_CR19'} from the article 10.1007/s11135-011-9636-5


First examples of a reference with only unstructured text: {u'unstructured': u'World Bank: Growth, poverty, and inequality. Eastern Europe and the Former Soviet Union. The International Bank for Reconstruction and Development/The World Bank, Washington, DC, (2005)', u'key': u'9636_CR20'} from the article 10.1007/s11135-011-9636-5


First

In [13]:
for e in examplesHard:
    print(e)
    print("\n")

First examples of a hard case but not with only unstructured text: {u'journal-title': u'Journal of Statistical Software', u'key': u'bibr53-0003122411407748', u'author': u'Sekhon Jasjeet S.'} from the article 10.1177/0003122411407748


First examples of a hard case but not with only unstructured text: {u'journal-title': u'Social Networks', u'key': u'bibr34-0003122410396196', u'author': u'Faris Robert'} from the article 10.1177/0003122410396196


First examples of a hard case but not with only unstructured text: {u'year': u'1992', u'journal-title': u'Streetwise: Race, Class, and Change in an Urban Community', u'key': u'bibr1-0003122411409705', u'author': u'Anderson Elijah'} from the article 10.1177/0003122411409705


First examples of a hard case but not with only unstructured text: {u'year': u'2010', u'journal-title': u'New York Times', u'key': u'bibr4-0003122411398443', u'author': u'Archibald Randal C.'} from the article 10.1177/0003122411398443


First examples of a hard case but not 

## Estimation of References in Books
The estimation of references in books is a more tricky, as we do not have a list of identifiers with which we can start. But as an approximation, we can call all resources in Crossref published in 2011, which have a certain Crossref resource type.
The list of resource types in Crossref can be found here: http://api.crossref.org/types .
Problem: This is neither specific to the UniMA purchases nor domain-specific. Question: Can we do something with property category-name?

In [3]:
# for monographs
works = Works(etiquette=my_etiquette).filter(has_references="true").filter(from_pub_date=2011).filter(until_pub_date=2011).filter(type="monograph")
noReferencesList = []
totalK = 0
k = 0
sum = 0
for work in works:
    nref = work['reference-count']
    k += 1
    sum += nref
    noReferencesList.append(nref)
totalK += k
print("Number of monographs returned: ", k)    
print("Max", max(noReferencesList))
print("Min", min(noReferencesList))
print("Sum of references: ", sum)
print("Average: ", sum/k)
# Number of monographs returned:  3848
# Max 166
# Min 1
# Sum of references:  170141
# Average:  44.21543659043659

Number of monographs returned:  13
Max 166
Min 1
Sum of references:  780
Average:  60.0


In [4]:
# for books
works = Works(etiquette=my_etiquette).filter(has_references="true").filter(from_pub_date=2011).filter(until_pub_date=2011).filter(type="book")
noReferencesListBooks = []
k = 0
sum = 0
for work in works:
    nref = work['reference-count']
    k += 1
    sum += nref
    noReferencesListBooks.append(nref)
totalK +=k
print("Number of books returned: ", k)    
print("Max", max(noReferencesListBooks))
print("Min", min(noReferencesListBooks))
print("Sum of references: ", sum)
print("Average: ", sum/k)
# Number of books returned:  3862
# Max 54
# Min 2
# Sum of references:  170546
# Average:  44.160020714655616

Number of books returned:  14
Max 54
Min 2
Sum of references:  405
Average:  28.928571428571427


We can also combine this estimation..

In [5]:
print("Total monographs and books in 2011 with references", totalK)
totalBooksList = noReferencesList + noReferencesListBooks
print("Min References", min(totalBooksList))
print("Max References", max(totalBooksList))
# Sum is not callable as it was used as a variable name


Total monographs and books in 2011 with references 27
Min References 1
Max References 166


## Acknowledgement

This Jupyter notebook is heavenly influenced on the nice demo [crossref-api-demo.ipynb](https://github.com/CrossRef/rest-api-doc/blob/master/demos/crossref-api-demo.ipynb) by Geoffrey Bilder and is based on the Python library [crossrefapi](https://github.com/fabiobatalha/crossrefapi) by Fabio Batalha and all of this is based on the great [Crossref API](https://github.com/CrossRef/rest-api-doc).