## Examples using the CORE API

Before you proceed with the practical examples you need first to authenticate your Google account. To do that, you need to run the following code. When the code stops running it will produce a long url and display a text box; click on the url and you will be directed to another page, which will containt an id string. Copy this id, come back to this page and paste the id in the text box. Then press "Enter" or "Return" in your keyboard.

Here is an screenshot showing the long url and the code as I have pasted it into the text area. 

![google_authentication](https://lh3.google.com/u/0/d/1JBA2JvXdd19P5GAimUPFf9JByZCCuaX1=w1439-h780-iv2)

You will need to follow this procedure everytime you leave this page for some time and you need to revisit it. 

In the block below, place your cursor at `[ ]` and click on the "Play" button. 

In [0]:
import pandas


#
# Replace the assignment below with your file ID
# to download a different file.
#
# A file ID looks like: 1uBtlaggVyWshwcyP6kEI-y_W3P8D26sz
file_id = 'target_file_id'

import io
from io import StringIO
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth
from googleapiclient.discovery import build

auth.authenticate_user()
drive_service = build('drive', 'v3')

def load_from_gdrive(file_id):
  request = drive_service.files().get_media(fileId=file_id)
  downloaded = io.BytesIO()
  downloader = MediaIoBaseDownload(downloaded, request)
  done = False
  while done is False:
    # _ is a placeholder for a progress object that we ignore.
    # (Our file is small, so we skip reporting progress.)
    _, done = downloader.next_chunk()

    downloaded.seek(0)
    return downloaded.read()

## What is CORE
### [CORE](https://core.ac.uk/) aggregates the world’s open access research papers. It offers seamless access to millions of open access research papers, enrich the collected data for text-mining and provide unique services to the research community.


### CORE's collection is offered via the following services:

### CORE API
provides an access point for those who want to develop applications making use of our large collection of Open Access content. 

![the CORE API](https://github.com/oacore/cambridge-vs-oxford-repo-showdown/raw/master/resources/api.png?classes="float-left"  "The CORE API") 
 
### CORE Dataset
enables you to download all the aggregated content from Open Access journals and repositories. Use the dataset for analysis and application of computationally intensive processes.

![the CORE Dataset](https://github.com/oacore/cambridge-vs-oxford-repo-showdown/raw/master/resources/dataset.png "The CORE Dataset") 
 
### CORE Publisher connector
is a software providing seamless access to Gold and Hybrid Gold Open Access articles aggregated from non-standard systems of major publishers. Data is exposed using the ResourceSync protocol.

!["The CORE Publisher connector"](https://github.com/oacore/cambridge-vs-oxford-repo-showdown/raw/master/resources/pubconnector.png "The CORE Publisher connector") 

### [NEW] CORE Fast Sync 
provides a fast incremental synchronisation for all of CORE’s data. Allows you to always keep your data in sync with CORE.
![the CORE Fast Sync](https://github.com/oacore/cambridge-vs-oxford-repo-showdown/raw/master/resources/corefastsync.png "The CORE Fast Sync") 


The following two sections: 

- *"How to download research outputs using the CORE API and where to start"* and 
- *"How to download all the articles from a single repository through a CORE API call"*

explain how to use with the CORE API. However, you can solve all the practical exercises given below without registering for and using a CORE API key.  

### How to download research outputs using the CORE API and where to start:
First you need to register for an API key at https://core.ac.uk/api-keys/register, you will need this further down. 


At https://core.ac.uk/docs there is a Swagger page, which is an interactive page that allows you to run the CORE API directly in a browser window.

*Note:* The use of SDKs is mentioned here only for reference purposes. In order to better understand and apply these you will need a basic understanding of one of the programming languages listed below. 

CORE has also integrated various [Software Development Kits](https://en.wikipedia.org/wiki/Software_development_kit) (SDKs) to kick start its API. 
At the moment the following SDKs are available: 
- A [Java](https://en.wikipedia.org/wiki/Java_(programming_language) client available at https://github.com/oacore/oacore4j.
- An [R](https://en.wikipedia.org/wiki/R_(programming_language) client developed in collaboration with ROpenScience available at https://github.com/ropensci/rcoreoa.
- A [Python](https://en.wikipedia.org/wiki/Python_(programming_language) client available at https://github.com/oacore/pyoacore, which we will use in this example. Python allows us to demonstrate the examples using the Python Notebooks and we can create interactive demonstrations that everyone can test, practice and run. 

You can experiment with more examples in our Github repository https://github.com/oacore.



Now we need to install the libraries required from the CORE API. Because this can be challenging for those with limited or no technical skills, the code to download the libraries is for demonstration purposes only and it is not executable. If you want to find out more on how to download these libraries, please refer to the Github pages of each programming language, Java, R and Python.  

This is the Java code: 

```java
OACoreService coreService = new OACoreService(readApiKey()); 
HashMap params = new HashMap<>(); 
params.put(ArticlesService.CITATIONS, Boolean.TRUE); 
Call request = coreService.getArticlesService().getArticleById(43, params); 
ArticleResponse article = request.execute().body();
```

This is the R Language code: 

```R
library("rcoreoa")
core_articles(id = 21132995)
```

This is the Python code:

```python 
from pyoacore.apis.articles_api import ArticlesApi
api = ArticlesApi()
result = api.articles_get_core_id_get(77398041)
```

For the purposes of this demonstration we have downloaded some ready data for  you to use.

### How to download all the articles from a single repository through a CORE API call:
The following url is a sample url of the CORE API. 

`
https://core.ac.uk/api-v2/articles/search/repositories.id:REPO_ID?apiKey=APIKEY&page=PAGE&pageSize=SIZE
`

In order to be able to download CORE articles, there are some sections where you need to replace the existing text:  
 - `REPO_ID`: replace this with the repository ID that you are interested to access. A list of repositories and their CORE IDs are available here https://core.ac.uk/repositories 
 - `APIKEY`: replace this with your add your own API key (if you haven't done so already, go to the very beginning of this page where you will find a link to get your CORE API key).  
 - `PAGE`: replace this with 0 (zero) and you will land on the first page. 
 - `SIZE`: replace the number of articles you want for each call. We recommend that you do not use a number larger than 20. If you try it for the first time we recommend that you use 1.  
 
Here is an example of how a final url looks like. I chose the Open University Repository, Open Research Online, with a CORE ID 86, added my API key (well sort of... this is private information and you should not share it in public, thus I removed some letters and numbers here and there), I want to land on the first page of the results and view ten results in the page. 
![API_KEY](https://lh3.google.com/u/0/d/1zKfPnY7GopeF41ojq0EEi_1kmM3zbmUP=w2560-h1282-iv1)

 
 
 


To assist you with this example we have already downloaded a collection of articles from the University of Cambridge and University of Oxford repositories. We will use a library that is capable of dealing with large amounts of textual data to run a few examples in this notebook. 

#### How to create word clouds from the University of Cambridge publications

In the following block we use a sample of the CORE data from the University of Cambridge and University of Oxford. You will asked to use the latter in a practical exercise at the end of this example. (Tip: you will need to execute all the  code boxes below in the order of appearance.)

In [0]:
cambridge = pandas.read_csv(StringIO((load_from_gdrive('1SzDiPNFA8jO_6al8udW6Gl-ogJ1VVuwg').decode("utf-8") )),sep='\t', names=('ID', "Title", "Published date", "DOI"))
oxford = pandas.read_csv(StringIO((load_from_gdrive('1LYdq4p19R6afC4ufFf7xWDfcqfzm9pS1').decode("utf-8") )),sep='\t', names=('ID', "Title", "Published date", "DOI"))

print("Oxford total records: %d Cambridge total records: %d" % (oxford.Title.count(), cambridge.Title.count()))

Above you can see the University of Cambridge results, 4298 records and University of Oxford 6132. 

---



In this example we will create a word cloud of the full text we downloaded. Before we do that we need to download a tool that will help us build the word cloud, thus we are using the following code. 

In [0]:
!pip install wordcloud 

Now that we have downloaded the word cloud tool, we can proceed and ask for the University of Cambridge publications. To do that, we insert a new line of code and type in `"cambridge"`

In [0]:
cambridge 

As you can see, the sample we are using contains a table with the CORE ID, title, date and DOI.

**Note:** Pay attention to the "Publication Date" field. What do you notice? How is the data formatted? 
As you can see the date data column shows up with many variations. This is problematic for many reasons and results into a lot of noise both in CORE's harvesting process, but also in text mining practices.  

We now need to normalise the date field and try to give the same format to all dates. To accomplish this, we need to run the following code, where we are modifying all the dates to be represented only by the year. This will help us with the filtering demonstrated in the next blocks.

In [0]:
from dateutil.parser import parse
from datetime import *

def date_conv(x):
    try:
        return parse(str(x), fuzzy=True, default=datetime(1970, 1, 1)).year
    except:
        return str(x)[:4]

cambridge["year"] = cambridge["Published date"].apply(lambda x: date_conv(x) )
oxford["year"] = oxford["Published date"].apply(lambda x: date_conv(x) )

Let's try to build a word cloud with the University of Cambridge publications. Since the University of Cambridge repository, Apollo, is very big and has a large corpus of information we will choose the outputs with a full text that were published from 2016 onwards. 

In [0]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import random
stopwords = ["review","based","model","effect","analysis","research","study","using","a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also","although","always","am","among", "amongst", "amoungst", "amount",  "an", "and", "another", "any","anyhow","anyone","anything","anyway", "anywhere", "are", "around", "as",  "at", "back","be","became", "because","become","becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom","but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven","else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own","part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the"]

cambridge["year_float"] = pandas.to_numeric(cambridge['year'], errors='coerce')
cambridge1617 = cambridge[cambridge.year_float > 2016]
cambridgetitles = cambridge1617.Title.str.cat(sep=" ")
wordcloud = WordCloud(
                          stopwords=stopwords,
                          background_color='white',
                          width=1200,
                          height=1000
                         ).generate(cambridgetitles)

# store default colored image
def grey_color_func(word, font_size, position, orientation, random_state=None,
                    **kwargs):
    return "hsl(0, 100%%, %d%%)" % random.randint(10, 70)
default_colors = wordcloud.to_array()

fig, ax = plt.subplots(figsize=(15, 10))
plt.imshow(wordcloud.recolor(color_func=grey_color_func, random_state=3),
           interpolation="bilinear")
plt.axis('off')

plt.show()

**Practical exercise:** Can you spot in the code above where we are indicating that we want the publications from 2016 onwards? Can you change this to 2014? (*tip:* the publication date is marked with green font.)

##### Congratulations! You have done it! 

**Practical excercise:** Try this userself! Apart from the University of Cambridge data we have also imported for you the University of Oxford data. How can you do that? 
- Try inserting a new line further down to call `oxford` and get the table with the publication data. Do not forget to push "Play".
- Add a new line and copy-paste the code to normalise the dates. 
- Add a new line and copy-paste the code to create the word cloud. Notice that you need to change some parts of the code by adding the same word in all cases.

Need more help? Click [here](https://drive.google.com/file/d/1DNAbTwYfLw1NfFbEujm5qHyxbuHSpd3E/view?usp=sharing) to see the solution to this exercise. 
