# Natural Language Processing of the bioRxiv
## on Synthetic Biology

**Author:** 'Felipe Millacura'
    
**Date:** '30th March 2021'

## General aims

* Extract web content (*Web Scraping*) from the open preprint server [bioRxiv](https://www.biorxiv.org/)
* Generate title, abstract analysis using `nltk`
* Generation of WordClouds for visual representation


## Web Scrapping

While *web scraping* can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. 

Scraping a webpage involves both a `fetching` and `extracting` step. **Fetching** is the downloading of a website (which a browser does when a user views a page). Once fetched, then the **extraction** can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on.

The scraping software simulates to be a human being by extracting data is visible on the webpage. The automation of this process allows us to extract large volumes of data quickly and accurately. 

## Python libraries

There are many Python libraries to perform web scrapping (*Source: [Python Web Scrapping - Kite, 2020](https://www.youtube.com/watch?v=zucvHSQsKHA)*)

 <img src="./images/scraping.png" alt="Drawing" style="width: 600px;"/> 

Here we will use **[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)**

We will start by downloading two needed libraries: `requests` that will allow us load URLs and `bs4` that includes `BeautifulSoup`

In [None]:
!pip install requests 
!pip install bs4

In [3]:
import requests 
from bs4 import BeautifulSoup

The first thing we must do is to **load the html code** from the URL of the page of interest. Here we will retrieve the **latest publications** from [bioRxiv](https://www.biorxiv.org/) 

Note that the query for the search engine is explicited in the URL allowing us to modify it directly but here we will look directly into the **Synthetic Biology** collection

```https://www.biorxiv.org/collection/synthetic-biology```

In this case, we can create a ``list`` to receive the keywords to search for

In [4]:
url  = 'https://www.biorxiv.org/collection/synthetic-biology'
print(url)

https://www.biorxiv.org/collection/synthetic-biology


Once we have our link, we can request the `HTML` code of the page by using [`requests.get()`](https://requests.readthedocs.io/en/master/user/quickstart/)

In [5]:
%%time
html_page = requests.get(url) 

Wall time: 1.21 s


In [6]:
html_page.text

'<!DOCTYPE html>\n<html lang="en" dir="ltr" \n  xmlns="http://www.w3.org/1999/xhtml"\n  xmlns:mml="http://www.w3.org/1998/Math/MathML">\n  <head prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book#" >\n    <!--[if IE]><![endif]-->\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="dns-prefetch" href="//stats.g.doubleclick.net" />\n<link rel="dns-prefetch" href="//d33xdlntwy0kbs.cloudfront.net" />\n<link rel="dns-prefetch" href="//www.google-analytics.com" />\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=3, minimum-scale=1, user-scalable=yes" />\n<link rel="shortcut icon" href="https://www.biorxiv.org/sites/default/files/images/favicon.ico" type="image/vnd.microsoft.icon" />\n<meta name="description" content="bioRxiv - the preprint server for biology, operated by Cold Spring Harbor Laboratory, a research and educational institution" />\n<meta name="generator" content="Drupal 7 (

Once the HTML plain text is loaded, we must start a `Parser` from `BeautifulSoup`. A parser is a **sinctactic analyser**. 

In this case BeautifulSoup allow us to work with different syntaxes, such as `HTML` or `XML` 

In [7]:
soup = BeautifulSoup(html_page.text, 'html.parser')

Now we are able to extract the information we need. The most direct way to do it is from a web browser:

1. Open the web page to analyse
2. Right click on the item of interest <img src = "./images/biorxiv_1.png" alt = "Drawing" style = "width: 600px;" />
3. Click on inspect <img src = "./images/biorxiv_2.png" alt = "Drawing" style = "width: 600px;" />
4. Check the HTML tags to access the content directly in our scrapper 

In this example we will extract the manuscript titles, therefore we must extract the paragraphs  tagged as ``<p>`` whose class name is ``class ="title is-5 mathjax">`` 

In [32]:
titles = soup.find_all('a', attrs = {'class': 'highwire-cite-linked-title'})

The function ``find_all()`` returns a list of type ``bs4.element.Tag`` with the matches of our query. 

In [33]:
for title in titles:
    print(title.text.strip())
    break

A synthetic receptor platform enables rapid and portable monitoring of liver dysfunction via engineered bacteria.


and there you you have!

### Exercise: Obtaining the paper links and information

In this case we are going to read the HTML container in a little more general way. According to our previous query, all the results are encapsulated in the tag ``<ul>`` whose class is ``class =""`` 

In [89]:
results = soup.find_all('li', attrs = {'class': ['first odd', 'even','odd','last even']}, )

In [93]:
results

[<li class="first odd"><div class="highwire-article-citation highwire-citation-type-highwire-article tooltip-enable" data-apath="/biorxiv/early/2021/03/29/2021.03.24.436753.atom" data-hw-abstract-tooltip-instance="highwire_abstract_tooltip" data-node-nid="1869724" data-pisa="biorxiv;2021.03.24.436753v1" data-pisa-master="biorxiv;2021.03.24.436753" data-seqnum="1869724" data-url="/highwire/article_citation_preview/1869724" id="node1869724" title="A synthetic receptor platform enables rapid and portable monitoring of liver dysfunction via engineered bacteria."><div class="highwire-cite highwire-cite-highwire-article highwire-citation-biorxiv-article-pap-list-overline clearfix">
 <span class="highwire-cite-title">
 <a class="highwire-cite-linked-title" data-hide-link-title="0" data-icon-position="" href="/content/10.1101/2021.03.24.436753v1"><span class="highwire-cite-title">A synthetic receptor platform enables rapid and portable monitoring of liver dysfunction via engineered bacteria.</

In [124]:
s=(res.find('span', attrs = {'class': 'highwire-cite-title'}) for res in results if res.find('span', attrs = {'class': 'highwire-cite-title'}) != None)
        
for y in s:
    print(y)

<span class="highwire-cite-title">
<a class="highwire-cite-linked-title" data-hide-link-title="0" data-icon-position="" href="/content/10.1101/2021.03.24.436753v1"><span class="highwire-cite-title">A synthetic receptor platform enables rapid and portable monitoring of liver dysfunction via engineered bacteria.</span></a> </span>
<span class="highwire-cite-title">
<a class="highwire-cite-linked-title" data-hide-link-title="0" data-icon-position="" href="/content/10.1101/2021.03.28.437402v1"><span class="highwire-cite-title">Combining evolutionary and assay-labelled data for protein fitness prediction</span></a> </span>
<span class="highwire-cite-title">
<a class="highwire-cite-linked-title" data-hide-link-title="0" data-icon-position="" href="/content/10.1101/2021.02.19.432011v2"><span class="highwire-cite-title">Bulk-assembly of monodisperse coacervates and giant unilamellar vesicles with programmable hierarchical complexity</span></a> </span>
<span class="highwire-cite-title">
<a clas

We create three empty lists to save the information

In [91]:
titles_list  = [] # to save titles
authors_list = [] # to save authors
pdf_link_list = [] # to save links


Now we iterate within `results`  going tag by tag until reaching our needed content

In [131]:
for res in results:
    title = (res.find('span', attrs = {'class': 'highwire-cite-title'}) for res in results if res.find('span', attrs = {'class': 'highwire-cite-title'}))# access the title tag
    link_content = (res.find('span', attrs = {'class': 'highwire-cite-title'}) for res in results if res.find('span', attrs = {'class': 'highwire-cite-title'}) != None)
 # access the link tag
    link_content = (link.find('a') for link in link_content) # then to the span tag within the previous tag
    links = ['https://www.biorxiv.org'+link.attrs.get('href') for link in link_content] # we iterate and extract the href attribute 
    authors = (res.find('span', attrs = {'class': 'highwire-citation-authors'}) for res in results if res.find('span', attrs = {'class': 'highwire-citation-authors'})!=None) # access the authors tag
 
    # Saves the paper title
    for tit in title:
        titles_list.append(tit.text.strip())
    
    # Saves the authors 
    for aut in authors:
        authors_list.append(aut.text.strip())
    
    # The first link will always be pdf 
    pdf_link_list.append(links[0])

In [133]:
authors_list

['Hung-Ju Chang, Ana Zuniga, Ismael Conejero, Peter L Voyvodic, Jerome Gracy, Elena Fajardo-Ruiz, Martin Cohen-Gonsaud, Guillaume Cambray, Georges-Philippe Pageaux, Magdalena Meszaros, Lucy Meunier, Jerome Bonnet',
 'Chloe Hsu, Hunter Nisonoff, Clara Fannjiang, Jennifer Listgarten',
 'Qingchuan Li, Qingchun Song, Jing Wei, Yang Cao, Xinyu Cui, Dairong Chen, Ho Cheung Shum',
 'Nadine Bongaerts, Zainab Edoo, Ayan A. Abukar, Xiaohu Song, Sebastián Sosa Carrillo, Ariel B. Lindner, Edwin H. Wintermute',
 'Daniel Stukenberg, Tobias Hensel, Josef Hoff, Benjamin Daniel, René Inckemann, Jamie N. Tedeschi, Franziska Nousch, Georg Fritz',
 'Santiago C. Lopez, Kate D. Crawford, Santi Bhattarai-Kline, Seth L. Shipman',
 'Mohamad H. Abedi, Michael S. Yao, David R. Mittelstein, Avinoam Bar-Zion, Margaret Swift, Audrey Lee-Gosselin, Mikhail G. Shapiro',
 'Sahil B. Shah, Alexis M. Hill, Claus O. Wilke, Adam J. Hockenberry',
 'Maurice Filo, Mustafa Khammash',
 'Markéta Vlková, Bhargava Reddy Morampalli,

Now that we have everything properly saved we can create a Pandas `DataFrame` with the obtained the results 

In [21]:
import pandas as pd

In [135]:
df_biorxiv = pd.DataFrame() # We create an empty DataFrame
df_biorxiv['title'] = titles_list # Creates the title column
#df_biorxiv['authors'] = authors_list # Creates the authors column
#df_biorxiv['link'] = pdf_link_list  # Creates the link column

In [145]:
df_biorxiv.head(1)

Unnamed: 0,title
0,A synthetic receptor platform enables rapid an...


Now that our data is inside a data frame we can use the `unnest_tokens` function from `tidytext` to transform this data.

The `unnest_tokens` function is probably the most important function in tidytext. It takes data from a an `object` variable (aka `string`) and splits it into *tokens*. For just now, the tokens will be words, but it's also possible to specify that our tokens are sentences, or characters. We'll see the options for different tokens later.

The `unnest_tokens` function takes three mandatory arguments. The first is the data frame that contains our text data; here we have piped that in. The second is the new column that we are going to create that contains our tokens, in our case we've called it `word` because the tokens are words. And the third argument is the name of the column that contains the text data that we are going to tokenise, in our case `phrase`.


In [144]:
from tidytext import unnest_tokens

In [146]:
words_df = unnest_tokens(df_biorxiv, 'word', 'title')

words_df

Unnamed: 0,word
0,a
0,synthetic
0,receptor
0,platform
0,enables
...,...
359,and
359,transchanges
359,in
359,genetic


You'll notice that the ID column has been preserved. You can go and check that all the words that appeared in the first phrase have id 0, all the words that appeared in the second phrase have id 1 etc. This is really useful when we have extra information about the text in our original data that we want to preserve through tokenisation.

Now that we have a tidy data frame, it's easy to manipulate using `pandas`. To start, let's put our words in alphabetical order.


In [147]:
words_df.sort_values('word')

Unnamed: 0,word
0,a
244,a
240,a
238,a
234,a
...,...
197,without
77,without
287,without
37,without


A really common task that you'll want to perform is finding out how often each word appears in each phrase. We can do this with a `groupby` and `size`.


In [148]:
words_df.groupby(['word', words_df.index]).size().reset_index(name='counts')

Unnamed: 0,word,level_1,counts
0,a,0,1
1,a,4,1
2,a,8,1
3,a,10,1
4,a,14,1
...,...,...,...
4171,without,317,1
4172,without,327,1
4173,without,337,1
4174,without,347,1


Or you may want to count the words across all phrases. Again, this is done with group by 

In [149]:
words_df.groupby('word').size().reset_index(name='counts')

Unnamed: 0,word,counts
0,a,108
1,acoustic,36
2,across,36
3,activity,36
4,an,36
...,...,...
90,vesicles,36
91,via,36
92,vibrio,36
93,with,72


Again, let's find out how often each word appears and arrange from the most common word to the least common.
Since this is such a common pattern (not just in text mining, but in many types of analysis) `pandas` provides a short-cut. The function `value_counts` will group by the variable or variables given, and summarise by `count`.


In [150]:
words_df.value_counts('word').reset_index(name='counts')

Unnamed: 0,word,counts
0,of,288
1,and,144
2,a,108
3,for,108
4,synthetic,108
...,...,...
90,controllers,36
91,control,36
92,complexity,36
93,complex,36


The most common words are not very interesting! They are words common to all English texts.

These common English words are known as **stop words**. The [`stop_words`](https://pypi.org/project/stop-words/) library has a built-in data frame that contains stop words in different languages. This means we can remove the stop words from our data by using either  `merge`, `lambda` or `isin`


In [151]:
from stop_words import get_stop_words

In [152]:
#as a list

stop_words = get_stop_words('en')

#as a DataFrame

df_stop_words = pd.DataFrame({
    'stop_words' : get_stop_words('en')
})
df_stop_words

Unnamed: 0,stop_words
0,a
1,about
2,above
3,after
4,again
...,...
169,you've
170,your
171,yours
172,yourself


You can also use `isin` and negate the mask to find values not in `df_stop_words`:

In [153]:
words_df[~words_df['word'].isin(df_stop_words['stop_words'])]

Unnamed: 0,word
0,synthetic
0,receptor
0,platform
0,enables
0,rapid
...,...
359,robust
359,cis
359,transchanges
359,genetic


In [154]:
df_words_ns = words_df[~words_df['word'].isin(df_stop_words['stop_words'])]

In [155]:
df_words_ns.value_counts('word').reset_index(name='counts')

Unnamed: 0,word,counts
0,synthetic,108
1,dna,72
2,engineered,72
3,acoustic,36
4,marburg,36
...,...,...
80,controllers,36
81,control,36
82,complexity,36
83,complex,36


## Generating Wordclouds

We are going to use the [`wordcloud`](https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html) package

In [None]:
!pip install git+https://github.com/amueller/word_cloud

In [None]:
!pip install Pillow

Now we can import the libraries and functions we will use

In [156]:
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
from wordcloud import WordCloud, STOPWORDS



This library also contains StopWords as we saw last week

In [157]:
stopwords = set(STOPWORDS)
#stopwords.add("said") #Added specifically for this book

In [158]:
def plot_cloud(wordcloud):
    # Set figure size
    plt.figure(figsize=(40, 30))
    # Display image
    plt.imshow(wordcloud) 
    # No axis details
    plt.axis("off");

In [4]:
titles_joined =  ' '.join(titles_list)

NameError: name 'titles_list' is not defined

In [161]:
# Generate word cloud
wordcloud = WordCloud(width = 80, 
                      height = 40, 
                      max_words=200, 
                      random_state=1, 
                      background_color='white', 
                      colormap='Pastel1', 
                      collocations=False, 
                      stopwords = STOPWORDS).generate(' '.join(titles_list))

OSError: invalid face handle

In [None]:
plot_cloud(wordcloud)

In [None]:
# Save image
wordcloud.to_file("wordcloud.png")

In [None]:
wc = WordCloud(background_color="white", max_words=200, stopwords=stopwords, contour_width=3, contour_color='steelblue', collocations=False)
wc.generate(text_alice)

In [None]:
plt.figure(figsize=(10,5))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

We can create a personalised WordCloud by using any image

In [163]:
!wget --quiet https://www.pinclipart.com/picdir/middle/355-3551258_bacteria-png-download-png-image-with-transparent-background.png

Lets transform each pixel into a `np.array` value

In [165]:
bacteria_mask = np.array(Image.open('355-3551258_bacteria-png-download-png-image-with-transparent-background.png'))
np.unique(bacteria_mask)

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 18

Here is the image we downloaded

In [1]:
fig = plt.figure(figsize=(10,5))
plt.imshow(bacteria_mask, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis('off')
plt.show()

NameError: name 'plt' is not defined

In [2]:
import re
def getFrequencyDictForText(sentence):
    fullTermsDict = {}
    tmpDict = {}

    # making dict for counting frequencies
    for text in sentence.split(" "):
        if re.match("a|the|an|the|to|in|for|of|or|by|with|is|on|that|be", text):
            continue
        val = tmpDict.get(text, 0)
        tmpDict[text.lower()] = val + 1
    for key in tmpDict:
        fullTermsDict[key] = tmpDict[key]
    return fullTermsDict

In [None]:
getFrequencyDictForText(text_alice)

In [None]:
text_dict = getFrequencyDictForText(text_alice)
text_dict
wc.generate_from_frequencies(text_dict)

In [None]:
wc = WordCloud(background_color="white", max_words=2000, mask=bacteria_mask,
               stopwords=stopwords)
wc.generate_from_frequencies(text_dict)

In [None]:
plt.figure(figsize=(20,10))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

We can use any image for this actually

In [None]:
!wget https://img.pokemondb.net/artwork/large/pikachu.jpg

In [None]:
image = np.array(Image.open("pikachu.jpg"))
plt.imshow(image)

In [None]:
from wordcloud import ImageColorGenerator
image_colors = ImageColorGenerator(image, default_color=(0,0,0))
wc = WordCloud(background_color="white", max_words=2000, mask=image,
               stopwords=stopwords)
wc.generate_from_frequencies(text_dict)

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20,15))
axes[0].imshow(wc, interpolation="bilinear")
axes[1].imshow(wc.recolor(color_func=image_colors), interpolation="bilinear")
axes[2].imshow(image, cmap=plt.cm.gray, interpolation="bilinear")
for ax in axes:
    ax.set_axis_off()
plt.show()