In this notebook I prepare the datasets consisting of:

1.   scientific abstracts,
2.   news.

First, let's import required packages:

In [None]:
!pip install requests
!pip install beautifulsoup4

In [10]:
import numpy as np
import pandas as pd
import requests
import bs4
import lxml.etree as xml
import urllib.request as ulb
from http.cookiejar import CookieJar

# 1. Abstracts

Due to accessibility, I decided to scrape abstracts from https://arxiv.org/ webpage. When searching for abstracts, I chose the keyword "neural networks". I use *requests* library to get the webpage and *BeautifulSoup* to further extract the HTML. 

In [77]:
URL = "https://arxiv.org/search/?searchtype=abstract&query=neural+networks&abstracts=show&size=200&order=-announced_date_first"
requests.get(URL)

<Response [200]>

In [78]:
web_page = bs4.BeautifulSoup(requests.get(URL, {}).text, "lxml")
web_page.head.title.text
#web_page.body

'Search | arXiv e-print repository'

In [79]:
sub_web_page = web_page.find_all(attrs={"class": 'abstract-full has-text-grey-dark mathjax'})
print(sub_web_page[20].text.strip()[:100] + '...')
print('...' + sub_web_page[20].text.strip()[-100:])

We present a novel deep neural network (DNN) architecture for compressing an image when a correlated...
... outperforms previous work on stereo image compression with decoder side information.
        △ Less


I crop the odd end of each abstract and clean the data a bit:

In [80]:
for id,ab in enumerate(sub_web_page):
  sub_web_page[id]=ab.text.strip()[:-15].replace('$', '')

Let's see how exemplary abstracts look:

In [81]:
sub_web_page[:5]

['Deep neural networks have reached very high accuracy on object detection but their success hinges on large amounts of labeled data. To reduce the dependency on labels, various active-learning strategies have been proposed, typically based on the confidence of the detector. However, these methods are biased towards best-performing classes and can lead to acquired datasets that are not good representatives of the data in the testing set. In this work, we propose a unified framework for active learning, that considers both the uncertainty and the robustness of the detector, ensuring that the network performs accurately in all classes. Furthermore, our method is able to pseudo-label the very confident predictions, suppressing a potential distribution drift while further boosting the performance of the model. Experiments show that our method comprehensively outperforms a wide range of active-learning methods on PASCAL VOC07+12 and MS-COCO, having up to a 7.7% relative improvement, or up t

And finally, I write the data (200 abstracts) to a text file:

In [82]:
Outfile = open('data_abstracts.txt','a+')
Outfile.truncate(0)
for a in sub_web_page:
  Outfile.write(a + ' ')
Outfile.close()

# 2. News

For the news, I also chose one website: https://www.nytimes.com/. I first scraped the article links from the main page of The New York Times using *BeautifulSoup*, then used *urllib.requests* and *BeautifulSoup* to extract article texts.

In [45]:
def get_article(url):
    cookiejar = CookieJar()
    opener = ulb.build_opener(ulb.HTTPCookieProcessor(cookiejar))
    webpage = opener.open(url).read().decode('utf8')

    title = bs4.BeautifulSoup(webpage).title.text
    p = bs4.BeautifulSoup(webpage).find_all("p")
    txt = ' '.join(map(lambda x: x.text, p))
    
    return title, txt


I make sure I choose only the links to the articles (not ads or help center etc.) and make sure to remove duplicates:

In [71]:
html_page = ulb.urlopen("https://www.nytimes.com/")
data = []
tits = []
soup = BeautifulSoup(html_page)
t = 'Random title'
for link in soup.findAll('a'):
    if link.get('href')[:len('https://www.nytimes.com')] == 'https://www.nytimes.com':
      try: 
        title,text = get_article(link.get('href'))
        if title != t:
          data.append(text)
          tits.append(title)
        t = title
      except: print('Oops')

oops


Exemplary title and text look like that:

In [75]:
print(tits[12])
print(data[12][:200])

Live Updates: Harris to Visit Southern Border Amid Republican Criticism - The New York Times
Vice President Kamala Harris will travel to El Paso on Friday with Homeland Security Secretary Alejandro N. Mayorkas. President Biden is set to deliver remarks on gun violence as homicides rise in som


Finally, I write the data (a total of 55 articles) to the text file:

In [76]:
Outfile = open('data_news.txt','a+')
Outfile.truncate(0)
for a in data:
  Outfile.write(a + ' ')
Outfile.close()