# Scraping Websites

### Cheat sheet for CSS selectors
```
- head selects the element with the head tag
- .red selects all elements with the ‘red’ class
- #nav selects the elements with the ‘nav’ Id
- div.row selects all elements with the div tag and the ‘row’ class
```

https://gist.github.com/magicznyleszek/809a69dd05e1d5f12d01

In [None]:
%run -m pip install requests beautifulsoup4



In [None]:
import requests
from bs4 import BeautifulSoup

def clean(s):
    return " ".join(s.split())

In [None]:
# We want to parse links from a web page, so change the URL and selector below
URL = f'https://www.poetryfoundation.org/poems/browse#page=1&sort_by=recently_added&filter_poetry_children=1'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# Parse a list of links from a page
links = soup.select('.c-hdgSans a')

In [None]:
# Checking the links parsed above
links[:2]

[<a href="https://www.poetryfoundation.org/poetrymagazine/poems/29195/thesis-56d212c726456">?????</a>,
 <a href="https://www.poetryfoundation.org/poetrymagazine/poems/38528/">?</a>]

In [None]:
# Now we have to figure out how to parse a single web page for content
# Change the URL to a sample page and figure out the selector to find the content
page = requests.get('https://www.lyrics.com/lyric/36275842/Imagine+Dragons/Heart+Upon+My+Sleeve')
soup = BeautifulSoup(page.content, 'html.parser')

# Grab the lyrics from a lyrics web page
clean(soup.select_one('#lyric-body-text').text)

"With my heart upon my sleeve My head down low, I still feel broken Down upon my knees With my head down low and I still feel broken Where are you? Where are you? Oh, now that I need you most and My heart upon my sleeve, broken down, whoa I guess I'm just down on my luck a bit, shakin' me out of it I guess I'm just down on my luck a bit, shakin' me out of it With my heart upon my sleeve My head down low, I still feel broken Down upon my knees With my head down low and I still feel broken Where are you? Where are you? Oh, now that I need you most and My heart upon my sleeve, broken down, down, down, down Now, I can't go a single day without thinking of the words I'd say And I can't do a single thing without thinking of you, thinking of you Now I'm just left with the pieces to put back together (Together, together, together) (Forever, forever, forever) With my heart upon my sleeve My head down low, I still feel broken Down upon my knees With my head down low and I still feel broken Where

In [None]:
# Looping over all links and pasting together into the text variable
text = ''
for link in links:
  # Only consider links with a href
  if link.has_attr('href'):
    page = requests.get(link['href'])
    soup = BeautifulSoup(page.content, 'html.parser')
    # Change the selector to the selector you found above    
    body = soup.select_one('article')
    # Only use the text if we found content    
    if body is not None:
      text += clean(body.get_text(' '))

In [None]:
# Writing everything to a file
open("train.txt", "w").write(text[:-1000])

10072

In [None]:
open("valid.txt", "w").write(text[-1000:])

1000

In [None]:
!pip install -q tqdm boto3 requests regex sacremoses transformers importlib_metadata "datasets>=1.1.3" "sentencepiece!=0.1.92" protobuf

[31mERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/tmp/pip-req-tracker-rxpgf5e0/ee20ff7baddba63fd3048ee75891160bd1777f24549f682f9a92dce4'
[0m


In [None]:
!git clone -q https://github.com/huggingface/transformers.git

In [None]:
%run transformers/examples/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --train_file train.txt \
    --validation_file valid.txt \
    --do_train \
    --do_eval \
    --output_dir out

In [None]:
from transformers import pipeline
generator = pipeline('text-generation', model='/content/out/checkpoint-1000')

In [None]:
generator("Love is", max_length=30, num_return_sequences=5)