## Website scraping

It is strongly recommended to consult with your local legal department for compliance before proceeding to scrape content from websites/webpages with permission.

There is no one-fits-all website scrapping solution, when applying the following to other websites/webpages, please modify accordingly.

---

## Learning Objectives
The goal of this lab is to obtain raw text data via webscrapping.

To run through Megatron-LM's default workflow in order to train a GPT model, we will need to obtain data first. The outcome of this notebook is the raw text data which will be used for subsequent tasks in Lab1.

This notebook covers the below steps : 

    1. Install necessary python libraries and download 2 python scripts which will be used for website crawling.
    2. Crawl links from a seeded url and write to a text file.
    3. Remove incompliant links from the text file in order to ensure legal compliance.
    4. Fetch the corresponding webpage from each approved url and write it to html format.
    5. Parse the html file and extract raw text and write to disk.
    6. Move the `extractedNVblogs.txt` to the correct folder under **dataset**.

This notebook did not intend to cover crawling webpages for other websites/webpages.


1. Install python libraries and download 2 python scripts which will be used for website crawling.

In [None]:
# install python libraries
!pip install beautifulsoup4
!pip install html5lib
!pip install PyPDF2
!pip install selenium
!pip install Scrapy
!pip install requests bs4 colorama requests-html

In [None]:
# download 2 python scripts which will be used for website crawling
!wget https://raw.githubusercontent.com/x4nth055/pythoncode-tutorials/master/web-scraping/link-extractor/link_extractor.py
!wget https://raw.githubusercontent.com/x4nth055/pythoncode-tutorials/master/web-scraping/link-extractor/link_extractor_js.py

2. Crawl links from a seeded url and write to a text file named `NVdevblog_urls.txt`

In [None]:
# extracting links from a seeded url and write to a raw text file
!python link_extractor_js.py https://blogs.nvidia.com/blog/category/deep-learning/ -m 2

3. Remove incompliant links from the text file in order to ensure legal compliance.

    Normally, one should check with the legal and remove each incompliant link.

    For this exercise, a pre-filtered `NVdevblog_urls.txt` is provided for your convenience.

In [None]:
from bs4 import BeautifulSoup
import urllib.request as request
import re
import scrapy
import random
import os, sys
# create folder to hold the scrapped html pages
os.makedirs('./htmls/', exist_ok=True)
# read the NVdevblog_urls.txt file and print out a sample url to view
f=open('NVdevblog_urls.txt','r')
lines=f.readlines()
rn=random.randint(0,len(lines)-1)
url=str(lines[rn]).strip()
url

4. Fetch the corresponding webpage from each approved url and write it to `XXX.html` format.

In [None]:
!python -c "import scrapy"
!bash fetchURLs_and_write2html.sh

Below is an example of expected outputs :

    ./htmls/response_64.html
    ./htmls/response_65.html
    ./htmls/response_66.html
    ./htmls/response_67.html
    ./htmls/response_68.html
    ./htmls/response_69.html
    ./htmls/response_70.html

5. Parse the html file and extract the raw text data, which will be written to `extractedNVblogs.txt`.

In [None]:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import html5lib
import codecs
import os,sys
# read the html into python
def covert2txt(html_f ,f_out):
    file = codecs.open(html_f, "r", "utf-8")
    html_doc=file.read()
    soup = BeautifulSoup(html_doc)
    sent_cnt=0
    for node in soup.findAll('p'):
        #print(type(node.text), node.text)
        if node.text not in ['/n','','\t',' ','\n\r'] : 
            sent_cnt+=1
            f_out.write(node.text)            
    f_out.write('\n')      
html_dir='./htmls/'
htmls=os.listdir('./htmls')
f_out=open('extractedNVblogs.txt' , 'a')
for html in htmls:
    outtxt=html.split('.')[0]   
    covert2txt(html_dir+html ,f_out)
f_out.close()
print("finish processing htmls files and convert them to raw txt file")    

6. Move the `extractedNVblogs.txt` to the correct folder under the **dataset** folder. This file `extractedNVblogs.txt` will be used in subsequent notebooks in lab1.

In [12]:
!mv extractedNVblogs.txt ../../../../dataset/EN/

**Note:** Please run the below cell to free up space.

In [14]:
!rm -fr htmls*
!rm link_extractor.py
!rm link_extractor_js.py
!rm blogs.nvidia.com_*

Verify `extractedNVblogs.txt` is successfully moved to the correct folder.

In [None]:
!head -1 ../../../../dataset/EN/extractedNVblogs.txt

Below is an example of expected outputs :

        The NVIDIA NGC team is hosting a webinar with live Q&A to dive into this Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey. Register now: NVIDIA NGC Jupyter Notebook Day: Medical Imaging Segmentation.Image segmentation partitions a digital image into multiple segments by changing the representation into something more meaningful and easier to analyze. In the field of medical imaging, image segmentation can be used to help identify organs and anomalies, measure them, classify them, and even uncover diagnostic information. It does this by using data gathered from x-rays, magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and other formats.To achieve state-of-the-art models that deliver the desired accuracy and performance for a use case, you must set up the right environment, train with the ideal hyperparameters, and optimize it to achieve the desired accuracy. All of this can be time-consuming. Data scientists and developers need the right set of tools to quickly overcome tedious tasks.

--- 

## Links and Resources
Don't forget to check out additional webscraping documents such as [selenium](https://www.selenium.dev/selenium/docs/api/py/index.html), [scrapy](https://docs.scrapy.org/en/latest/) and [beautifulsoup](https://beautiful-soup-4.readthedocs.io/en/latest/).


-----
## <p style="text-align:center;border:3px; padding: 1em"> <a href=../../../../Start_Here.ipynb>HOME</a>&nbsp; &nbsp; &nbsp; <a href=../../../Lab1-2_EstimateComputeDaysNeeded.ipynb>NEXT</a></p>

--- 

## Licensing

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).