<a href="https://colab.research.google.com/github/know2001/ask_divya/blob/dani-in_progress/dani_scraper_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BeautifulSoup4 Scraper Notes

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import urllib.parse
import pandas as pd
import csv
import numpy as np
import re

## Get Soup

In [2]:
def get_soup(url):
    page = requests.get(url) # gets status code of a web page
    soup = BeautifulSoup(page.text, 'html.parser') # Parsed HTML code
    return soup

## Get Links (Simple Spider)
BS4 has two search methods, `find()` and `find_all()`. The first will give you the first element that meets the search condition, the later will give you a list of all the findings. In HTML the a tag defines a hyperlink, in this case we want to fish all the urls that have the base url in common, to get all the documentation about immigration.


The url is stored by the href attribute. It is worth noting that in HTML you use HTML's `<base>` tag to specify the base url for all elements that use the `href` attribute. Now, any tag with an `href` or `src` attribute that is empty, it will automatically go to the url you specified in the base tag by default. We are also going to parse the urls extracted from all the hyperlinks:



```
>>> url ='https://cat.example/list;meow?breed=siberian#pawsize'
>>> urllib.parse.urlparse(url)
ParseResult(scheme='https', netloc='cat.example', path='/list', params='meow', query='breed=siberian', fragment='pawsize')
```
```
>>> url ='https://cat.example/list;meow?breed=siberian#pawsize'
>>> parsed_url = urllib.parse.urlparse(url)
>>> parsed_url.fragment
pawsize
```


One the url parser joins the base url only if the relative url is missing one in the scheme. There are some urls that are already absolute and do have a base url. Usually the href URLs are relative. In that case `urllib.parser.urljoin()` will not join a new base_url. Check the next two examples:

When they both have a base url, and it differs, url2 keeps its base url:
```
urllib.parse.urljoin('http://BASE_URL1/%7Eguido/Python.html', 'http://BASE_URL2/FAQ.html')
>>> http://BASE_URL2/FAQ.html
```
When the second url is relative, it acquires the base url from url1
```
urllib.parse.urljoin('http://BASE_URL1/%7Eguido/Python.html', 'FAQ.html')
>>> http://BASE_URL1/%7Eguido/FAQ.html
```

In [3]:
def get_links(soup, base_url):
    links = []
    for link in soup.find_all('a', href=True):
        url = link["href"] # get url from href attribute
        # Resolve relative links
        url = urllib.parse.urljoin(base_url, url) #joins relative link to base_url
        # Avoid repeating links just because they have a fragment
        fragment = urllib.parse.urlparse(url).fragment
        url = url.replace(('#'+fragment),'')
        if url.startswith(base_url) and url not in links:
            links.append(url)
    return links

## Get Page Title
It could be useful to get the title for the contents we are going to collect

In [4]:
def get_title(soup):
    title = soup.find('h1').text.strip()
    return title

## Get Text
what text?
- The Main Header of the page h1
- Text inside the Main element extracting the text from:
    - The subheaders: h2...h6
    - The paragraphs: p
    - The accordion__headers and paragraphs 
        ```
        >>> soup = get_soup('https://www.uscis.gov/working-in-the-united-states/
        >>> temporary-workers/e-1-treaty-traders')
        >>> acc_headers = soup.find_all('div',class_='accordion__header cke-active')
        >>> for i in acc_headers:
        >>>    print(i['class'],i.get_text) # see how the class key has is a list of two values
        ['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">Who May File for Change of Status to E-1 Classification</div>>
        ['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">How to Obtain E-1 Classification if Outside the United States</div>>
        ...
        ```
    - The ordered lists and unordered lists: ol and ul
    - The tables. Extracting the text of each cell and prepending the header of each column to it. 

In [5]:
def get_table_text(table):
    text = ''
    table_headers = [header.get_text() for header in table.find_all('th')]
    table_rows = [row for row in table.find_all('tr')]

    for row in table_rows:
        row_cells = [cell.get_text().strip() for cell in row.find_all('td')]
        row_text =''
        for i, cell_text in enumerate(row_cells):
            cell_text = ' '.join([table_headers[i], cell_text])
            row_text =  ' '.join([row_text, cell_text]).replace('\n', '')

        text = '\n'.join([text, row_text])

    return text

def get_row(url, soup):
    title = get_title(soup)
    url = url
    text = ''
    main = soup.find('main')

    for element in main.find_all():
        if element.name in ['ul', 'p', 'ol', 'h2', 'h3', 'h4', 'h5', 'h6']:
            text = '\n'.join([text, element.get_text()])
        elif element.has_attr('class'):
            if len(element['class'])>1:
                if element['class'][0]=='accordion__header':
                    text = '\n'.join([text, element.get_text()])
        elif element.name == 'table':
            table_text = get_table_text(element)
            text = '\n'.join([text, table_text])


    row = [title, url, text]
    return row

### DataFrame rows by Section
This function stores the scraped text of each page distinguishing the sections within a page, each section will be stored in a different row. It does not differentiate by subsections, only sections.

In [49]:
def get_rows(url, soup):
    title = get_title(soup)
    rows = [] #for the dataframe
    url = url
    main = soup.find('main')
    section_text =''
    section_title = ''

    for element in main.find_all():
        if element.name == 'h2':
            if section_title != '':
                rows.append([url, title, section_title, section_text]) # the section is saved as a row for the df with all the scrapped text
                section_text =''
            section_title = element.get_text()
        elif element.name in ['ul', 'p', 'ol']:
            section_text = '\n'.join([section_text, element.get_text()])
        elif element.has_attr('class'):
            if len(element['class'])>1:
                if element['class'][0]=='accordion__header':
                    section_text = '\n'.join([section_text, element.get_text()])
        elif element.name == 'table':
            table_text = get_table_text(element)
            section_text = '\n'.join([section_text, table_text])
    
    # Append the only section or the last section
    if section_title == '': 
        section_title = title
        rows.append([url, title, section_title, section_text]) # no subsections
    else: 
        rows.append([url, title, section_title, section_text]) # last section
    return rows        

## Scraper
The scraper is going to visit all the relative urls and extract the useful contents from the paragraphs of each page. It will write the text on an output file, a csv.

When you open a file you usually use with open(), this method will automatically close the file after you are done reading or writing. Open takes three attributes, the file name, the mode, and the encoding (automatic). You are usually reading or writing on a file, `r` will select reading mode, `w` will select writing mode. It is worth mentioning the modes:
*   w+: Opens a file in read and write mode. It creates a new file if it does not exist, if it exists, it erases the contents of the file and the file pointer starts from the beginning.
*   rw+: Opens a file in read and write mode. File pointer starts at the beginning of the file.

In [None]:
def scraper(base_url, output_file):
    visited = set()
    to_visit = [base_url]
    data = []

    i=0
    while to_visit:
        # get url from to_visit
        url = to_visit.pop() # removes and returns last element of the list
        # confirm it is not in visited if it is skip to next iteration using continue
        if url in visited:
            continue
        i+=1
        if i%5==0:
            print(f'{i} pages scraped')
        # add to visited
        visited.add(url)
        # get soup
        soup = get_soup(url)
        # get page title and text from soup and create a new row which has dict format
        rows = get_rows(url, soup)
        # write new row in the csv
        data.extend(rows)
        # get links from soup
        links = get_links(soup, base_url)
        # append links to to_visit list if they are not in the visited set
        to_visit.extend(link for link in links if link not in visited)

    print(f'{i} Pages scraped in total')
    columns = ['url', 'title', 'section', 'text']
    df = pd.DataFrame(data, columns=columns)
    return df

In [6]:
base_url = "https://www.uscis.gov/working-in-the-united-states"
output_file = "text.csv"
soup = get_soup(base_url)
links = get_links(soup, base_url)
df = scraper(base_url, output_file)

Check for duplicate data in the text column:

In [51]:
print(f"{sum(df.duplicated(subset='text'))} duplicates")

0 duplicates


## Save CSV
We also want to make sure we can save the DataFrame into a CSV, and then load it back without breaking the shape (row x column).

In [52]:
df.to_csv('scraped_sections.csv', index = False)
df1 = pd.read_csv('scraped_sections.csv')
df2 = pd.read_csv('scraped_data.csv')
df1

Unnamed: 0,url,title,section,text
0,https://www.uscis.gov/working-in-the-united-st...,Working in the United States,Topics,\nMany noncitizens want to come to the United ...
1,https://www.uscis.gov/working-in-the-united-st...,Petition Process Overview,"Form I-129, Petition for Nonimmigrant Worker",\nIf you would like to come to the United Stat...
2,https://www.uscis.gov/working-in-the-united-st...,Petition Process Overview,"Form I-140, Immigrant Petition for Alien Workers","\nBelow is the list of Form I-140, Immigrant P..."
3,https://www.uscis.gov/working-in-the-united-st...,Petition Process Overview,"Form I-360, Petition for Amerasian, Widow(er),...","\nBelow is the list of Form I-360, Petition fo..."
4,https://www.uscis.gov/working-in-the-united-st...,Petition Process Overview,"Form I-526, Immigrant Petition by Alien Investor","\nBelow is the list of Form I-526, Immigrant P..."
...,...,...,...,...
194,https://www.uscis.gov/working-in-the-united-st...,EB-5 What's New,EB-5 What's New,\nThis page provides the latest information on...
195,https://www.uscis.gov/working-in-the-united-st...,EB-5 Regional Center Compliance Reviews,Compliance Review Team Tasks,"\nThis page in Simplified Chinese. (PDF, 95.3..."
196,https://www.uscis.gov/working-in-the-united-st...,EB-5 Regional Center Compliance Reviews,Preparing for a Compliance Review,"\nBefore the site assessment, regional centers..."
197,https://www.uscis.gov/working-in-the-united-st...,EB-5 Regional Center Compliance Reviews,After Completing the Review,\nThe review team will document the results in...
