<a href="https://colab.research.google.com/github/know2001/ask_divya/blob/dani-in_progress/dani_scraper_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BeautifulSoup4 Scraper Notes

In [None]:
!pip install beautifulsoup4



In [1]:
import requests
from bs4 import BeautifulSoup
import re
import urllib.parse
import pandas as pd
import csv

## Get Soup

In [2]:
def get_soup(url):
    page = requests.get(url) # gets status code of a web page
    soup = BeautifulSoup(page.text, 'html.parser') # Parsed HTML code
    return soup

## Get Links (Simple Spider)
BS4 has two search methods, `find()` and `find_all()`. The first will give you the first element that meets the search condition, the later will give you a list of all the findings. In HTML the a tag defines a hyperlink, in this case we want to fish all the urls that have the base url in common, to get all the documentation about immigration.


The url is stored by the href attribute. It is worth noting that in HTML you use HTML's `<base>` tag to specify the base url for all elements that use the `href` attribute. Now, any tag with an `href` or `src` attribute that is empty, it will automatically go to the url you specified in the base tag by default. We are also going to parse the urls extracted from all the hyperlinks:



```
>>> url ='https://cat.example/list;meow?breed=siberian#pawsize'
>>> urllib.parse.urlparse(url)
ParseResult(scheme='https', netloc='cat.example', path='/list', params='meow', query='breed=siberian', fragment='pawsize')
```



One the url parser joins the base url only if the relative url is missing one in the scheme. There are some urls that are already absolute and do have a base url. In that case `urllib.parser.urljoin()` will not join a new base_url. Check the next two examples:

In [3]:
urllib.parse.urljoin('http://BASE_URL1/%7Eguido/Python.html', 'http://BASE_URL2/FAQ.html')

'http://BASE_URL2/FAQ.html'

In [4]:
urllib.parse.urljoin('http://BASE_URL1/%7Eguido/Python.html', 'FAQ.html')

'http://BASE_URL1/%7Eguido/FAQ.html'

In [5]:
def get_links(soup, base_url):
    links = []
    for link in soup.find_all('a', href=True):
        url = link["href"] # get url from href attribute
        # Resolve relative links
        url = urllib.parse.urljoin(base_url, url) #joins relative link to base_url
        if url.startswith(base_url) and url not in links:
            links.append(url)
    return links

In [6]:
base_url = "https://www.uscis.gov/working-in-the-united-states"
output_file = "text.csv"
soup = get_soup(base_url)
links = get_links(soup, base_url)

## Get Page Title
It could be useful to get the title for the contents we are going to collect

In [7]:
def get_title(soup):
    title = soup.find('h1').text.strip()
    return title

## Write Text
We are going to extract teh content of each page ignoring non-text. We do this targeting the paragraph tags `<p>`. The function will be given a soup (parsed html script), and it will write the contents of all the paragraphs in a text file.

In [13]:
soup = get_soup('https://www.uscis.gov/working-in-the-united-states/temporary-workers/e-1-treaty-traders')
acc_headers = soup.find_all('div',class_='accordion__header cke-active')
for i in acc_headers:
    print(i['class'],i.get_text)

['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">Who May File for Change of Status to E-1 Classification</div>>
['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">How to Obtain E-1 Classification if Outside the United States</div>>
['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">General Qualifications of a Treaty Trader</div>>
['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">General Qualifications of the Employee of a Treaty Trader</div>>
['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">Period of Stay</div>>
['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="a

In [11]:
def get_row(soup):
    title = get_title(soup)
    text = ''
    main = soup.find('main')

    for element in main.find_all():
        if element.name == 'ul'or element.name == 'p':
            text += element.get_text() + '\n'
        elif element.has_attr('class'):
            if len(element['class'])>1:
                if element['class'][0]=='accordion__header':
                    text += element.get_text() + '\n'

    row = {'title': title, 'text': text}
    return row

## Scraper
The scraper is going to visit all the relative urls and extract the useful contents from the paragraphs of each page. It will write the text on an output file, a csv.

When you open a file you usually use with open(), this method will automatically close the file after you are done reading or writing. Open takes three attributes, the file name, the mode, and the encoding (automatic). You are usually reading or writing on a file, `r` will select reading mode, `w` will select writing mode. It is worth mentioning the modes:
*   w+: Opens a file in read and write mode. It creates a new file if it does not exist, if it exists, it erases the contents of the file and the file pointer starts from the beginning.
*   rw+: Opens a file in read and write mode. File pointer starts at the beginning of the file.

In [None]:
def scraper(base_url, output_file):
    visited = set()
    to_visit = [base_url]
    columns = ['title', 'text']
    df = pd.DataFrame(columns=columns)
    with open(output_file, 'w', newline = '') as f: # w for write mode
        writer = csv.DictWriter(f, fieldnames = columns, dialect = 'unix')
        writer.writeheader()
        i=0
        while to_visit:
            print(i)
            i+=1
            # get url from to_visit
            url = to_visit.pop() # removes and returns last element of the list
            # confirm it is not in visited if it is skip to next iteration using continue
            if url in visited:
                continue
            # add to visited
            visited.add(url)
            # get soup
            soup = get_soup(url)
            # get page title and text from soup and create a new row which has dict format
            row = get_row(soup)
            # write new row in the csv
            writer.writerow(row)
            print(row)
            # get links from soup
            links = get_links(soup, base_url)
            # append links to to_visit list if they are not in the visited set
            to_visit.extend(link for link in links if link not in visited)
            if i==50:
                continue
scraper(base_url, output_file)

In [None]:
def scraper(base_url, output_file):
    visited = set()
    to_visit = [base_url]
    columns = ['title', 'text']
    df = pd.DataFrame(columns=columns)

    i=0
    while to_visit:
        print(i)
        i+=1
        # get url from to_visit
        url = to_visit.pop() # removes and returns last element of the list
        # confirm it is not in visited if it is skip to next iteration using continue
        if url in visited:
            continue
        # add to visited
        visited.add(url)
        # get soup
        soup = get_soup(url)
        # get page title and text from soup and create a new row which has dict format
        row = get_row(soup)
        # write new row in the csv
        df = df.append(row, ignore_index = True)
        # get links from soup
        links = get_links(soup, base_url)
        # append links to to_visit list if they are not in the visited set
        to_visit.extend(link for link in links if link not in visited)

    df.to_csv('text.csv', encoding='utf-8')
    return df
df = scraper(base_url, output_file)

In [31]:
df.to_csv('text.csv', encoding='utf-8', separator)

#FASTAPI intro
Web framework for building APIs with Python. To run a FastAPI application we need a server program like Uvicorn. Uvicorn is an ASGI server program. It will let us run a server manually, our application will that way run in a remote server machine.