<a href="https://colab.research.google.com/github/know2001/ask_divya/blob/dani-in_progress/dani_scraper_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BeautifulSoup4 Scraper Notes

In [None]:
!pip install beautifulsoup4



In [36]:
import requests
from bs4 import BeautifulSoup
import re
import urllib.parse
import pandas as pd
import csv
import numpy as np
import re

## Get Soup

In [2]:
def get_soup(url):
    page = requests.get(url) # gets status code of a web page
    soup = BeautifulSoup(page.text, 'html.parser') # Parsed HTML code
    return soup

## Get Links (Simple Spider)
BS4 has two search methods, `find()` and `find_all()`. The first will give you the first element that meets the search condition, the later will give you a list of all the findings. In HTML the a tag defines a hyperlink, in this case we want to fish all the urls that have the base url in common, to get all the documentation about immigration.


The url is stored by the href attribute. It is worth noting that in HTML you use HTML's `<base>` tag to specify the base url for all elements that use the `href` attribute. Now, any tag with an `href` or `src` attribute that is empty, it will automatically go to the url you specified in the base tag by default. We are also going to parse the urls extracted from all the hyperlinks:



```
>>> url ='https://cat.example/list;meow?breed=siberian#pawsize'
>>> urllib.parse.urlparse(url)
ParseResult(scheme='https', netloc='cat.example', path='/list', params='meow', query='breed=siberian', fragment='pawsize')
```



In [92]:
url ='https://cat.example/list;meow?breed=siberian#pawsize'
parsed_url = urllib.parse.urlparse(url)

In [93]:
parsed_url.fragment

'pawsize'

One the url parser joins the base url only if the relative url is missing one in the scheme. There are some urls that are already absolute and do have a base url. Usually the href URLs are relative. In that case `urllib.parser.urljoin()` will not join a new base_url. Check the next two examples:

When they both have a base url, and it differs, url2 keeps its base url:
```
urllib.parse.urljoin('http://BASE_URL1/%7Eguido/Python.html', 'http://BASE_URL2/FAQ.html')
>>> http://BASE_URL2/FAQ.html
```
When the second url is relative, it acquires the base url from url1
```
urllib.parse.urljoin('http://BASE_URL1/%7Eguido/Python.html', 'FAQ.html')
>>> http://BASE_URL1/%7Eguido/FAQ.html
```

In [3]:
def get_links(soup, base_url):
    links = []
    for link in soup.find_all('a', href=True):
        url = link["href"] # get url from href attribute
        # Resolve relative links
        url = urllib.parse.urljoin(base_url, url) #joins relative link to base_url
        # Avoid repeating links just because they have a fragment
        fragment = urllib.parse.urlparse(url).fragment
        url = url.replace(('#'+fragment),'')
        if url.startswith(base_url) and url not in links:
            links.append(url)
    return links

In [28]:
base_url = "https://www.uscis.gov/working-in-the-united-states"
output_file = "text.csv"
soup = get_soup(base_url)
links = get_links(soup, base_url)

## Get Page Title
It could be useful to get the title for the contents we are going to collect

In [25]:
def get_title(soup):
    title = soup.find('h1').text.strip()
    return title

## Get Text
We are going to extract teh content of each page ignoring non-text. We do this targeting the paragraph tags `<p>`. The function will be given a soup (parsed html script), and it will write the contents of all the paragraphs in a text file.

In [13]:
soup = get_soup('https://www.uscis.gov/working-in-the-united-states/temporary-workers/e-1-treaty-traders')
acc_headers = soup.find_all('div',class_='accordion__header cke-active')
for i in acc_headers:
    print(i['class'],i.get_text)

['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">Who May File for Change of Status to E-1 Classification</div>>
['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">How to Obtain E-1 Classification if Outside the United States</div>>
['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">General Qualifications of a Treaty Trader</div>>
['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">General Qualifications of the Employee of a Treaty Trader</div>>
['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="accordion__header cke-active" tabindex="0">Period of Stay</div>>
['accordion__header', 'cke-active'] <bound method PageElement.get_text of <div class="a

In [26]:
def get_row(url, soup):
    title = get_title(soup)
    url = url
    text = ''
    main = soup.find('main')

    for element in main.find_all():
        if element.name == 'ul'or element.name == 'p':
            text = '\n'.join([text, element.get_text()])
        elif element.has_attr('class'):
            if len(element['class'])>1:
                if element['class'][0]=='accordion__header':
                    text = '\n'.join([text, element.get_text()])

    row = [title, url, text]
    return row

## Scraper
The scraper is going to visit all the relative urls and extract the useful contents from the paragraphs of each page. It will write the text on an output file, a csv.

When you open a file you usually use with open(), this method will automatically close the file after you are done reading or writing. Open takes three attributes, the file name, the mode, and the encoding (automatic). You are usually reading or writing on a file, `r` will select reading mode, `w` will select writing mode. It is worth mentioning the modes:
*   w+: Opens a file in read and write mode. It creates a new file if it does not exist, if it exists, it erases the contents of the file and the file pointer starts from the beginning.
*   rw+: Opens a file in read and write mode. File pointer starts at the beginning of the file.

In [29]:
def scraper(base_url, output_file):
    visited = set()
    to_visit = [base_url]
    data = []

    i=0
    while to_visit:
        # get url from to_visit
        url = to_visit.pop() # removes and returns last element of the list
        # confirm it is not in visited if it is skip to next iteration using continue
        if url in visited:
            continue
        i+=1
        if i%5==0:
            print(f'{i} pages scraped')

        # add to visited
        visited.add(url)
        # get soup
        soup = get_soup(url)
        # get page title and text from soup and create a new row which has dict format
        row = get_row(url, soup)
        # write new row in the csv
        data.append(row)
        # get links from soup
        links = get_links(soup, base_url)
        # append links to to_visit list if they are not in the visited set
        to_visit.extend(link for link in links if link not in visited)

    print(f'{i} Pages scraped in total')
    columns = ['title', 'url', 'text']
    df = pd.DataFrame(data, columns=columns)
    return df

df = scraper(base_url, output_file)

5 pages scraped
10 pages scraped
15 pages scraped
20 pages scraped
25 pages scraped
30 pages scraped
35 pages scraped
40 pages scraped
45 pages scraped
50 pages scraped
55 pages scraped
60 pages scraped
65 pages scraped
70 pages scraped
75 pages scraped
80 pages scraped
85 pages scraped
89 Pages scraped in total


We want to make sure we can save the DataFrame into a CSV, and then load it back without breaking the shape (row x column).

In [32]:
df.to_csv('data.csv', index = False)
df1 = pd.read_csv('data.csv')
df1

Unnamed: 0,title,url,text
0,Working in the United States,https://www.uscis.gov/working-in-the-united-st...,\nMany noncitizens want to come to the United ...
1,Petition Process Overview,https://www.uscis.gov/working-in-the-united-st...,\nIf you would like to come to the United Stat...
2,Report Labor Abuses,https://www.uscis.gov/working-in-the-united-st...,\nWe are committed to helping protect the righ...
3,Options for Nonimmigrant Workers Following Ter...,https://www.uscis.gov/working-in-the-united-st...,"\nWhen nonimmigrant workers are laid off, they..."
4,Employment Authorization in Compelling Circums...,https://www.uscis.gov/working-in-the-united-st...,\nThis temporary employment authorization may ...
...,...,...,...
84,Questions and Answers: EB-5 Immigrant Investor...,https://www.uscis.gov/working-in-the-united-st...,\nQuestions and Answers: Visa Availability App...
85,Questions and Answers: EB-5 Further Deployment,https://www.uscis.gov/working-in-the-united-st...,"\nA1. For now, Form I-924A does not separately..."
86,EB-5 Questions and Answers: EB-5 Reform and In...,https://www.uscis.gov/working-in-the-united-st...,\nEntities seeking to be designated as a regio...
87,EB-5 What's New,https://www.uscis.gov/working-in-the-united-st...,\nThis page provides the latest information on...


Check for duplicates:

In [131]:
print(f"{sum(df1.duplicated(subset='text'))} duplicates")

0 duplicates


## Data Cleaning

In [37]:
def remove_unwanted_text(df, unwanted_text):
   df['text'] = df['text'].apply(lambda x: x.replace(unwanted_text, '', 1) if x.starstswith(unwanted_text) else x)
   return df

str

In [80]:
def get_pdf_file_names(df):
    pdf_matches = []
    pattern_list = ["\(PDF[,].*KB\)", "\(PDF[,].*MB\)", "\(PDF\)"]

    for i, text in enumerate(df1['text']):
        #print(f'**************Text {i}****************')
        matched_search = True
        while matched_search is True:
            matched_search = False
            for pattern in pattern_list:
                result = re.search(pattern, text, flags=0)
                if result!=None:
                    #print(f'match: {result[0]}')
                    matched_search = True
                    pdf_matches.append(result[0])
                    text = text.replace(result[0],'')

    return set(pdf_matches)

pdf_file_names = get_pdf_file_names(df1)
print(len(pdf_file_names), pdf_file_names)

48 {'(PDF, 1.3 MB)', '(PDF)', '(PDF, 10.99 MB)', '(PDF, 1015.37 KB)', '(PDF, 399.14 KB)', '(PDF, 253.39 KB)\xa0 | 한국어 (PDF, 322.1 KB)\xa0 | Русский (PDF, 280.01 KB)', '(PDF, 268.06 KB)', '(PDF, 225.48 KB)', '(PDF, 228.14 KB)', '(PDF, 160.67 KB)', '(PDF, 270.61 KB)', '(PDF, 250.08 KB)\xa0 | 한국어 (PDF, 428.48 KB)\xa0 | Русский (PDF, 361.57 KB)', '(PDF, 379.71 KB)', '(PDF, 790.07 KB)', '(PDF, 347.4 KB)', '(PDF, 221.65 KB)', '(PDF, 70.28 KB)', '(PDF, 5.73 MB)', '(PDF, 324.11 KB)', '(PDF, 476.13 KB)', '(PDF, 272.13 KB)', '(PDF, 357.48 KB) \xa0| 한국어 (PDF, 596.67 KB) \xa0| Русский (PDF, 540.72 KB)', '(PDF, 400.77 KB)', '(PDF, 156 KB)', '(PDF, 95.33 KB)', '(PDF, 1.12 MB)', '(PDF, 641.66 KB)', '(PDF, 125.43 KB)', '(PDF, 314.97 KB)', '(PDF, 367.63 KB)', '(PDF, 315.88 KB)', '(PDF, 238.48 KB)', '(PDF, 416.01 KB)', '(PDF, 1.29 MB)', '(PDF, 198.94 KB)', '(PDF, 249.12 KB)', '(PDF, 151.74 KB)', '(PDF, 123.38 KB)', '(PDF, 276.55 KB)', '(PDF, 262.77 KB)', '(PDF, 1.13 MB)', '(PDF, 618.16 KB)', '(PDF, 766.

In [58]:
pdf_matches

[]