### Remix your own scraper

This notebook combines several code snippets for scrapers I tend to use again and again. Feel free to copy, paste, delete and comment out any code that is relevant to your personal analysis. 

In [None]:
# —————— libraries that need to be installed, which you can do via pip ———————

import pdfplumber # to scrape pdfs, documentation: https://github.com/jsvine/pdfplumber
import requests # to open up live links, documentation: https://docs.python-requests.org/en/latest/
from bs4 import BeautifulSoup # to parse HTML, documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import spacy # to do natural language processing like word counts, documentation
import pandas as pd # to use pandas to process data


# —————— libraries built into Python ———————

import json # to read json formatted data
import csv # to write and read csv
import time # to build in wait time for loops
import glob # to access file paths


### Using `glob` to get file paths from a folder

In [None]:
paths = glob.glob("../data/folder_name/*.pdf") # *.pdf can be replaced with other file extensions, like *.html, or left blank

### Using `requests` to open up live URL
- use `requests.get()` to get website response for an API feed and a live link

#### For an API feed
- For APIs: Build your API URL and store it in a variable
    - a long string with your API using concatenation
    - insert api key into quotation marks or store separately as a file that is not committed via github, open this way: open('../data/api-key.txt').read().strip() 
    - build base URL according to documentation
    - store it as json

In [None]:
api_key = "" # insert api key into quotation marks or store separately as a file
api_url = 'https://www.googleapis.com/youtube/v3/search?key='+api_key+'&part=snippet&channelId=UCJFp8uSYCjXOMnkUyb3CQ3Q' #sample API key is youtube
print(api_url)

In [None]:
api_response = requests.get(api_url).text
items = json.loads(api_response)

#### For a live link
- use `requests.get()` to open URL 
- use `.content` to get its HTML

In [None]:
url = "https://en.wikipedia.org/wiki/Category:Women_computer_scientists"
page = requests.get(url)
page_content = page.content

#### For polite scrapers (optional)
- add information about yourself so server maintenance can contact you
- add time between each scrape for any loops so you don't overload the servers?


In [None]:
# Your identification
headers = {"user-agent" : "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36(KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36;",
"from": "Your name example@domain.com"}
page_urls = ["url1.com", "url2.com"]

for url in page_urls:
    # adds time between pinging the server as to not overload it
    time.sleep(2) 
    # add headers
    page = requests.get(url, headers= headers) # `headers= headers` allows websites to trace back who scraped
    page_content = page.content

### Using `pdfplumber` to extract information from pdfs

In [None]:
with pdfplumber.open("path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

### Using `BeautifulSoup` to parse and `html` file

- open the file
- parse it through beautiful soup
- find the container with your content 
- find each item and cycle through it
- store the data in a dictionary

In [None]:
with open("sample_page.html") as page:
    soup = BeautifulSoup(page,  "html.parser")
    contents = soup.find("div", class_="name_of_class")
    list = contents.find_all( "div" , class_="name_of_class2")

    for item in list:
        content_item = item.find("div", class_="name_of_class3").get_text()
        timeaccessed = item.find("div", class_="name_of_class4").get_text()

        row = { "content_item": content_item,
                "timeaccessed": timeaccessed
              }
        rows.append(row)

### Using `csv` to write scraped information to file
- open and create file with a `with open()` statement
- created fieldnames if you've stored data as dictionary 
- write header
- write rows

In [None]:
# make a new csv into which we will write all the rows
with open("../output/csv_name.csv", "w+") as csvfile:
    # these are the header names (must correspond with names you gave it while scraping, see above):
    fieldnames = ["content_item", "timeaccessed"]
    # this creates your csv
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    # this writes in the first row, which are the headers
    writer.writeheader()

    # this loops through your rows (the array you set at the beginning and have updated throughtout)
    for row in rows:
        # this takes each row and writes it into your csv
        writer.writerow(row)

## Word analysis (bonus!)

### Using `spacy` to do a simple word count

Spacy is a library that can assist you in doing linguistic analyses. 

To install and use the Englis-language version of spacy you should run these commands in your virtual environment:
`pip3 install spacy`
`python3 -m spacy download en_core_web_sm`

In this example, you will be importing the `text.txt` file in our `data` folder.

- open document with text
- turn it into a `spacy` document/corpus
- process tokens

In [None]:
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')

# opens the text file and turns it into a string
text = open("../data/text.txt","r+").read()
len(text) # this returns the length of characters and spaces

In [None]:
doc = nlp(text)
len(doc) # this returns the tokens

In [None]:
rows = []
for token in doc:
    rows.append(token.text)