In this assingment we'll scrape the Democratic and Republican national conventions, thanks to transcripts created by Rev, a company that builds transcripts. 

In [1]:
import requests               # To get the pages
from bs4 import BeautifulSoup # and to process them

from time import sleep      # Allowing us to pause between pulls
from random import random   # And allowing that pause to be random

import textwrap             # Useful for our wrapped output

For our purposes, we'll just work with the visible text, so let's grab that function. We'll also want to have the function that makes a nice filename for us. You'll need to fill in that filename function, but I've given you the code in an exercise. 

In [2]:
from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def generate_filename_from_url(url) :
    
    if not url :
        return None
    
    # drop the http or https
    name = url.replace("https","").replace("http","")

    # Replace useless chareacters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    
    # remove last underscore
    last_underscore_spot = name.rfind("_")
    
    name = name[:last_underscore_spot] + name[(last_underscore_spot+1):]

    # tack on .txt
    name = name + ".txt"
    
    return(name)

    return("")

In [3]:
convention_pages = dict()

convention_pages["democrats"] = """
https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-night-1-transcript
https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-2020-night-2-transcript
https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-night-3-transcript
https://www.rev.com/blog/transcripts/2020-democratic-national-convention-dnc-night-4-transcript
""".split()

convention_pages["republicans"] = """
https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-1-transcript
https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-2-transcript
https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-3-transcript
https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-4-transcript
""".split()


In [4]:
convention_pages

{'democrats': ['https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-night-1-transcript',
  'https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-2020-night-2-transcript',
  'https://www.rev.com/blog/transcripts/democratic-national-convention-dnc-night-3-transcript',
  'https://www.rev.com/blog/transcripts/2020-democratic-national-convention-dnc-night-4-transcript'],
 'republicans': ['https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-1-transcript',
  'https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-2-transcript',
  'https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-3-transcript',
  'https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-4-transcript']}

In [5]:
convention_pages["republicans"]

['https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-1-transcript',
 'https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-2-transcript',
 'https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-3-transcript',
 'https://www.rev.com/blog/transcripts/2020-republican-national-convention-rnc-night-4-transcript']

Some questions to answer as part of the assignment: 

1. What kind of object is `convention_pages`? 
1. What kind of object is `convention_pages["republicans"]`? 

Now your answers: 

1. 'convention_pages' is a dictionary of lists. 
1. convention_pages["republicans"] is a list within the dictionary


Let's go through these pages and scrape all visible text. We'll store each text in its own
file where the file name is the last part of the URL.

In [6]:
for party in convention_pages : ##Using party as a key, pull transcripts
    for link in convention_pages[party] : 
        output_file_name = generate_filename_from_url(link) #Using the function created above, create a file name
        
        # pull the page 
        try:
            r = requests.get(link)
        except :
            pass
        
        # process the page if r status code is 200 (successful pull)
        if r.status_code == 200:
            soup = BeautifulSoup(r.text, 'html.parser')
            texts=soup.findAll(text=True)
            visible_texts = filter(tag_visible, texts)
        
        
        # write out the page to a file with the appropriate name
        with open(output_file_name,'w',encoding = "UTF-8") as outfile :
            outfile.write(" ".join(t.strip() for t in visible_texts))
            
    # Pause for a bit
    wait_time = 5 + random()*10
    print(f"Waiting for {wait_time:.02f} seconds.")
        
    sleep(wait_time)
        

Waiting for 13.69 seconds.
Waiting for 10.80 seconds.


When opening each text file on my computer it looks like each one contains the proper title and corresponding transcript. The political party and night of the convention are visible in the title of the file.

--- 

### A Helpful Function

When you have to write out a long string, it's nice to wrap that text. The library `textwrap` makes that easy. The code below generates a long string and writes out the output in wrapped form. 

In [7]:
from random import choices, seed
from string import ascii_letters

In [8]:
# Generate a long string with some spaces. 

string_length = 50000
chars_to_sample = ascii_letters + " "*8 # Get some spaces in there

seed(20200916)

text = "".join(choices(chars_to_sample,k=string_length))

First we'll just write out the text. You'll notice it's just one long line. 

In [9]:
with open("text.txt",'w') as outfile :
    outfile.write(text)

The library `textwrap` will let us make a nice, wrapped output.

In [10]:
wrapped_text = textwrap.wrap(text)

with open("text_wrapped.txt",'w') as outfile :
    for piece in wrapped_text :
        outfile.write(piece + "\n")