# Two Examples of Scraping and Chat Bot Creation
This notebook details the scraping and creation of two chat bots. Please be kind to the websites used in here and do not excessively burden them by running the scraping cells repeatedly. Especially for iteratively accessing websites. The Prager University scraping is slightly more complex, while the Frankenstein scraping is straightforward. In both cases, making chat bots are easy with markovify, and while they lack the complexity of certain recurrent neural network bots, they end up providing relatively coherent and entertaining sentences.

In [1]:
from bs4 import BeautifulSoup
import requests as rq
import time
import pandas as pd
from datetime import datetime
import markovify as mk

## Prager University Chat Bot
This will be a chat bot created from the transcripts of all of the available Prager University. This will take two steps:
1. Scraping of Prager University's Website
2. Building Markov Chain Chat Bot

### Scraping Prager University's Website
This stage is comprised of two steps
1. Gather URLs for each video
2. Scrape relevant data for each video

#### Gather URLs for each video

In [2]:
vidSelPage = rq.get("https://www.prageru.com/5-minute-videos") # Getting homepage data
vidSelSoup = BeautifulSoup(vidSelPage.content, 'html.parser') # Parsing homepage data
vidThumb = vidSelSoup.find_all('div', class_='video-thumbnail') # Isolating data with video thumbnail links
vidThumb[0:4]

[<div class="video-thumbnail">
 <a href="/videos/where-are-you-martin-luther-king"><img height="212" src="https://www.prageru.com/sites/default/files/styles/16x9_small/public/courses/image/rileyjason_martinlutherkingjr_thumbnail_1280x720.png?itok=qB0xzr0R" width="380"/></a> </div>,
 <div class="video-thumbnail">
 <a href="/videos/what-does-diversity-have-do-science"><img height="212" src="https://www.prageru.com/sites/default/files/styles/16x9_small/public/courses/image/thumbnail_macdonaldheather_whatdoesdiversity_1280x720.png?itok=TAnF4zK4" width="380"/></a> </div>,
 <div class="video-thumbnail">
 <a href="/videos/wwi-war-changed-everything"><img height="212" src="https://www.prageru.com/sites/default/files/styles/16x9_small/public/courses/image/robertsandrew_wwi_website.png?itok=9wFpxFoi" width="380"/></a> </div>,
 <div class="video-thumbnail">
 <a href="/videos/how-reformation-shaped-your-world"><img height="212" src="https://www.prageru.com/sites/default/files/styles/16x9_small/pub

In [3]:
vidLinks = [i.select('a')[0].attrs['href'] for i in vidThumb] # Grabbing just the video thumbnail links
vidLinks[0:4]

['/videos/where-are-you-martin-luther-king',
 '/videos/what-does-diversity-have-do-science',
 '/videos/wwi-war-changed-everything',
 '/videos/how-reformation-shaped-your-world']

In [4]:
vidBase = 'https://www.prageru.com' # Creating thumbnail base
vidUrls = [vidBase+i for i in vidLinks] # Combining base and extension
vidUrls[0:4]

['https://www.prageru.com/videos/where-are-you-martin-luther-king',
 'https://www.prageru.com/videos/what-does-diversity-have-do-science',
 'https://www.prageru.com/videos/wwi-war-changed-everything',
 'https://www.prageru.com/videos/how-reformation-shaped-your-world']

Now we have the URL for each video and can go to the next stage of scraping
#### Scrape relevant data for each video
For making a chatbot I will only need the transcripts but I'll pull together some accompanying data in case I want to look at other stuff at a later date. So I will grab the title, date, presenter, views, and transcript for each video and compil it into a pandas dataframe.

In [5]:
title = [] # Making empty list to store titles
date = [] # Making empty list to store dates
presenter = [] # Making empty list to store presenters
views = [] # Making empty list to store views
transcripts = [] # Making empty list to store video transcripts
for i in vidUrls: # looping over each URL
    a = rq.get(i) # Downloading video web page
    b = BeautifulSoup(a.content, 'html.parser') # Parsing video web page
    title.append(b.find('h1', class_='video-title h4').get_text()) # Gathering title
    date.append(' '.join(b.find('div', class_='date').get_text().split())) # Gathering date
    presenter.append(' '.join(b.find('div', class_='presenter').get_text().split())) # Gathering presenter
    views.append(' '.join(b.find('div', class_='view-count').get_text().split())) # Gathering views
    c = b.find('div', class_="transcript reveal-single-target") # Gathering transcript paragraphs
    tmpList = [] # Creating empty list to store each paragraph
    for j in c.select('p'): # interating over each paragraph
        d = j.get_text() # Storing text in variable d
        e = ' '.join(d.split()) # Cleaning up extra spaces
        tmpList.append(e) # Compiling each paragraph into list
    transcripts.append(tmpList) # Storing transcripts as list of lists
    time.sleep(1) # Waiting a second to be respectful of not overloading website

In [6]:
transcript = [' '.join(i) for i in transcripts] # Collapsing lists of lists into paragraph per video

In [7]:
pragerU = pd.DataFrame({ # Compiling lists into pandas DataFrame
    'title': title,
    'date': date,
    'presenter': presenter,
    'views': views,
    'transcript': transcript
})
pragerU.head()

Unnamed: 0,title,date,presenter,views,transcript
0,"Where Are You, Martin Luther King?","Jan 14, 2019",Jason Riley,"1,406,116 Views",It’s been 50 years since Dr. Martin Luther Kin...
1,What Does Diversity Have to Do with Science?,"Jan 7, 2019",Heather Mac Donald,"2,708,398 Views",The promoters of identity politics—the idea th...
2,WWI: The War That Changed Everything,"Dec 31, 2018",Andrew Roberts,"3,154,425 Views","As an historian, I’m often asked if I could st..."
3,How the Reformation Shaped Your World,"Dec 24, 2018",Stephen Cornils,"2,197,363 Views","Five hundred years ago, on October 31, 1517, a..."
4,Why You Should Be a Nationalist,"Dec 17, 2018",Yoram Hazony,"2,803,954 Views",Britain votes to leave the European Union. The...


In [8]:
pragerU['date'] = [datetime.strptime(i, '%b %d, %Y') for i in pragerU['date']] # Putting date into datatime format

In [9]:
# Getting views into integer format
x = 0 # Making iteratable count
for i in pragerU['views']: # iterating over every observation in views
    a = i.replace(' Views', '') # removing the ' Views' text
    if 'M' in a: # handling views in the millions
        b = float(a.split('M')[0])
        c = b*1000000
    elif 'K' in a: # handling views in the thousands
        b = float(a.split('K')[0])
        c = b*1000
    else: # handling videos without alphabetic notation
        c = float(''.join(a.split(',')))
    pragerU['views'][x] = c # assigning cleaned observation over current
    x+=1 # Increasing iterator for proper indexing on next loop

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


In [10]:
pragerU.head()

Unnamed: 0,title,date,presenter,views,transcript
0,"Where Are You, Martin Luther King?",2019-01-14,Jason Riley,1406120.0,It’s been 50 years since Dr. Martin Luther Kin...
1,What Does Diversity Have to Do with Science?,2019-01-07,Heather Mac Donald,2708400.0,The promoters of identity politics—the idea th...
2,WWI: The War That Changed Everything,2018-12-31,Andrew Roberts,3154420.0,"As an historian, I’m often asked if I could st..."
3,How the Reformation Shaped Your World,2018-12-24,Stephen Cornils,2197360.0,"Five hundred years ago, on October 31, 1517, a..."
4,Why You Should Be a Nationalist,2018-12-17,Yoram Hazony,2803950.0,Britain votes to leave the European Union. The...


### Making Markov Chain Chat Bot
This part (like with most machine learning) is actually the easy bit. This will comprise of two short stages
1. Generating chat bot
2. Generating sentences form chat bot

#### Generating chat bot
Here I chose a state size (the number of words before that should be considered when deciding on next word) to be 4, which appears to be the best balance between sentence coherency and freedom to create new sentences without them being identical to those in the corpus, But I think this will depend mostly on the size and variability of your corpus if you're doing this on other data.

In [11]:
pUbot = mk.Text(pragerU.transcript, state_size=4)

#### Generating sentences from chat bot
Here I chose 20 short sentences of no more than 100 characters at a time, but could have chosen anything really. Feel free to play around with a few different methods here. This will generate a different set of sentences for each time the cell is run. I believe 'None's occur if the sentence was going to be too similar to other sentences in the corpus, but don't quote me.

In [12]:
for i in range(20):
    print(pUbot.make_short_sentence(100))

And by the end of the war.
None
None
None
Notice equality is not part of the American way of life.
None
In other words, we are either created in the image of God and therefore infinitely valuable.
The printing press allowed, for the first time in the history of public health.
None
None
That is why figuring out how to make good people is the hardest part of any employer’s job.
None
You come off as one of the great chapters in the history of public health.
One of the most misunderstood clauses in the United States who, like me, have special needs.
Rather than let the free market heal itself.
One reason, I've discovered, is that many people don’t know what progressivity is.
In the autumn of 1863—at the height of the Civil War, the slave population had grown to 4 million.
In fact, a strong case can be made for both sides of the argument.
Because many people don't want to believe that the citizens of the city could still kill the child.
I’m Paul Kengor, Professor of Political Science at the

In [13]:
pragerU.to_csv("pragerU.csv") # Saving dataset so I won't have to scrape to access later

## Frankenstein Chat Bot
This bot creation also required two steps:
1. Scraping Frankenstein Text
2. Making Markov Chain Chat Bot

As you will see, the scraping process for Frankenstein was far simpler, because it was all on one page and already in an amenable format.
### Scraping Frankenstein Text

In [14]:
# Accessing and parsing frankenstein text from url below
frankSoup = BeautifulSoup(rq.get('https://www.gutenberg.org/files/84/84-h/84-h.htm').content, 'html.parser')

In [15]:
# Isolating all text located within paragraph tags
frankList = frankSoup.find_all('p')
frankList[25:28]

[<p>
 Well, these are useless complaints; I shall certainly find no friend on the
 wide ocean, nor even here in Archangel, among merchants and seamen. Yet
 some feelings, unallied to the dross of human nature, beat even in these
 rugged bosoms. My lieutenant, for instance, is a man of wonderful courage
 and enterprise; he is madly desirous of glory, or rather, to word my phrase
 more characteristically, of advancement in his profession. He is an
 Englishman, and in the midst of national and professional prejudices,
 unsoftened by cultivation, retains some of the noblest endowments of
 humanity. I first became acquainted with him on board a whale vessel;
 finding that he was unemployed in this city, I easily engaged him to assist
 in my enterprise.
 </p>, <p>
 The master is a person of an excellent disposition and is remarkable in the
 ship for his gentleness and the mildness of his discipline. This
 circumstance, added to his well-known integrity and dauntless courage, made
 me very de

In [16]:
# Cleaning up text
frankText = [] # Making empty list for storing text
for i in frankList: # Iterating over each paragraph tag
    if len(i.get_text()) > 100: # Ignoring any paragraph with less than 100 characters
        frankText.append(
            i.get_text().replace('\r\n',' ').replace('\\"','').replace('\\“','').replace('\\”', '')) 
        # removing all line breaks, and various forms of quotation marks

### Making Markov Chain Chat Bot
Once again this stage will be comprised of two steps:
1. Generating chat bot
2. Generating sentences from chat bot

#### Generating chat bot

In [17]:
frankBot = mk.Text(frankText, state_size=4)

#### Generating sentences from chat bot

In [18]:
for i in range(20):
    print(frankBot.make_short_sentence(100))

The forms of the beloved dead flit before me, and I began to ascend the mountain that overhangs it.
None
The image of Clerval was for ever before my eyes, and cried out in agony,
None
None
None
When I look back, it seems to me as if nothing would or could ever be known.
None
It advanced from behind the mountains of Jura, and the bright summit of Mont Blanc.
Shut in, however, by ice, it was impossible to return to my retreat during that day.
I resolved, therefore, that if my immediate union with my Elizabeth was one of horror and dismay.
The servant instantly showed it to one of the best houses in the town.
None
My rage is unspeakable when I reflect that you are the cause of its excess.
None
I thought of Switzerland; it was far different from mine in every other respect.
The image of Clerval was for ever before my eyes, and cried out in agony,
None
None
After passing some months in London, we received a letter from you in your own handwriting.


There's something pesky going on with those quotation marks, but it doesn't really impact the reading of the sentences. Feel free to play around with some of the arguments in the sentence generation. There are a variety of methods that can be called rather than just make_short_sentence. Check the markovify github repository available at https://github.com/jsvine/markovify.

If any of the websites used for scraping would like this notebook to be removed, please contact me via github at https://github.com/molmatt