# Web crawler - almost

One approach to building a web crawler is to start with a starter page and extract all links from that page and put them in a queue. Then:

* go to each link in the queue
* scape all links and add to the queue

This could go on forever, or until you decide to stop.

In the next cell we see an initial approach to a crawler for our project which simply scrapes links off our starter page. The code uses the Python library BeautifulSoup as well as the Python **requests** module. You may need to install BeautifulSoup.

In [1]:
from bs4 import BeautifulSoup
import requests

In [2]:
starter_url = "https://en.wikipedia.org/wiki/Vince_Gilligan"

r = requests.get(starter_url)

data = r.text
soup = BeautifulSoup(data)

counter = 0
# write urls to a file
with open('urls.txt', 'w') as f:
    for link in soup.find_all('a'):
        print(link.get('href'))
        f.write(str(link.get('href')) + '\n\n')
        if counter > 20:
            break
        counter += 1

# end of program
print("end of crawler")

None
#mw-head
#searchInput
/wiki/Vince_Gill
/wiki/File:Vince_Gilligan_by_Gage_Skidmore_3.jpg
/wiki/San_Diego_Comic-Con
/wiki/Richmond,_Virginia
/wiki/Tisch_School_of_the_Arts
#cite_note-chesterfield-1
/wiki/AMC_(TV_channel)
/wiki/Breaking_Bad
/wiki/Spin-off_(media)
/wiki/Better_Call_Saul
/wiki/The_X-Files
/wiki/The_Lone_Gunmen_(TV_series)
/wiki/Primetime_Emmy_Awards
/wiki/Writers_Guild_of_America_Awards
/wiki/Critics%27_Choice_Television_Awards
/wiki/Producers_Guild_of_America_Awards
/wiki/Directors_Guild_of_America_Award
/wiki/British_Academy_of_Film_and_Television_Arts
/wiki/Hancock_(film)
end of crawler


Looking at those links, there are a lot that we don't want. The following rewrites the code, narrowing down what is saved.

In [3]:
#starter_url = "https://www.google.com/search?q=vince+gilligan&rlz=1C5CHFA_enUS584US584&oq=vince+&aqs=chrome.0.69i59j69i60j69i65j69i57j69i61j69i60.1220j0j7&sourceid=chrome&ie=UTF-8"

#r = requests.get(starter_url)

#data = r.text
#soup = BeautifulSoup(data)


# write urls to a file
with open('urls.txt', 'w') as f:
    for link in soup.find_all('a'):
        link_str = str(link.get('href'))
        print(link_str)
        if 'Gilligan' in link_str or 'gilligan' in link_str:
            if link_str.startswith('/url?q='):
                link_str = link_str[7:]
                print('MOD:', link_str)
            if '&' in link_str:
                i = link_str.find('&')
                link_str = link_str[:i]
            if link_str.startswith('http') and 'google' not in link_str:
                f.write(link_str + '\n')

# end of program
print("end of crawler")

None
#mw-head
#searchInput
/wiki/Vince_Gill
/wiki/File:Vince_Gilligan_by_Gage_Skidmore_3.jpg
/wiki/San_Diego_Comic-Con
/wiki/Richmond,_Virginia
/wiki/Tisch_School_of_the_Arts
#cite_note-chesterfield-1
/wiki/AMC_(TV_channel)
/wiki/Breaking_Bad
/wiki/Spin-off_(media)
/wiki/Better_Call_Saul
/wiki/The_X-Files
/wiki/The_Lone_Gunmen_(TV_series)
/wiki/Primetime_Emmy_Awards
/wiki/Writers_Guild_of_America_Awards
/wiki/Critics%27_Choice_Television_Awards
/wiki/Producers_Guild_of_America_Awards
/wiki/Directors_Guild_of_America_Award
/wiki/British_Academy_of_Film_and_Television_Arts
/wiki/Hancock_(film)
/wiki/El_Camino:_A_Breaking_Bad_Movie
#Early_life
#Education
#Career
#The_X-Files_and_The_Lone_Gunmen
#Breaking_Bad_and_Better_Call_Saul
#Other_work
#Personal_life
#Filmography
#Film
#Television
#Production_staff
#Writer
#Acting
#Awards_and_nominations
#References
#External_links
/w/index.php?title=Vince_Gilligan&action=edit&section=1
/wiki/Richmond,_Virginia
/wiki/Claims_adjuster
#cite_note-nytime

Let's look at the urls that actually got saved.

In [5]:
with open('urls.txt', 'r') as f:
    urls = f.read().splitlines()
for u in urls:
    print(u)

http://www.amctv.com/shows/breaking-bad/crew/vince-gilligan
http://www.mahalo.com/vince-gilligan
http://www.huffingtonpost.com/2012/07/17/vince-gilligan-breaking-bad_n_1679038.html
https://www.thewrap.com/breaking-bad-creator-vince-gilligan-staying-at-sony-tv-with-new-three-year-deal/
https://www.theverge.com/2018/11/7/18072070/breaking-bad-spinoff-film-vince-gilligan-greenbriar
http://nofilmschool.com/2013/10/vince-gilligan-breaking-bad-20th-austin-film-festival
http://artsbeat.blogs.nytimes.com/2014/03/12/breaking-bad-creator-vince-gilligans-next-project-an-appearance-on-community/?_php=true
https://www.nytimes.com/2013/09/26/business/media/breaking-bad-creator-gilligan-in-deal-for-cbs-show-battle-creek.html
http://www.hollywoodreporter.com/live-feed/cbs-cancels-vince-gilligans-battle-794524
https://commons.wikimedia.org/wiki/Category:Vince_Gilligan
https://web.archive.org/web/20081219095120/http://www.amctv.com/originals/breakingbad/cast/vgilligan
https://interviews.televisionacadem

Now to make a true crawler out of the code above, we would need to implement the queue of urls to keep crawling until some stopping criterion is reached. 

For your project code, crawl beyond the first page until you get 15 *relevant* urls.

## crawling and scraping

You can also use BeautifulSoup to get text from the urls in your url list. Let's try that on one url.

In [4]:
from bs4 import BeautifulSoup
import urllib.request
import re

my_url = "http://deadline.com/2017/08/better-call-saul-vince-gilligan-emmys-interview-news-1202151940/"

# function to determine if an element is visible
def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element.encode('utf-8'))):
        return False
    return True

html = urllib.request.urlopen(my_url)
soup = BeautifulSoup(html)
data = soup.findAll(text=True)
result = filter(visible, data)
temp_list = list(result)      # list from filter
temp_str = ' '.join(temp_list)
temp_str

'\n <![endif] \n \n \n \n  Add to home screen for iOS  \n \n \n \n \n \n  Tile icons for Windows  \n \n \n \n  Favicons  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n  Smart Banner start  \n \n  Smart Banner end  \n   \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n  Jetpack Open Graph Tags  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n  End Jetpack Open Graph Tags  \n \n \n \n \n   \n \n \n \n \n \n  Swiftype Meta Tags Start  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n  Swiftype Meta Tags End  \n \n \n \n \n   \n \n \n \n \n \n \n \n  Hotjar Tracking Code for https://pmcdeadline2.wordpress.com  \n \n  pmc-tags-head Venatus   Venatus Ad Manager  \n \n  / Venatus Ad Manager  \n  end pmc-tags-head Venatus   pmc-tags-head permutive  \n STAR

So that's some messy text. You will want to write functions to clean up that text by removing newlines and tabs, perhaps use regex to remove wiki-style links, 

## finding important terms

A simple way to find important words is to tokenize the text, remove stop words, punctuation and other nonimportant matter, and then do frequency counts to get the top terms. We will learn more sophisticated techniques later. 

You might print out the top 40 terms and then manually remove some that are not important, using your domain knowledge. Here are the 15 terms I kept. Notice that I put related terms in parenthesis so that these are my 15 topics my bot should be able to talk about. 

* personal (girlfriend, born, school, father, mother)
* series
* breaking_bad
* better_call_saul
* xfiles
* write
* produce
* direct
* tv (television)
* film (movie)

### building a knowledge base

Assuming that you saved the cleaned documents you can loop through those and extract sentences using nltk's sentence tokenizer.

In [5]:
from nltk.tokenize import sent_tokenize
text = "This is a sentence. This is another sentence. Here is one more."
sents = sent_tokenize(text)
for sent in sents:
    print(sent)

This is a sentence.
This is another sentence.
Here is one more.


## sample dialog

Before leaving Part 1 you should type up some possible dialogs that you want your chatbot to be able to do. These dialogs can be short interchanges that vary depending on user input. Be prepared to share a couple of possible dialogs with the class. Here are a couple:


Initial dialog 1

Hi, I'm Fabian, a fan-bot of Vince Gilligan. I love to talk about Vince Gilligan! What is your name?

	I'm Karen.
    
Hi Karen. Do you like Vince Gilligan?

	Yes, I am a fan of Breaking Bad and Better Call Saul.
    
Me too! What is your favorite Vince Gilligan show?

	Oh, I don't know. It's hard to pick a favorite.
    
If you had to choose between Breaking Bad and Better Call Saul, which would it be?

	I guess Better Call Saul, but maybe because I just watched it recently.
    
Better Call Saul is a good choice. Did you know 
|random fact|?

    Yes, I knew that.

Initial dialog 2

Hi, I'm Fabian, a fan-bot of Vince Gilligan. I love to talk about Vince Gilligan! What is your name?

	I'm Karen.
    
Hi Karen. Do you like Vince Gilligan?

	Not really.
    
That's too bad! Have you seen Breaking Bad?

	Yes, I thought it was too violent.
    
So you didn't like it because it was too violent.

	That's right.
    
What about some of his other shows, like X-Files or Better Call Saul?

    I didn't know Vince Gilligan was invoved with the X-Files.