## Webscraping

Use BeautifulSoup to get quotes, authors, and tags from [Quotes to Read](http://quotes.toscrape.com/)

First go to the site and inspect the page, look at what links there are and how the entire site is structured.

In [None]:
# import the necessary libraries
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import pymongo

1. Get the first author and the href for the author's page as a tuple from the [homepage](http://quotes.toscrape.com/)

In [None]:
# Make a get request to retrieve the page
html_page = requests.get('http://quotes.toscrape.com/') 
# Pass the page contents to beautiful soup for parsing
soup = BeautifulSoup(html_page.content, 'html.parser')

# Your code here


In [None]:
""" SOLUTION: data for one author """
author = soup.find('small')
author.find_next_siblings()[0].get('href')
(author.text, author.find_next_siblings()[0].get('href'))

2. Write a function to get **all** the authors and href links for the authors from the [homepage](http://quotes.toscrape.com/)


In [None]:
def authors(url):
    '''
    input: url
    
    return: a dictionary of of authors and their urls
            {'author_1':'url_of_author_1', 'author_2':'url_of_author_2' ...}
    '''
    pass

In [None]:
""" SOLUTION: data for all the authors on a page """

def authors(url):
    # Make a get request to retrieve the page
    html_page = requests.get(url) 
    # Pass the page contents to beautiful soup for parsing
    soup = BeautifulSoup(html_page.content, 'html.parser')
    authors = soup.find_all('small')
    author_dictionary = {}
    for author in authors:
        author_dictionary[author.text] = author.find_next_siblings()[0].get('href')
    return author_dictionary

In [None]:
# run this cell to test the function
print(authors('http://quotes.toscrape.com/'))
print(authors('http://quotes.toscrape.com/page/3'))

3. Get the first author on each of the first 5 pages of quotes. You can get to the next page with the next button at the bottom of the homepage.


In [None]:
# Your code here


In [None]:
""" SOLUTION: get_some_quotes """

for i in range(1,6):
    html_page = requests.get(f'http://quotes.toscrape.com/page/{i}/')
    soup = BeautifulSoup(html_page.content, 'html.parser')
    author = soup.find('small')
#     author.find_next_siblings()[0].get('href')
    print(author.text)#, author.find_next_siblings()[0].get('href'))
    
    


4. Write a function to get all of the quotes from a page.

In [None]:
def get_some_quotes(url):
    '''
    input: url, number of pages to scrap (just scrape the home page if no argument is passed in)
    
    return: a list of dictionaries of quotes with their attributes
            [{'quote':'quote_1_text', 'author':'url_of_author_1'}, 
            {'quote':'quote_2_text', 'author':'url_of_author_2', 'quote_tags':[list_of_quote_2_tags]}, ...]
    '''
    pass

In [None]:
""" SOLUTION: get_some_quotes """

def get_some_quotes(url):
    # Make a get request to retrieve the page
    html_page = requests.get(url) 
    # Pass the page contents to beautiful soup for parsing
    soup = BeautifulSoup(html_page.content, 'html.parser')
        
    list_quotes = []
    for i in soup.find_all(class_="quote"):
        quotes = {}
        quote = (i.find(class_="text").text)
        quotes['quote'] = quote
        list_quotes.append(quotes)
        author = i.find(class_ = "author").text
        quotes['author'] = author
    return list_quotes

In [None]:
for_mongo = get_some_quotes('http://quotes.toscrape.com/' )

## NoSQL 

Now open a connection to a mongo database in the terminal, using `mongod` in order to **create**, **update**, and **read** from the database.

Create and connect to a mongo database.

In [None]:
myclient = pymongo.MongoClient("mongodb://127.0.0.1:27017/")
mydb = myclient['quote_database']

In [None]:
mycollection = mydb['quote_collection']

1. Add the quotes from `get_some_quotes` for the [homepage](http://quotes.toscrape.com/) or use the json file `quotes.json`

In [None]:
results = mycollection.insert_many(for_mongo)

In [None]:
results.inserted_ids

2. Query the database for all the quotes by `'Albert Einstein'`

In [None]:
query_1 = mycollection.find({})
for x in query_1:
    pass

In [None]:
""" SOLUTION: data for Albert Einstein quotes """

query_1 = mycollection.find({'author':'Albert Einstein'})
for x in query_1:
    print(x)

3. Update the 1st quote with the tags.

In [None]:
update_record = {'author': 'Steve Martin'}
first_quote_tags = {'$set':{'quote_tags': ['change', 'deep-thoughts', 'thinking', 'world']}}

mycollection.update_one(update_record, first_quote_tags)


In [None]:
query_2 = mycollection.find({'author': 'Steve Martin'})
for item in query_2:
    print(item)