### Setting up Chrome Driver for Selenium

chrome://version/


https://chromedriver.chromium.org/downloads

### Inspecting website using Developer Tools

Developer tools can help you understand the structure of a website. All modern browsers come with developer tools installed. In this section, you’ll see how to work with the developer tools in Chrome. The process will be very similar to other modern browsers.


http://books.toscrape.com/index.html

* Mac: Cmd+Alt+I
* Windows/Linux: Ctrl+Shift+I
    
* All: F12


HTML Markup Language
CSS Styling Language
JS Programming Language

### Follow the rules for scrapers and bots
Each site usually has a robots.txt on the root of their domain. This is where the website owner explicitly states what bots are allowed to do on their site. Simply go to example.com/robots.txt and you should find a text file.


* https://www.facebook.com/robots.txt

In [None]:
import requests

URL = "http://books.toscrape.com/index.html"
page = requests.get(URL)

#print(page.text)


1. import the requests library.
2. specify the URL you want to scrape.
3. send a HTTP request to the specified URL and save the response from server in a response object called page

`page.text` would be preferred for textual responses, such as an HTML or XML document, and `page.content` would be preferred for "binary" filetypes, such as an image or PDF file

In [None]:
# import web grabbing client and
# HTML parser
# urllib is a native python library that is already available to you.
from urllib.request import urlopen

url = 'http://books.toscrape.com/index.html'

In [None]:
# grab website and store in variable client
client = urlopen(url)
# read and close HTML
page_html = client.read()
client.close()

BeautifulSoup is a tool for HTML parsing but we will need a web client to grab something from the internet. 

**html.parser** - BeautifulSoup(markup, "html.parser")

* Advantages: Decent speed, Lenient (as of Python 2.7.3 and 3.2.)

* Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

**lxml** - BeautifulSoup(markup, "lxml")

* Advantages: Very fast, Lenient

* Disadvantages: External C dependency

**html5lib** - BeautifulSoup(markup, "html5lib")

* Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5

* Disadvantages: Very slow, External Python dependenc

A single book and its information are contained under the `<li>` tag. 

The findAll() function looks for all the li tags with class(can be ID if its ID) named `col-xs-6 col-sm-4 col-md-3 col-lg-3` and stores it in the variable bookshelf.

In [None]:
from bs4 import BeautifulSoup as BS

# call BeautifulSoup for parsing
page_soup = BS(page_html, "html.parser")
# grabs all the products under list tag
bookshelf = page_soup.findAll("li", {"class": "col-xs-6 col-sm-4 col-md-3 col-lg-3"})


It will store all information about that `<li>` class which are the books. You can see in the picture above, each `<li>` tag and class : `col-xs-6 col-sm-4 col-md-3 col-lg-3` represents one book.
Now that we have all our books, we need to select which particular information we must extract from each book and that is Title and Price. 
 
The title of each book is under `<h3>` tag which is under the `<a>` tag with `title`

The price is under `<p>` tag in class : `price_color` so we use findAll()


In [None]:
# create csv file of all products
filename = ("Books.csv")
f = open(filename, "w")

headers = "Book title, Price, Availability\n"
f.write(headers)

for books in bookshelf:

    # collect title of all books
    book_title = books.h3.a["title"]

    # collect book price of all books
    book_price = books.findAll("p", {"class": "price_color"})
    price = book_price[0].text.strip()
    
    book_availability = books.findAll("p", {"class": "instock availability"})[0].text.strip()

    print("Title of book: " + book_title)
    print("Price of book: " + price)
    print("Book availability: "+ book_availability)

    f.write(book_title + "," + price+ "," + book_availability+"\n")

f.close()


In [None]:
from selenium import webdriver
url = 'https://books.toscrape.com/'
driver = webdriver.Chrome()
driver.get(url)
print(driver.title)
def get_books_info():
    data = []
    container = driver.find_element_by_xpath('/html/body/div/div/div/div/section/div[2]/ol')
    
    print(type(container))
    
    titles = container.find_elements_by_tag_name('a')
    for title in titles:
        print(title.text)
        
    prices = container.find_elements_by_class_name('price_color')
    for price in prices:
        print(price.text)
        

    next_page = driver.find_element_by_link_text('next')
    next_page.click()


for x in range(5):
    get_books_info()
driver.quit()


In [None]:
#!pip install spacytextblob
#!pip install textblob
from textblob import TextBlob
from bs4 import BeautifulSoup
import requests
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
def scrape_lyrics(url):
    page = requests.get(url)
    html = BeautifulSoup(page.text, "html.parser")
    lyrics = html.find("pre", class_="lyric-body").get_text()
    print(url)
    #print(lyrics)
    return lyrics.replace("\n"," ")


links = ['https://www.lyrics.com/lyric/36863481/Justin+Bieber/Yummy',
     'https://www.lyrics.com/lyric/35362456/Ed+Sheeran/Castle+on+the+Hill',
     'https://www.lyrics.com/lyric/35342586/Taylor+Swift/22',
     'https://www.lyrics.com/lyric/36147543/Kygo/Happy+Now',
     'https://www.lyrics.com/sublyric/58125/Lauv/Superhero',
     'https://www.lyrics.com/lyric/30514737/Fix+You',
     'https://www.lyrics.com/lyric/32981724/One+Direction/Perfect',
     'https://www.lyrics.com/lyric/36489666/Bahari/Crashing',
     'https://www.lyrics.com/lyric/33787626/ROZES/Matches',
     'https://www.lyrics.com/lyric/36341880/Maroon+5/She+Will+Be+Loved',
     'https://www.lyrics.com/lyric/25306933/Queen/Dont+Stop+Me+Now',
     'https://www.lyrics.com/lyric/31781320/Eric+Clapton/Tears+In+Heaven']


lyrics = [scrape_lyrics(link) for link in links]

artists = ['JustinBieber', 'EdSheeran', 'TaylorSwift', 'Kygo', 'Lauv', 'Coldplay', 'OneDirection','Bahari','Rozes','Maroon5', 'Queen', 'EricClapton']

#fun fact: queen dont stop me now is apparently the happiest song, and eric clapton is supposedly a sad song
# https://www.indy100.com/article/dont-stop-me-now-is-the-happiest-song-in-the-world-according-to-a-neuroscientist-7318321
#but there's more to the what affects the sentiment of the song, not just the lyrics, e.g. tempo

df = pd.DataFrame({'Lyrics':lyrics}, index=artists)

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity
df['polarity'] = df['Lyrics'].apply(pol)
df['subjectivity'] = df['Lyrics'].apply(sub)

plt.rcParams['figure.figsize'] = [10, 8]

for artist in df.index:
    x = df.polarity.loc[artist]
    y = df.subjectivity.loc[artist]
    plt.scatter(x, y, color='green')
    plt.text(x+.001, y+.001, artist, fontsize=10)
    plt.xlim(-.7, .7) 
    plt.ylim(0,1) 
    
plt.title('Sentiment Analysis', fontsize=20)
plt.xlabel('<-- Negative ---------- Positive -->', fontsize=15)
plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)

plt.show()