### Capstone: Predicting Quotes

#### The goal of this project is to use both supervised and unsupervised machine learning models to explore a collection of quotes by famous people of various backgrounds and professions and address two problem statements.

- **Problem Statement 1:  Given a collection of quotes, can the author of the quote be identified based on content and style.**

- **Problem Statement 2:  Given a collection of quotes, can a set of common topics be identified.**

#### Dataset
The source for the quote dataset will be the Successories web collection of most popular quotes:
https://www.successories.com/iquote/authors/most

Quotes will be scraped for all authors from A to Z.  Attributes of interest will be:

- author
- quote
- ~~categories~~ _Decided no on categories since they are somewhat random and not helpful._

#### Scraping Plan

The authors quote site is somewhat deeply nested, but data will only need to be scraped once as the content is relatively static. I will find the link for each letter A-Z, and then links for each letter's subcategories. Then I will scrape the quotes for each author in the subcategory. The individual author's quote may be several pages deep.  Note several authors in the list only have a small number of quotes.  A lower bound will be set for Problem Statement 1 where authors will have to have at least #? quotes. 


#### Modeling

The question around Problem Statement 1 is a supervised learning issue that involves multiple classification where each author represents a category.  Models used will most likely be MultinomialNB (Naive Bayes) classifier, RandomForestClassifier and LogisticRegression.

The question around Problem Statement 2 is an unsupervised learning issue that involes topic modeling. An LDA (Latent Dirichlet Allocation) model will be used to explore this issue.


In [5]:
import requests
import pandas as pd
import numpy as np
import time
from bs4 import BeautifulSoup
import glob
import os

### Gathering Data: Web Scraping

HTML for site is not well done.  The div includes several groups of links addressing different categories.  Classes were not used to help categorize the 'a' tags, so there is no easy way to distinguish/extract the href urls of interest. Extraction is based on link text being an uppercase alpha letter.

``` html
<a href="https://www.successories.com/iquote/authors/ka/kb">K</a>
```


In [9]:
class QuoteWebScraper(object):
    
    def __init__(self):
        self
        
    def get_author_quotes(self, letters):
        count = 0
        benchmark = 50
        self.auth_urls_ = []
        self.author_quotes_ = []
        self.subindexes_ = []
        alphabet_index = self.get_index_urls()
        for letter in letters:
            if letter.upper() in alphabet_index:
                url = alphabet_index[letter.upper()]
                self.subindexes_.extend(self.get_subindex_urls(url))
                
        [self.auth_urls_.extend(self.get_urls_for_authors(ilnk_)) for ilnk_ in self.subindexes_] 
        #[self.author_quotes_.extend(self.get_all_author_quotes(auth_url)) for auth_url in self.auth_urls_]
        print('Begin quote extraction...')
        for auth_url in self.auth_urls_:
            count = count + 1
            self.author_quotes_.extend(self.get_all_author_quotes(auth_url))
            if (count % benchmark) == 0:
                print(count, "authors processed... ")
                
        print("Total authors processes=", count)
        return self.author_quotes_
        
    def get_alphabet(self, case='upper'):
        self.alphabet = []
        if case.lower() == 'upper':
            range_ = range(65, 91)
        elif case.lower() == 'lower':
            range_ = range(97,123)

        for letter in range_:
            self.alphabet.append(chr(letter))
        return self.alphabet
    
    def get_index_urls(self):
        index_urls = {}
        alphabet = self.get_alphabet() 
        url = 'https://www.successories.com/iquote/authors/most'
        soup = self.create_beautifulSoup(url)
        content_element = soup.find('div', {'class': 'quotedb_content'})
        links = content_element.find_all('a')
        for link in links:
            if link.text in alphabet:
                index_urls[link.text] = link['href']
        return index_urls
    
    def get_subindex_urls(self, index_url):
        subindex_urls = []
        soup = self.create_beautifulSoup(index_url)
        content_element = soup.find('div', {'class': 'quotedb_content'})
        subindex = content_element.find('p').find_all('a')
        # TODO list comprehension
        for i in subindex:
            link = i['href'].lower()
            subindex_urls.append(link)
        return subindex_urls
    
    def get_urls_for_authors(self, subindex_url):
        self.author_urls = []
        a_tags = []
        soup = self.create_beautifulSoup(subindex_url)
        results_div = soup.find('div', {'class', 'quotedb_navresults'})
        author_divs = results_div.find_all('div', {'class', 'quotedb_navlist'})
        [a_tags.extend(div.find_all('a')) for div in author_divs]
        self.author_urls = [a['href'] for a in a_tags]
        return self.author_urls
    
    def get_all_author_quotes(self, auth_url):
        all_author_quotes = []
        #print(auth_url)
        while auth_url: # is not None: #True:
            soup = self.create_beautifulSoup(auth_url)
            all_author_quotes.extend(self.__get_page_quotes(soup))
            pages = soup.find('ul', {'class', 'pager'})
            if pages is not None:
                next_ = pages.find('a', attrs={'class':'pager-link', 'rel':'next'})
                if next_ is not None:
                    auth_url = next_['href']
                else:
                    auth_url = None
            else:
                auth_url = None
        return all_author_quotes
    
    def __get_page_quotes(self, soup):
        self.quotes = []
        author_name = soup.find('div', {'class', 'quotedb_quotelist'}).find('h1').find('a').text.replace(' Quotes', '')
        quote_divs = soup.find_all('div', {'class', 'quote'})
        for div in quote_divs:
            self.quotes.append({'author':author_name.strip(), 'quote':div.find('a').text.replace('"', '').strip()})
        return self.quotes
  
    def create_beautifulSoup(self, url):
        html = self.scrape_url(url)
        return BeautifulSoup(html, 'lxml')

    def scrape_url(self, url, req_delay_sec=.500):
        '''Method will GET the submitted url and return it's content. 
        The request is delayed by .5 seconds by default.  
        To turn off any delay sumbit a value of 0. '''
        if url:
            if req_delay_sec > 0:
                time.sleep(req_delay_sec)
            resp = requests.get(url, headers = 
                                {'User-agent': 'pyeduquotereader:v0.0 (by /u/jmkds)',
                                 'Cache-Control': 'no-cache'})
            if resp.status_code == 200:
                return resp.content
            else:
                print('Unable to get data due to failed request.  Status code: ', resp.status_code)
                print('Error details: ', resp.text)
        else:
            return ''

q_scraper = QuoteWebScraper()

#### Run scraper for each letter

In [11]:
# a, b, c, d, e, f, g, ...
letters = 'iq'
quotes_ = q_scraper.get_author_quotes(list(letters))

# !! Write data to file
write_date_to_file(quotes_, letters + '_quotes.csv')
#pd.DataFrame(quotes_)

Begin quote extraction...
50 authors processed... 
Total authors processes= 68


In [10]:
def write_date_to_file(data, filename, directory='../quote_data/'):
    '''Write data to csv file.  Note, method will created directory if it does not exist.'''
    df_data = pd.DataFrame(data)
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(directory, ' created!')
    df_data.to_csv(directory + filename, encoding='utf-8', index=False)

In [8]:
def create_dataset_from_csv(directory_):
    path = r'../' + directory_
    print(path + '/*.csv')
    allFiles = glob.glob(path + '/*.csv')
    list_ = []
    for file_ in allFiles:
        print(file_)
        df = pd.read_csv(file_,index_col=None, header=0)
        list_.append(df)
        df_all = pd.concat(list_)
    return df_all

df_all = create_dataset_from_csv("quote_data")

../quote_data/*.csv
../quote_data/a_quotes.csv
../quote_data/b_quotes.csv
../quote_data/c_quotes.csv
../quote_data/d_quotes.csv
../quote_data/e_quotes.csv
../quote_data/f_quotes.csv
../quote_data/g_quotes.csv
../quote_data/h_quotes.csv
../quote_data/iq_quotes.csv
../quote_data/j_quotes.csv
../quote_data/k_quotes.csv
../quote_data/l_quotes.csv
../quote_data/m_quotes.csv
../quote_data/n_quotes.csv
../quote_data/o_quotes.csv
../quote_data/p_quotes.csv
../quote_data/r_quotes.csv
../quote_data/s_quotes.csv
../quote_data/t_quotes.csv
../quote_data/uv_quotes.csv
../quote_data/w_quotes.csv
../quote_data/xyz_quotes.csv


In [203]:
df_all.head(10)

Unnamed: 0,author,quote
0,Alvar Aalto,Modern architecture does not mean the use of i...
1,Alvar Aalto,Building art is a synthesis of life in materia...
2,Alvar Aalto,We should concentrate our work not only to a s...
3,Hank Aaron,I'm here to support the commissioner and tough...
4,Hank Aaron,That's going to be left up to the commissioner...
5,Hank Aaron,That's going to be left up to the commissioner...
6,Hank Aaron,I have always felt that although someone may d...
7,Hank Aaron,I think it's very much a distraction to the ba...
8,Hank Aaron,Discover Greatness: An Illustrated History of ...
9,Hank Aaron,"It took me seventeen years to get 3,000 hits i..."
