# Capstone Project: The Persuasive Power of Words

*by Nee Bimin*

## Notebook 4: Model Testing

In this notebook, we will webscrape from https://highspark.co/famous-persuasive-speeches/ for 40 famous persuasive speeches to be tested using the models developed.

## Content

- [Pre-processing](#Preprocessing)
    * [Tokenizing and Lemmatizing](#Tokenizing-and-Lemmatizing)
- [Train/Test Split](#Train/Test-Split)
- [Grid Search CV](#Grid-Search-CV)
    * [Baseline Accuracy](#Baseline-Accuracy)
    * [Count Vectorizer](#Count-Vectorizer)
    * [Tfidf Vectorizer](#Tfidf-Vectorizer)
- [Optimising Tfidf Multinomial Naive Bayes](#Optimising-Tfidf-Multinomial-Naive_Bayes)
- [Optimising Tfidf Logistic Regression](#Optimising-Tfidf-Logistic-Regression)
- [Conclusion-and-Recommendations](#Conclusion-and-Recommendations)

In [3]:
# Import relevant packages
import bs4 as bs # Beautiful Soup to parse html
import soupsieve as sv # Soup Sieve to parse using CSS selector
import pandas as pd # Pandas to make use of dataframe
import codecs # To read in HTML file
import re # To make use of regex to process dirty html

## Parsing Data
The html has been downloaded from the website given above. Beautiful Soup will be used to parse the html.

In [5]:
# Read in Raw.html and parse into bs4 format
raw = codecs.open("../data/Raw.html", "r", "utf-8").read()
print(raw)
soup = bs.BeautifulSoup(raw,'lxml')

<!DOCTYPE html>
<html lang="en-US">

<head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <link rel="pingback" href="https://highspark.co/xmlrpc.php" />

    <script type="text/javascript">
        document.documentElement.className = 'js';
    </script>

    <script>
        var et_site_url = 'https://highspark.co';
        var et_post_id = '11744';

        function et_core_page_resource_fallback(a, b) {
            "undefined" === typeof b && (b = a.sheet.cssRules && 0 === a.sheet.cssRules.length);
            b && (a.onerror = null, a.onload = null, a.href ? a.href = et_site_url + "/?et_core_page_resource=" + a.id +
                et_post_id : a.src && (a.src = et_site_url + "/?et_core_page_resource=" + a.id + et_post_id))
        }
    </script>
    <title>40 Most Famous Speeches In History | HighSpark</title>
    <style id="et-divi-userfonts">
        @font-face {
            font-family: "Brandon Text Bold";

In [6]:
# To get main div where all the quotes are
body = soup.select(".et_pb_module.et_pb_post_content.et_pb_post_content_0_tb_body")
# To find all the titles of the quotes
h2 = body[0].find_all("h2")
# To find all the quotes
blockquotes = body[0].find_all("blockquote")

In [7]:
# Function to remove non-text characters from the quote
def get_text_from_blockquote(blockquote):
    Quote = ""
    for p in blockquote.find_all("p"):
        Quote += p.get_text().strip().replace("\r\n"," ")
    return re.sub("\s{2,}", " ", Quote)

In [8]:
articles = pd.DataFrame()
# Putting the title and articles into a dataframe
articles['title'] = h2
articles['article'] = blockquotes

# Clean up titles and articles
articles['title'] = articles['title'].apply(lambda x: re.sub("\d+.","",x.get_text())) # Use regex to remove index that comes with the title
articles['article'] = articles['article'].apply(get_text_from_blockquote)

In [9]:
articles.head()

Unnamed: 0,title,article
0,I have a dream by MLK,"“I have a dream that one day down in Alabama, ..."
1,Tilbury Speech by Queen Elizabeth I,"“My loving people,We have been persuaded by so..."
2,"Woodrow Wilson, address to Congress (April",“The world must be made safe for democracy. It...
3,Ain’t I A Woman by Sojourner Truth,“That man over there says that women need to b...
4,The Gettsyburg Address by Abraham Lincoln,"“Fondly do we hope, fervently do we pray, that..."


## Preprocessing

In [None]:
def lemmastop(word):
    # Instantiate Lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    #remove if words from stoplist or words with http or '/' in it
    if word in stopwords.words('english') or 'http' in word or '/' in word:
        word = ''
        
    # Lemmatize word then remove any non word characters not catched in previous steps
    p_word = re.sub('\W+', '',lemmatizer.lemmatize(word))
    
    # returns processed words
    return p_word

def clean_data(raw_string):
    # The input is raw unprocessed text), and 
    # the output is preprocessed text)
    # Instantiate Tokenizer. 
    tokenizer = RegexpTokenizer(r'\w+\'?\w+(?=\W)') # Regex matches words and words with apostrophe in between
    
    # Tokenize raw string
    tokens = tokenizer.tokenize(raw_string.lower())  
    
    # call function to remove stop list words and lemmatize words
    processed_tokens = map(lemmastop, tokens)
    
    # Joins only tokens with words and returns processed string
    return ' '.join(token for token in processed_tokens if token != '')

ted_model['transcript'] = ted_model['transcript'].apply(clean_data)