<h1>Document Similarity using LSI</h1>

<h4>In this assignment we’re going to practice document similarity. Here’s
what you need to do:</h4>
<ol>
<li>From Wikipedia’s List of writers page (https://en.wikipedia.org/wiki/Lists_of_writers), pick five lists of
writers (e.g., List of detective fiction authors). You can pick any five
you like but make sure that the list has at least 30 writers listed
<li>Collect the urls of all the writers on those five pages and place them in a list
<li>Add summarization of the pages
<li>Grab the content of each writer in the list and place them in a list (of documents)
<li>Build an LSI model using this data. This is your <b>"reference" data set</b>
<li>Now grab another list of writers from wikipedia and create a new list of documents using the detail from each writers page. This is your <b>"writer" data set</b>
<li>For each writer in the new list, find the writer in the <b>reference data set</b> that is the least similar with at least 0.6 similarity.
<li>Print a table that contains each writer from the <b>writer data set</b> and the least similar writer with at least 0.6 similarity from the <b>reference data set</b>
<li>Perform sentiment analysis
</ol>
<h4>Use the code below to build your solution

<p><span style="color:blue">get_writers</span>: A function that, given a "list of writers" url, returns a list containing the names of the writers and the urls for their wikipedia pages
<p><span style="color:blue">non_writer_finder</span> tries its best to remove links that are not writer links from the page (not perfect, but good enough!)

In [33]:
def get_writers(url):
    from bs4 import BeautifulSoup
    import requests

    page_soup = BeautifulSoup(requests.get(url).content, "lxml")
    li_tags = page_soup.find_all("li")
    all_writers = list()
    for tag in li_tags:
        if tag.get("id"):
            continue

        try:
            tag.find("sup", class_="reference")
            link = tag.find("a").get("href")
            name = tag.find("a").get_text()
            if "/wiki/" in link and non_writer_finder(link):
                all_writers.append((name, "https://en.wikipedia.org" + link))
        except:
            pass
    return all_writers

def non_writer_finder(link):
    non_writer_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User','https']
    for word in non_writer_words:
        if word in link:
            return False
    return True

<h4>testing the function</h4>
<li>Note that Wikipedia does not have a standard for its page design so this code may not work with every list

In [34]:
url = "https://en.wikipedia.org/wiki/List_of_detective_fiction_authors"
get_writers(url)

[('Mario Acevedo', 'https://en.wikipedia.org/wiki/Mario_Acevedo_(author)'),
 ('Douglas Adams', 'https://en.wikipedia.org/wiki/Douglas_Adams'),
 ('Humayun Ahmed', 'https://en.wikipedia.org/wiki/Humayun_Ahmed'),
 ('Margery Allingham', 'https://en.wikipedia.org/wiki/Margery_Allingham'),
 ('Rudolfo Anaya', 'https://en.wikipedia.org/wiki/Rudolfo_Anaya'),
 ('Gosho Aoyama', 'https://en.wikipedia.org/wiki/Gosho_Aoyama'),
 ('Frank Arnau', 'https://en.wikipedia.org/wiki/Frank_Arnau'),
 ('Taku Ashibe', 'https://en.wikipedia.org/wiki/Taku_Ashibe'),
 ('Ace Atkins', 'https://en.wikipedia.org/wiki/Ace_Atkins'),
 ('Kate Atkinson', 'https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)'),
 ('Yukito Ayatsuji', 'https://en.wikipedia.org/wiki/Yukito_Ayatsuji'),
 ('Sharadindu Bandyopadhyay',
  'https://en.wikipedia.org/wiki/Sharadindu_Bandyopadhyay'),
 ('Nevada Barr', 'https://en.wikipedia.org/wiki/Nevada_Barr'),
 ('Earle Basinsky', 'https://en.wikipedia.org/wiki/Earle_Basinsky'),
 ('M. C. Beaton', 'https:/

<h4>get_writer_text(url): returns the page text of the wikipedia page associated with a writer</h4>
<li>Since we're not sure if this will always work, we use a try ... except to catch exceptions
<li>If it doesn't work, the function returns None
<li>We will need to delete this (writer, url) pair from our writers list

In [35]:
def get_writer_text(url):
    from bs4 import BeautifulSoup
    import requests

    all_text = ""
    try:
        page_soup = BeautifulSoup(requests.get(url).content, "lxml")
        for p_tag in page_soup.find_all("p"):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text

<h4>testing get_writer_text</h4>

In [36]:
url = "https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)"
get_writer_text(url)

'\nKate Atkinson MBE (born 20 December 1951) is an English writer of novels, plays and short stories.[1]  She has written historical novels, detective novels and family novels, incorporating postmodern and magical realist elements into the plots. Her debut, Behind the Scenes at the Museum, won the Whitbread Book Award, the precursor to the Costa Book Award, in 1995. The novels Life After Life and A God in Ruins won the Costa Book Award for novel in 2013 and 2015. She is also known for the Jackson Brodie series of detective novels, which has been adapted into the BBC One series, Case Histories.[2][3]\nThe daughter of a shopkeeper, Atkinson was born in York, the setting for several of her books.[4] She was an only child and often had to finds ways to amuse herself. She describes herself as an anxious child, something she believes had to do with being illegitimate. Her parents lived together but were not married, because her mother could not divorce her first husband. At the time, that wa

<p><span style="color:blue">get_all_writers</span>: A function that, given a list of genres, returns a list containing the names of the writers and the urls for their wikipedia pages associated with that list of genres
<p>The function should return a list of (name,url) pairs for all the writers in the list of genres
<p>You need to:
<ol>
<li>iterate through the list of genres
<li>initialize a list "all_writers"
<li>construct a url for the list of writers (I've done these first three steps for you)
<li>call get_writers for that url
<li>extend all_writers by what get_writers returns

In [37]:
def get_all_writers(genre_list):
    all_writers = list()
    # Your code here
    base_url = "https://en.wikipedia.org/wiki/List_of_"
    for genre in genre_list:
        url = base_url + genre
        all_writers.extend(get_writers(url))
    return all_writers

<h4>Example of how to use get_all_writers</h4>

In [38]:
genre_list = ["detective_fiction_authors", "romantic_novelists", "Western_fiction_authors", "Odia-language_authors", "List_of_Tamil-language_writers"] # can change it to your list of 5 genres
# genre_list = ["detective_fiction_authors", "romantic_novelists"] # can change it to your list of 5 genres

all_writers = get_all_writers(genre_list)
all_writers

[('Mario Acevedo', 'https://en.wikipedia.org/wiki/Mario_Acevedo_(author)'),
 ('Douglas Adams', 'https://en.wikipedia.org/wiki/Douglas_Adams'),
 ('Humayun Ahmed', 'https://en.wikipedia.org/wiki/Humayun_Ahmed'),
 ('Margery Allingham', 'https://en.wikipedia.org/wiki/Margery_Allingham'),
 ('Rudolfo Anaya', 'https://en.wikipedia.org/wiki/Rudolfo_Anaya'),
 ('Gosho Aoyama', 'https://en.wikipedia.org/wiki/Gosho_Aoyama'),
 ('Frank Arnau', 'https://en.wikipedia.org/wiki/Frank_Arnau'),
 ('Taku Ashibe', 'https://en.wikipedia.org/wiki/Taku_Ashibe'),
 ('Ace Atkins', 'https://en.wikipedia.org/wiki/Ace_Atkins'),
 ('Kate Atkinson', 'https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)'),
 ('Yukito Ayatsuji', 'https://en.wikipedia.org/wiki/Yukito_Ayatsuji'),
 ('Sharadindu Bandyopadhyay',
  'https://en.wikipedia.org/wiki/Sharadindu_Bandyopadhyay'),
 ('Nevada Barr', 'https://en.wikipedia.org/wiki/Nevada_Barr'),
 ('Earle Basinsky', 'https://en.wikipedia.org/wiki/Earle_Basinsky'),
 ('M. C. Beaton', 'https:/

<p><span style="color:blue">get_all_writer_docs</span>: A function that, given the list of (writer,url) pairs, returns two lists, a list of writers and a parallel (same size) list of documents. 

<p>You need to:

<ol>
<li>initialize the two lists

<li>iterate through the all_writers list
<li>extract the name and the url of the writer
<li>get the text using predefined function
<li>if the function returns None, ignore it and move to the next writer
<li>otherwise, append the name ot the writer_names list and the text to the writer_texts list
<li>return writer_names and writer_texts


In [39]:
def get_all_writer_docs(all_writers):
    writer_names = list()
    writer_texts = list()
    for writer in all_writers:
        #Your code here
        writer_name, writer_url = writer
        writer_text = get_writer_text(writer_url)
        if writer_text is not None:
            writer_names.append(writer_name)
            writer_texts.append(writer_text)
        else:
            continue
    return writer_names,writer_texts

<h4>Example of how to use get_all_writer_docs</h4>

In [40]:
# may take some minutes to run the code depend on the number of writers on your list
# print(get_all_writer_docs(all_writers))

reference_names, reference_docs = get_all_writer_docs(all_writers)
print(len(reference_names), len(reference_docs))

1275 1275


<h3>Repeat the process on summarized text</h3>

<h4>Text summarization</h4>
Create a function <span style="color:blue">summarize_text()</span> which summarizes text with the most relevant sentences.
<br>The function should summarize the given text by extracting the top N sentences containing the highest frequency of important words.
    
Parameters:
<li>text (str): The full text content of a writer's Wikipedia page.
<li>num_sentences (int): Number of sentences to include in the summary (default is 3).

In [41]:
import warnings

warnings.filterwarnings("ignore")
import nltk

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")  # Download wordnet
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from collections import OrderedDict
import pprint


# Ensure necessary resources are available
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hayoon\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Hayoon\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Hayoon\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hayoon\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Hayoon\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [42]:
def summarize_text(text, num_sentences=3):
    # Tokenize text into sentences
    sentences = sent_tokenize(text)
    # your code here
    summary_sentences = []
    candidate_sentences = {}
    candidate_sentence_counts = {}
    striptext = text.replace('\n\n', ' ')
    striptext = striptext.replace('\n', ' ')
    words = word_tokenize(striptext)

    # construct a list of words
    words = word_tokenize(striptext)
    lowercase_words = [word.lower() for word in words if word not in stopwords.words() and word.isalpha()]
    word_frequencies = FreqDist(lowercase_words)
    most_frequent_words = FreqDist(lowercase_words).most_common(20)
    for sentence in sentences:
        candidate_sentences[sentence] = sentence.lower()

    for long, short in candidate_sentences.items():
        count = 0
        for freq_word, frequency_score in most_frequent_words:
            if freq_word in short:
                count += frequency_score
                candidate_sentence_counts[long] = count
    sorted_sentences = OrderedDict(sorted(candidate_sentence_counts.items(), key = lambda x:x[1], reverse=True)[:num_sentences])
    summary = '\n'.join(sorted_sentences)

    return summary

<h4>Testing summarize_text</h4>

In [43]:
# Sample URL from your writer data in Assignment 6
sample_url = "https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)"

nltk.download("punkt_tab")
# Fetch the full text for the writer
writer_text = get_writer_text(sample_url)

# Apply the summarization function to the fetched text
if writer_text:
    summary = summarize_text(writer_text, num_sentences=25)
    print("Summary of the writer's page:")
    print(summary)
else:
    print("Failed to fetch writer text.")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Hayoon\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Summary of the writer's page:
[5]
In 2004, Case Histories, a novel centered around the private investigator Jackson Brodie, was published; he was Atkinson's first male protagonist.
The novels Life After Life and A God in Ruins won the Costa Book Award for novel in 2013 and 2015.
After a number of books about World War II, Atkinson wanted to write about a different theme.
Atkinson's next novel A God in Ruins (2015) follows the life of Ursula's brother Teddy Todd who is a pilot in the Royal Air Force during the war, but is more realistic than Life After Life.
She is also known for the Jackson Brodie series of detective novels, which has been adapted into the BBC One series, Case Histories.
Life After Life received the Costa Book Award for Novel in 2013, and was adapted for television in 2022 .
Atkinson is fascinated by the role of chance in life, and this is a recurring theme in her stories.
[11][12]
She followed up the Brodie-series with three novels set during World War II.
Atkinson's 

Now create a function that get the summaries for each writer

In [44]:
def get_all_writer_docs_summary(all_writers):
    """
    Fetches, summarizes, and returns texts for each writer in the list.

    Parameters:
    - all_writers (list): A list of tuples containing writer names and URLs.

    Returns:
    - writer_names (list): Names of the writers.
    - writer_summaries (list): Summarized text content for each writer.
    """
    writer_names = []
    writer_summaries = []

    for writer in all_writers:
        #your code here
        writer_name, url = writer
        writer_names.append(writer_name)
        text = get_writer_text(url)
        summary = summarize_text(text)
        writer_summaries.append(summary)
    return writer_names, writer_summaries


test get all writers function with summarized texts

In [45]:
# May take a couple of minutes depend on the length of your writer list.
# print(get_all_writer_docs_summary(all_writers))

In [46]:
reference_names_2, reference_docs_2 = get_all_writer_docs_summary(all_writers)
print(len(reference_names_2), len(reference_docs_2))

1275 1275


<h3>Set up the LSI model</h3>
<li>reference_docs is the list of documents
<li>construct texts, dictionary, and corpus (see class iPython notebook)
<li>construct an LSI model. Use 5 topics initially but you should play around with this number

In [1]:
# Code for LSI model for texts goes here
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities
from gensim.parsing.preprocessing import STOPWORDS

texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in reference_docs]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lsi_text = models.LsiModel(corpus, id2word=dictionary, num_topics=5)

NameError: name 'reference_docs' is not defined

In [None]:
# Code for LSI model for summaries goes here

texts_summary = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in reference_docs_2]
dictionary_summary = corpora.Dictionary(texts_summary)
corpus_summary = [dictionary.doc2bow(text) for text in texts_summary]

lsi_summary = models.LsiModel(corpus_summary, id2word=dictionary_summary, num_topics=5)

<h3>Construct the writer data set with texts and summaries</h3>
<h4>Example</h4>

In [None]:
# for whole text
writer_genre_list = ["Western_fiction_authors"]
all_writers = get_all_writers(writer_genre_list)
writer_names, writer_docs = get_all_writer_docs(all_writers)
writer_names, writer_docs

(['Film',
  'Television',
  'Literature',
  'Visual arts',
  'Dime novels',
  'Comics',
  'Wild West shows',
  'Acid Western',
  'Australian Western',
  'Contemporary Western',
  'Dacoit Western',
  'Epic Western',
  'Fantasy Western',
  'Florida Western',
  'Gothic Western',
  'Horror Western',
  'Northern',
  'Ostern',
  'Revisionist Western',
  'Science fiction Western',
  'Singing cowboy',
  'Space Western',
  'Spaghetti Western',
  'Weird Western',
  'Western romance',
  'Zapata Western',
  'Golden Boot Awards',
  'Old West',
  'Cowboy culture',
  'Cowboy',
  'Gunfighter',
  'Outlaw',
  'Quick draw',
  'Saloon',
  'Manifest destiny',
  'Edward Abbey',
  'Andy Adams',
  'William Lacey Amy',
  'Rudolfo Anaya',
  'Todhunter Ballard',
  'S. Omar Barker',
  'Rex Beach',
  'James Warner Bellah',
  'Don Bendell',
  'Tom W. Blackburn',
  'James Carlos Blake',
  'William Blinn',
  'Stephen Bly',
  'Frank Bonham',
  'Allan R. Bosworth',
  'Peter Bowen',
  'B.M. Bower',
  'Leigh Brackett',
 

In [None]:
# for summaries
writer_genre_list = ["Western_fiction_authors"]
all_writers = get_all_writers(writer_genre_list)
writer_names_2, writer_docs_summaries = get_all_writer_docs_summary(all_writers)
writer_names_2, writer_docs_summaries

(['Film',
  'Television',
  'Literature',
  'Visual arts',
  'Dime novels',
  'Comics',
  'Wild West shows',
  'Acid Western',
  'Australian Western',
  'Contemporary Western',
  'Dacoit Western',
  'Epic Western',
  'Fantasy Western',
  'Florida Western',
  'Gothic Western',
  'Horror Western',
  'Northern',
  'Ostern',
  'Revisionist Western',
  'Science fiction Western',
  'Singing cowboy',
  'Space Western',
  'Spaghetti Western',
  'Weird Western',
  'Western romance',
  'Zapata Western',
  'Golden Boot Awards',
  'Old West',
  'Cowboy culture',
  'Cowboy',
  'Gunfighter',
  'Outlaw',
  'Quick draw',
  'Saloon',
  'Manifest destiny',
  'Edward Abbey',
  'Andy Adams',
  'William Lacey Amy',
  'Rudolfo Anaya',
  'Todhunter Ballard',
  'S. Omar Barker',
  'Rex Beach',
  'James Warner Bellah',
  'Don Bendell',
  'Tom W. Blackburn',
  'James Carlos Blake',
  'William Blinn',
  'Stephen Bly',
  'Frank Bonham',
  'Allan R. Bosworth',
  'Peter Bowen',
  'B.M. Bower',
  'Leigh Brackett',
 

<h4>find the least similar writers with at least 0.6 similarity for each new writer from our reference data set (both for whole and summarized texts)</h4>
<li>Write code to print table_data after the for loop ends

In [None]:
warnings.filterwarnings("ignore")
table_data = list()
for i, doc in enumerate(writer_docs):
    #Your similarity code here. Use the in-class notebook as a reference
    vec_bow_text = dictionary.doc2bow(doc.lower().split())
    vec_lsi_text = lsi_text[vec_bow_text]
    index = similarities.MatrixSimilarity(lsi_text[corpus])
    sims_text = index[vec_lsi_text]
    sims_text = sorted(enumerate(sims_text), key=lambda item: -item[1])

# Write code to print table_data after the for loop ends
    for sim in sims_text:
        if sim[1] > 0.6:
            table_data.append([writer_names[i], reference_names[sim[0]]])
            break

In [None]:
# Summaries: 
warnings.filterwarnings("ignore")
table_data_summary = list()
for i, doc in enumerate(writer_docs_summaries):
    # Your similarity code here. Use summary data for this part.
    vec_bow_summary = dictionary_summary.doc2bow(doc.lower().split())
    vec_lsi_summary = lsi_summary[vec_bow_summary]
    index = similarities.MatrixSimilarity(lsi_summary[corpus_summary])
    sims_summary = index[vec_lsi_summary]
    sims_summary = sorted(enumerate(sims_summary), key=lambda item: -item[1])
# Write code to print table_data after the for loop ends
    for sim in sims_summary:
        if sim[1] > 0.6:
            table_data_summary.append([writer_names_2[i], reference_names_2[sim[0]]])
            break

In [None]:
# print table for the texts

print(table_data)

[['Film', 'Ann Roth'], ['Television', 'Jimmy McGovern'], ['Literature', 'Martin Cruz Smith'], ['Visual arts', 'Patricia Robertson'], ['Dime novels', 'Karen Marie Moning'], ['Comics', 'Carol Higgins Clark'], ['Wild West shows', 'Kelly Jamison'], ['Acid Western', 'Heather Graham'], ['Australian Western', 'Ann Roth'], ['Contemporary Western', 'Jill Munroe'], ['Dacoit Western', 'Nancy Kelly'], ['Epic Western', 'Nancy Kelly'], ['Fantasy Western', 'Nancy Kelly'], ['Florida Western', 'Premendra Mitra'], ['Gothic Western', 'Daphne de Jong'], ['Horror Western', 'Amanda Kramer'], ['Northern', 'Jacqueline Frank'], ['Ostern', 'Heather Graham'], ['Revisionist Western', 'Nancy Kelly'], ['Science fiction Western', 'Émile Gaboriau'], ['Singing cowboy', 'Jill Munroe'], ['Space Western', 'Hemendra Kumar Roy'], ['Spaghetti Western', 'Ann Roth'], ['Weird Western', 'Leigh Brackett'], ['Western romance', 'Ginn Hale'], ['Zapata Western', 'Ann Roth'], ['Golden Boot Awards', 'Jill Munroe'], ['Old West', "Marga

In [None]:
# print table for the summaries

print(table_data_summary)

[['Film', 'Heather Graham'], ['Television', 'Mary McBride'], ['Literature', 'Jimmy McGovern'], ['Visual arts', 'Amanda Kramer'], ['Dime novels', 'Elizabeth Lowell'], ['Comics', 'Wendy Warren'], ['Wild West shows', 'Patricia Robertson'], ['Acid Western', 'Heather Graham'], ['Australian Western', 'Kazuhiro Kiuchi'], ['Contemporary Western', 'Hemendra Kumar Roy'], ['Dacoit Western', 'Stephanie James'], ['Epic Western', 'Heather Graham'], ['Fantasy Western', 'Heather Graham'], ['Florida Western', 'Kōtarō Isaka'], ['Gothic Western', 'Jessica Hart'], ['Horror Western', 'Jill Munroe'], ['Northern', 'Chloe Lang'], ['Ostern', 'Heather Graham'], ['Revisionist Western', 'Dorothy Phillips'], ['Science fiction Western', 'Émile Gaboriau'], ['Singing cowboy', 'Nancy Kelly'], ['Space Western', 'Arimasa Osawa'], ['Spaghetti Western', 'Nancy Kelly'], ['Weird Western', 'Arimasa Osawa'], ['Western romance', 'James Lee Burke'], ['Zapata Western', 'Nancy Kelly'], ['Golden Boot Awards', 'Hemendra Kumar Roy']

In [None]:
# the two tables should be identycal:
n_differents = 0
for i in range(len(table_data)):
    if (
        table_data_summary[i][0] != table_data[i][0]
        and table_data_summary[i][1] != table_data[i][1]
    ):
        n_differents += 1
n_differents

0

# Some simple sentiment analysis

In this part we are gonna run some simple sentiment analysis to find out which writer has the most positive description.

Define a function simple_sentiment_analysis(writer_names, writer_docs) that takes as inputs the list of writers and their corresponding descriptions.
The expected output is a list, each element of this list should be a list with the writer name, the percentage of positive words in their description and the percentage of negative words in their description.

In [None]:
# Example output
"""
[('William Blinn', 0.81, 0.54),
 ('Stephen Bly', 0.75, 0.94),
 ('Frank Bonham', 3.73, 0.62)
 ...]
"""

"\n[('William Blinn', 0.81, 0.54),\n ('Stephen Bly', 0.75, 0.94),\n ('Frank Bonham', 3.73, 0.62)\n ...]\n"

To ensure results can be compared please use the following function to define your list of positive and negative words:

In [None]:
def get_pos_neg_words():
    def get_words(url):
        import requests

        words = requests.get(url).content.decode("latin-1")
        word_list = words.split("\n")
        index = 0
        while index < len(word_list):
            word = word_list[index]
            if ";" in word or not word:
                word_list.pop(index)
            else:
                index += 1
        return word_list

    # Get lists of positive and negative words
    p_url = "http://ptrckprry.com/course/ssd/data/positive-words.txt"
    n_url = "http://ptrckprry.com/course/ssd/data/negative-words.txt"
    positive_words = get_words(p_url)
    negative_words = get_words(n_url)
    return positive_words, negative_words

In [None]:
def simple_sentiment_analysis(writer_names, writer_docs):
    ### YOUR CODE HERE ###
    positive_words, negative_words = get_pos_neg_words()
    from nltk import word_tokenize
    results = list()
    for i in range(len(writer_docs)):
        cpos = cneg = lpos = lneg = 0
        for word in word_tokenize(writer_docs[i]):
            if word in positive_words:
                cpos+=1
            if word in negative_words:
                cneg+=1
        results.append((writer_names[i],cpos/len(word_tokenize(writer_docs[i])),cneg/len(word_tokenize(writer_docs[i]))))
    return results

In [None]:
simple_sentiment_analysis(writer_names, writer_docs)