# ADS 509 Sentiment Assignment

This notebook holds the Sentiment Assignment for Module 6 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required.

In a previous assignment you put together Twitter data and lyrics data on two artists. In this assignment we apply sentiment analysis to those data sets. If, for some reason, you did not complete that previous assignment, data to use for this assignment can be found in the assignment materials section of Blackboard.


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it.

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link.

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell.

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. _Make sure to answer every question marked with a `Q:` for full credit._


In [1]:
import os
import re
import emoji
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
from string import punctuation

from nltk.corpus import stopwords

sw = stopwords.words("english")

In [2]:
# Add any additional import statements you need here
from pathlib import Path

In [3]:
# change `data_location` to the location of the folder on your machine.
data_location = Path("../datasets")

# These subfolders should still work if you correctly stored the
# data from the Module 1 assignment
twitter_folder = "twitter/"
lyrics_folder = "lyrics/"

positive_words_file = "positive-words.txt"
negative_words_file = "negative-words.txt"
tidy_text_file = "tidytext_sentiments.txt"

## Data Input

Now read in each of the corpora. For the lyrics data, it may be convenient to store the entire contents of the file to make it easier to inspect the titles individually, as you'll do in the last part of the assignment. In the solution, I stored the lyrics data in a dictionary with two dimensions of keys: artist and song. The value was the file contents. A Pandas data frame would work equally well.

For the Twitter data, we only need the description field for this assignment. Feel free all the descriptions read it into a data structure. In the solution, I stored the descriptions as a dictionary of lists, with the key being the artist.


In [4]:
# Read in the lyrics data

# create path to lyrics
lyrics_path = os.path.join(data_location, lyrics_folder)

# get list of artists in lyrics folder
artists = os.listdir(lyrics_path)

# initialize rows list for dataframe
rows = []

for artist in artists:
    # create path to song lyrics
    song_lyrics_path = os.path.join(lyrics_path, artist)

    # iterate through all song files in the directory
    for songs in os.listdir(song_lyrics_path):
        # create path to song file
        file_path = os.path.join(song_lyrics_path, songs)

        # read txt file to lyrics var
        with open(file_path, "r", encoding="utf-8") as f:
            lines = f.readlines()

        # create regex to capture title between double quotes
        match = re.match(r'^"(.*)"$', lines[0].strip())
        if match:
            song_title = match.group(1)
        else:
            # fallback to first line as title
            song_title = lines[0].strip()

        # save rest of lines to lyrics
        lyrics = "".join(lines[1:]).strip()

        rows.append({"artist": artist, "song_title": song_title, "lyrics": lyrics})

# create dataframe from rows var
df_lyrics = pd.DataFrame(rows)

In [5]:
# Read in the twitter data
twitter_path = os.path.join(data_location, twitter_folder)

twitter_files = os.listdir(twitter_path)

rows = []

# Regex101 was used to create the regex
# ChatGPT was used to help create the syntax to make the regex useable with
# Python
LINE_RE = re.compile(
    r"^(?P<screen_name>[^\t]*)\t"
    r"(?P<name>[^\t]*)\t"
    r"(?P<id>\d+)\t"
    r"(?P<location>[^\t]*)\t"
    r"(?P<followers_count>\d+)\t"
    r"(?P<friends_count>\d+)\t"
    r"(?P<description>.*)$"
)


for file in twitter_files:
    # filter for data file that has description column
    if "_data.txt" in file and file != ".DS_Store":
        # save robynkonichiwa as robyn
        artist = file.split("_")[0].replace("konichiwa", "")
        file_path = os.path.join(twitter_path, file)

        with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
            # skip header row by assigning to `_`
            _ = f.readline()
            # iterate through remaining lines
            for line in f:
                line = line.rstrip("\n")
                m = LINE_RE.match(line)
                if m:
                    desc = m.group("description")
                    # only append if desc is not blank
                    if desc:
                        rows.append({"artist": artist, "description": desc})


df_twitter = pd.DataFrame(rows)

In [None]:
def check_sentiment_value(word, current_value, new_value, verbose=False):
    """
    Resolve a word's sentiment score when encountering potential conflicts.

    Given an existing sentiment score (`current_value`) and a proposed new
    score (`new_value`) for the same word, this function applies the
    following rules:

    1) If the current value is already 0 (previously marked conflicting),
       keep it as 0.
    2) If the current value differs from the new value (e.g., +1 vs -1),
       mark as conflicting by setting the score to 0.
    3) Otherwise (values match), keep the new value.

    When `verbose=True`, a human-readable message describing the decision
    is printed.

    Parameters
    ----------
    word : str
        The token/term being evaluated.
    current_value : int
        The word's existing sentiment score in the dictionary. Expected
        values are typically {-1, 0, 1}, where 0 denotes a conflict found
        earlier.
    new_value : int
        The proposed sentiment score for the word (usually -1 or 1).
    verbose : bool, optional
        If True, prints a message describing the resolution. Defaults to False.

    Returns
    -------
    int
        The resolved sentiment score for the word:
        - 0 if a conflict is detected or already present,
        - `new_value` if it matches `current_value`,
        - `current_value` if it is already 0.

    Notes
    -----
    - Use this helper during dictionary construction to ensure that once a
      word is marked as conflicting (0), it remains 0 on subsequent passes.
    - ChatGPT assisted in creating this docstring.

    Examples
    --------
    >>> check_sentiment_value("happy", 1, 1)
    1
    >>> check_sentiment_value("sick", 1, -1)
    0
    >>> check_sentiment_value("wicked", 0, 1)
    0
    """
    message = ""
    score_value = None
    if current_value == 0:
        message = "Conflicting sentiment value already determined!"
        score_value = current_value
    elif current_value != new_value:
        message = f"Conflict: {current_value} vs {new_value}."
        score_value = 0
    else:
        message = "Exists but no conflict in sentiment value."
        score_value = new_value

    if verbose:
        message_0 = f"{word}: "
        message_value = f"Value set to {score_value}."
        print(message_0 + message + message_value)

    return score_value


def create_scoring_dict(dictionary_name, text_file_of_words, base_value, verbose=False):
    """
    Populate a sentiment scoring dictionary from a plain-text word list.

    Reads a text file (one word per line) and inserts each cleaned, lowercased
    word into `dictionary_name` with the given `base_value` (e.g., +1 for a
    positive lexicon, -1 for a negative lexicon). Lines starting with ';'
    are treated as comments and skipped. If a word already exists in the
    dictionary, the final score is resolved via `check_sentiment_value`
    to handle conflicts (e.g., a word appearing in both positive and negative lists);
    once a word has been marked conflicting (score 0), it remains 0.

    Parameters
    ----------
    dictionary_name : dict[str, int]
        The mapping of words to sentiment scores. Modified in place.
        Typical scores are {-1, 0, 1}.
    text_file_of_words : str
        Path to a newline-delimited word list. Each non-empty, non-comment
        line should contain a single token.
    base_value : int
        The base sentiment score to assign to each word from this file
        (commonly +1 for positive or -1 for negative).
    verbose : bool, optional
        If True, passes through to `check_sentiment_value` to print
        per-word resolution messages. Default is False.

    Returns
    -------
    None
        This function modifies `dictionary_name` in place and returns None.

    Notes
    -----
    - Words are cleaned with `strip().lower()` before insertion.
    - Comment lines (prefixed with ';') and blank lines are ignored.
    - Conflict resolution is delegated to `check_sentiment_value`.
    - ChatGPT assisted in creating this docstring.

    Examples
    --------
    >>> word_dict = {}
    >>> create_scoring_dict(word_dict, "negative-words.txt", base_value=-1)
    >>> create_scoring_dict(word_dict, "positive-words.txt", base_value=+1)
    >>> word_dict.get("terrible"), word_dict.get("great")
    (-1, 1)
    """
    score_value = base_value
    with open(text_file_of_words, "r", encoding="utf-8") as f:
        for line in f:
            word = line.strip().lower()
            if word and not word.startswith(";"):
                word_score = score_value
                if word in dictionary_name:
                    word_score = check_sentiment_value(
                        word, dictionary_name[word], score_value, verbose
                    )
                dictionary_name[word] = word_score

    return None


def tidytext_scoring(
    dictionary_name, text_file_of_words="tidytext_sentiments.txt", verbose=False
):
    """
    Augment a sentiment dictionary using a Tidytext-style lexicon file.

    Reads a whitespace-delimited file whose first line is a header and whose
    rows contain at least two columns: the token (column 0) and its sentiment
    label (column 1). The label "negative" is mapped to -1; any other label
    is mapped to +1. Each token is lowercased before lookup/insertion. If a
    token already exists in `dictionary_name`, conflicts are resolved via
    `check_sentiment_value` (e.g., positive vs. negative → set to 0).

    Parameters
    ----------
    dictionary_name : dict[str, int]
        Mapping from token to sentiment score. Modified in place. Typical
        values are {-1, 0, 1}, where 0 indicates a previously detected conflict.
    text_file_of_words : str, optional
        Path to the Tidytext-style sentiment file. Must contain a header line
        followed by rows with at least "word" and "sentiment" columns.
        Defaults to "tidytext_sentiments.txt".
    verbose : bool, optional
        If True, passes through to `check_sentiment_value` to print per-token
        resolution messages. Defaults to False.

    Returns
    -------
    None
        The function updates `dictionary_name` in place and returns None.

    Notes
    -----
    - Assumes whitespace-delimited columns and a single header line (skipped).
    - Tokens are normalized with `lower()`.
    - Conflict handling is delegated to `check_sentiment_value`.
    - ChatGPT assisted in creating this docstring.

    Examples
    --------
    >>> d = {"great": 1}
    >>> tidytext_scoring(d, "tidytext_sentiments.txt")
    >>> d.get("awful"), d.get("great")
    (-1, 1)
    """
    with open(text_file_of_words, "r", encoding="utf-8") as f:
        next(f)  # skip header
        for line in f:
            words = line.split()
            if words[1] == "negative":
                score_value = -1
            else:
                score_value = 1
            word = words[0].lower()
            if word in dictionary_name:
                score_value = check_sentiment_value(
                    word, dictionary_name[word], score_value, verbose
                )
            dictionary_name[word] = score_value

    return None


def clean_sentiment_dictionary(dictionary_name, conflicting_words_list):
    """
    Remove ambiguous (conflicting) entries from a sentiment dictionary.

    This function scans `dictionary_name` for words whose sentiment score is
    `0` (used to mark conflicts, e.g., words that appeared in both positive
    and negative lists). It removes those words from the dictionary and
    appends each removed word to `conflicting_words_list` for auditing or
    later review.

    Parameters
    ----------
    dictionary_name : dict[str, int]
        Mapping of words to sentiment scores. This dictionary is modified
        in place; any key with value `0` is deleted.
    conflicting_words_list : list[str]
        A list that will be extended with the words removed due to conflict.

    Returns
    -------
    None
        The function performs in-place modifications and prints a short
        summary of how many words were removed.

    Notes
    -----
    - Iterates over a snapshot of keys (`list(dictionary_name.keys())`) to
      safely delete items during traversal.
    - Use this after building your lexicon to ensure that ambiguous words
      don't contribute noise to sentiment scoring.
    - ChatGPT assisted in creating this docstring.

    Examples
    --------
    >>> d = {"happy": 1, "sad": -1, "sick": 0}
    >>> removed = []
    >>> clean_sentiment_dictionary(d, removed)
    Total words removed from: 1
    >>> d
    {'happy': 1, 'sad': -1}
    >>> removed
    ['sick']
    """
    for key in list(dictionary_name.keys()):
        if dictionary_name[key] == 0:
            conflicting_words_list.append(key)
            del dictionary_name[key]
    print(f"Total words removed from: {len(conflicting_words_list)}")

In [7]:
# Read in the positive and negative words and the
# tidytext sentiment. Store these so that the positive
# words are associated with a score of +1 and negative words
# are associated with a score of -1. You can use a dataframe or a
# dictionary for this.
pos_score = 1
neg_score = -1
word_dict = {}
conflict_words = []
set_verbose = False

create_scoring_dict(word_dict, "negative-words.txt", neg_score, verbose=set_verbose)
create_scoring_dict(word_dict, "positive-words.txt", pos_score, verbose=set_verbose)
tidytext_scoring(word_dict, verbose=set_verbose)

In [8]:
clean_sentiment_dictionary(word_dict, conflict_words)

Total words removed from: 131


## Sentiment Analysis on Songs

In this section, score the sentiment for all the songs for both artists in your data set. Score the sentiment by manually calculating the sentiment using the combined lexicons provided in this repository.

After you have calculated these sentiments, answer the questions at the end of this section.


In [None]:
# your code here

### Questions

Q: Overall, which artist has the higher average sentiment per song?

A: <!-- Your answer here -->

---

Q: For your first artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score?

A: <!-- Your answer here -->

---

Q: For your second artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score?

A: <!-- Your answer here -->

---

Q: Plot the distributions of the sentiment scores for both artists. You can use `seaborn` to plot densities or plot histograms in matplotlib.


## Sentiment Analysis on Twitter Descriptions

In this section, define two sets of emojis you designate as positive and negative. Make sure to have at least 10 emojis per set. You can learn about the most popular emojis on Twitter at [the emojitracker](https://emojitracker.com/).

Associate your positive emojis with a score of +1, negative with -1. Score the average sentiment of your two artists based on the Twitter descriptions of their followers. The average sentiment can just be the total score divided by number of followers. You do not need to calculate sentiment on non-emoji content for this section.


In [None]:
# your code here

Q: What is the average sentiment of your two artists?

A: <!-- Your answer here -->

---

Q: Which positive emoji is the most popular for each artist? Which negative emoji?

A: <!-- Your answer here -->
