# ADS 509 Sentiment Assignment

This notebook holds the Sentiment Assignment for Module 6 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required. 

In a previous assignment you put together Twitter data and lyrics data on two artists. In this assignment we apply sentiment analysis to those data sets. If, for some reason, you did not complete that previous assignment, data to use for this assignment can be found in the assignment materials section of Blackboard. 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [16]:
import os
import re
import emoji
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
from string import punctuation

from nltk.corpus import stopwords

sw = stopwords.words("english")

In [17]:
# Add any additional import statements you need here




In [18]:
# change `data_location` to the location of the folder on your machine.
data_location = "/Users/kyledalope/Documents/GitHub/ads509-tm-sentiment"

# These subfolders should still work if you correctly stored the 
# data from the Module 1 assignment
twitter_folder = "/Users/kyledalope/Downloads/M1 Results/twitter"
lyrics_folder = "/Users/kyledalope/Downloads/M1 Results/lyrics"

positive_words_file = "positive-words.txt"
negative_words_file = "negative-words.txt"
tidy_text_file = "tidytext_sentiments.txt"

## Data Input

Now read in each of the corpora. For the lyrics data, it may be convenient to store the entire contents of the file to make it easier to inspect the titles individually, as you'll do in the last part of the assignment. In the solution, I stored the lyrics data in a dictionary with two dimensions of keys: artist and song. The value was the file contents. A Pandas data frame would work equally well. 

For the Twitter data, we only need the description field for this assignment. Feel free all the descriptions read it into a data structure. In the solution, I stored the descriptions as a dictionary of lists, with the key being the artist. 




In [19]:
# Read in the lyrics data

lyrics = {}
# For loop to the artist subfolders
for artist_sub in os.listdir(lyrics_folder):
    artist_path = os.path.join(lyrics_folder, artist_sub)
    if os.path.isdir(artist_path):
        lyrics[artist_sub] = {}
        
        # For loop to the songs for each artist subfolder
        for song in os.listdir(artist_path):
            song_path = os.path.join(artist_path, song)
            if os.path.isfile(song_path):
                with open(song_path, 'r') as file:
                    lyrics_content = file.read()
                    lyrics[artist_sub][song] = lyrics_content

# Create a dataframe from the lyrics dictionary
lyrics_data = pd.DataFrame([(artist, song, lyrics[artist][song]) for artist in lyrics for song in lyrics[artist]],
columns=['artist', 'song', 'lyrics_content'])

In [20]:
# read in the twitter data

twitter_descriptions = {}

for file_name in os.listdir(twitter_folder):
    if file_name.endswith(".txt"):
        artist = os.path.splitext(file_name)[0]
        file_path = os.path.join(twitter_folder, file_name)
        with open(file_path, 'r') as file:
            descriptions = [line.strip() for line in file]
        twitter_descriptions[artist] = descriptions

In [21]:
# Read in the positive and negative words and the
# tidytext sentiment. Store these so that the positive
# words are associated with a score of +1 and negative words
# are associated with a score of -1. You can use a dataframe or a 
# dictionary for this.

# Read positive words
positive_words_path = os.path.join(data_location, positive_words_file)
positive_words = []
with open(positive_words_path, "r") as file:
    for line in file:
        if not line.startswith(";"):  # Skip comment lines
            positive_words.append(line.strip())

# Read negative words
negative_words_path = os.path.join(data_location, negative_words_file)
negative_words = []
with open(negative_words_path, "r") as file:
    for line in file:
        if not line.startswith(";"):  # Skip comment lines
            negative_words.append(line.strip())

# Create dictionaries with scores
positive_dict = {word: 1 for word in positive_words}
negative_dict = {word: -1 for word in negative_words}

# Read tidy text sentiments
tidy_text_path = os.path.join(data_location, tidy_text_file)
tidy_text_sentiments = pd.read_csv(tidy_text_path, sep="\t")


## Sentiment Analysis on Songs

In this section, score the sentiment for all the songs for both artists in your data set. Score the sentiment by manually calculating the sentiment using the combined lexicons provided in this repository. 

After you have calculated these sentiments, answer the questions at the end of this section.


In [22]:
lyrics_data #view format of lyrics data frame

Unnamed: 0,artist,song,lyrics_content
0,robyn,robyn_includemeout.txt,"""Include Me Out""\n\n\n\nIt is really very simp..."
1,robyn,robyn_electric.txt,"""Electric""\n\n\n\nElectric...\n\nIt's electric..."
2,robyn,robyn_beach2k20.txt,"""Beach 2K20""\n\n\n\n(So you wanna go out?\nHow..."
3,robyn,robyn_lovekills.txt,"""Love Kills""\n\n\n\nIf you're looking for love..."
4,robyn,robyn_timemachine.txt,"""Time Machine""\n\n\n\nHey, what did I do?\nCan..."
...,...,...,...
415,cher,cher_takeitfromtheboys.txt,"""Take It From The Boys""\n\n\n\nSo scared I nev..."
416,cher,cher_dreambaby.txt,"""Dream Baby""\n\n\n\nI found the boy for me\nHe..."
417,cher,cher_pleasedonttellme.txt,"""Please Don't Tell Me""\n\n\n\nYa shook the ove..."
418,cher,cher_ihopeyoufindit.txt,"""I Hope You Find It""\n\n\n\nThese clouds aren'..."


In [23]:
#create dictionary for song sentiment scores

songs_sentiment_scores = []
for lyrics_content in lyrics_data['lyrics_content']:
    sentiment_score = 0
    words = lyrics_content.split()  # Split the lyrics content into words
    for word in words:
        if word in positive_dict:
            sentiment_score += positive_dict[word]
        elif word in negative_dict:
            sentiment_score += negative_dict[word]
    songs_sentiment_scores.append(sentiment_score)

# Add sentiment scores to the lyrics data dataframe
lyrics_data['SentimentScore'] = songs_sentiment_scores

In [24]:
lyrics_data['SentimentScore']

0      -8
1      -9
2      20
3     -19
4      -3
       ..
415    10
416    11
417    -1
418    -1
419     2
Name: SentimentScore, Length: 420, dtype: int64

In [25]:
# average sentiment per song for each artist
avg_sent_per_song = lyrics_data.groupby('artist')['SentimentScore'].mean()

avg_sent_per_song #view avg sentiment score for cher and robyn

artist
cher     2.177215
robyn    0.875000
Name: SentimentScore, dtype: float64

In [35]:
# first artist, three songs with highest and lowest sentiments

first_cher = lyrics_data[lyrics_data['artist'] == 'cher']

high_sent_songs = first_cher.nlargest(3, 'SentimentScore') #top 3 highest sentiment songs

low_sent_songs = first_cher.nsmallest(3, 'SentimentScore') #3 lowest sentiment songs

print("Three highest sentiment songs for Cher:", high_sent_songs)
print()
print("Three lowest sentiment songs for Cher:", low_sent_songs)

Three highest sentiment songs for Cher:     artist                           song  \
207   cher            cher_perfection.txt   
250   cher  cher_loveandunderstanding.txt   
119   cher                cher_mylove.txt   

                                        lyrics_content  SentimentScore  
207  "Perfection"\n\n\n\nHush little Baby, gotta be...              41  
250  "Love And Understanding"\n\n\n\nHere, here in ...              36  
119  "My Love"\n\n\n\nWhen I go away\nI know my hea...              34  

Three lowest sentiment songs for Cher:     artist                              song  \
275   cher  cher_iwalkonguildedsplinters.txt   
262   cher               cher_outrageous.txt   
111   cher                    cher_julie.txt   

                                        lyrics_content  SentimentScore  
275  "I Walk On Guilded Splinters"\n\n\n\nSome peop...             -25  
262  "Outrageous"\n\n\n\nOutrageous, outrageous\n(T...             -20  
111  "Julie"\n\n\n\nCheap lips lie 

### Questions

Q: Overall, which artist has the higher average sentiment per song? 

A: Overall, Cher had the higher average sentiment per song.

---

Q: For your first artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score? 

A: <!-- Your answer here -->

---

Q: For your second artist, what are the three songs that have the highest and lowest sentiments? Print the lyrics of those songs to the screen. What do you think is driving the sentiment score? 

A: <!-- Your answer here -->

---

Q: Plot the distributions of the sentiment scores for both artists. You can use `seaborn` to plot densities or plot histograms in matplotlib.




## Sentiment Analysis on Twitter Descriptions

In this section, define two sets of emojis you designate as positive and negative. Make sure to have at least 10 emojis per set. You can learn about the most popular emojis on Twitter at [the emojitracker](https://emojitracker.com/). 

Associate your positive emojis with a score of +1, negative with -1. Score the average sentiment of your two artists based on the Twitter descriptions of their followers. The average sentiment can just be the total score divided by number of followers. You do not need to calculate sentiment on non-emoji content for this section.

In [None]:
# your code here

Q: What is the average sentiment of your two artists? 

A: <!-- Your answer here --> 

---

Q: Which positive emoji is the most popular for each artist? Which negative emoji? 

A: <!-- Your answer here --> 

