# ADS 509 Sentiment Assignment

This notebook holds the Sentiment Assignment for Module 6 in ADS 509, Applied Text Mining. Work through this notebook, writing code and answering questions where required. 

In a previous assignment you put together Twitter data and lyrics data on two artists. In this assignment we apply sentiment analysis to those data sets. If, for some reason, you did not complete that previous assignment, data to use for this assignment can be found in the assignment materials section of Blackboard. 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


In [1]:
import os
import re
import emoji
import pandas as pd
import numpy as np

from collections import Counter, defaultdict
from string import punctuation

from nltk.corpus import stopwords

sw = stopwords.words("english")

In [2]:
# Add any additional import statements you need here

import seaborn as sns
import emoji
import regex
from collections import Counter


In [3]:
# change `data_location` to the location of the folder on your machine.
data_location = "/users/chandler/dropbox/teaching/repos/ads-tm-api-scrape/"

# These subfolders should still work if you correctly stored the 
# data from the Module 1 assignment
twitter_folder = "twitter/"
lyrics_folder = "lyrics/"

positive_words_file = "positive-words.txt"
negative_words_file = "negative-words.txt"
tidy_text_file = "tidytext_sentiments.txt"

## Data Input

Now read in each of the corpora. For the lyrics data, it may be convenient to store the entire contents of the file to make it easier to inspect the titles individually, as you'll do in the last part of the assignment. In the solution, I stored the lyrics data in a dictionary with two dimensions of keys: artist and song. The value was the file contents. A Pandas data frame would work equally well. 

For the Twitter data, we only need the description field for this assignment. Feel free all the descriptions read it into a data structure. In the solution, I stored the descriptions as a dictionary of lists, with the key being the artist. 




In [4]:
# Read in the lyrics data
# Feel free to use the below cells as an example or read in the data in a way you prefer

data_location = "C:/Users/elfek/datamining/mod6/" # change to your location if it is not in the same directory as your notebook
artist1 = 'cher'
artist2 = 'robyn'
artist_files = {artist1:'cher_followers_data.txt',
            artist2:'robynkonichiwa_followers_data.txt'}

#reading the file
lyrics1 = []  
os.chdir(r"C:\Users\elfek\datamining\lyrics")
for filename in os.listdir(artist1):
    with open(os.path.join(artist1, filename), 'r', encoding="utf-8") as f:
        songLyrics = f.read()
        lyrics1.append(songLyrics.split('\n', 1)[1:])
        song = str(songLyrics.split('\n', 1)[1:])
        songTitle = songLyrics.split('\n', 1)[0]
        # if (re.search("tjaffs", song)):
        #     print(songTitle)

lyrics2 = []        
os.chdir(r"C:\Users\elfek\datamining\lyrics")
for filename in os.listdir(artist2):
    with open(os.path.join(artist2, filename), 'r', encoding="utf-8") as f:
        songLyrics = f.read()
        song = str(songLyrics.split('\n', 1)[1:])
        lyrics2.append(songLyrics.split('\n', 1)[1:]) #extend
        songTitle = songLyrics.split('\n', 1)[0]
        # if (re.search("transistors", song)):
        #     print(songTitle)
        

lyrics_data =  pd.DataFrame(lyrics1)
lyrics_data['artist'] = artist1
lyrics_data_2 =  pd.DataFrame(lyrics2)
lyrics_data_2['artist'] = artist2

lyrics_data = pd.concat([
    lyrics_data,lyrics_data_2])
    
del(lyrics_data_2)

lyrics_data.rename(columns={0: "lyrics"}, inplace = True)


In [5]:
# Read in the twitter data
twitter_data = pd.read_csv(data_location + twitter_folder + artist_files[artist1],
                        sep="\t",
                        quoting=3)

twitter_data['artist'] = artist1
twitter_data_2 = pd.read_csv(data_location + twitter_folder + artist_files[artist2],
                            sep="\t",
                            quoting=3)
twitter_data_2['artist'] = artist2

twitter_data = pd.concat([
twitter_data,twitter_data_2])

del(twitter_data_2)

In [6]:
# Read in the positive and negative words and the
# tidytext sentiment. Store these so that the positive
# words are associated with a score of +1 and negative words
# are associated with a score of -1. You can use a dataframe or a 
# dictionary for this.

f=data_location + tidy_text_file
df = pd.read_csv(f,sep='\t')
df.drop(columns = 'lexicon', inplace = True)
df.replace('positive', 1, inplace=True)
df.replace('negative', -1, inplace=True)
df.sample()

Unnamed: 0,word,sentiment
11963,unfunded,-1


In [7]:
negative = df.loc[df['sentiment'] == -1].drop( columns = 'sentiment')['word'].values.tolist()
positive = df.loc[df['sentiment'] == 1].drop( columns = 'sentiment')['word'].values.tolist()


#negative = df.loc[df['sentiment'] == -1].drop( columns = 'sentiment').values.tolist()
#positive = df.loc[df['sentiment'] == 1].drop( columns = 'sentiment').values.tolist()
# positive

## Sentiment Analysis on Songs

In this section, score the sentiment for all the songs for both artists in your data set. Score the sentiment by manually calculating the sentiment using the combined lexicons provided in this repository. 

After you have calculated these sentiments, answer the questions at the end of this section.


In [8]:
# your code here
lyrics_data
lyrics_data['score'] = pd.Series(dtype='int')
temp_score= []
i = 0
for index, row in lyrics_data.iterrows():
    score = 0
    for word in row[0].split():
        #print(word)
        if word in negative:
           # print("The word is not in the list!")
            score = score - 1
        elif word in positive:
           # print("The word is in the list!")
            score = score + 1
    #print(score)
    temp_score.append(score)
lyrics_data['score'] =temp_score
lyrics_data.sample()


Unnamed: 0,lyrics,artist,score
113,"\n\n\nThese clouds aren't going nowhere, darli...",cher,6


In [9]:
cher_avg_score = lyrics_data.loc[lyrics_data['artist'] == 'cher']['score'].mean()
robyn_avg_score = lyrics_data.loc[lyrics_data['artist'] == 'robyn']['score'].mean()
cher_min_score = lyrics_data.loc[lyrics_data['artist'] == 'cher']['score'].min()
robyn_min_score = lyrics_data.loc[lyrics_data['artist'] == 'robyn']['score'].min()
cher_max_score = lyrics_data.loc[lyrics_data['artist'] == 'cher']['score'].max()
robyn_max_score = lyrics_data.loc[lyrics_data['artist'] == 'robyn']['score'].max()
cher_avg_score, robyn_avg_score

(4.079113924050633, 5.663461538461538)

In [10]:
cher_min_score, robyn_min_score, cher_max_score, robyn_max_score

(-42, -94, 50, 85)

In [11]:
lyrics_data.loc[lyrics_data['score'] == cher_min_score], lyrics_data.loc[lyrics_data['score'] == cher_max_score]

(                                               lyrics artist  score
 16  \n\n\nBang bang you shot me down\nBang bang I ...   cher    -42,
                                                 lyrics artist  score
 155  \n\n\nHere, here in this world\nWhere do we go...   cher     50)

In [12]:
lyrics_data.loc[lyrics_data['score'] == robyn_min_score], lyrics_data.loc[lyrics_data['score'] == robyn_max_score] 

(                                               lyrics artist  score
 26  \n\n\nMy drinking is killing me\nMy drinking i...  robyn    -94
 27  \n\n\nMy drinking is killing me\nMy drinking i...  robyn    -94,
                                                lyrics artist  score
 64  \n\n\nFree\nLove is free, baby\nFree\nLove is ...  robyn     85)

### Questions

Q: Overall, which artist has the higher average sentiment per song? 

A: Robyn has avg higher sentiment per song.

---

Q: For your first artist, what songs have the highest and lowest sentiments? Print those songs to the screen.

A: For Cher, "Bang bang you shot me down" has the lowest sentiments and "Here, here in this world" has the highest sentiments.

---

Q: For your second artist, what songs have the highest and lowest sentiments? Print those songs to the screen.

A: For Robyn, "My Drinking is killing me" has the lowest sentiments and "Love is free, baby" has the highest sentiments.

---

Q: Plot the distributions of the sentiment scores for both artists. You can use `seaborn` to plot densities or plot histograms in matplotlib.




## Sentiment Analysis on Twitter Descriptions

In this section, define two sets of emojis you designate as positive and negative. Make sure to have at least 10 emojis per set. You can learn about the most popular emojis on Twitter at [the emojitracker](https://emojitracker.com/). 

Associate your positive emojis with a score of +1, negative with -1. Score the average sentiment of your two artists based on the Twitter descriptions of their followers. The average sentiment can just be the total score divided by number of followers. 

In [13]:
twitter_data.drop(columns = ['screen_name', 'name', 'id', 'location','followers_count', 'friends_count' ], inplace = True)


In [14]:
twitter_data.sample(10)

Unnamed: 0,description,artist
342163,olha a bio biuô pirulitô,cher
3277761,,cher
3502551,,cher
3620447,,cher
2627230,🍼🍼,cher
3589372,"Bookworm, nerd, and all around loser at life",cher
1166010,,cher
412995,,cher
1633800,Entrepreneur || Founder @ Seers Projects (Pty)...,cher
3086165,,cher


In [15]:
positive_emoticons = []
temp_positive_emoticons = [":kissing_heart:", ":smile:", ":blush:", ":grin:", ":kissing_closed_eyes:", ":heart_eyes:", "😊", ":smiling_imp:", ":imp:", ":yum:"]
for emoticon in temp_positive_emoticons:
    positive_emoticons.append(emoji.emojize(emoticon,language= 'alias'))
# print(positive_emoticons)
negative_emoticons = []
temp_negative_emoticons = [":angry:", ":sob:", ":cry:", ":fearful:", ":tired_face:", ":disappointed:", ":confounded:", ":frowning:", ":grimacing:", ":worried:"]
for emoticon in temp_negative_emoticons:
    negative_emoticons.append(emoji.emojize(emoticon, language = 'alias'))
# print(negative_emoticons)
twitter_data = twitter_data.dropna()
twitter_data['description'].sample(10)

251641     If today is the first day of the rest of my li...
75763      Liberal ° Políticamente incorrecto 🙃 Máster en...
454241                                    WAKANDA FOREVER...
2355164                                            fuck cops
2498457    hello I'm Acacia Clark...i'm real! :) I became...
3753117                                  The Voice of Reason
3896706    Be who you are and say what you feel because t...
2301218                              we’re all battling fear
266359                    Don't Blame Me, I Voted for Trump!
2006963    INTJ ... Aries ... Quins ... just getting on w...
Name: description, dtype: object

In [44]:
# your code here

def extract_emojis(s):
  return ' '.join(c for c in s if c in (positive_emoticons + negative_emoticons))


# def extract_emojis(s):
#   return ' '.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])

def split_count(text):

    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI['en'] for char in word):
            emoji_list.append(word)
        emoji_list.append()
    
    return emoji_list

# It's handy to have a full set of emojis
all_language_emojis = set()

for country in emoji.UNICODE_EMOJI : 
    for em in emoji.UNICODE_EMOJI[country] : 
        all_language_emojis.add(em)
        
def is_emoji(s):
    return(s in all_language_emojis)
    
def contains_emoji(s):
    s = str(s)
    emojis = [ch for ch in s if is_emoji(ch)]
    return(len(emojis) > 0)

#print(emoji.emojize(":zipper-mouth_face:"))

temp_score= []
temp_emoji= []

for row in twitter_data['description']:
    score = 0
    #print(row)
    emo = extract_emojis(row)
    temp_emoji.append(emo)
    # print(emo)
    if emo != '':
        for item in emo:
            # print(item)
            if item in negative_emoticons:
                # print(score)
                score = score - 1
            elif item in positive_emoticons:
                # print(score)
                score = score + 1
            #print(score)
            #print(emoji.demojize(item))
        temp_score.append(score)
    else:
        temp_score.append(0)
        
twitter_data['score'] = temp_score
twitter_data['sentiment_emojis'] = temp_emoji
twitter_data.sample()

Unnamed: 0,description,artist,score,sentiment_emojis
2014720,Leaning in. Digital marketing strategist. Writ...,cher,0,


In [45]:
twitter_data.loc[twitter_data['score'] != 0 ].sample(10)

Unnamed: 0,description,artist,score,sentiment_emojis
1030303,*RESIST* I positively DESPISE chump and all of...,cher,1,😊
1274992,DJ👣Atheist ⚛️ Model 🌍Public 🇺🇸Figure ☔Seattle ...,cher,1,😈
171065,Entrepreneur in 🏴󠁧󠁢󠁳󠁣󠁴󠁿 the prints are availab...,cher,1,😘
1326040,❤️😍Gleek✌️💗 🎶🎤Arianator Boy🎧💖 😎Psicólogo 👀,cher,1,😍
2351272,"i dont get mad, I get even ✌🏻️😊",cher,1,😊
1095302,"Ayee yall😍 I'm a very sweet person, I'm fun an...",cher,2,😍 😍
14464,"Cashtag for @cashapp $dsb1p Spicy humor, fun, ...",robyn,1,😋
1130150,"Liberal, (over) thinker, traveler who loves te...",cher,1,😊
1230270,😊am crushing on maself❤TYLGFML,cher,1,😊
3895199,"Protección civil, gestión integral del riesgo,...",cher,1,😁


In [46]:
twitter_data.loc[twitter_data['artist'] == 'cher']['score'].max(), twitter_data.loc[twitter_data['artist'] == 'cher']['score'].min(),twitter_data.loc[twitter_data['artist'] == 'robyn']['score'].max(), twitter_data.loc[twitter_data['artist'] == 'robyn']['score'].min()

(37, -16, 19, -6)

In [47]:
twitter_data.loc[twitter_data['artist'] == 'cher']['score'].mean(),twitter_data.loc[twitter_data['artist'] == 'cher']['score'].mean()

(0.012164563692917857, 0.012164563692917857)

Q: What is the average sentiment of your two artists? 

A: The average sentiment is zero for both artists.

---

Q: Which positive emoji is the most popular for each artist? Which negative emoji? 

A: Most popular positive emoji for cher is 😊, most negative for cher is 😬. Most popular positive emoji for robyn is 😍, most negative for robyn is 😭



In [48]:
twitter_data.loc[twitter_data['score'] == 37]['sentiment_emojis']


758219    😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 ...
Name: sentiment_emojis, dtype: object

In [54]:
twitter_data.loc[twitter_data['score'] == -16]['sentiment_emojis']

1293326    😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬 😬
Name: sentiment_emojis, dtype: object

In [50]:
twitter_data.loc[twitter_data['score'] == 19]['sentiment_emojis']


1153184    😍 😍 😍 😍 😍 😍 😍 😍 😍 😍 😍 😍 😍 😍 😍 😍 😍 😍 😍
32139      😍 😍 😍 😍 😍 😍 😍 😍 😍 😍 😊 😊 😊 😊 😊 😊 😊 😊 😊
Name: sentiment_emojis, dtype: object

In [51]:
twitter_data.loc[twitter_data['score'] == -6]['sentiment_emojis']

488262     😭 😭 😭 😭 😭 😭
692493     😭 😭 😭 😭 😭 😭
1103078    😭 😭 😭 😭 😭 😭
1699870    😬 😬 😬 😬 😬 😬
3208022    😭 😭 😞 😞 😭 😭
7414       😭 😭 😭 😭 😭 😭
Name: sentiment_emojis, dtype: object