[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/109x8CyqEGEOz0rclgTMgX8a3YRuOiAyG)

# Tweet sentiment analysis

## Description and requirements:

In this notebook, you will learn how to perform sentiment analysis on tweets extracted from Twitter in Python.

A requirement for this code to work is to have the English version of the Hedonomoter word list, which can be downloaded from [Hedonometer](https://hedonometer.org/). This dataset contains a list of words, in which each word is assigned a "happiness score". This happiness score varies between 1 and 9, with 1 being very negative and 9 being very positive. For example the word `coronavirus` received the value of 1.34, while the word `laughter` received 8.5.

The method we will use here to measure the sentiment of one tweet is to add up the happiness scores of each word contained in the tweet and divide the total score by the number of words in the tweet.


### Import modules

In [29]:
import ast
import os
import re

import numpy as np
import pandas as pd
from tqdm import tqdm
import plotly.express as px
%matplotlib notebook

tqdm.pandas()

path_to_data = './data'
hedonometer_path = os.path.join(path_to_data, 'Hedonometer_translated.csv')


The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version



In [2]:
import plotly
plotly.__version__

'4.14.3'

### Define initial functions
We will define a few functions that will be used for the sentiment analysis.

#### sentiment
This function will determine the sentiment of a tweet.
It will take as an input:
- `sentence` which is the tweet to analyze.

It then goes over all words in the sentence and compute the average happiness score of all words that appear in the Hedonometer dataset.

In [3]:
def sentiment(sentence):
    senti = 0
    total = 0
    words = sentence.split(" ")
    for word in words:
        if word in scores.keys():
            senti += scores[word]
            total += 1
    try:
        return senti / total
    except ZeroDivisionError:
        return 0

#### standard_deviation
This function will determine the standard deviation of a tweet.
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.

It will take as an input:
- `sentence` which is the sentence (Tweet) to analyze

It then goes over all words in the sentence and compute the average standard deviation of all words that appear in the Hedonometer dataset.

In [4]:
def standard_deviation(sentence):
    standard_deviation = 0
    total = 0
    words = sentence.split(" ")
    for word in words:
        if word in sds.keys():
            standard_deviation += sds[word]
            total += 1
    try:
        return standard_deviation / total
    except ZeroDivisionError:
        return 0

#### grab_hits
This function will catch all the matched words from the tweet that appear on the hedonometer.

It will take as an input:
- `sentence` which is the sentence (Tweet) to analyze

It then goes over all words in the sentence, tries to match them with the hedonometer list, and returns the ones that were found.

In [5]:
def grab_hits(sentence):
    hits = []
    words = sentence.split(" ")
    for word in words:
        if word in sds.keys():
            hits.append(word)
    return hits

Now, we would like to do this for each of the tweets in our df. To do this, we will use the `apply` method of a df.
This method will run the functions `sentiment`, `standard_deviation`, and `grab_hits` on each of the tweets.
After running, it will convert the result to a Pandas series and return it.

In [6]:
# Run sentiment on each tweet
def return_sentiment(df):
    return pd.Series(df["clean_text"].apply(lambda x: sentiment(x)))

# Run standard_deviation on each tweet
def return_standard_deviation(df):
    return pd.Series(df["clean_text"].apply(lambda x: standard_deviation(x)))

# Run grab_hits on each tweet
def return_hits(df):
    return pd.Series(df["clean_text"].apply(lambda x: grab_hits(x)))

### Define cleaning functions

In order to clean our tweets from URLs, prefixes and more, we will define the following functions:

#### remove_pattern
This function will remove a given pattern from the text using regular expression.

It will take as an input:
- `input_txt` The input text to remove
- `pattern` The pattern to remove

It will then return the cleaned text.

In [7]:
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt

#### remove_prefix
This function will remove a prefix from a given text.

It will receive as input:
- `text` The text to remove the prefix from
- `prefix` The prefix to remove

It will then return the cleaned text.

In [8]:
def remove_prefix(text, prefix):
    if text.startswith(prefix):
        text = text.replace(prefix, "", 1)
    return text

#### clean_and_sentiment
It will clean the tweets from the df using the above functions.
This function will receive:
- `df` the DataFrame containing all the tweets

It will create a new column called `clean_text` in the df that will contain all the clean and translated tweets.

In [9]:
def clean_and_sentiment(df):
    # Remove all url's from the tweet
    df['clean_text'] = df['text'].str.replace('http\S+|www.\S+', '', case=False)
    # Remove all \n from text
    df['clean_text'] = np.vectorize(remove_pattern)(df['clean_text'], "\n")
    # Remove all @ tags
    df['clean_text'] = np.vectorize(remove_pattern)(df['clean_text'], "@[\w]*")
    # Remove all RT prefix
    df['clean_text'] = df['clean_text'].apply(lambda x: remove_prefix(x, "RT : "))
    # remove all # tags
    df['clean_text'] = np.vectorize(remove_pattern)(df['clean_text'], r"#(\w+)")
    # Remove all # from it
    df['clean_text'] = df['clean_text'].str.replace('#', '')
    # Make the string lower case
    df['clean_text'] = df['clean_text'].str.lower()
    # remove all the punctuation
    df['clean_text'] = df.clean_text.str.replace(r'[^\w\s]', '')
    # If some tweets were left empty then remove them
    df = df[df['clean_text'] != '']
    # Run the hedonometer on the tweets
    df["hedonometer"] = return_sentiment(df)
    # Clear all tweets with no score
    df = df[df["hedonometer"] != 0]
    # Calculate the standard deviation for each tweets sentiment
    df["hedo_sd"] = return_standard_deviation(df)
    # Run function to pull the words from each tweet that determine sentiment into a new column
    df["hits"] = return_hits(df)
    return df

### Load Hedonometer data

Finally, we can load our Hedonometer and start calculating the sentiments of the tweets.
First, we will load the Hedonometer scores and store them in a dictionary. Each entry will be in the form of {key: value}, where `key` stands for a word from the list and `value` will be the sentiment score given to it.

In [10]:
file = pd.read_csv(hedonometer_path).sort_values(by='Happiness Score')
scores = file[["translate_text", "Happiness Score"]]
scores.columns = ["word", "score"]
scores = dict(zip(scores.word, scores.score))

In [11]:
file.head()

Unnamed: 0.1,Unnamed: 0,Rank,Word,Word in English,Happiness Score,Standard Deviation of Ratings,translate_text
0,10168,10168,suicide,suicide,1.3,0.84,התאבדות
1,10169,10169,terrorist,terrorist,1.3,0.91,מחבל
2,10171,10171,coronavirus,coronavirus,1.34,0.66,קורונה
3,10167,10167,rape,rape,1.44,0.79,אונס
4,10165,10165,murder,murder,1.48,1.01,רצח


In [12]:
file.tail()

Unnamed: 0.1,Unnamed: 0,Rank,Word,Word in English,Happiness Score,Standard Deviation of Ratings,translate_text
10182,4,4,laughed,laughed,8.26,1.16,צחק
10183,3,3,happy,happy,8.3,0.99,שמח
10184,2,2,love,love,8.42,1.11,אהבה
10185,1,1,happiness,happiness,8.44,0.97,אושר
10186,0,0,laughter,laughter,8.5,0.93,צחוק


Second, we will load the deviation scores.
Again, we will create a dictionary where each entry will be in the form of {key: value}, where `key` stands for a word from the list and `value` will be the standard deviation given to it.
Using the same file, we create a dictionary of words and standard deviations of scores.

In [13]:
sds = file[["translate_text", "Standard Deviation of Ratings"]]
sds.columns = ["word", "deviation"]
sds = dict(zip(sds.word, sds.deviation))

### Example on Pakistani tweets

Now, let's run an example on the tweets we downloaded in our [previous post](https://www.linkedin.com/posts/boris-sobol_a-small-analysis-i-performed-on-twitter-to-activity-6780033712374579200-iixv)

First we will load out data and take a look at the data

In [14]:
# Open all Urdu translated tweets from the timelines directory
df = pd.read_csv('candidates_06_04_2021.csv')
# Load the tweets size
lens = pd.read_csv('candidates_final_counts_06_04_2021.csv')

In [15]:
df['tag'] = ''
previous = 0
for _, row in lens.iterrows():
    df['tag'][previous:previous + row['total_tweets']] = row['name']
    previous += row['total_tweets']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [16]:
df.shape

(40038, 36)

In [17]:
# Print the first 5 items
df['text'].head()

0    @bardugojacob אני מאזין לך ולדליה איציק , חרמו...
1    @WalaGoanmi @AviEdelson1 @tommybor כי יאיר לפי...
2    יודע משהו שאנחנו לא? על אף שלא קיבל את מלאכת ה...
3    @Orisimo1984 @aviadglickman אולי הם חשבו שעל י...
4    @ronitlev12 נתניהו לא ראש הממשלה שלי!!!\n\nזה ...
Name: text, dtype: object

In [18]:
# print a single user from the data
print(df.iloc[0]['user'])

{'id': 1280990552201232387, 'id_str': '1280990552201232387', 'name': 'Yorik Cohen Zedek', 'screen_name': 'CohenYorik', 'location': '', 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 11, 'friends_count': 223, 'listed_count': 0, 'created_at': 'Wed Jul 08 22:22:17 +0000 2020', 'favourites_count': 1487, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 686, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1314635318818156544/UMS555vJ_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1314635318818156544/UMS555vJ_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/12809

In [19]:
# Make a dict from the text string
df['user'] = df['user'].map(ast.literal_eval)
# Extract the username of each user from the user column
df['name'] = df['user'].apply(lambda user: user['name'])

In our dataset:
- Total of 4500 tweets.
- The oldest tweet is from: March 19, 2021.
- the most recent tweet is from: March 24, 2021.
- Total of 2039 unique users.


In [20]:
print(f"The oldest tweet is from: {df['created_at'].min()}.")
print(f"The most recent tweet is from: {df['created_at'].max()}.")
print(f"We have a total of {len(df)} tweets.")
print(f"We have a total of {len(df['name'].unique())} unique users.")

The oldest tweet is from: Fri Apr 02 00:00:32 +0000 2021.
The most recent tweet is from: Wed Mar 31 23:57:38 +0000 2021.
We have a total of 40038 tweets.
We have a total of 8342 unique users.


We then clean our tweets, and calculate the sentiment and standard deviation of each tweet. As mentioned earlier, the method used to measure the sentiment of one tweet is to add up the happiness scores of each word contained in the tweet and divide the total score by the number of words in the tweet.

In [21]:
# Clean input and translate if needed
df = clean_and_sentiment(df)
print(df['clean_text'].head())

0     אני מאזין לך ולדליה איציק  חרמות חרמות בושה ו...
1       כי יאיר לפיד זה ממש שמאל אז זהו שלא זה הכי ...
2    יודע משהו שאנחנו לא על אף שלא קיבל את מלאכת הה...
3      אולי הם חשבו שעל ידי זה הם יקבלואצל יאיר לפי...
4     נתניהו לא ראש הממשלה שליזה אומר שאני יכול להפ...
Name: clean_text, dtype: object


### Define functions to plot results:

We define a few functions to produce our final results and plot them.
#### get_mean_based_on_frequency
This function will calculate the mean sentiment of a DataFrame based on a given frequency.

This function will receive:
- `df` A DataFrame of the tweets
- `freq` A frequency of the tweets, default value is daily. For more information about this please visit [time-date-components](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components)
This function will calculate the mean sentiment of the tweets by a given frequency (e.g. Day, Week, Month, Year, etc.) starting from the `date`.

In [22]:
def get_mean_based_on_frequency(df, freq):
    df["created_at"] = pd.to_datetime(df['created_at'])
    df = df.set_index("created_at")
    df = df[['hedonometer']]
    return df.resample(freq).mean()

### Produce and plot final results

Now, we can finally plot our results.
We will plot the mean of all the users that tweeted as well as the top user.

In order to achieve that, we will need to do some work.
We will only take the relevant information:
- `tag`
- `created_at`
- `hedonometer` score
- `clean_text`

In [23]:
# Get only the user_id, the time the tweet was created and the hedonometer score
sent_df = df[["tag", "created_at", "hedonometer", 'clean_text']]

# Split them to different DataFrames, this will sort the DataFrame by the 'user_id' and sort them
split_df = sorted([pd.DataFrame(y) for x, y in sent_df.groupby('tag', as_index=False)], key=len,
                  reverse=True)
print(f"There are {len(split_df)} different candidates\n"
      f"The candidates with most tweets is: {split_df[0].iloc[0]['tag']} with {len(split_df[0])} tweets\n"
      f"The candidates with the least tweets is: {split_df[-1].iloc[0]['tag']} with {len(split_df[-1])} tweets")

There are 14 different candidates
The candidates with most tweets is: מנסור עבאס with 7850 tweets
The candidates with the least tweets is: ירון זליכה with 83 tweets


In [24]:
# get the names of the users
user_names = [sdf.iloc[0]['tag'] for sdf in split_df]

In [25]:
# Get hourly Mean of each user
split_df = [get_mean_based_on_frequency(sf, freq='3H') for sf in split_df]
# combine the users, each to its own column
split_sentiment = pd.concat(split_df, axis=1)
# Rename the columns
split_sentiment.columns = user_names
# Get Overall hourly mean of tweets
user_sentiment = get_mean_based_on_frequency(sent_df, freq='3H')
user_sentiment.columns = ['Sentiment']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [30]:
fig = px.line(user_sentiment, title='Mean sentiment of 3 hourly tweets of all candidates',
        labels={
            'created_at': 'Date',
            'value': 'Mean 3 hourly sentiment (scale 1-9)'
        })
fig.write_html('_'.join('Mean sentiment of 3 hourly tweets of all candidates.html'.split(' ')))
fig.show()

In [31]:
fig = px.line(split_sentiment, title="Mean sentiment of 3 hourly tweets of the candidates from Israel",
        labels={
            'created_at': 'Date',
            'value': 'Mean 3 hourly sentiment (scale 1-9)'
        },
        color_discrete_sequence=px.colors.qualitative.Bold)
fig.write_html('_'.join('Mean sentiment of 3 hourly tweets of the candidates from Israel.html'.split(' ')))
fig.show()

In [28]:
# Save tweets with sentiments
df.to_csv('candidates_06_04_2021.csv', index=False)