# **Introduction to text analysis in Python. Day 4 Part 2**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 4 Part 2!**

## **Today, we are going to look at:**

+ *Bag-of-Words* model and `CountVectorizer`
+ Lexicon-based sentiment analysis

---



## **Preparatory steps first**

In [None]:
# Importing some of the required libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# This will ensure that all rows of the dataframe will be shown 

pd.set_option('display.max_rows', None)


In [None]:
# Uploading ted.csv and ted_clean.csv into the current Google Colab session

from google.colab import files

uploaded = files.upload()


In [None]:
# Getting both datasets

ted = pd.read_csv('ted.csv')

ted_clean = pd.read_csv('ted_clean.csv')


# **2. Lexicon-based sentiment analysis**

<figure>
<left>
<img src=https://d3caycb064h6u1.cloudfront.net/wp-content/uploads/2021/06/sentimentanalysishotelgeneric-2048x803-1.jpg  width="700">
</figure>

[Image source](https://expressanalytics.com/blog/social-media-sentiment-analysis/)


`Textblob`'s `.sentiment` method returns two values: *polarity* and *subjectivity*

The **polarity** score is a value within the range **[-1.0, 1.0]**
+ -1 - negative sentiment
+ 0 - neutral sentiment
+ +1 - positive sentiment

The **subjectivity** is a float within the range **[0.0, 1.0]**, where 0.0 is *very objective* and 1.0 is *very subjective*

*Subjective sentences* express author's feelings, views, beliefs, opinions

*Objective sentences* communicate the facts rhather than beliefs

**Lexicon-based** sentiment analysis implies that there is a pre-defined set of categorized words, that the lemmas of our document are compared to

Each word in a pre-defined dictionary is classified either as a negative, neutral, positive, with respective intensity of a sentiment

There are two algorithms that can be implemented:

+ *PatternAnalyzer* - a default classifier that is built on the pattern library

+ *NaiveBayesAnalyzer* - an NLTK model trained on a movie reviews corpus

You can read [this](https://www.cs.rit.edu/usr/local/pub/GraduateProjects/2165/fjk9481/Report.pdf) report to get a better understanding of how `TextBlob`'s sentiment analysis works 

In [None]:
from textblob import TextBlob # importing TextBlob


## **Running sentiment analysis on a single TED talk**

In [None]:
# Getting a single TED talk. We will use unpreprocessed text entry for now

single_talk = ted['transcript'][0]

single_talk


In [None]:
# Making it a TextBlob object

single_talk_blob = TextBlob(single_talk)

single_talk_blob

type(single_talk_blob) # textblob.blob.TextBlob


In [None]:
# Getting the polarity and subjectivity of the TED talk

single_talk_blob.sentiment # Sentiment(polarity=0.2574747337056497, subjectivity=0.47191676855799003)

# Polarity separately

single_talk_blob.polarity # 0.2574747337056497

# Subjectivity separately

single_talk_blob.subjectivity # 0.47191676855799003


In [None]:
# Which lemmas contributed to the scores of polarity and subjectivity?

single_talk_blob.sentiment_assessments


## **Running sentiment analysis on a corpus of raw and preprocessed TED talks**

In [None]:
# Let's define a function that return the polarity and subjectivity score for a text entry

def get_polarity_subjectivity(text):

  # Making text a TextBlob object
  text_blob = TextBlob(text)

  # Getting its polarity, rounded to 2 decimals
  polarity = round(text_blob.polarity, 2)

  # Getting its subjectivity, rounded to 2 decimals
  subjectivity = round(text_blob.subjectivity, 2)

  # Returning both objects as a list
  return([polarity, subjectivity])


In [None]:
# Running the above created function on a corpus of raw TED talks

ted_scores = ted['transcript'].apply(lambda x: get_polarity_subjectivity(x))

ted_scores.head(10)


In [None]:
# Running the same function but on a corpus of preprocessed TED talks

ted_clean_scores = ted_clean['transcript'].apply(lambda x: get_polarity_subjectivity(x))

ted_clean_scores.head(10)


In [None]:
# Now we have two pandas Series, where each rows contains a list of values

type(ted_scores) # pandas.core.series.Series
type(ted_clean_scores) # pandas.core.series.Series

ted_scores[0] # [0.26, 0.47]
ted_clean_scores[0] # [0.32, 0.52]


In [None]:
# Now let's convert it all into a dataframe with 4 columns:

# ted_raw_pol - polarity score of unpreprocessed TED talks
# ted_raw_sub - subjectivity score of unpreprocessed TED talks
# ted_clean_pol - polarity score of preprocessed TED talks
# ted_clean_sub - subjectivity score of preprocessed TED talks

scores_raw = pd.DataFrame(ted_scores.tolist(), columns = ['ted_raw_pol', 'ted_raw_sub'])

scores_clean = pd.DataFrame(ted_clean_scores.tolist(), columns = ['ted_clean_pol', 'ted_clean_sub'])

df_scores = pd.concat([scores_raw, scores_clean], axis = 1) # concatenate by columns

df_scores.head(10)


## **Are the scores on the preprocessed talks any different from the scores on the raw talks?**

In [None]:
# Obtaining the correlation matrix between the variables

round(df_scores.corr(), 2)


In [None]:
# Getting the mean and standard deviation of the polarity and subjectivity scores

round(df_scores.agg(['mean', 'std']).transpose(), 2)

# The range of polarity scores is from -1 to 1
# The range of subjectivity scores is from 0 to 1


In [None]:
# Drawing multiple regression graphs on one plot

figure, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2,
                                                ncols = 2,
                                                figsize=(20, 15))

sns.regplot(data = df_scores, x = 'ted_raw_pol', y = 'ted_raw_sub', ax = ax1)
sns.regplot(data = df_scores, x = 'ted_clean_pol', y = 'ted_clean_sub', ax = ax2)

sns.regplot(data = df_scores, x = 'ted_raw_pol', y = 'ted_clean_pol', ax = ax3)
sns.regplot(data = df_scores, x = 'ted_raw_sub', y = 'ted_clean_sub', ax = ax4)


ax1.set_title('The relationship between \nRAW polarity and RAW subjectivity', fontsize = 15)
ax2.set_title('The relationship between \nCLEAN polarity and CLEAN subjectivity', fontsize = 15)
ax3.set_title('The relationship between \nRAW polarity and CLEAN polarity', fontsize = 15)
ax4.set_title('The relationship between \nRAW subjectivity and CLEAN subjectivity', fontsize = 15)

sns.set_style('whitegrid')

plt.show()


## **Identifying the most positive and subjective TED talks**

In [None]:
# We first need to concatenate the columns with scores to the dataframe with TED talks

ted_upd = pd.concat([ted, df_scores], axis = 1).drop('url', axis = 1)

ted_upd.head(10)


In [None]:
# Which TED talk has the most positive sentiment?

max_score = ted_upd['ted_clean_pol'].max()

max_score_id = ted_upd['ted_clean_pol'].idxmax()

print(f'The most positive TED talk has the polarity value of {max_score}')
print('\nHere it is:\n')

ted_upd.iloc[max_score_id, 0]


In [None]:
# Which TED talk is the most subjective one?

max_score = ted_upd['ted_clean_sub'].max()

max_score_id = ted_upd['ted_clean_sub'].idxmax()

print(f'The most subjective TED talk has the subjectivity value of {max_score}')
print('\nHere it is:\n')

ted_upd.iloc[max_score_id, 0]


# **That's the end of Day 4 Part 2!**