# **Introduction to text analysis in Python. Day 4 Part 1**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 4 Part 1!**

## **Today, we are going to look at:**

+ *Bag-of-Words* model and `CountVectorizer`
+ Lexicon-based sentiment analysis

---



## **Preparatory steps first**

In [None]:
# Importing some of the required libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt


In [None]:
# This will ensure that all rows of the dataframe will be shown 

pd.set_option('display.max_rows', None)


In [None]:
# Uploading the dataset containing TED talks into the current Google Colab session

from google.colab import files

uploaded = files.upload()


In [None]:
# Getting the dataset

df = pd.read_csv('ted.csv')


In [None]:
# Let's take a single TED talk transcript and preprocess it!

single_talk = df['transcript'][0]

single_talk


In [None]:
# We will use spacy library for this and other tasks

import spacy

nlp = spacy.load('en') # load English module


In [None]:
# Defining a preprocessing function

def preprocess(string):

  # making text lowercase
  string_low = string.lower()

  # processing lowercase text through spacy's English module
  doc = nlp(string_low)

  # obtaining token lemmas via 1) splitting into tokens, 2) removing stop words, 3) removing punctuation
  lemmas = [e.lemma_ for e in doc if e.is_stop == False and e.text.isalpha() == True]

  # glue lemmas back into a string
  lemmas_to_string = ' '.join(lemmas)

  # returning lemmas
  return(lemmas_to_string)


In [None]:
# Preprocessing the single_talk object

single_talk_prep = preprocess(single_talk)

single_talk_prep

type(single_talk_prep) # str


In [None]:
# Converting this string object into a list,
# as CountVectorizer requires a list or other iterable (e.g. pandas Series) as an input

single_talk_prep_list = [single_talk_prep]

type(single_talk_prep_list) # list


# **1. *Bag-of-Words* model and `CountVectorizer`**

`CountVectorizer` is Python tool to transform text into a **Bag-of-Words** model

**Bag-of-Words** model is essentially a matrix that looks like that:

<figure>
<left>
<img src=https://miro.medium.com/max/880/1*hLvya7MXjsSc3NS2SoLMEg.png  width="600">
</figure>

[Image source](https://medium.com/swlh/spam-filtering-using-bag-of-words-aac778e1ee0b)

In this matrix, each row is a **document**, and each column is a **token**

The values in the matrix cells show the frequency of each token in a document

**Bag-of-Words** matrix can be used:

+ as an input to the machine learning models
+ as a handy tool to count the occurence of tokens per document

*Remember: the better text is preprocessed, the more accurate the model is going to be!*




In [None]:
# Import the CountVectorizer from the sklearn library

from sklearn.feature_extraction.text import CountVectorizer


In [None]:
# Instantiate the vectorizer

# Check out this page for all the parameters that you can modify within CountVectorizer():
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

count_vector = CountVectorizer()

# Fit vectorizer onto the preprocessed TED talk and transform it into a Bag-of-Words

bow_model = count_vector.fit_transform(single_talk_prep_list)


In [None]:
# CountVectorizer return an object called sparse matrix

# Sparse matrix is essentially a matrix that contains a lot of zeros,
# and keeping it in a separate type of object allows Python to handle it more efficiently

type(bow_model) # scipy.sparse.csr.csr_matrix


In [None]:
bow_model.shape # 1 row, 316 columns

# This means that 1 document has been broken down into 316 elements (lemmas)


In [None]:
# Show all lemmas in the vocabulary

count_vector.vocabulary_ # this is a dictionary object

# let's keep lemmas as a separate array

lemmas = count_vector.get_feature_names_out()

len(lemmas) # 316 lemmas


In [None]:
# You can convert the Bag-of-Words matrix into a single array by using the .toarray() method

frequencies = bow_model.toarray()

frequencies


In [None]:
# Recall that in the Bag-of-Words matrix, each row is a document, and each column as a lemma

# Let's recreate this matrix!

bow_df = pd.DataFrame(frequencies, # the values in the dataframe are taken from the frequencies object
                      columns = lemmas, # column names are taken from the lemmas object
                      index = ['First TED talk']) # you can additionally give a name to an index if you wish

bow_df


In [None]:
# Finally, let's see what are the most common words in this document!

bow_df.transpose().sort_values('First TED talk', ascending = False).head(15)

bow_df.transpose().sort_values('First TED talk', ascending = False).head(15)


# The most frequent words are 'illusion', 'go', 'thing', and 'sort'!


### **Now running the same thing but on the subset of first 50 TED talks!**

(it might take quite a while if you run this on the entire dataframe of 500 talks)

In [None]:
# Additionally preprocessing all talks to save as a separate object

ted_clean = df['transcript'].apply(lambda x: preprocess(x))

ted_clean


In [None]:
# Saving it as a separate .csv file

ted_clean.to_csv('ted_clean.csv', index = False)


In [None]:
# Preprocessing first 50 talks

ted_clean = df['transcript'][0:50].apply(lambda x: preprocess(x))

ted_clean


In [None]:
# Instantiate the vectorizer

count_vector = CountVectorizer()

# Fit vectorizer onto the preprocessed TED talks and transform them into a Bag-of-Words
# Note that CountVectorizer() accepts pandas Series with strings, so there is no need to transform it into a list

bow_model = count_vector.fit_transform(ted_clean)




In [None]:
# CountVectorizer return an object called sparse matrix

# Sparse matrix is essentially a matrix that contains a lot of zeros,
# and keeping it in a separate type of object allows Python to handle it more efficiently

type(bow_model) # scipy.sparse.csr.csr_matrix


In [None]:
bow_model.shape # 50 rows, 6446 columns

# This means that 50 documents have been broken down into 6446 unique elements (lemmas)


In [None]:
# Show all lemmas in the vocabulary

count_vector.vocabulary_ # this is a dictionary object

# let's keep lemmas as a separate array

lemmas = count_vector.get_feature_names_out()

len(lemmas) # 6446 lemmas


In [None]:
# You can convert the Bag-of-Words matrix into an array with lists by using the .toarray() method

frequencies = bow_model.toarray()

frequencies


In [None]:
# Recall that in the Bag-of-Words matrix, each row is a document, and each column as a lemma

# Let's recreate this matrix!

bow_df = pd.DataFrame(frequencies, # the values in the dataframe are taken from the frequencies object
                      columns = lemmas) # column names are taken from the lemmas object

bow_df.head(10)

# Each row is a document, and each column is a unique lemma!


In [None]:
# Adding row names for clarity

row_names = []

for e in np.arange(1, 51):
  
  row = 'Talk #' + str(e) # Talk #1, Talk #2, Talk #3, ... , Talk #50

  row_names.append(row)

bow_df.index = row_names

bow_df.head(10)


In [None]:
# Finally, let's see what are the most common words across all these documents!

# Since there are more than 1 row (document) in this matrix,
# we need to manually calculate the frequency of each lemma across all documents

freq_total = bow_df.sum(axis = 0) # sum up all values by row (axis = 1 would sum up by column)

len(freq_total) # 6446, so we got a frequency value for each lemma, this is exactly what we wanted!


In [None]:
# Finally, we can order this array
# and see what are the 50 most common words (lemmas) across all these TED talks!

freq_total.sort_values(ascending = False).head(50)


In [None]:
# Visualizing the distribution of lemmas? Yes!

plt.figure(figsize = (14, 9)) # set figure size

# To understand what is going on here, break down this code bit by bit,
# run it, and see what you get as an output of each step
freq_total.sort_values(ascending = False).head(25).iloc[::-1].plot(kind = 'barh',
                                                                   color = 'green')

plt.title('The frequency of 25 most popular lemmas\n in 50 TED talks', fontsize = 25)
plt.xlabel('Frequency', fontsize = 20)
plt.ylabel('Lemma', fontsize = 20) 

plt.xticks(ticks = np.arange(0, 400, 25), fontsize = 15) # tweak x axis ticks
plt.yticks(fontsize = 15) # tweak y axis ticks

plt.show()


## **Exercise**

Remember that newspaper article that we inspected in the first class? 

It was entitled *Overconfident of spotting fake news? If so, you may be more likely to fall victim*.

Please: 
+ *preprocess it*
+ *obtain the Bag-of-Words model*
+ *see what are the most common words (lemmas) that are used in it*

You can find an original article [here](https://www.theguardian.com/media/2021/may/31/confident-spotting-fake-news-if-so-more-likely-fall-victim)







In [None]:
# Newspaper article (no preprocessing or text cleaning has been done)

article = 'Are you a purveyor of fake news? People who are most confident about their ability to discern between fact and fiction are also the most likely to fall victim to misinformation, a US study suggests. Although Americans believe the confusion caused by false news is all-pervasive, relatively few indicate having seen or shared it, something the researchers suggested shows that many may not only have a hard time identifying false news but are not aware of their own deficiencies at doing so. Nine out of 10 participants surveyed indicated they were above average in their ability to discern false and legitimate news headlines. About a fifth of respondents rated themselves 50 or more percentiles higher than their score warranted, the analysis of a nationally representative study of data collected during and after the 2018 US midterm elections found. In the survey, 8,285 Americans were asked to evaluate the accuracy of a series of Facebook headlines, and then rate their own abilities in discerning false news content relative to others. When researchers looked at data measuring respondents’ online behaviour, those with inflated perceptions of their abilities more frequently visited websites linked to the spread of false or misleading news. The overconfident participants were also less able to distinguish between true and false claims about current events and reported higher willingness to share false content, especially when it aligned with their political predispositions, the authors found. “No matter what domain, people on average are overconfident … but over 70% of people displaying overconfidence is just such a huge number,” said the lead author, Ben Lyons, an assistant professor of communication at the University of Utah. Although the study does not prove that overconfidence directly causes engagement with false news, the mismatch between a person’s perceived ability to spot misinformation and their actual competence could play a crucial role in the spread of false information, the authors wrote in the studypublished in the Proceedings of the National Academy of Sciences of the United States of America. It also suggests that those who are humble – people who tend to engage in self-monitoring, reflective behaviours and put more thought into the sites they visit and content they share – are likely to be less susceptible to misinformation, said Lyons. Factors such as gender also played a key role in the likelihood of overconfidence and, in turn, vulnerability to false news, suggested Lyons. “Male respondents [in the study] displayed more overconfidence – and this is a consistent finding in overconfidence literature – men are always more confident than women, which is always not so surprising.” He added: “Overconfidence is truly universal. I would be shocked if we didn’t find this in every country we looked at … although we might not see this extreme level of overconfidence, just based on cultural differences.”'

article


# **That's the end of Day 4 Part 1!**