# **Introduction to text analysis in Python. Day 1**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 1!**

## **Today, we are going to look at:**

+ Python workflow via Google Colab and Jupyter Notebooks
+ Why learning text analysis
+ Basic string manipulation
+ How to create word clouds

---



# **1. Python workflow via Google Colab and Jupyter Notebooks**

<figure>
<left>
<img src=https://miro.medium.com/max/502/1*sXs3TvhjvXcVCTldKnwMpA.png  width="400">
</figure>

## **What is Google Colab and Jupyter Notebooks?**

The *Jupyter Notebook* is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

Basically, running Python in Jupyter Notebook allows you to combine *text*, *code*, and *code output* in a single notebook that can be then saved as a PDF or HTML document.

We will run our Jupyter Notebooks via *Google Colaboratory (Colab)* which allows to write and execute Python code in your browser.

Let's watch a short video to see how it works:


In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('inN8seMm7UI', width = 800, height = 450)


You can find more information about Jupyter Notebooks [here](https://jupyter.org/), [here](https://www.dataquest.io/blog/jupyter-notebook-tutorial/), and [here](https://www.youtube.com/watch?v=2eCHD6f_phE).

Also, take a look at [this](https://colab.research.google.com/?utm_source=scs-index#) exemplary notebook to get a sense of what you can do with it!

## **How to combine text and code in one workflow?**

By using Text Cells and Code Cells!

*Code cells* is where the code is written and executed.

*Text cells* are used to describe the output of the coding and they have some flexibility in terms of the appearance. 

Before diving into coding, let us briefly look at how one can format text in Jupyter Notebooks.





## **Text cells in Jupyter Notebooks**

In *text cells* you can create:

# h1 Heading
## h2 Heading
### h3 Heading
#### h4 Heading

## Emphasis

**This is bold text**

__This is bold text__

*This is italic text*

_This is italic text_

~~Strikethrough~~

## Lists

Unordered

+ Create a list by starting a line with `+`, `-`, or `*`
+ Sub-lists are made by indenting 2 spaces:
 + Marker character change forces new list start:
    + Ac tristique libero volutpat at
    + Facilisis in pretium nisl aliquet
    + Nulla volutpat aliquam velit
+ Very easy!

Ordered

1. Lorem ipsum dolor sit amet
2. Consectetur adipiscing elit
3. Integer molestie lorem at massa

## Tables

| Option | Description |
| ------ | ----------- |
| data   | path to data files to supply the data that will be passed into templates. |
| engine | engine to be used for processing templates. Handlebars is the default. |
| ext    | extension to be used for dest files. |

Check out [this page](https://markdown-it.github.io/) for more!


## **Helpful small tips**

One way to make your life easier is by using keyboard shortcuts when navigating through the notebook!

Here are the most common ones:

| Command | Windows | Mac
| ------- | -------- | ---
| Run entire cell | ctrl + enter | ctrl + enter
| Run entire cell and move to the next one | shift + enter | shift + enter
| Run single line in a cell | ctrl + shift + enter | ctrl + shift + enter
| Insert code cell above | ctrl + m + a | ctrl + m + a
| Insert code cell below | ctrl + m + b | ctrl + m + b
| Switch from code to text cell | ctrl + m + m | ctrl + m + m
| Switch from text to code cell | ctrl + m + y | ctrl + m + y
| Move cell up | ctrl + m + k | ctrl + m + k
| Move cell down  | ctrl + m + j | ctrl + m + j
| Delete cell  | ctrl + m + d | ctrl + m + d

In addition to that, let me also make a distinction between running an entire *cell of code* and a *single line of code* clearer.

It is a good practice to structure your code in such a way that one cell contains a chunk of code that is devoted to one particular task. 

However, one should take into account that (unless you use print statements explicitly), one code cell will produce only one piece of output, and it is going to be related to the latest statement in the code cell.

See example below:


In [None]:
# As you can see, even though I asked to show both x and y objects,
# if I simply run a code cell, only y will be produced

x = 'Hello'

x

y = 'World'

y


In [None]:
# You can overocome this by using print statements, but it's not very handy in notebooks

x = 'Hello'

print(x)

y = 'World'

print(y)


In [None]:
# Ultimately, if you have two tasks with separate pieces of output that you want to be produced, use two different code cells

x = 'Hello'

x


In [None]:
y = 'World'

y


In [None]:
# However, please also note that you can highlight a single line in a code cell and run it
# by using ctrl + shift + enter shortcut

4 ** 4 # highlight this line and run it via ctrl + shift + enter

4 ** 6 # then highlight this line and run it in the same way

# You see that in this way you can sequentially get more than one output from a single code cell

# This is not very oftenly used when you write a proper research notebook, however the notebooks for
# this course are structured with this in mind to save space and make them less volumnious


---

# **2. Why learning text analysis?**

+ Text is one of the most abundant sources of information that human civilization has ever produced
+ It comes in different shapes and sizes:
    + *offline sources*: song lyrics, poems, parliamentary proceedings, novels
    + *online sources*: tweets, Youtube comments, blogposts, product reviews
+ Text reflects sociocultural norms and their development over time
+ Text encodes sociodemographic characteristics of its author i.e. age, gender, origin
+ Text is an integral part of a variety of research fields ranging from sociology and marketing to law, political science, and digital humanities

**In short: text mining can help you in answering those research questions that cannot be handled by analyzing 'pure' numeric data that come from surveys, experiments, statistical records, etc.!**

Let's look at some examples:

### **Study #1. Leo Tolstoy's War and Peace**

<a href="https://ibb.co/4sBh7t3"><img src="https://i.ibb.co/s5L82Rz/pic1.png" alt="pic1" border="0" width="1200"/></a>

### **Study #2. Machine Translation: Mining Text for Social Theory**

<a href="https://ibb.co/k8QYMTn"><img src="https://i.ibb.co/9TsRG0X/pic2.png" alt="pic2" border="0" width="1200"/></a>


### **Study #3. Text-mining the Signals of Climate Change Doubt**

<a href="https://ibb.co/zVyG714"><img src="https://i.ibb.co/Qv2rCZK/pic3.png" alt="pic3" border="0" width="1200"/></a>

<a href="https://ibb.co/MCjbL6p"><img src="https://i.ibb.co/Vqhd8mD/pic4.png" alt="pic4" border="0" width="1200"/></a>


---

# **3. Basic string manipulation**

<figure>
<left>
<img src=https://www.101computing.net/wp/wp-content/uploads/string-manipulation-1.png  width="450">
</figure>

**[Image source](https://https://www.101computing.net/string-manipulation-2/)**

In [None]:
# Let's define a string object

first_string = "Hello world"

first_string # calling an object

first_string = 'Hello world' # you can use either single or double speech marks, as long as it's consistent


In [None]:
# Some helpful commands

type(first_string) # type of object (str)

len(first_string) # number of characters in a string i.e. letters + punctuation + symbols + whitespaces


In [None]:
# Objects can be converted into strings via str() command, e.g.:

x = 4 ** 4 # 256

type(x) # int - integer

x_str = str(x) # converting integer to string

x_str # '256'

type(x_str) # str - string

x + 5 # you get 261 as integers allow for mathematical operations

# x_str + 5 # error message: can only concatenate str (not "int") to str

x_str + str(5) # '2565' - the example of concatenation


## **Accessing string characters**

You can access string characters very much in the same way as you access elements of the list

<figure>
<left>
<img src=https://static.javatpoint.com/python/images/lists-indexing-and-splitting.png width="400">
</figure>

**[Image source](https://www.analyticsvidhya.com/blog/2021/06/15-functions-you-should-know-to-master-lists-in-python/)**



In [None]:
# Let's define another string

nlp = 'Natural language processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language,\
in particular how to program computers to process and analyze large amounts of natural language data.'

# Note '/' inserted into this part of string: '...and human language,\in particular how to...'
# / is used in Python strings as a line break, so you could start a new line within the same string object
# / is not visible in the output

nlp

len(nlp) # 274 characters in total


In [None]:
nlp[0] # accessing the first character (note that the count starts from 0!)

nlp[-1] # accessing the last character

nlp[1:10] # accessing the characters from 2nd to 10th (note that when slicing the last element of a slice is not included!)

nlp[50:] # accessing all elements after the 51th one (including the 51th one too)

nlp[-15:] # accessing last 15 characters

nlp[::-1] # reverse the string


## **Python methods to work with strings**

Here is the list of most commonly used methods to deal with strings in Python:

| Method | Explanation
| ------- | --------
| `.lower()` | make lowercase
| `.upper()` | make uppercase
| `.capitalize()` | capitalize (first letter uppercase)
| `.split()` | split a string into a list (each word becomes an element)
| `.splitlines()` | split a string into a list (each line becomes an element)
| `.strip()` | remove spaces at the beginning and at the end of the string
| `.lstrip()` | remove spaces to the left of the string
| `.rstrip()` | remove spaces to the right of the string
| `.index()` | return an index of the first occurence of an element (character)
| `.count()` | return the number of elements with the specified value
| `.find()` | find the first occurrence of the specified value
| `.replace()` | replace a specified phrase with another specified phrase
| `.join()` | join elements into a string


Let's look at some of the examples of using these methods!

In [None]:
string = 'The University of Exeter is a public research university in Exeter, Devon, South West England, United Kingdom.'

string.lower() # the university of exeter is a public research university in exeter, devon, south west england, united kingdom.

string.upper() # THE UNIVERSITY OF EXETER IS A PUBLIC RESEARCH UNIVERSITY IN EXETER, DEVON, SOUTH WEST ENGLAND, UNITED KINGDOM.

string.capitalize() # The university of exeter is a public research university in exeter, devon, south west england, united kingdom.


In [None]:
string.split() # you can define a separator by adding a respective argument e.g. separator = "/". The default separator is whitespace

string.splitlines() # the string containts a single line so the entire string becomes an element of the list

###############

# Here is an example where splitting by line makes more sense
# Notice \n - it creates a line break in a string

string_lines = 'First line\nSecond line\nThird line'

print(string_lines) # The effect of \n is only visible when a string is printed out via print() command

string_lines.splitlines() # Split by \n


In [None]:
string_spaces = '         The University of Exeter is a public research university in Exeter, Devon, South West England, United Kingdom.       '

string_spaces.strip() # you can specify the characters that are to be removed (see next code cell)

# By default it removes spaces at the beginning and at the end of the string


In [None]:
string_symbols = '££££$$%^^^^_The University of Exeter is a public research university in Exeter, Devon, South West England, United Kingdom.rrrrkkkkk'

string_symbols.lstrip('£$%^_') # remove unwanted characters to the left of the string

string_symbols.rstrip('rk') # remove unwanted characters to the right of the string

# You can combine .lstrip() and .rstrip() (and other methods) into one flow

string_symbols.rstrip('rk').lstrip('£$%^_')


In [None]:
# Where is the first occurence of the word 'Exeter'?

string # make sure our main string has not been modified

string.index('Exeter') # 19th character

string[18] # 19th character is 'E'

string[18:24] # 'Exeter'

# What about the word 'United'?

string.index('United') # 96th character

string[95:101] # United


In [None]:
# How many times word 'Exeter' occurs in the string?

string.count('Exeter') # 2

# How many letters 'r' are there in the string?

string.count('r') # 6

# How many whitespaces?

string.count(' ') # 16


In [None]:
# The .find() method is almost the same as the .index() method,
# the only difference is that the .index() method returns an error if the value is not found, whereas .find() method returns -1

string.find('United') # 96th character

# string.index('q') # returns error if not found

string.find('q') # returns -1 if not found


In [None]:
# Replace University with UnIvErSiTy

string.replace('University', 'UnIvErSiTy') # 'old value', 'new value'

# Replace all whitespaces with _

string.replace(' ', '_')


In [None]:
# Say we have a list, in which each element is a word

string_list = string.split()

string_list

# Use .join() to recreate a string out of a list of words

' '.join(string_list) # ' ' denotes that whitespaces will be put in between words

'_'.join(string_list) # you can use other symbols e.g. _


## **Exercise 1**

Alright, time to practice!

*Please use the methods that we discussed above to create the `nlp_clean` string out of the `nlp_messy` one.*

*`nlp_clean`* should look like that:

*Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.*

Note that you can either combine multiple methods in one code flow i.e. `nlp_messy.rstrip().lstrip().replace()`, or create a new object after each step i.e. `nlp_up = nlp_messy.rstrip()`







In [None]:
# Here is a messy string:

nlp_messy = '  £$pppppp   Natural languAge PROCESSING {NLP} is a subfield of linguistics, COMPuter science, and artificial_intelligence_concerned_WITH_THE interactions \
between computers **and** human *****language*****, iN particular how to program Computers to process and analyze LARGE amOUnts of natural LaNgUaGe data.....   ())))()()()()()()#######  '

nlp_messy


In [None]:
# Please create nlp_clean below:

nlp_clean = nlp_messy.lower()\
                     .lstrip(' £$p')\
                     .rstrip(' ()#')\
                     .replace('{nlp}', '(NLP)')\
                     .replace('_', ' ')\
                     .replace('*', '')\
                     .rstrip('.')\
                     .replace('natural language processing', 'Natural Language Processing')\
                     + '.' # concatenate '.' to the end of the string

# In the code above, \ allows for splitting code into multiple lines. It is essentially the same code as this:

# nlp_clean = nlp_messy.lower().lstrip(' £$p').rstrip(' ()#').replace('{nlp}', '(NLP)').replace('_', ' ').replace('*', '').rstrip('.').replace('natural language processing', 'Natural Language Processing') + '.'

nlp_clean


Once you have cleaned up the string, please use `nlp_clean` to answer the following questions:

+ *how many characters in total are there in the string?*
+ *how many whitespaces?*
+ *where is the first occurence of the word 'computer'?*
+ *how many words are there in the string?*
+ *what are the last 14 characters of the string?*
+ *how many times the word 'language' appears in the string? (mind the letter case)*

In [None]:
len(nlp_clean) # total number of characters

nlp_clean.count(' ') # total number of whitespaces

nlp_clean.index('computer') # first occurence of the word 'computer'

len(nlp_clean.split()) # total number of words

nlp_clean[-14:] # last 14 characters of the string

nlp_clean.lower().count('language') # frequency of the word 'language'


---

# **4. How to create a word cloud?**

<figure>
<left>
<img src=https://static.commonlounge.com/fp/600w/FxEgN5woHmXOJOLtm7oGGenV81520493685_kc width="600">
</figure>

**[Image source](https://www.commonlounge.com/discussion/317a12109a634fc1aa44150ea806bbf3)**

**Word cloud** (also knows as **text cloud**) is a handy tool to visualize text and get a quick sense of its contents.

The **word cloud** is created in such a way that the size of the word in the word cloud represents its frequency in a document.

And it's very easy to create one in Python! Let's do this.

We will use [this](https://www.theguardian.com/media/2021/may/31/confident-spotting-fake-news-if-so-more-likely-fall-victim) The Guardian newspaper article as an example.

In [None]:
# We need to import the necessary libraries first

from wordcloud import WordCloud, STOPWORDS

import matplotlib.pyplot as plt # data visualization library


In [None]:
# Newspaper article (no preprocessing or text cleaning has been done)

article = 'Are you a purveyor of fake news? People who are most confident about their ability to discern between fact and fiction are also the most likely to fall victim to misinformation, a US study suggests. Although Americans believe the confusion caused by false news is all-pervasive, relatively few indicate having seen or shared it, something the researchers suggested shows that many may not only have a hard time identifying false news but are not aware of their own deficiencies at doing so. Nine out of 10 participants surveyed indicated they were above average in their ability to discern false and legitimate news headlines. About a fifth of respondents rated themselves 50 or more percentiles higher than their score warranted, the analysis of a nationally representative study of data collected during and after the 2018 US midterm elections found. In the survey, 8,285 Americans were asked to evaluate the accuracy of a series of Facebook headlines, and then rate their own abilities in discerning false news content relative to others. When researchers looked at data measuring respondents’ online behaviour, those with inflated perceptions of their abilities more frequently visited websites linked to the spread of false or misleading news. The overconfident participants were also less able to distinguish between true and false claims about current events and reported higher willingness to share false content, especially when it aligned with their political predispositions, the authors found. “No matter what domain, people on average are overconfident … but over 70% of people displaying overconfidence is just such a huge number,” said the lead author, Ben Lyons, an assistant professor of communication at the University of Utah. Although the study does not prove that overconfidence directly causes engagement with false news, the mismatch between a person’s perceived ability to spot misinformation and their actual competence could play a crucial role in the spread of false information, the authors wrote in the studypublished in the Proceedings of the National Academy of Sciences of the United States of America. It also suggests that those who are humble – people who tend to engage in self-monitoring, reflective behaviours and put more thought into the sites they visit and content they share – are likely to be less susceptible to misinformation, said Lyons. Factors such as gender also played a key role in the likelihood of overconfidence and, in turn, vulnerability to false news, suggested Lyons. “Male respondents [in the study] displayed more overconfidence – and this is a consistent finding in overconfidence literature – men are always more confident than women, which is always not so surprising.” He added: “Overconfidence is truly universal. I would be shocked if we didn’t find this in every country we looked at … although we might not see this extreme level of overconfidence, just based on cultural differences.”'

article


In [None]:
# Define a wordcloud 
wordcloud = WordCloud(background_color = 'white',
                      width = 2000, # width of canvas
                      height = 1000, # height of canvas
                      stopwords = STOPWORDS, # the built-in STOPWORDS list is used
                      collocations = False, # whether to include collocations (bigrams) of two words
                      normalize_plurals = True, # e.g. 'day' and 'days' will be counted as one
                      random_state = 1, # seed to get exactly same wordcloud every time you rerun script
                      colormap = 'seismic') # set the colormap

# Generate a wordcloud on a string object
wordcloud.generate(article)

plt.figure(figsize = (15, 10)) # set figure size
plt.axis("off") # turn off axes details

# Display image
plt.imshow(wordcloud)

# Get rid of the object description above the wordcloud
plt.show()


Mode detailed information about the `WordCloud` command is available [here](https://amueller.github.io/word_cloud/index.html).

In [None]:
# Take a look at the list of WordCloud stopwords i.e. those that are excluded from the analysis
STOPWORDS

# If you want to add some specific words to the list of stopwords, you can use the .update() method:
# STOPWORDS.update(['word1', 'word2'])


## **Exercise 2**

Now you can try to create a wordcloud yourself!

Let's use [this](https://www.theguardian.com/environment/2022/feb/24/climate-change-is-intensifying-earths-water-cycle-at-twice-the-predicted-rate-research-shows) The Guardian article entitled *Climate change is intensifying Earth’s water cycle at twice the predicted rate, research shows*.





In [None]:
# Newspaper article (no preprocessing or text cleaning has been done)

article_2 = 'Rising global temperatures have shifted at least twice the amount of freshwater from warm regions towards the Earth’s poles than previously thought as the water cycle intensifies, according to new analysis. Climate change has intensified the global water cycle by up to 7.4% – compared with previous modelling estimates of 2% to 4%, research published in the journal Nature suggests. The water cycle describes the movement of water on Earth – it evaporates, rises into the atmosphere, cools and condenses into rain or snow and falls again to the surface. “When we learn about the water cycle, traditionally we think of it as some unchanging process which is constantly filling and refilling our dams, our lakes, and our water sources,” the study’s lead author, Dr Taimoor Sohail of the University of New South Wales, said. But scientists have long known that rising global temperatures are intensifying the global water cycle, with dry subtropical regions likely to get drier as freshwater moves towards wet regions. Last August, the Intergovernmental Panel on Climate Change’s sixth assessment report concluded that climate change will cause long-term changes to the water cycle, resulting in stronger and more frequent droughts and extreme rainfall events. Sohail said the volume of extra freshwater that had already been pushed to the poles as a result of an intensifying water cycle was far greater than previous climate models suggest. “Those dire predictions that were laid out in the IPCC will potentially be even more intense,” he said. The scientists estimate the volume of extra freshwater that shifted from warmer regions between 1970 and 2014 is between 46,000 and 77,000 cubic kms. “We’re seeing higher water cycle intensification than we were expecting, and that means we need to move even more quickly towards a path of net zero emissions.” The team used ocean salinity as a proxy for rainfall in their research. “The ocean is actually more salty in some places and less salty in other places,” Sohail said. “Where rain falls on the ocean, it tends to dilute the water so it becomes less saline … Where there is net evaporation, you end up getting salt left behind.” The researchers had to account for the mixing of water due to ocean currents. “We developed a new method that basically tracks … how the ocean is moving around with reference to this freshening or salinification,” Sohail said. “It’s kind of like a rain gauge that’s in constant motion.” Dr Richard Matear, a chief research scientist in the CSIRO Climate Science Centre, who was not involved in the research, said the study suggested existing climate modelling has underestimated the potential impacts of climate change on the water cycle. “There’s been a dramatic uplift in our ability to monitor the ocean,” he said. “Observational datasets [like those used in the study] are really ripe for revisiting how global warming is changing the climate system, and the implications it might have on important things like the hydrological cycle.”'

article_2


In [None]:
# Create a wordcloud here:

# Define a wordcloud 
wordcloud = WordCloud(background_color = 'white',
                      width = 2000, # width of canvas
                      height = 1000, # height of canvas
                      stopwords = STOPWORDS, # the built-in STOPWORDS list is used
                      collocations = False, # whether to include collocations (bigrams) of two words
                      normalize_plurals = True, # e.g. 'day' and 'days' will be counted as one
                      random_state = 1, # seed to get exactly same wordcloud every time you rerun script
                      colormap = 'seismic') # set the colormap

# Generate a wordcloud on a string object
wordcloud.generate(article_2)

plt.figure(figsize = (15, 10)) # set figure size
plt.axis("off") # turn off axes details

# Display image
plt.imshow(wordcloud)

# Get rid of the object description above the wordcloud
plt.show()


# **That's the end of Day 1!**