# "Working with Text Data"

**Author:** 'Felipe Millacura'
    
**Date:** '28th February 2021'

### Learning Objectives


* Know how to convert from text data to tokenised text data using `unnest_tokens`.
* Understand how text data can be cleaned.

## Introduction

We've worked with small amounts of text data before; anything that is an `object` variable in our dataset counts as text data. But how do you work with a large chunk of text like a set of articles, or a book?

In this lesson we'll learn how you can turn text into a format that is useful, in particular how to turn text into a data frame. We'll do this using the `tidytext` package.


Also you will learn about `TF-IDF scores`, `n-grams` and how to find the sentiment of words and of texts as a whole. Sentiment is the feelings demonstrated in the text, this can be as simple as positive/negative, or include emotions like fear, disgust and joy. 

Sentiment analysis is not an exact science! The results you get may not make scene in context, hence, you should always treat any sentiment analysis with some scepticism.



## Un-nesting tokens

### The `unnest_tokens` function

To start we need to install the `tidytext` package. This package helps us transform text into *tidy* data; that is data where each variable is a column and each observation a row.


In [1]:
!pip install tidytext



This will also install the `nltk` package. However, you will need to download additional resources to use tidytext, using the code below.

In [2]:
import nltk

In [3]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fmill\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Now we are ready to go!

In [4]:
import pandas as pd

Here's an example of very simple text data - just a three element object. Don't worry we'll be working with more complicated, realistic data soon!


In [35]:
phrases = ["here is some text",
           "again more text",
           "text is text"]

We need this text to be in a `DataFrame` before we can start using `tidytext`. In this case each element of the vector will be associated with an ID.


In [36]:
example_text = pd.DataFrame({
     'phrase' : phrases
    
})

example_text

Unnamed: 0,phrase
0,here is some text
1,again more text
2,text is text


Now that our data is inside a data frame we can use the `unnest_tokens` function from `tidytext` to transform this data.

The `unnest_tokens` function is probably the most important function in tidytext. It takes data from a an `object` variable (aka `string`) and splits it into *tokens*. For just now, the tokens will be words, but it's also possible to specify that our tokens are sentences, or characters. We'll see the options for different tokens later.

The `unnest_tokens` function takes three mandatory arguments. The first is the data frame that contains our text data; here we have piped that in. The second is the new column that we are going to create that contains our tokens, in our case we've called it `word` because the tokens are words. And the third argument is the name of the column that contains the text data that we are going to tokenise, in our case `phrase`.


In [37]:
from tidytext import unnest_tokens

In [38]:
words_df = unnest_tokens(example_text, 'word', 'phrase')

words_df

Unnamed: 0,word
0,here
0,is
0,some
0,text
1,again
1,more
1,text
2,text
2,is
2,text


You'll notice that the ID column has been preserved. You can go and check that all the words that appeared in the first phrase have id 0, all the words that appeared in the second phrase have id 1 etc. This is really useful when we have extra information about the text in our original data that we want to preserve through tokenisation.

Now that we have a tidy data frame, it's easy to manipulate using `pandas`. To start, let's put our words in alphabetical order.


In [39]:
words_df.sort_values('word')

Unnamed: 0,word
1,again
0,here
0,is
2,is
1,more
0,some
0,text
1,text
2,text
2,text


A really common task that you'll want to perform is finding out how often each word appears in each phrase. We can do this with a `groupby` and `size`.


In [40]:
words_df.groupby(['word', words_df.index]).size().reset_index(name='counts')

Unnamed: 0,word,level_1,counts
0,again,1,1
1,here,0,1
2,is,0,1
3,is,2,1
4,more,1,1
5,some,0,1
6,text,0,1
7,text,1,1
8,text,2,2


Or you may want to count the words across all phrases. Again, this is done with group by 

In [41]:
words_df.groupby('word').size().reset_index(name='counts')

Unnamed: 0,word,counts
0,again,1
1,here,1
2,is,2
3,more,1
4,some,1
5,text,4


With only a small amount of code we can see that the most common word in our phrases was 'text' and it appears 4 times.


### Capitals and punctuation

Here's an example of another small text dataset. Before you run the code below, have a think about the data frame you will produce.


In [42]:
phrases = ["Here is some text.",
           "Again, more text!",
           "TEXT is text?"]

In [43]:
example_text = pd.DataFrame({
    'phrase' : phrases,
    'id'     : range(1,4)
})

In [44]:
example_text

Unnamed: 0,phrase,id
0,Here is some text.,1
1,"Again, more text!",2
2,TEXT is text?,3


In [45]:
unnest_tokens(example_text, 'word', 'phrase')

Unnamed: 0,id,word
0,1,here
0,1,is
0,1,some
0,1,text
1,2,again
1,2,more
1,2,text
2,3,text
2,3,is
2,3,text


Is this what you were expecting? Probably not, by default `tidytext` converts all text to lower case before tokenising into words. It also ignores punctuation.

This is *normally* a good thing: most times when you are analysing text, you don't care about the difference between "Text", "TEXT," and "text". You just want to know how often the word text appears in the data in any context.

However, if you do care about different capitalisations you can set `to_lower` to be `FALSE` inside `unnest_tokens`. You cannot currently choose to not strip punctuation.


In [46]:
words_df = unnest_tokens(example_text, 'word', 'phrase',  to_lower= False)

words_df

Unnamed: 0,id,word
0,1,Here
0,1,is
0,1,some
0,1,text
1,2,Again
1,2,more
1,2,text
2,3,TEXT
2,3,is
2,3,text


Again, let's find out how often each word appears and arrange from the most common word to the least common.


In [52]:
words_df.groupby('word').size().sort_values(ascending=False).reset_index(name='counts')

Unnamed: 0,word,counts
0,text,3
1,is,2
2,Again,1
3,Here,1
4,TEXT,1
5,more,1
6,some,1


Since this is such a common pattern (not just in text mining, but in many types of analysis) `pandas` provides a short-cut. The function `value_counts` will group by the variable or variables given, and summarise by `count`.


In [54]:
words_df.value_counts('word').reset_index(name='counts')

Unnamed: 0,word,counts
0,text,3
1,is,2
2,Again,1
3,Here,1
4,TEXT,1
5,more,1
6,some,1


You can even set `sort = False` to stop the final arranging step too.

In [56]:
words_df.value_counts('word', sort=False).reset_index(name='counts')

Unnamed: 0,word,counts
0,Again,1
1,Here,1
2,TEXT,1
3,is,2
4,more,1
5,some,1
6,text,3


In the rest of these notes we will be using the shortcut version. 

In [20]:
lines = [
  "Whose woods these are I think I know.",
  "His house is in the village though;", 
  "He will not see me stopping here",
  "To watch his woods fill up with snow."
        ]

1. Create a data frame that has two variables: one with each word, the second with the line number of the word.
2. Use this data frame to find all the words that appear more than once in the four lines.


## Removing stop words

As promised, let's look at a more realistic text dataset. For this we'll use the text for all seven `Harry Potter` books. 


In [21]:
philosophers_stone = pd.read_csv("data/HPBook1.txt",  sep="@")
chamber_of_secrets = pd.read_csv("data/HPBook2.txt",  sep="@")
prisoner_of_azkaban = pd.read_csv("data/HPBook3.txt",  sep="@")
goblet_of_fire = pd.read_csv("data/HPBook4.txt",  sep="@")
order_of_the_phoenix = pd.read_csv("data/HPBook5.txt",  sep="@")
half_blood_prince = pd.read_csv("data/HPBook6.txt",  sep="@")
deathly_hallows = pd.read_csv("data/HPBook7.txt",  sep="@")

Each book is stored as its own `DataFrame` and each chapter as an `object` variable, where each element contains all the text in one chapter.


In [22]:
philosophers_stone

Unnamed: 0,Text,Chapter,Book
0,"THE BOY WHO LIVED Mr. and Mrs. Dursley, of nu...",1,1
1,THE VANISHING GLASS Nearly ten years had pass...,2,1
2,THE LETTERS FROM NO ONE The escape of the Bra...,3,1
3,THE KEEPER OF THE KEYS BOOM. They knocked aga...,4,1
4,DIAGON ALLEY Harry woke early the next mornin...,5,1
5,THE JOURNEY FROM PLATFORM NINE AND THREE-QUART...,6,1
6,THE SORTING HAT The door swung open at once. ...,7,1
7,"THE POTIONS MASTER There, look.\ \""Where?\"" ...",8,1
8,THE MIDNIGHT DUEL Harry had never believed he...,9,1
9,HALLOWEEN Malfoy couldn't believe his eyes wh...,10,1


We can now use `unnest_tokens` to find the most common words in "The Philosopher's Stone".

What do you notice about the most common words?


In [57]:
df_ps = unnest_tokens(philosophers_stone, 'Word', 'Text')

df_ps

Unnamed: 0,Chapter,Book,Word
0,1,1,the
0,1,1,boy
0,1,1,who
0,1,1,lived
0,1,1,mr
...,...,...,...
16,17,1,dudley
16,17,1,this
16,17,1,summer
16,17,1,the


In [58]:
df_ps.value_counts('Word')

Word
the            3627
and            1918
to             1856
a              1688
he             1528
               ... 
scrabbling        1
cooked            1
forgetmenot       1
scraped           1
smarmy            1
Length: 6034, dtype: int64

The most common words are not very interesting! They are words common to all English texts.

These common English words are known as **stop words**. The [`stop_words`](https://pypi.org/project/stop-words/) library has a built-in data frame that contains stop words in different languages. This means we can remove the stop words from our data by using either  `merge`, `lambda` or `isin`


In [25]:
!pip install stop-words



In [26]:
from stop_words import get_stop_words

In [27]:
get_stop_words('es')

['a',
 'al',
 'algo',
 'algunas',
 'algunos',
 'ante',
 'antes',
 'como',
 'con',
 'contra',
 'cual',
 'cuando',
 'de',
 'del',
 'desde',
 'donde',
 'durante',
 'e',
 'el',
 'ella',
 'ellas',
 'ellos',
 'en',
 'entre',
 'era',
 'erais',
 'eran',
 'eras',
 'eres',
 'es',
 'esa',
 'esas',
 'ese',
 'eso',
 'esos',
 'esta',
 'estaba',
 'estabais',
 'estaban',
 'estabas',
 'estad',
 'estada',
 'estadas',
 'estado',
 'estados',
 'estamos',
 'estando',
 'estar',
 'estaremos',
 'estará',
 'estarán',
 'estarás',
 'estaré',
 'estaréis',
 'estaría',
 'estaríais',
 'estaríamos',
 'estarían',
 'estarías',
 'estas',
 'este',
 'estemos',
 'esto',
 'estos',
 'estoy',
 'estuve',
 'estuviera',
 'estuvierais',
 'estuvieran',
 'estuvieras',
 'estuvieron',
 'estuviese',
 'estuvieseis',
 'estuviesen',
 'estuvieses',
 'estuvimos',
 'estuviste',
 'estuvisteis',
 'estuviéramos',
 'estuviésemos',
 'estuvo',
 'está',
 'estábamos',
 'estáis',
 'están',
 'estás',
 'esté',
 'estéis',
 'estén',
 'estés',
 'fue',
 'f

In [28]:
#as a list

stop_words = get_stop_words('en')

#as a DataFrame

df_stop_words = pd.DataFrame({
    'stop_words' : get_stop_words('en')
})
df_stop_words

Unnamed: 0,stop_words
0,a
1,about
2,above
3,after
4,again
...,...
169,you've
170,your
171,yours
172,yourself


We can `merge` both `Dataframe`s

In [29]:
df_ps.merge(df_stop_words, how='left', left_on='Word', right_on='stop_words')

Unnamed: 0,Chapter,Book,Word,stop_words
0,1,1,the,the
1,1,1,boy,
2,1,1,who,who
3,1,1,lived,
4,1,1,mr,
...,...,...,...,...
77570,17,1,dudley,
77571,17,1,this,this
77572,17,1,summer,
77573,17,1,the,the


and take just the `NaN` in `stop_words`

In [63]:
merged = df_ps.merge(df_stop_words, how='left', left_on='Word', right_on='stop_words')

merged[merged['stop_words'].isna()].drop('stop_words', axis=1)

Unnamed: 0,Chapter,Book,Word
1,1,1,boy
3,1,1,lived
4,1,1,mr
6,1,1,mrs
7,1,1,dursley
...,...,...,...
77566,17,1,lot
77568,17,1,fun
77570,17,1,dudley
77572,17,1,summer


In [61]:
merged['stop_words'].isna()

0        False
1         True
2        False
3         True
4         True
         ...  
77570     True
77571    False
77572     True
77573    False
77574     True
Name: stop_words, Length: 77575, dtype: bool

Or use a `lambda` function with `list` comprenhension

In [64]:
df_ps[['Word']].apply(lambda x: [word for word in x if word not in stop_words])

Unnamed: 0,Word
0,boy
1,lived
2,mr
3,mrs
4,dursley
...,...
42740,lot
42741,fun
42742,dudley
42743,summer


You can also use `isin` and negate the mask to find values not in `df_stop_words`:

In [69]:
df_ps[~df_ps['Word'].isin(df_stop_words['stop_words'])]

Unnamed: 0,Chapter,Book,Word
0,1,1,boy
0,1,1,lived
0,1,1,mr
0,1,1,mrs
0,1,1,dursley
...,...,...,...
16,17,1,lot
16,17,1,fun
16,17,1,dudley
16,17,1,summer


**Task - 5 minutes**

Find the most common words, not including stop words, in the book "Chamber of Secrets".


### Recap 

* Which function do you use to convert from text into words?

**Solution:**  `unnest_tokens`


* Which function do you use to count the number of words and arrange in order?

**Solution:** 
`value_counts(..., sort = TRUE)`


* How do you remove stopwords from a data frame?

**Solution:** 
using `merge`, `lambda` or `isin`



## Regular Expressions

### Learning Outcomes

* Know what a regular expression is
* Be familiar with regular expression syntax
* Use regular expressions to match patterns in text

We've already seen some of the ways in which we can work with strings, including how we can include some special characters in our reports using Unicode. Now we're going to look at a way in which we can pick specific strings out of a collection based on the characters and symbols they contain.

## What are regular expressions?

A **regular expression** (usually abbreviated to a **regex**) is, fundamentally, just another type of string. We can use them to specify search terms or to validate data by making sure it fits a given pattern. We can also find and replace text in a more generic way than we can by using simple string methods, for example in order to redact email addresses or phone numbers.

Regex is used in data analysis for a number of purposes - e.g. checking whether data is valid (does it follow a pattern of a valid email address, phone number or postcode), extracting parts of variables (e.g. maybe only want the first initial of a surname or the street name from your full address) and more generally it is used in text analysis - which we will come to in module 3.  

## Matching patterns in text

Let's take a look at some of the things we can do with regex. First up, we'll declare some strings to work with:


In [70]:
single_string = 'string a'
strings = ['string a', 'string b', 'string c', 'string d', 'string e']



These strings all have something in common, but also have their differences. This is something you'll encounter fairly often when working with datasets, both text-based and numerical. Email addresses are a great example: they all contain the "@" symbol and all end in a similar way, eg. ".com", but the other parts could be wildly different. Regex will enable us to find any email address hidden away in a dense block of text.

## Matching Single Characters

The easiest pattern we could try to match is a single alphanumeric character, and we can find some Python built-in functions to help us doing  it. 



In [71]:
import re 

First we need to define the pattern we want to look for. To start with let's just look for the letter "a". Our pattern must be a string and is case-sensitive:

In [72]:
pattern = 'a'


If we simply want to check if a pattern is present in a string we can use re's `search()` function to return a boolean value. We pass it two parameters: the string to check and the pattern to look for. As usual, the pipe operator will take care of the first 'data' argument for us.


In [73]:
[re.search(pattern, x) for x in strings]

[<re.Match object; span=(7, 8), match='a'>, None, None, None, None]

In [74]:
[re.findall(pattern, x) for x in strings]

[['a'], [], [], [], []]

If we simply want to check if a pattern is present in a string we can use pandas [`str.contains()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) function to return a boolean value.

In [76]:
merged['Word'].str.contains(pattern).sum()

25549

If we want to test whether the start or the end of a string contains an element, we can use [`startswith`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.startswith.html) or [`endswith`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.endswith.html)

In [78]:
merged['Word'].str.startswith(pattern).sum()

7622

In [80]:
merged['Word'].str.endswith(pattern).sum()

1901

If we want to find negative ocurrences we can simply use `~` to transform the dataset

In [83]:
~merged['Word'].str.endswith(pattern)

0        True
1        True
2        True
3        True
4        True
         ... 
77570    True
77571    True
77572    True
77573    True
77574    True
Name: Word, Length: 77575, dtype: bool

## Matching multiple characters

We don't need to limit ourselves to one character, we can check for many at once. The easiest use case is to look for a specific substring:


In [84]:
pattern = 'rry'

In [85]:
merged['Word'].str.contains(pattern).sum()

1418

But, instead of looking for a substring, we can also look for a selection of characters. We do this by enclosing the characters we're looking for in square brackets (`[]`):


In [87]:
pattern = '[abc]'

In [88]:
merged['Word'].str.contains(pattern).sum()

32337

Note that there is no space between the characters, and that we still need to enclose the brackets in quotes. Here we're looking to match any string including the letters "a" **or** "b" **or** "c", (not the pattern "abc"). But what if we wanted to check for a bigger range of letters? Say from 'a' to 't' - do we need to manually type them all out? No! We can use a handy expression which checks for a range of characters:


In [89]:
pattern = '[a-z]'

In [91]:
merged['Word'].str.contains(pattern).sum()

77551

What happens if we change it to match capital letters?

In [92]:
pattern = '[A-Z]'

In [93]:
merged['Word'].str.contains(pattern)

0        False
1        False
2        False
3        False
4        False
         ...  
77570    False
77571    False
77572    False
77573    False
77574    False
Name: Word, Length: 77575, dtype: bool

Nothing matches any more. We need to take care to use the correct casing when matching letters. If we wanted to match the whole alphabet we would combine the two cases using `[a-zA-Z]`. Or another option is you could cast your data to lower or upper case and then use the resulting case for regex matching.

We can even be specific about how many occurrences of a character we're looking for. If we follow the character(s) with a number in braces (`{}`) we can be quite specific. Let's only match strings which have the letter "i" three times consecutively:


In [94]:
pattern = '[i{3}]'

In [96]:
merged['Word'].str.contains(pattern).sum()

19714

## Extracting matching substrings

We've seen how we can find out if a string contains a regular expression or not, but how can we use what we find? We might be only be interestd in certain parts of our data and want to pull them out, or we might want to hide personal information in our dataset before we publish it. We can use regex for both of these.



In [97]:
email_data = pd.DataFrame({
     'data': ["This string has an_address@email.com in it", 
            "This one has user.name@company.co.uk", 
            "Now we've got other_person_123@server.net and my.name@runningoutofideas.com",
             "@emailprovider.com"
             ]
    
})

email_data
 

Unnamed: 0,data
0,This string has an_address@email.com in it
1,This one has user.name@company.co.uk
2,Now we've got other_person_123@server.net and ...
3,@emailprovider.com


### Extracting parts of a string

we can use `str.extract` to extract groups from the strings in the given series object.

In [98]:
email_data['data'].str.extract(pat = '([i{3}])')

Unnamed: 0,0
0,i
1,i
2,3
3,i


Well, we've found something at least. Our regex has matched the first lower case letter it found in each string, (`str.extract()` will pull out the first matching expression it finds and then stop, ignoring any other potential matches) which doesn't do us much good. We can be more specific though, and say that we want there to be an "@" symbol after those letters:


In [99]:
email_data['data'].str.extract(pat ='([a-z]@)')

Unnamed: 0,0
0,s@
1,e@
2,e@
3,


That's a bit better, but not much. We're now matching the symbol and the letter preceding it, but still only one letter. We need to capture all of the letters, but we don't know exactly how many there will be.

If we're not sure about how many occurrences of a character there will be, we can use something called the **Kleene Star**. By adding a `*` after the character or group we want to match, we say "match this expression if there are any number of occurences of these characters".


In [100]:
email_data['data'].str.extract(pat ='([a-z]*@emailprovider.com)')

Unnamed: 0,0
0,
1,
2,
3,@emailprovider.com


Our expression has matched this string, but if we were checking for valid email addresses we would have wanted this to fail (since there's nothing before the "@"). This is the downside of the Kleene star -- it matches **any** number of occurrences, including none! If we want to make sure we have **at least one**, we use the `+` symbol:


In [101]:
email_data['data'].str.extract(pat ='([a-z]+@emailprovider.com)')

Unnamed: 0,0
0,
1,
2,
3,


Much better.

Let's incorporate what we've just seen into our own expression:

In [102]:
email_data['data'].str.extract(pat ='([a-z]+@)')

Unnamed: 0,0
0,address@
1,name@
2,name@
3,


More progress! We're not just looking at letters before the "@" though, we've also got an address featuring numbers. We have to think about what comes after "@" too.


In [103]:
email_data['data'].str.extract(pat ='([a-z0-9]+@[a-z]+)')

Unnamed: 0,0
0,address@email
1,name@company
2,123@server
3,


Almost there now. The final step is to include the punctuation marks which are vital to defining email addresses: "_" and ".". We can include them in our expression just like any other character:


In [104]:
email_data['data'].str.extract(pat ='([a-z0-9._]+@[a-z.]+)')

Unnamed: 0,0
0,an_address@email.com
1,user.name@company.co.uk
2,other_person_123@server.net
3,


Success! We've pulled the email address out of each of our strings!

Well, almost success... The last string had two email addresses in it, and we've only matched one. As we found earlier `str_extract()` will pull out the first matching expression it finds and then stop, ignoring any other potential matches. It has a partner in `str_extract_all()` which will find everything, though:


In [105]:
email_data['data'].str.extractall(pat ='([a-z0-9._]+@[a-z.]+)')

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,an_address@email.com
1,0,user.name@company.co.uk
2,0,other_person_123@server.net
2,1,my.name@runningoutofideas.com


### Replacing Parts of a String

We've looked at a way of finding any email addresses in a string, regardless of how long they are or where they sit in the string. That gives us a way to extract information from our datasets and analyse it. We may want to publish that dataset though, which means we probably don't want email addresses (or names, or phone numbers) left exposed in it. Instead of simply pulling the information out, we can replace it with something else.

Let's say we want to replace all the email addresses in our strings above with "REDACTED". We can do it using the function `str_replace()`. It works in almost the same way as `str_extract()`, but this time takes another argument representing the string we want to use as a replacement:


In [107]:
email_data['data'].str.replace('([a-z0-9._]+@[a-z.]+)', 'ANONYMOUS', regex=True)

0          This string has ANONYMOUS in it
1                   This one has ANONYMOUS
2    Now we've got ANONYMOUS and ANONYMOUS
3                       @emailprovider.com
Name: data, dtype: object

Just like before, we replaced only the first matching string, but the 3rd element has 2 email addresses in it. Like before, we've got to fix it.
