# Parsing text with NLTK 📚

This notebook introduces [NLTK](https://www.nltk.org/), which is one of Python's key libraries for natural language processing (NLP).

NLP is broadly about using computers to manipulate the type of text that humans use for everyday communication. This can be very simple or very complicated. This notebook focuses on the simpler end of the spectrum, on which some other methods build.

NLTK was created in 2001 at the University of Pennsylvania and is widely embraced in academia. There are other libraries, such as [spaCy](https://spacy.io/), which are also helpful and commonly used. One advantage of NLTK is that it provides lots of customization.

This notebook focuses mainly on how to parse text to create bags of words. In other words, this notebook teaches how to go from raw text (a string) to a list of tokens (e.g., words) that can be used for different purposes.

Let's get started! 💪🏼

## Importing libraries

In [None]:
import pandas as pd # To work with data frames
import re # To work with regular expressions
import nltk
nltk.download('punkt')
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
nltk.download('wordnet')

## Reading data

This notebook will be working with posts from Reddit's API. The posts are the "hotest" posts in four subreddits: Meditation, Mindfulness, Headspace, and Buddhism. (You can see [how the posts where collected in this URL](https://github.com/emiliolehoucq/mindfulness/blob/main/data_collection_reddit_4_17_2024.ipynb).)

The data are hosted on GitHub. We can use the URL to the raw data directly in `pd.read_csv`.

In [None]:
df_reddit = pd.read_csv('https://raw.githubusercontent.com/emiliolehoucq/mindfulness/main/data_raw_reddit_4_17_2024.csv', index_col=0)

## Taking a quick look at the data 👀

Before going any further into NLTK and NLP, let's get to know the data that we will be working with.

In [None]:
df_reddit.shape

3513 rows and 14 columns...

In [None]:
df_reddit.columns

You can find what each column means in [PRAW's documentation](https://praw.readthedocs.io/en/stable/code_overview/models/submission.html).

In [None]:
df_reddit.dtypes

Aha, we seem to have some text! Let's see what it looks like.

In [None]:
df_reddit.head()

We'll be working mostly with `selftext`, the text of the post, and `title`, the title of the post. Here's one example of `selftext`:

In [None]:
IDX = 194
post = df_reddit['selftext'][IDX]
post

And one example of `title`:

In [None]:
title = df_reddit['title'][IDX]
title

`selftext` has some missing values (try `df_reddit['selftext'].isnull().mean()`), so we're dropping them:

In [None]:
df_reddit.dropna(subset=['selftext'], inplace=True)

Okay, ready to parse text with NLTK!

## Processing raw text for computing

We have a fair amount of raw text (strings). 😱 We'll call each of the posts and titles a *document*. How can we compute with those documents? We need a numerical representation.

In this notebook, we'll focus on the "*bag of words*" model, basically turning documents into lists of words and counting how many times each word appears in the document.

How can we turn documents into lists of words? *Tokenization*! Tokenization consists of splitting a string into linguistic units. When we tokenize a string, we produce a list of *tokens*, or sequences of characters that we want to treat as a group. For now, we're going to focus on words.

For example, `"This is a sentence"` would turn into `["This", "is", "a", "sentence"]`. Easy, no? 🤔 Let's try splitting on white spaces:

In [None]:
post.split()

That seems to work pretty well. However, there are a couple of problems, two of which are:
- Some tokens include punctuation or parentheses
- 'I’ve', 'consistency-sake.', etc. are actually two words

We could improve our approach using regular expressions (or ways to describe patterns in text). Python provides a character class for word characters (`\w`), which is equivalent to `[a-zA-Z0-9_]`. There's also a complement to this class (`\W`). We could use that complement to split strings into anything other than word characters (i.e., anything other than letters, digits, and underscores):

In [None]:
re.split(r'\W+', post)

That works pretty well in this case! However, we get empty strings at the end. Of course, we could solve this problem and continue improving our regular expression to account for various cases. However, if you've ever worked with text and regular expressions, you know that gets tricky very quickly... 😵‍💫

Fortunately, NLTK provides a tokenizer that you can use ([several tokenizers](https://www.nltk.org/api/nltk.tokenize.html), actually): 😅

In [None]:
nltk.word_tokenize(post)

As you can see, it's still not perfect (e.g., we get 'consistency-sake' again!). 😬

Still, `word_tokenize` is a more sophisticated tokenizer than most of us could probably write. It's NLTK's recommented word tokenizer, which builds on [`TreebankWordTokenizer`](https://www.nltk.org/api/nltk.tokenize.TreebankWordTokenizer.html#nltk.tokenize.TreebankWordTokenizer) (that uses regular expressions) and [`PunktSentenceTokenizer`](https://www.nltk.org/api/nltk.tokenize.PunktSentenceTokenizer.html#nltk.tokenize.PunktSentenceTokenizer) (a sentence tokenizer that uses an unsupervised learning algorithm to build a model and then uses the model to find sentence boundaries). (It's because `word_tokenize` uses the Punkt sentence tokenization models that we needed to download it above: `nltk.download('punkt')`.)

Tokenization turns out to be pretty difficult. 😥 Also, there's no solution that fits all cases. What counts as a token could depend on your application. To check the quality of your tokenization, it's useful to have some text that has been manually tokenized to compare the output of the tokenizer with what you want. For this notebook, let's move on...

## Exercise 1

Select a different post than the one we've been working with, apply the different methods for tokenizing covered above, and compare the results.

Wasn't that fun?! 🤩

Now, let's tokenize every post and store the result in a new column in the data frame:

In [None]:
df_reddit['selftext_tokenized'] = df_reddit['selftext'].apply(nltk.word_tokenize)

- `df_reddit['selftext_tokenized'] = ` creates a new column
- `df_reddit['selftext'].apply(nltk.word_tokenize)` applies `nltk.word_tokenize` to every cell of `df_reddit['selftext']`

Did we get what we expected?

In [None]:
df_reddit[['selftext_tokenized']].head()

Seems about right! Let's take a look at the post that we've been working with:

In [None]:
df_reddit['selftext_tokenized'][IDX]

## Exercise 2 🫵🏼

Tokenize every title (`title`) and store the result in a new column in the data frame.

Great! Now we have a list of words. What sort of cool insights can we get from that?

## Frequency distributions

A frequency distribution is a collection of items along with their frequency counts. A frequency distribution can help identify the words that are most informative of a document or a collection of documents.

Do users talk about different things in the subreddits about meditation v mindfulness v Headspace v Buddhism? Let's find out!

First, let's combine the list of words for each of the subreddits:

In [None]:
meditation = df_reddit[df_reddit['subreddit'] == 'Meditation']['selftext_tokenized'].sum()
mindfulness = df_reddit[df_reddit['subreddit'] == 'Mindfulness']['selftext_tokenized'].sum()
headspace = df_reddit[df_reddit['subreddit'] == 'Headspace']['selftext_tokenized'].sum()
buddhism = df_reddit[df_reddit['subreddit'] == 'Buddhism']['selftext_tokenized'].sum()

As you can see, we get big lists of words (exciting!):

In [None]:
print(f'Type of meditation: {type(meditation)}')
print(f'Length of meditation: {len(meditation)}')
print()
print(f'Type of mindfulness: {type(mindfulness)}')
print(f'Length of mindfulness: {len(mindfulness)}')
print()
print(f'Type of headspace: {type(headspace)}')
print(f'Length of headspace: {len(headspace)}')
print()
print(f'Type of buddhism: {type(buddhism)}')
print(f'Length of buddhism: {len(buddhism)}')

Now, let's create our frequency distributions:

In [None]:
freq_dist_meditation = nltk.FreqDist(meditation)
freq_dist_mindfulness = nltk.FreqDist(mindfulness)
freq_dist_headspace = nltk.FreqDist(headspace)
freq_dist_buddhism = nltk.FreqDist(buddhism)

As you can see, we use the function [`FreqDist`](https://www.nltk.org/api/nltk.probability.FreqDist.html). Now, let's see the most common words for each of the four subreddits... 🫣

In [None]:
print(f'Most common words in meditation: {freq_dist_meditation.most_common(10)}')
print()
print(f'Most common words in mindfulness: {freq_dist_mindfulness.most_common(10)}')
print()
print(f'Most common words in headspace: {freq_dist_headspace.most_common(10)}')
print()
print(f'Most common words in buddhism: {freq_dist_buddhism.most_common(10)}')

Wow, that was UNDERWHELMING! 😭 The most common words are not informative at all! Sometimes they're not even words. Okay, let's fix that...

Before, though, your turn to create frequency distributions:

## Exercise 3

Create frequency distributions for the titles for each subreddit.

## Tokenization 2.0: removing stopwords, removing punctuation, and lowercasing 🤞🏽

Our approach to tokenization so far has several problems, particularly:
- The most common words are not very informative. They are often so-called *stopwords*, or high-frequency words that we often filter out before further processing since they're not very distinctive of a document or collection of documents. You may also want to add common words for your application (e.g., "meditation" in the case of this notebook).
- We're including punctuation in the bag of words, which is often not useful. We don't care if the posts in the Meditation subreddit use "." more or less often than the posts in the "Mindfulness" subreddit.
- The words are capitalized differently. So, we're counting "Mindfulness" and "mindfulness" as two different words.

Fortunately, these problems are not too hard to solve. Let's do it! We'll start by creating a function that we can apply to each row:

In [None]:
# Add custom words to the list of stop words
# Remember code above:
# nltk.download('stopwords')
# stop_words = nltk.corpus.stopwords.words('english')
stop_words = stop_words + [
    # You can optionally add words here
]

# In this case, we also want to remove URLs from the text before parsing it
# Taken from: https://www.geeksforgeeks.org/remove-urls-from-string-in-python/
def remove_urls(text, replacement_text=""):
    # Define a regex pattern to match URLs
    url_pattern = re.compile(r'https?://\S+|www\.\S+') # https?:// protocol (optional s), \S+ one or more non-white space characters, | or, www\.\S+ URLs starting with www.
    # Use the sub() method to replace URLs with the specified replacement text
    text_without_urls = url_pattern.sub(replacement_text, text)
    return text_without_urls

# Define custom function for tokenization
def my_tokenizer(post):
  """
  Function to tokenize a post.
  Input: post.
  Output: list of tokens.
  Dependencies: NLTK
  """
  # Remove URLs from post
  post = remove_urls(post)
  # Convert post to string and tokenize it
  tokens = nltk.word_tokenize(str(post))
  # Clean tokens:
  # - Check that it's alphanumeric (a-z and 0-9)
  # - Remove stopwords
  # - Lowercase
  tokens = [token.lower() for token in tokens if token.isalnum() and token.lower() not in stop_words]
  return tokens

Let's try our function with the post we've already seen a number of times:

In [None]:
my_tokenizer(post)

Seems to be working well! Let's tokenize posts and create the frequency distributions again:

In [None]:
# Tokenize posts using are custom function
df_reddit['selftext_tokenized'] = df_reddit['selftext'].apply(my_tokenizer)

# Create a list of words for each subreddit
meditation = df_reddit[df_reddit['subreddit'] == 'Meditation']['selftext_tokenized'].sum()
mindfulness = df_reddit[df_reddit['subreddit'] == 'Mindfulness']['selftext_tokenized'].sum()
headspace = df_reddit[df_reddit['subreddit'] == 'Headspace']['selftext_tokenized'].sum()
buddhism = df_reddit[df_reddit['subreddit'] == 'Buddhism']['selftext_tokenized'].sum()

# Create a frequency distribution for each subreddit
freq_dist_meditation = nltk.FreqDist(meditation)
freq_dist_mindfulness = nltk.FreqDist(mindfulness)
freq_dist_headspace = nltk.FreqDist(headspace)
freq_dist_buddhism = nltk.FreqDist(buddhism)

In this case, let's visualize the results instead of printing a list of tuples: 📉

In [None]:
NUM_RESULTS = 25
freq_dist_meditation.plot(NUM_RESULTS)
freq_dist_mindfulness.plot(NUM_RESULTS)
freq_dist_headspace.plot(NUM_RESULTS)
freq_dist_buddhism.plot(NUM_RESULTS)

We get some interesting results, but nothing all that exciting. Let's try a couple of other methods!

## Tokenization 3.0: stemming, lemmatizing, and n-grams 🤞🏽🤞🏽

Our approach to tokenization so far has two problems that we can solve:
- We're treating differently words that convey similar information (e.g., "meditation" and "meditate" or "mindfulness" and "mindful"). Fortunately, we can address this problem with stemming and lemmatization. Stemming consists in keeping only the stem of a word, removing "the end" of the word (e.g., "manufactur" instead of "manufacturing"). Lemmatizing is similar to stemming, but making sure that the resulting word exists in the dictionary. Since lemmatizing involves further computation, it is slower than stemming. Fortunately, NLTK includes off-the-shelf stemmers and lemmatizers! However, as it is more generally the case with tokenizing, you should select the stemmer/lemmatizer that best fits your needs.
- We're focusing on single words (or *unigrams*). However, sometimes the meaning of words changes when they're together with another/other word(s) (e.g., "White House" vs "White" and "House" or "Thich Nhat Hanh" vs "Thich", "Nhat", and "Hanh"). To address this problem, we can tokenize posts into *n-grams*, particularly *bi-grams* and *tri-grams*, or ordered sets of n, 2, and 3 words, respectively. This increases the number of unique tokens, but retains more information.

Let's try it out! Let's improve the function we had created for tokenization:

In [None]:
# Define new custom function for tokenization
def my_tokenizer_improved(post):
  """
  Improved function to tokenize a post.
  Input: post.
  Output: tuple with lists of tokens.
  Dependencies: NLTK
  """
  # Remove URLs from post
  post = remove_urls(post)
  # Convert post to string and tokenize it
  tokens = nltk.word_tokenize(str(post))
  # Create variables to store unigrams, stems, and lemmas
  unigrams = []
  stems = []
  lemmas = []
  # Clean tokens, stem, and lemmatize
  for token in tokens:
    # Lowercase
    token = token.lower()
    # Keep only alphanumeric and remove stopwords
    if token.isalnum() and token not in stop_words:
      # Store unigram
      unigrams.append(token)
      # Stemming
      stems.append(nltk.PorterStemmer().stem(token))
      # Lemmatizing
      lemmas.append(nltk.WordNetLemmatizer().lemmatize(token)) # nltk.download('wordnet') is necessary for this
  # Create lists of bigrams
  bigrams = list(nltk.bigrams(unigrams))
  bigrams_stems = list(nltk.bigrams(stems))
  bigrams_lemmas = list(nltk.bigrams(lemmas))
  # Create lists of trigrams
  trigrams = list(nltk.trigrams(unigrams))
  trigrams_stems = list(nltk.trigrams(stems))
  trigrams_lemmas = list(nltk.trigrams(lemmas))
  # Return tuple with everything
  return unigrams, stems, lemmas, bigrams, bigrams_stems, bigrams_lemmas, trigrams, trigrams_stems, trigrams_lemmas

## Exercise 4

By now you're somewhat familiar with the workflow that we're using to tokenize text, create frequency distributions, and visualize them. Shortly we'll apply `my_tokenizer_improved` to the posts. Before, though, try doing it yourself for `title`. You can start by comparing what you get with unigrams, stems, and lemmas. No worries if you can't do it. Trying it may help you identify questions that we can answer throughout the rest of the notebook.

Let's apply `my_tokenizer_improved` to the posts:

In [None]:
# Columns to store results
df_reddit['selftext_unigrams'] = None
df_reddit['selftext_stems'] = None
df_reddit['selftext_lemmas'] = None
df_reddit['selftext_bigrams'] = None
df_reddit['selftext_bigrams_stems'] = None
df_reddit['selftext_bigrams_lemmas'] = None
df_reddit['selftext_trigrams'] = None
df_reddit['selftext_trigrams_stems'] = None
df_reddit['selftext_trigrams_lemmas'] = None

# Iterating over rows of the data frame storing results
for i, row in df_reddit.iterrows():
  # Tokenize post
  unigrams, stems, lemmas, bigrams, bigrams_stems, bigrams_lemmas, trigrams, trigrams_stems, trigrams_lemmas = my_tokenizer_improved(str(row['selftext']))
  # Store results
  df_reddit.at[i, 'selftext_unigrams'] = unigrams
  df_reddit.at[i, 'selftext_stems'] = stems
  df_reddit.at[i, 'selftext_lemmas'] = lemmas
  df_reddit.at[i, 'selftext_bigrams'] = bigrams
  df_reddit.at[i, 'selftext_bigrams_stems'] = bigrams_stems
  df_reddit.at[i, 'selftext_bigrams_lemmas'] = bigrams_lemmas
  df_reddit.at[i, 'selftext_trigrams'] = trigrams
  df_reddit.at[i, 'selftext_trigrams_stems'] = trigrams_stems
  df_reddit.at[i, 'selftext_trigrams_lemmas'] = trigrams_lemmas

Let's quickly see what we get:

In [None]:
df_reddit.head()

Great, let's now create the lists for each subreddit and the frequency distributions. Let's start with stems:

In [None]:
# Create lists for each subreddit
meditation_stems = df_reddit[df_reddit['subreddit'] == 'Meditation']['selftext_stems'].sum()
mindfulness_stems = df_reddit[df_reddit['subreddit'] == 'Mindfulness']['selftext_stems'].sum()
headspace_stems = df_reddit[df_reddit['subreddit'] == 'Headspace']['selftext_stems'].sum()
buddhism_stems = df_reddit[df_reddit['subreddit'] == 'Buddhism']['selftext_stems'].sum()

# Create frequency distributions for each subreddit
freq_dist_meditation_stems = nltk.FreqDist(meditation_stems)
freq_dist_mindfulness_stems = nltk.FreqDist(mindfulness_stems)
freq_dist_headspace_stems = nltk.FreqDist(headspace_stems)
freq_dist_buddhism_stems = nltk.FreqDist(buddhism_stems)

# Visualizing results
freq_dist_meditation_stems.plot(NUM_RESULTS)
freq_dist_mindfulness_stems.plot(NUM_RESULTS)
freq_dist_headspace_stems.plot(NUM_RESULTS)
freq_dist_buddhism_stems.plot(NUM_RESULTS)

Not particularly useful in this case 😔. Let's do lemmas now:

In [None]:
# Create lists for each subreddit
meditation_lemmas = df_reddit[df_reddit['subreddit'] == 'Meditation']['selftext_lemmas'].sum()
mindfulness_lemmas = df_reddit[df_reddit['subreddit'] == 'Mindfulness']['selftext_lemmas'].sum()
headspace_lemmas = df_reddit[df_reddit['subreddit'] == 'Headspace']['selftext_lemmas'].sum()
buddhism_lemmas = df_reddit[df_reddit['subreddit'] == 'Buddhism']['selftext_lemmas'].sum()

# Create frequency distributions for each subreddit
freq_dist_meditation_lemmas = nltk.FreqDist(meditation_lemmas)
freq_dist_mindfulness_lemmas = nltk.FreqDist(mindfulness_lemmas)
freq_dist_headspace_lemmas = nltk.FreqDist(headspace_lemmas)
freq_dist_buddhism_lemmas = nltk.FreqDist(buddhism_lemmas)

# Visualizing results
freq_dist_meditation_lemmas.plot(NUM_RESULTS)
freq_dist_mindfulness_lemmas.plot(NUM_RESULTS)
freq_dist_headspace_lemmas.plot(NUM_RESULTS)
freq_dist_buddhism_lemmas.plot(NUM_RESULTS)

Again, not particularly informative 😫. Let's do bigrams now:

In [None]:
# Create lists for each subreddit
meditation_bigrams = df_reddit[df_reddit['subreddit'] == 'Meditation']['selftext_bigrams'].sum()
mindfulness_bigrams = df_reddit[df_reddit['subreddit'] == 'Mindfulness']['selftext_bigrams'].sum()
headspace_bigrams = df_reddit[df_reddit['subreddit'] == 'Headspace']['selftext_bigrams'].sum()
buddhism_bigrams = df_reddit[df_reddit['subreddit'] == 'Buddhism']['selftext_bigrams'].sum()

# Create frequency distributions for each subreddit
freq_dist_meditation_bigrams = nltk.FreqDist(meditation_bigrams)
freq_dist_mindfulness_bigrams = nltk.FreqDist(mindfulness_bigrams)
freq_dist_headspace_bigrams = nltk.FreqDist(headspace_bigrams)
freq_dist_buddhism_bigrams = nltk.FreqDist(buddhism_bigrams)

# Visualizing results
freq_dist_meditation_bigrams.plot(NUM_RESULTS)
freq_dist_mindfulness_bigrams.plot(NUM_RESULTS)
freq_dist_headspace_bigrams.plot(NUM_RESULTS)
freq_dist_buddhism_bigrams.plot(NUM_RESULTS)

Now we're getting more juice out of it! 🧃 Please notice how the frequency is much lower, though.

Let's go straight to trigrams for time sake:

In [None]:
# Create lists for each subreddit
meditation_trigrams = df_reddit[df_reddit['subreddit'] == 'Meditation']['selftext_trigrams'].sum()
mindfulness_trigrams = df_reddit[df_reddit['subreddit'] == 'Mindfulness']['selftext_trigrams'].sum()
headspace_trigrams = df_reddit[df_reddit['subreddit'] == 'Headspace']['selftext_trigrams'].sum()
buddhism_trigrams = df_reddit[df_reddit['subreddit'] == 'Buddhism']['selftext_trigrams'].sum()

# Create frequency distributions for each subreddit
freq_dist_meditation_trigrams = nltk.FreqDist(meditation_trigrams)
freq_dist_mindfulness_trigrams = nltk.FreqDist(mindfulness_trigrams)
freq_dist_headspace_trigrams = nltk.FreqDist(headspace_trigrams)
freq_dist_buddhism_trigrams = nltk.FreqDist(buddhism_trigrams)

# Visualizing results
freq_dist_meditation_trigrams.plot(NUM_RESULTS)
freq_dist_mindfulness_trigrams.plot(NUM_RESULTS)
freq_dist_headspace_trigrams.plot(NUM_RESULTS)
freq_dist_buddhism_trigrams.plot(NUM_RESULTS)

This one is also pretty interesting! 🥳

Again, please note that the frequency gets lower.

This is as far as we'll get in this notebook... but please note that we could keep trying other approaches to make frequency distributions more informative. For example, we could look only at the long words, at least those that occur relatively frequently (which could be more characteristic and informative), or focusing on bigrams that occur more often than we would expect based on the frequency of the individual words ([collocations, which you can find using NLTK](https://www.nltk.org/api/nltk.collocations.html?highlight=collocation#module-nltk.collocations)).

## Bonus: Exercise 5

Haven't had enough yet? Try creating frequency distributions for only the long words in either `selftext` or `title` (or both!).

## Conclusion 🎬

This notebook provided an introduction to parsing text with NLTK. We discussed the bag of words model, tokenization, and frequency distributions focusing on how to implement them with NLTK. We covered various approaches to tokenization, including aspects such as removing stopwords and punctuation, lowercasing, stemming and lemmatizing, and using bigrams and tri-grams.

Now that you know how to parse text with NLTK, the sky is the limit! Besides calculating frequency distributions, you can use lists of tokens with other methods such as topic modeling (Latent Dirichlet Allocation) and (lexicon-based) sentiment analysis.

## Resources to continue learning 🔖

This workshop draws from other materials that you can consult as well to continue learning:

- Tutorial: [Natural Lanugage Processing with Python's NLTK Package](https://realpython.com/nltk-nlp-python/)
- Book (available online): [Natural Language Processing with Python](https://search.library.northwestern.edu/permalink/01NWU_INST/h04e76/alma9962172594202441)
- Book (available in the library): [Text as Data](https://search.library.northwestern.edu/permalink/01NWU_INST/h04e76/alma9982095900202441)

## Answers to the exercises

### Exercise 1

In [None]:
post_exercise = df_reddit['selftext'][900]

print(post_exercise.split())
print()
print(re.split(r'\W+', post_exercise))
print()
print(nltk.word_tokenize(post_exercise))

### Exercise 2

In [None]:
df_reddit['title_tokenized'] = df_reddit['title'].apply(nltk.word_tokenize)
df_reddit[['title_tokenized']].head()

### Exercise 3

In [None]:
# Create lists of words for each subreddit
meditation_title = df_reddit[df_reddit['subreddit'] == 'Meditation']['title_tokenized'].sum()
mindfulness_title = df_reddit[df_reddit['subreddit'] == 'Mindfulness']['title_tokenized'].sum()
headspace_title = df_reddit[df_reddit['subreddit'] == 'Headspace']['title_tokenized'].sum()
buddhism_title = df_reddit[df_reddit['subreddit'] == 'Buddhism']['title_tokenized'].sum()

# Create frequency distributions for each subreddit
freq_dist_meditation_title = nltk.FreqDist(meditation_title)
freq_dist_mindfulness_title = nltk.FreqDist(mindfulness_title)
freq_dist_headspace_title = nltk.FreqDist(headspace_title)
freq_dist_buddhism_title = nltk.FreqDist(buddhism_title)

# Print most common words
print(f'Most common words in meditation: {freq_dist_meditation_title.most_common(10)}')
print()
print(f'Most common words in mindfulness: {freq_dist_mindfulness_title.most_common(10)}')
print()
print(f'Most common words in headspace: {freq_dist_headspace_title.most_common(10)}')
print()
print(f'Most common words in buddhism: {freq_dist_buddhism_title.most_common(10)}')

### Exercise 4

The answer was provided above.

### Exercise 5

In [None]:
# Let's focus on the unigrams we created

# Length threshold
LEN = 10

# Get only long words
meditation_long = [token for token in meditation if len(token) > LEN]
mindfulness_long = [token for token in mindfulness if len(token) > LEN]
headspace_long = [token for token in headspace if len(token) > LEN]
buddhism_long = [token for token in buddhism if len(token) > LEN]

# Create a frequency distribution for each subreddit
freq_dist_meditation_long = nltk.FreqDist(meditation_long)
freq_dist_mindfulness_long = nltk.FreqDist(mindfulness_long)
freq_dist_headspace_long = nltk.FreqDist(headspace_long)
freq_dist_buddhism_long = nltk.FreqDist(buddhism_long)

# Visualizing results
freq_dist_meditation_long.plot(NUM_RESULTS)
freq_dist_mindfulness_long.plot(NUM_RESULTS)
freq_dist_headspace_long.plot(NUM_RESULTS)
freq_dist_buddhism_long.plot(NUM_RESULTS)