Assignment 1 - NLTK and corpus functions

---

Subcorpus for Recipe (Takaya Shirai)

0. Preparation

In [113]:
## Importing libraries
import nltk
from nltk.probability import FreqDist
from nltk.corpus import PlaintextCorpusReader
from nltk.stem import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer

In [114]:
## Creating the recipe corpus from the data directory
corpus_root = "./data"
reviews = PlaintextCorpusReader(corpus_root, '.*')
recipe_corpus = reviews.words('recipe.txt')

---

1. The length

In [115]:
## Calculate the length of the original recipe corpus
recipe_corpus_len = len(recipe_corpus)
print(f"Original corpus length: {recipe_corpus_len}")

## Create a corpus containing only alphabetic words and convert them to lowercase
recipe_alpha_corpus = [word.lower() for word in recipe_corpus if word.isalpha()]
recipe_alpha_corpus_len = len(recipe_alpha_corpus)
print(f"Alphabet-only and lower-cased corpus length: {recipe_alpha_corpus_len}")

Original corpus length: 8747
Alphabet-only and lower-cased corpus length: 7093


---

2. The lexical diversity

In [117]:
## Function for calculating the lexical diversity of a text
def lexical_diversity(text):
    sorted_words = sorted(w.lower() for w in text)
    unique_sorted_words = sorted(set(w.lower() for w in text))
    return len(set(unique_sorted_words)) / len(sorted_words)

In [118]:
## Print the lexical diversities
print(f"lexical diversity for the recipe corpus: {lexical_diversity(recipe_corpus)}")
print(f"lexical diversity for the alphabet only recipe corpus: {lexical_diversity(recipe_alpha_corpus)}")

lexical diversity for the recipe corpus: 0.18029038527495142
lexical diversity for the alphabet only recipe corpus: 0.210630198787537


---

3. Top 10 most frequent words and their counts

In [119]:
## Calculate the frequency distributions for the original and alphabet-only recipe corpora
freq_dist = FreqDist(recipe_corpus)
freq_dist_alpha = FreqDist(recipe_alpha_corpus)

## Store the 10 most frequent words from each frequency distribution
most_freq_words = freq_dist.most_common(10)
most_freq_words_alpha = freq_dist_alpha.most_common(10)

In [120]:
## Print the 10 most frequent words
print("10 most frequent words in the original recipe corpus:")
for word, count in most_freq_words:
    print(f"word: '{word}', count: {count}")

print()

print("10 most frequent words in the alphabet-only recipe corpus:")
for word, count in most_freq_words_alpha:
    print(f"word: '{word}', count: {count}")

10 most frequent words in the original recipe corpus:
word: '.', count: 379
word: ',', count: 345
word: 'the', count: 334
word: 'and', count: 199
word: 'a', count: 172
word: 'to', count: 157
word: 'in', count: 114
word: 'of', count: 112
word: '’', count: 107
word: 'it', count: 105

10 most frequent words in the alphabet-only recipe corpus:
word: 'the', count: 351
word: 'and', count: 206
word: 'a', count: 176
word: 'to', count: 166
word: 'in', count: 125
word: 'it', count: 117
word: 'of', count: 112
word: 'for', count: 92
word: 'with', count: 89
word: 'is', count: 76


---

4. Words that are at least 10 characters long and their counts

In [121]:
## Store the words and their counts that are at least 10 characters long
long_words_with_counts = [(word, count) for word, count in freq_dist.items() if len(word) >= 10]
long_alpha_words_with_counts = [(word, count) for word, count in freq_dist_alpha.items() if len(word) >= 10]

## Print the words that are at least 10 characters long and their counts
print("Words that are at least 10 characters long in the original recipe corpus:")
for word, count in long_words_with_counts:
    print(f"word: '{word}', count: {count}")

print()

print("Words that are at least 10 characters long in the alphabet-only recipe corpus:")
for word, count in long_alpha_words_with_counts:
    print(f"word: '{word}', count: {count}")

Words that are at least 10 characters long in the original recipe corpus:
word: 'cranberries', count: 3
word: 'pistachios', count: 3
word: 'Ingredients', count: 9
word: 'tablespoons', count: 12
word: 'tablespoon', count: 4
word: 'ingredients', count: 18
word: 'Directions', count: 7
word: 'vegetables', count: 6
word: 'overlapping', count: 1
word: 'Alternatives', count: 2
word: 'breadcrumbs', count: 7
word: 'separately', count: 1
word: 'beforehand', count: 1
word: 'breadcrumb', count: 1
word: 'Worcestershire', count: 2
word: 'Substitute', count: 3
word: 'wholegrain', count: 1
word: 'portobello', count: 1
word: 'substitute', count: 3
word: 'throughout', count: 3
word: 'reasonably', count: 1
word: 'traditional', count: 2
word: 'Bourguignon', count: 1
word: 'cauliflower', count: 1
word: 'Vietnamese', count: 1
word: 'caramelise', count: 1
word: 'lemongrass', count: 2
word: 'incredible', count: 1
word: 'difference', count: 2
word: 'reputation', count: 1
word: 'overcooked', count: 2
word: 'fla

---

5. The longest sentence (type the sentence and give the number of words)

In [122]:
## Retrieve the sentences from the recipe corpus
recipe_sentences = reviews.sents('recipe.txt')

## Find the longest sentence
longest_sentence = []
for sentence in recipe_sentences:
    if len(longest_sentence) < len(sentence):
        longest_sentence = sentence 

## Join the words of the longest sentence into a single string for printing
joined_longest_sentence = ' '.join(longest_sentence)

## Print the longest sentence along with the word count
print(f"longest sentence:\n{joined_longest_sentence}")
print()
print(f"number of words: {len(longest_sentence)}")

longest sentence:
The texture of tenderised slow cooking cuts of beef is not quite the same as steak cuts , but it is still soft and tender , many of them have excellent beefy flavour ( like short rib ) and I would not hesitate to use any of them if that ’ s all I had !

number of words: 56


---

6. A stemmed version of the longest sentence

In [112]:
## Initialize the stemmers
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()

## Stem the words of the longest sentence using both stemmers
port_stemmed_sentence = []
lanc_stemmed_sentence = []
for word in longest_sentence:
    port_stemmed_sentence.append(porter_stemmer.stem(word))
    lanc_stemmed_sentence.append(lancaster_stemmer.stem(word))

## Join the stemmed words into single strings for printing
joined_port_stemmed_sentence = ' '.join(port_stemmed_sentence)
joined_lanc_stemmed_sentence = ' '.join(lanc_stemmed_sentence)

## Print the both stemmed longest sentences
print(f"porter stemmed longest sentence:\n{joined_port_stemmed_sentence}")
print()
print(f"lancaster stemmed longest sentence:\n{joined_lanc_stemmed_sentence}")

porter stemmed longest sentence:
the textur of tenderis slow cook cut of beef is not quit the same as steak cut , but it is still soft and tender , mani of them have excel beefi flavour ( like short rib ) and i would not hesit to use ani of them if that ’ s all i had !

lancaster stemmed longest sentence:
the text of tend slow cook cut of beef is not quit the sam as steak cut , but it is stil soft and tend , many of them hav excel beefy flavo ( lik short rib ) and i would not hesit to us any of them if that ’ s al i had !


---

7. Overall (not for each subcorpus): A reflection (1 paragraph or so): What do the
most frequent words, the longest words, and longest sentence tell you about each
of the 3 genres? How do you interpret the lexical diversity?

---