# <span style="color:red"> Lecture 24 - Application of Text Data: Trigram Language Model  </span>

<font size = "4">

- Today's lecture will briefly discuss an application of text data: predictive language models

- This will just scratch the surface, and generative AI like ChatGPT uses much more sophisticated models.

- We will focus on the coding aspect, omitting almost all the probabilistic details. Check out DATASCI 340 or [this free textbook](https://web.stanford.edu/~jurafsky/slp3/) for more info

- We'll estimate a very simple probability distribution from a trigram (3-gram) language model, a specific case of the N-gram language model.

$\qquad$ <img src="files_lec24/trigram.png" alt="drawing" width="500"/>


<font size = "2">

(quote taken from Chapter 3 of the textbook linked above)



##### Import necessary libraries:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

##### Load in a dataset of hotel reviews and replace some awkwardly named columns

In [None]:
hotels = pd.read_csv("files_lec24/Datafiniti_Hotel_Reviews.csv")

hotels = hotels.rename(columns = {"reviews.text": "review", "reviews.rating":"rating"})

display(hotels)

<font size = "4">

- First, we will cover another aspect of regular expressions that we haven't seen yet.

- If we want to search the reviews for the word "this", we don't need to use any regular expression. Same if we want to search for the word "that".

In [None]:
this = hotels["review"].str.findall("this")
display(this)

that = hotels["review"].str.findall("that")
display(that)

<font size = "4">

- But what if we want to search reviews for **either** one of the words?

- Both words start with "th". Then we need to look for either "is" or "at".

- We can imagine searching one character at a time:

    - Look for "t"
    - Look for "h"
    - Look for either "i" or "a"
    - Look for either "s" or "t"

- We can put brackets around the character choices like we do below:

In [None]:
this_that = hotels["review"].str.findall("th[ia][st]")

this_that

<font size = "4">

- We also might want to ignore capitalization and find words starting with both "T" and "t"

In [None]:
this_that = hotels["review"].str.findall("[Tt]h[ia][st]")
this_that

<font size = "4">

**Full disclosure:** The code we have above would also find the words "thit" and "thas". If we're confident there are no mispellings, we don't have to worry. Of course, it's not a good idea to be so trusting in general.

## <span style="color:red"> I. Find all individual words </span>

<font size = "4">

- We will search for **all** words of any kind. This will convert each review into a list of individual words.

- Therefore, we should search for one or more consecutive occurrences of strings containing the characters A thru Z, a thru z, and a single apostrophe (to cover words like "don't").

- We will extract each individual word (a.k.a token) from the "review" column.



In [None]:
hotels["tokens"] = hotels["review"].str.findall("[A-Za-z']+")

display(hotels[["review", "tokens"]])

<font size = "4">

- At this point, we will convert all letters to lowercase. We do not want to treat the word "we" and "We" differently.

In [None]:
def lower_list(lst):
    # Input "lst" will correspond to the list in each entry of "tokens" column

    out = []

    # If there is a missing value (NaN) in the "review" column,
    # there will be a NaN in the "tokens" column too.
    # So this if statement checks for NaN's
    if not isinstance(lst, list):
        return out

    # Loop over list, convert strings to lowercase and append to output list
    for w in lst:
        out.append(w.lower())
    return out

# Use .apply() on the "tokens column"
hotels["tokens"] = hotels["tokens"].apply(lower_list)
hotels["tokens"]


## <span style="color:red"> II. Construct trigrams </span>

<font size = "4">

- We now want to construct "trigrams" from the list of words, which are simply collections of three consecutive words.

- So for the sentence "this was a very bad hotel", we would collect the trigrams:
    - (this, was, a)
    - (was, a, very)
    - (a, very, bad)
    - (very, bad, hotel)

- We will collect each trigram in a **tuple**, not a list. A tuple is very similar to a list, and is defined using "()" instead of "[]".

- The reason for using tuples instead of lists is technical, but tuples can be used as the Index column of a DataFrame/Series, while lists cannot. We will need to do this later in the notebook.

- Here is an example for the first review, where we collect a **list of tuples** (list of trigrams)

In [None]:
words = hotels["tokens"].iloc[0]
print(words)
N = len(words)

trigram_list = []
for i in range(N-2):
    trigram = (words[i], words[i+1], words[i+2]) # "()" means this is a tuple 
    trigram_list.append(trigram)

trigram_list

<font size = "4">

- Now we use ``.apply()`` to perform this for every review.

- I've provided different versions of the "get_trigrams" function, all do exactly the same thing.

In [None]:
def get_trigrams(words):
    N = len(words)
    trigram_list = []
    for i in range(N-2):
        trigram = (words[i], words[i+1], words[i+2])
        trigram_list.append(trigram)
    return trigram_list

#### 3 other equivalent versions of this function below

def get_trigrams2(words):
    trigram_list = []
    for i in range(len(words) - 2):
        three_words = words[i:i+3]   # using slicing with ":"
        trigram = tuple(three_words) # convert to tuple
        trigram_list.append(trigram) 
    return trigram_list

def get_trigrams3(words):
    trigram_list = []
    for i in range(len(words) - 2):
        trigram_list.append(tuple(words[i:i+3]))
    return trigram_list

def get_trigrams4(words):
    return [tuple(words[i:i+3]) for i in range(len(words) - 2)] # list comprehension



hotels["trigrams"] = hotels["tokens"].apply(get_trigrams)
hotels["trigrams"]

## <span style="color:red"> III. Unravel trigrams into a single Pandas Series </span>

<font size = "4">

- The "trigrams" column has 10,000 rows and each row has one or more trigrams in them. 

- Rows corresponding to shorter reviews will have fewer trigrams.

- We now use the ``.explode()`` to "unravel" this column into a single Pandas series containing **all** the trigrams.

In [None]:
all_trigrams = hotels["trigrams"].explode()
display(all_trigrams)

## <span style="color:red"> IV. Compute probability distribution of trigrams </span>

<font size = "4">

- We can use the ``.value_counts`` method to see the most common trigrams appearing in all the reviews.

In [None]:
all_trigrams.value_counts()

<font size = "4">

- However, language models are **probabilistic** in nature. 

- Instead of value counts, we would like a probability distribution.

- Instead of counting how many times a trigram appears, we should calculate its **proportion** of appearances.

- This can be done by passing ``normalize = True`` into the ``.value_counts`` method.

In [None]:
tri_probs = all_trigrams.value_counts(normalize = True)
tri_probs

<font size = "4">

Let's make a barplot of the probabilities of the top 15 trigrams appearing in the reviews:

In [None]:
tri_probs.head(15).plot(kind='bar', figsize=(12,4))
plt.title("Top 15 Trigrams")
plt.ylabel("Probability")
plt.show()

## <span style="color:red"> V. Predict next word </span>

<font size = "4">

- Suppose we want to continue the phrase "We thought the room ...".

- Since we are using the trigram language model, we need to base our decision on the last two words: (the, room, ????)

- What is the most probable next word? 

- **Note:** Since "trigrams" is the index column, we cannot use the ``.apply`` method. However, there is a very similar ``.index.map()`` method.

In [None]:
prefix = ("the", "room") # predict next word based on these two

def check_for_prefix(trigram):
    # True or False: first two elements of trigram are "the" and "room"
    return trigram[:2] == prefix

bool_prefix = tri_probs.index.map(check_for_prefix)
candidate_trigrams = tri_probs[bool_prefix]

display(candidate_trigrams.sort_values(ascending = False))

<font size = "4">

**Note:** Sometimes I write very inefficient Python code, just to make it more readable to students. The following cell does the same thing:

In [None]:
candidates = tri_probs[tri_probs.index.map(lambda x: x[:2] == ('the', 'room'))]

candidates.sort_values(ascending = False)

<font size = "4">

- **Q:** Do the the entries of ``candidates`` represent a true probability distribution? Remember that language models are probabilistic.

- How can I check?

- If it's not, how can I make it a probability distribution?

In [None]:
# Check if it is a probability distribution.
# If it isn't, make the necessary change

# ????


## <span style="color:red"> VI. Practice quiz </span>

We will use the dataset of the top 1000 movies according to imdb.com (as of 2020).

In [None]:
df = pd.read_csv("files_lec24/imdb_top_1000.csv")

# Q1 

<font size = "4">

- The "Genre" column describes the category or categories of each movie in the dataset. 

- Each movie can have one or more genres. For example "The Shawshank Redemption" is classified as "Drama", while "The Dark Knight" is classified as "Action, Crime, Drama"

- One of the genres is "Animation". Using a **single Pandas method**, replace all appearances of the string "Animation" with the new string "Animated".

- For example, the 23rd ranked movie "Sen to Chihiro no kamikakushi" is originally classified as "Animation, Adventure, Family". After you perform the replacement correctly, it will be classified as "Animated, Adventure, Family". 

In [None]:
# your code here



# Q2

- Create a new DataFrame called ``df_music`` that only contains movies that are classified as "Music".

- "Music" does not have to be the only genre for each movie. For example, your DataFrame should include the 34th ranked movie "Whiplash", which is classified as "Drama, Music"

- **Using Python commands**, print the number of movies that have the "Music" classification as part of their genre.

In [None]:
# Your code here


# Q3

- The "Overview" column contains a text description of each movie.

- Define a Pandas Series containing all appearances of the two words "scent" and "sent". It should **only** include cases where a match is found.

- To search for the words, use ``.str.findall`` **one time**.

- To remove the rows where no matches are found, use ``.str.len`` **one time**.

In [None]:
# your answer here



# Q4

- The "Overview" column contains a text description of each movie.

- Define a Pandas Series containing all appearances of the two words "England" and "English". It should **only** include cases where a match is found.

- To search for the words, use ``.str.findall`` **one time**.

- To remove the rows where no matches are found, use ``.str.len`` **one time**.

In [None]:
# your answer here




# Q5

- Using ``.str.findall`` and ``.apply``, convert all strings from the "Genre" column into **tuples** of individual words.

- For example, the top 3 ranked movies (The Shawshank Redemption, The Godfather, The Dark Knight) are classified as "Drama", "Crime, Drama", "Action, Crime, Drama", respectively. After converting to tuples, they would read:

    - ("Drama")
    - ("Crime", "Drama")
    - ("Action", "Crime", "Drama")

- Each string in the tuple should be **lowercase only**

- Save this Pandas Series of tuples as a new column to the DataFrame ``df`` with the name "tokens".

- **Warning:** There are two genres with a dash: "Sci-Fi" and "Film-Noir". Make sure that the row corresponding to "Inception" has the tuple ("action", "adventure", "sci-fi") instead of the tuple ("action", "adventure", "sci", "fi")

In [None]:
# your code here



# Q6

- After Q5, the column df["tokens"] is a Pandas Series of length 1,000. Each row has a single tuple containing one or more strings.

- Create a new Pandas Series called ``genre_appearances`` where each row consists of a **single string**, one for each individual string appearing in df["tokens"].

- To be clear, the first 3 rows of df["tokens"] will be:

    |  | tokens |
    | --- | --- |
    | 0 | ("drama",) |
    | 1 | ("crime", "drama") |
    | 2 | ("action", "crime", "drama") |
    | ⋮ | ⋮ |

    while the first 6 rows of ``genre_appearances`` will be:

    |  | tokens |
    | --- | --- |
    | 0 | drama |
    | 1 | crime |
    | 1 | drama |
    | 2 | action |
    | 2 | crime |
    | 2 | drama |
    | ⋮ | ⋮ |

</br>

- Create a Pandas Series called ``genre_probs`` containing the probabilities/proportions of each genre appearing in the dataset.

- Generate a bar plot showing the probabilities/proportions of **all** genres appearing in the dataset. Add a title and label the y-axis.


In [None]:
# your code here

