#CMT309 – Data Science Portfolio (Spring)


In [9]:
import numpy as np
import string
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import math
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# the code block below is directly downloading commentary.txt and superheros.csv into your drive folder. Please just run it and do not comment out.
from urllib import request
module_url = [f"https://drive.google.com/uc?export=view&id=18y6hLv2bqAyJsIXwVCty58lF0u7yimVq"]
name = ['commentary.txt']
for i in range(len(name)):
    with request.urlopen(module_url[i]) as f, open(name[i],'w') as outf:
        a = f.read()
        outf.write(a.decode('ISO-8859-1'))
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import nltk
import re
from tqdm import tqdm
tqdm.pandas()
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Q5) Text Analysis (25 marks)

In this question, we will interrogate the football commentary dataset

In [None]:
df = pd.read_csv('commentary.txt', sep='\t')
df.head()

## Q5.1 - Preprocessing (2 marks)

You must implement a method for obtaining tokenized, PoS-tagged and PoS-tagged and lemmatized versions of the Commentary column. You must use only `nltk` libraries. You must create 3 new columns: `Tokenized`, `PoS_tagged` and `PoS_lemmatized`, and create them in order:

1.- New `Tokenized` column, by lower casing and tokenizing the `Commentary` column.

2.- New `PoS_tagged` column, by pos_tagging the `Tokenized` column.

3.- New `PoS_lemmatized` column, by lemmatizing only the words in the `PoS_tagged` column. The reason for doing it in this order is to present to the tagging function the original text.

An example outcome of the returned data frame is given below for each columns first three rows:

```python
>>print(df['Tokenized'][:3])
0    [plenty, of, chances, in, this, game, but, nei...
1    [that, 's, it, !, the, referee, blows, the, fi...
2    [ball, possession, :, tottenham, :, 44, %, ,, ...
Name: Tokenized, dtype: object

>>print(df['PoS_tagged'][:3])
0    [(plenty, NN), (of, IN), (chances, NNS), (in, ...
1    [(that, DT), ('s, VBZ), (it, PRP), (!, .), (th...
2    [(ball, DT), (possession, NN), (:, :), (totten...
Name: PoS_tagged, dtype: object

>>print(df['PoS_lemmatized'][:3])
0    [(plenty, NN), (of, IN), (chance, NNS), (in, I...
1    [(that, DT), ('s, VBZ), (it, PRP), (!, .), (th...
2    [(ball, DT), (possession, NN), (:, :), (totten...
Name: PoS_lemmatized, dtype: object
```

In [None]:
# Q5.1 - Your code here

## Q5.2 - Basic search engine (10 marks)

In this question, we implement a basic search engine in a function called `retrieve_similar_commentaries(df, query, k)`, which takes as input the following arguments:

- `df` the previously enriched (tokenized, pos tagged, etc) commentary dataframe.
- `query` a string of any type, which will be the query we will be using to retrieve similar commentaries.
- `k` and integer denoting the top `k` commentaries to be returned (by similarity).

Our function must perform the following steps:

1 - Tokenize and lemmatize the input query.

2 - For each commentary in the df, compute how similar it is to the query as the number of shared tokens between query and commentary.

3 - We will prioritize noun matches, so our similarity score will receive +1 if at least one of the matching tokens in the commentary is a noun (i.e., its part of speech starts with `N`). This means that, for example, if your query has 2 tokens, the maximum similarity a commentary can have is 4: 2 for 2 overlapping tokens, and 2 for both tokens being nouns.

4 - The function must return a list of tuples of the form `[(commentary1, sim), (commentary2, sim) ... (commentaryk, sim)]`, where commentaries are ranked by `sim` value in descending order.

An example test case is given below:

```python
>>> result = retrieve_similar_commentaries(df, "Manchester United ball", 3)
>>> for idx,r in enumerate(result):
>>>   print(idx,r)

0 ('Manchester United is in control of the ball.', 5)
1 ('Manchester United is in control of the ball.', 5)
2 ('Jadon Sancho from Manchester United crosses the ball, but it goes out for a goal kick.', 5)
```

In [None]:
def retrieve_similar_commentaries(df, query, k):
    # your code here
    pass

## Q5.3 - PMI (13 marks)

In this question, you implement and apply the pointwise mutual information (PMI) metric, a word association metric introduced in 1992, to the football commentaries. The purpose of PMI is to extract, from free text, pairs of words or phrases than tend to co-occur together more often than expected by chance. For example, PMI(`new`, `york`) would give a higher score than PMI(`new`, `car`) because the chance of finding `new` and `york` together in text is higher than `new` and `car`, despite `new` being a more frequent word than `york`.

The formula for PMI (where `x` and `y` are two words) is:

$PMI(x,y) = log(\frac{p(x,y)}{p(x)p(y)})$

Watch this video to understand how to estimate these probabilities: https://www.youtube.com/watch?v=swDoFpuHpzQ.

Detailed instructions:

You will implement the following logic:

- **Phrase Extraction**: The first step is to extract noun phrases (NPs) and verb phrases (VPs) from the lemmatized data. To do this, you'll need to write a function that goes through each entry and groups words into noun phrases or verb phrases based on their part-of-speech tags. We will reward cases where NPs and VPs go beyond single word matching.

- **Phrase Counting**: Once you have extracted the NPs and VPs, you'll need to count how many times each phrase occurs in the dataset. You'll have to write a function that iterates through the NPs and VPs and keeps track of the counts in dictionaries.

- **Total Counts**: The next step is to compute the total count of all NPs and VPs. This is simply the sum of all the counts in the dictionaries you created in the previous step.

- **Identifying Top Phrases**: To reduce computational complexity, we only want to compute PMI for the top occurring NPs and VPs. So, you will need to write a function that sorts the phrases by their counts and selects the top 100 phrases.

- **Creating the PMI Matrix**: Finally, you'll create a PMI matrix using the top NPs and VPs, their counts, and the total counts of NPs and VPs. This matrix will be a pandas DataFrame, which will have rows corresponding to the top VPs, columns corresponding to the top NPs, and each cell will contain the PMI value between the corresponding NP and VP. This part of your solution will return 0 when there is no co-occurrence between an NP and a VP, and apply smoothing only to the final PMI value (refer to the video).

You must implement all the functionality in a function called `compute_pmi_dataframe(df)` that takes as input the enriched `df` you created in `Q5.1`. You are encouraged to implement additional functions to break down your code and make it clear how you are separating different functionalities.

An example test case is given below:

```python
>>> def top_k_vps(pmi_matrix, np, k):
>>>    # Check if the NP exists in the matrix
>>>    if np in pmi_matrix.T.index:
>>>        top_vps = pmi_matrix.T.loc[np].nlargest(k)
>>>        return top_vps.index.tolist()
>>>    else:
>>>        print(f"Noun phrase '{np}' not found in PMI matrix.")
>>>        return []
>>>top_k_vps(pmidf, 'joao cancelo', 3)

['arrives', 'cut', 'benefit']
```

In [None]:
def compute_pmi_dataframe(df):
    # 1 - PHRASE EXTRACTION
    # your code here

    # 2 - PHRASE COUNTING
    # your code here

    # 3 - TOTAL COUNTS
    # your code here

    # 4 - FIND TOP PHRASES
    # your code here

    # 5 - CREATE PMI MATRIX
    # your code here
    return pmi_matrix

pmidf = compute_pmi_dataframe(df)

In [13]:
# you can test your resulting matrix
def top_k_vps(pmi_matrix, np, k):
    # Check if the NP exists in the matrix
    if np in pmi_matrix.T.index:
        top_vps = pmi_matrix.T.loc[np].nlargest(k)
        return top_vps.index.tolist()
    else:
        print(f"Noun phrase '{np}' not found in PMI matrix.")
        return []
top_k_vps(pmidf, 'joao cancelo', 3)

['arrives', 'cut', 'benefit']