<b>Discussion Points</b>
<ul>
<li>Is one engine better than another? </li>
<li>How did the text extractors compare? </li>
<li>How did they all compare to the different methods in the Jupyter Notebook? </li>
</ul>
<p>
Think in terms of:
<ul>
<li>Providing useful terms (however you choose to define them)</li>
<li>Catching all the terms that you think it should (precision vs recall)</li>
</ul>
<p>
In addition, note how  the engines perform on:
<ul>
<li><b>Stemming</b>: coalescing plurals and possessives into a single word, such as "emails" => "email," etc.</li>
<li><b>Synonyms / Alias:</b> Extracting the same entity, regardless of how the entity is named ("Bernie Sanders" vs. "Bernie" vs. "Sanders") </li>
<li>How did the methods in your notebook perform when you included the preprocessing step?</li>
</ul>
</p>
<p>
Also, note how the online term extraction engines "count" their terms.
<ul>
<li>TerMine provides two indices; Term Strength and one other. </li>
<li>FiveFilters gives you the actual term count. </li>
<li>Other term extraction engines provide different outputs. (For example, you could check out the Python NLTK, or Natural Language ToolKit, capability.) </li>
<li>You can run the Freq_Dist block in your Notebook to get NLTK based counts.
Some of the phrase extractors offer counts too.</li>
</ul>
</p>
<p>
Consider and comment.</p>


 <b>Objective:</b> Build a ground-truth term/entity list from your own 10-document movie set, then compare it against online term/entity extractors and programmatic methods (Jupyter/NLTK + other extractors). <br>Reflect on how preprocessing choices affect results, and discuss which methods worked best and why.

In [None]:
!pip install openai
!pip install dandelion-eu
!pip install tenacity
!pip install ipynb
!pip3 install rake_nltk




In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import ngrams, FreqDist
from nltk.corpus import stopwords
from rake_nltk import Rake
import re
import unicodedata
import pandas as pd
import os
from google.colab import drive, userdata
import json
from collections import Counter
from openai import OpenAI
from tenacity import retry, wait_exponential_jitter, stop_after_attempt
from pydantic import BaseModel, Field
from typing import List, Optional, Dict, Any
from dandelion import DataTXT
import ipynb
import requests


nltk.download('punkt')
## required as I was getting an error when attempting to flatten the reviews to ascii
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
## indicator variable that will pull from APIs if True
## will import from saved json files if False
dev = False

In [None]:
## load working area
drive.mount('/content/drive', force_remount=True)

# environment references
destination_folder= userdata.get('destination_folder')
data_folder= userdata.get('data_folder')



Mounted at /content/drive


In [None]:
def get_cleaned_text(text):
  """simple data cleaning - remove the special characters and convert to lower case"""
  cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
  cleaned_text = cleaned_text.lower()
  cleaned_text = cleaned_text.rstrip()
  return cleaned_text

In [None]:
def get_matched_term_list(list_a, list_b):
  matched_terms =[]
  for i in list_a:
    for j in list_b:
      if i==j:
        if i not in matched_terms:
          matched_terms.append(i)
      if i in j:
        if i not in matched_terms:
          matched_terms.append(i)
      if j in i:
        if i not in matched_terms:
          matched_terms.append(i)
  return matched_terms

In [None]:
def check_terms_in_df(df, term_list, alias_terms = {}):
  results = {}
  for f in df['FileName']:
    docfile = f
    reference = f[4:9].replace("_", "")
    results[reference] = {}
    for t in term_list:
      text = df['MovieReview_normalized'].loc[df['FileName']==f].to_list()[0]
      if t in text:
        results[reference][t] = text.count(t)
      else:
        results[reference][t] = 0


  df_results = pd.DataFrame.from_dict(results)
  df_results['TotalTerms'] = df_results.sum(axis=1, numeric_only=True)

  df_results.sort_values(by='TotalTerms', ascending = False, inplace=True)
  df_results = df_results.reset_index()
  df_results.rename(columns={'index':'Term'}, inplace=True)
  df_results = df_results.fillna(0)
  if alias_terms != {}:
    df_results['alias_terms'] = df_results['Term'].apply(lambda x: alias_terms[x] if x in alias_terms else '')
  df_output = df_results.loc[df_results['TotalTerms']>=1]
  return df_output

<b>Step 1: Manual Term Extraction</b>
Using EACH of the documents that you selected previously (your own 10 documents) - manually perform term extraction and entity extraction.

In [None]:
## load manual terms from text file
manual_terms_source = os.path.join(destination_folder, "LCK_TAKEN_MANUAL_Terms.txt")
with open(manual_terms_source, 'r') as file:
  delimiter = ","
  content = file.read()
  manual_terms = content.split(delimiter)

cleaned_terms = []
for m in manual_terms:
  cleaned_terms.append(get_cleaned_text(m))


In [None]:
## load movie reviews into a dataframe
movie_reviews = os.path.join(data_folder, "LCK_MovieReview.csv")
df = pd.read_csv(movie_reviews)

#apply a cleansing function (normalize text case, remove special characters)
df['MovieReview_normalized'] = df['MovieReview'].apply(get_cleaned_text)
print(df.head())

              FileName                                        MovieReview  \
0   LCK_DOC3_TAKEN.txt  "Taken," which tells the story of how Liam Nee...   
1   LCK_DOC9_TAKEN.txt  The conundrum posed by "Taken" is as old as ci...   
2  LCK_DOC10_TAKEN.txt  The coolest thing in Taken lasts about three s...   
3   LCK_DOC7_TAKEN.txt  If CIA agents in general were as skilled as Br...   
4   LCK_DOC6_TAKEN.txt  Without ever taking his shirt off, Liam Neeson...   

                              MovieReview_normalized  
0  taken which tells the story of how liam neeson...  
1  the conundrum posed by taken is as old as cine...  
2  the coolest thing in taken lasts about three s...  
3  if cia agents in general were as skilled as br...  
4  without ever taking his shirt off liam neeson ...  


<h2> Manual (Ground Truth) Term count in "Taken" movie reviews</h2>

In [None]:
## create the combined table that counts the term frequency within the movie reviews
df_manual = check_terms_in_df(df, cleaned_terms)
df_manual['Ground_Truth'] = df_manual['Term'].apply(lambda x: 1 if x in cleaned_terms else 0)
df_manual = df_manual[['Term','Ground_Truth','DOC1','DOC2','DOC3','DOC4','DOC5','DOC6','DOC7','DOC8','DOC9','DOC10','TotalTerms']]
## display the top 8 records
df_manual.head(8)

Unnamed: 0,Term,Ground_Truth,DOC1,DOC2,DOC3,DOC4,DOC5,DOC6,DOC7,DOC8,DOC9,DOC10,TotalTerms
0,la,1,8,11,11,10,9,14,14,3,12,15,107
1,us,1,12,9,10,4,7,4,15,7,6,25,99
2,cia,1,4,5,3,0,1,2,6,0,1,3,25
3,paris,1,2,5,1,1,1,4,2,1,3,5,25
4,kim,1,0,3,0,1,0,5,7,0,3,5,24
5,liam neeson,1,1,2,1,1,1,1,3,2,1,1,14
6,albanian,1,0,1,1,1,1,3,1,1,1,1,11
7,luc besson,1,1,1,1,1,0,1,1,1,1,2,10


In [None]:
print(f'The number of manually extracted terms from movie reviews is {len(cleaned_terms)} distinct terms')

The number of manually extracted terms from movie reviews is 102 distinct terms


<blockquote>
<b>Observation from the frequency count for manually extracted terms</b> the top responses ("la" and "us") are acronyms for place names (Los Angeles and United States) are also word parts - which explains the frequency of observed values.
</blockquote>


<b>Step 2: Run Two Different Automated Term Extraction Engines. </b>
<ul>
<li>Pick two online term extractors (e.g., FiveFilters, Dandelion, wordcount.com) and run each on all 10 documents. For this work - I am using online term extractors that have published Application Programming interface (API).
<ul>
<li>Option 1 - ChatGPT with specific instructions</li>
<li>Option 2 - Dandelion</li>
</li>
</ul>
<li>Extend your table to add their metrics/counts per document (for at least the top ~20 terms from your ground truth; more if you can).</li>
</ul>


<h2>Automated Extraction Engine 1: Generative Pre-trained Transformer (GPT)</h2>


In [None]:
## documented user instructions for the gpt input
## code reference:
SYSTEM_PROMPT = (
    "You are a precise term extractor. Return ONLY structured JSON per the schema. "
    "No explanations."
)

USER_INSTRUCTIONS = """\
From the text, extract up to {num_terms} high-value domain terms.

For each term include:
- term (lemma, ≤3 words)
- category (entity|noun_phrase|acronym|process|system|metric|place|noun)
- confidence (0–1)
- frequency count (integer)

Rules:
- Be conservitive: if unsure, omit.
- Merge obvious variants; keep acronyms with expansions.
- Exclude generic terms unless qualified (e.g., 'data pipeline' ok, 'data' not).
- Prefer domain-specific multiword noun phrases.
Text:
\"\"\"{text}\"\"\"
"""
MODEL_NAME = "gpt-4o"

In [None]:
#@retry(wait=wait_exponential_jitter(initial=1, max=20), stop=stop_after_attempt(6))

def call_gpt_extractor(client , chunk_text: str, num_terms: int = 40) -> Dict[str, Any]:
    """
    Call the Responses API from GPT model.
    Requires an OpenAI API key.
    """
    resp = client.responses.create(
        model=MODEL_NAME,
        input=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_INSTRUCTIONS.format(num_terms=num_terms, text=chunk_text)}
        ],
        temperature=0.1,
    )
    json_string = json.loads(resp.output[0].content[0].text[8:-4])
    return json_string

def call_dandelion_extractor(api_key, chunk_text: str, num_terms: int = 40):
    """
    Call the Responses API.
    """
    resp = api_key.nex(text = chunk_text, top_entities = num_terms, min_length = 3)
    return resp

def chunk_text(text: str, max_words=1000, overlap_words=100):
    """
    Split long text into overlapping chunks by words.
    """
    N = len(text.split())
    words = text.split()
    i = 0
    while i < N:
        j = min(i + max_words, N)
        chunk = " ".join(words[i:j])
        yield chunk
        i = j - overlap_words if j < N else j

def get_extracted_terms(df, api_key, api_option = 'gpt'):
  """standardized function to call and output the extracted terms from the movie reviews"""
  MAX_WORDS = 1000      ## words per loop
  OVERLAP = 10          ## word overlap between loops (to ensure no potential terms are missed)
  TOP_K_PER_CHUNK = 15  ## limit results to the top N requests
  all_rows = {}         ## generates the results disctionary
  all_terms = []        ## generates the consolidated list of terms extracted
  for d in range(10):
      for i, chunk in enumerate(chunk_text(df["MovieReview_normalized"][d], max_words=MAX_WORDS, overlap_words=OVERLAP)):
        #print loop output to see where we are
        print(d, i, df['FileName'][d])
        ## gpt options
        if api_option == 'gpt':
          out = call_gpt_extractor(api_key, chunk, num_terms=TOP_K_PER_CHUNK)
        ## dandelion options
        elif api_option == 'dandelion':
          output = call_dandelion_extractor(api_key, chunk_text = chunk, num_terms = TOP_K_PER_CHUNK)
          out = output['annotations']

        for t in out:
          lbl = t['label'] if 'label' in t else t['term']
          extracted_term = get_cleaned_text(lbl)
          if extracted_term not in all_terms and len(extracted_term)>=3:
            all_terms.append(extracted_term)
            if 'label' not in t:
              all_rows[extracted_term] = {**t}
            else:
              all_rows[extracted_term] = {'label':extracted_term, 'confidence':t['confidence']}

            ## count the number of instances from the term extraction step
            all_rows[extracted_term]['identified_count'] = 1

          if extracted_term in all_terms:
            all_rows[extracted_term]['identified_count'] += 1

  print("---------COMPLETE -----------")
  return (all_terms, all_rows)

In [None]:
## generate the terms from GPT api call
movie_gpt_terms = os.path.join(destination_folder, "LCK_GPT_Extracted_Terms.json")
if dev == False:
  with open(movie_gpt_terms, 'r') as f:
        all_gpt_rows = json.load(f)

elif dev == True:
  ## calls the apii key from google secrets
  client = OpenAI(api_key=userdata.get('gpt') )
  ## calls the data extractor function
  all_gpt_terms, all_gpt_rows = get_extracted_terms(df, client, api_option = 'gpt')
  ## save results
  with open(movie_gpt_terms, 'w') as f:
        json.dump(all_gpt_rows, f, indent=4)


In [None]:
all_gpt_rows

{'liam neeson': {'term': 'Liam Neeson',
  'category': 'entity',
  'confidence': 1,
  'frequency_count': 2,
  'identified_count': 9},
 'bryan mills': {'term': 'Bryan Mills',
  'category': 'entity',
  'confidence': 1,
  'frequency_count': 2,
  'identified_count': 8},
 'cia spook': {'term': 'CIA spook',
  'category': 'noun_phrase',
  'confidence': 0.9,
  'frequency_count': 1,
  'identified_count': 2},
 'los angeles mansion': {'term': 'Los Angeles mansion',
  'category': 'place',
  'confidence': 0.9,
  'frequency_count': 1,
  'identified_count': 2},
 'industrialist husband': {'term': 'industrialist husband',
  'category': 'noun_phrase',
  'confidence': 0.8,
  'frequency_count': 1,
  'identified_count': 2},
 'obama administration': {'term': 'Obama administration',
  'category': 'entity',
  'confidence': 0.9,
  'frequency_count': 1,
  'identified_count': 2},
 'french producer': {'term': 'French producer',
  'category': 'noun_phrase',
  'confidence': 0.8,
  'frequency_count': 1,
  'identified

In [None]:
print(f'The total number of unique terms extracted using GPT {MODEL_NAME} for the {df.shape[0]} movie reviews is {len(all_gpt_rows.keys())}')
print(list(all_gpt_rows.keys())[:10])

The total number of unique terms extracted using GPT gpt-4o for the 10 movie reviews is 106
['liam neeson', 'bryan mills', 'cia spook', 'los angeles mansion', 'industrialist husband', 'obama administration', 'french producer', 'charles de gaulle international', 'sex traffickers', 'oskar schindler']


In [None]:
gpt_manual_matched_terms = get_matched_term_list(list(all_gpt_rows.keys()), cleaned_terms)

print(f'The number of matched terms between the manual method and GPT {MODEL_NAME} for the {df.shape[0]} movie reviews is {len(gpt_manual_matched_terms)}.')

The number of matched terms between the manual method and GPT gpt-4o for the 10 movie reviews is 57.


Automated Extraction Engine 2: Dandelion

In [None]:
movie_dandelion_terms = os.path.join(destination_folder, "LCK_Dandelion_Extracted_Terms.json")
if dev == False:
  with open(movie_dandelion_terms, 'r') as f:
        all_dandelion_rows = json.load(f)
elif dev == True:
  term_extractor_dandelion = "https://dandelion.eu/"
  datatxt = DataTXT(token = userdata.get('dandelion_api'))
  all_dandelion_terms, all_dandelion_rows = get_extracted_terms(df, datatxt, api_option = 'dandelion')
  ## save results
  with open(movie_dandelion_terms, 'w') as f:
        json.dump(all_dandelion_rows, f, indent=4)

In [None]:
print(f'The total number of unique terms extracted using dandelion_api for the {df.shape[0]} movie reviews is {len(all_dandelion_rows.keys())}')
print(list(all_dandelion_rows.keys())[:10])

The total number of unique terms extracted using dandelion_api for the 10 movie reviews is 273
['liam neeson', 'gasket', 'france', 'albanian', 'labrador retriever', 'cia', 'black pudding', 'family', 'los angeles', 'industrialist']


In [None]:
dandelion_manual_matched_terms = get_matched_term_list(list(all_dandelion_rows.keys()), cleaned_terms)
print(f'The number of matched terms between the manual method and dandelion_api for the {df.shape[0]} movie reviews is {len(dandelion_manual_matched_terms)}.')

The number of matched terms between the manual method and dandelion_api for the 10 movie reviews is 87.


<blockquote>
<b>Initial Observations of term automated extractors (GPT & Dandelion)</b> for 10 movie reviews of the movie Taken:<br>
<ul>
<li>The total number of <b>danelion</b> extracted terms is 273</li>
<li>The total number of <b>GPT gpt-4o</b> extracted terms is 111</li>
</ul>
<p>
One of the key differentiators between the results from GPT and danelion api was the instruction to the GPT api.
</p>
<p><i>
Rules for the GPT prompt response:
<ul>
<li>Be conservitive; if unsure, omit.</li>
<li>Merge obvious variants; keep acronyms with expansions.</li>
</i></ul>
</p>

</blockquote>

In [None]:
matched_terms_automated = get_matched_term_list(list(all_gpt_rows.keys()), list(all_dandelion_rows.keys()))
print(f'The number of matched terms between the two methods is {len(matched_terms_automated)}.')
print(f'A sample of the  matched terms between the two sources are: {matched_terms_automated[:10]}')

The number of matched terms between the two methods is 73.
A sample of the  matched terms between the two sources are: ['liam neeson', 'cia spook', 'los angeles mansion', 'industrialist husband', 'french producer', 'charles de gaulle international', 'oskar schindler', 'eiffel tower', 'jason bourne', 'charles bronson simplicity']


In [None]:
merged_term_list = list(all_gpt_rows.keys()) + list(all_dandelion_rows.keys())+ cleaned_terms
## remove duplicates
merged_term_list = list(set(merged_term_list))
print(f'The number of terms between the two automated and manual methods: {len(merged_term_list)}.')

The number of terms between the two automated and manual methods: 390.


In [None]:
df_automated_gpt_check = check_terms_in_df(df, merged_term_list)
df_automated_gpt_check['Ground_Truth'] = df_automated_gpt_check['Term'].apply(lambda x: 1 if x in cleaned_terms else 0)
df_automated_gpt_check['GPT'] = df_automated_gpt_check['Term'].apply(lambda x: 1 if x in list(all_gpt_rows.keys()) else 0)
df_automated_gpt_check['Dandelion'] = df_automated_gpt_check['Term'].apply(lambda x: 1 if x in list(all_dandelion_rows.keys()) else 0)
df_automated_gpt_check = df_automated_gpt_check[['Term','Ground_Truth','GPT','Dandelion','DOC1','DOC2','DOC3','DOC4','DOC5','DOC6','DOC7','DOC8','DOC9','DOC10','TotalTerms']]
df_automated_gpt_check.head(15)

Unnamed: 0,Term,Ground_Truth,GPT,Dandelion,DOC1,DOC2,DOC3,DOC4,DOC5,DOC6,DOC7,DOC8,DOC9,DOC10,TotalTerms
0,la,1,0,0,8,11,11,10,9,14,14,3,12,15,107
1,us,1,0,0,12,9,10,4,7,4,15,7,6,25,99
2,taken,0,1,1,11,3,2,1,4,1,3,15,3,6,49
3,ive,0,0,1,3,5,3,3,3,3,6,5,5,6,42
4,action,0,0,1,4,2,1,2,3,4,4,12,0,5,37
5,film,0,0,1,3,1,1,0,3,0,1,6,1,9,25
6,cia,1,1,1,4,5,3,0,1,2,6,0,1,3,25
7,paris,1,1,1,2,5,1,1,1,4,2,1,3,5,25
8,kim,1,0,0,0,3,0,1,0,5,7,0,3,5,24
9,other,0,0,1,1,0,0,3,0,2,3,1,2,5,17


<h2><b>Step 3: Programmatic extraction (Notebook)</b></h2>
<ul>
<li>Run the provided term_extraction_exploration.ipynb (NLTK tokenizer + optional phrase extractors).</li>
<li>Apply the extracted term results to each of the 10 documents</li>
<li>Record comparable counts/metrics and compare to your ground truth and online tools</li>
</ul>

<h2>Programmatic extraction: NLTK Tokenizer</h2>

Using the nltk tokenizer methods, return phrases that have been extracted from the reviews

In [None]:
## code reference from discussion 2 term extraction code sample
## reference: Jennifer Sleeman

#Get Frequ Dist
def get_freq_dist(terms):
    all_counts = dict()
    all_counts = FreqDist(terms)
    return all_counts

# Rake Keyword Extractor
def run_rake(in_text):
    r = Rake(max_length=3, include_repeated_phrases = False)
    r.extract_keywords_from_text(in_text)
    rake_phrases= r.get_ranked_phrases()
    scores = r.get_ranked_phrases_with_scores()
    return scores

Format the text in the dataframe.
<ol>
<li>Split sentances to generate a list of sentances.</li>
<li>Apply the text cleaning function (lowercase, remove punctuation)</li>
</ol>

In [None]:
## format dataframe for nltk functions
df['Review_Sentences'] = df['MovieReview'].str.split(r'.', expand=False)
df['Review_Sentences'] = df['Review_Sentences'].apply(lambda x: [get_cleaned_text(item) for item in x])

In [None]:
## from discussion 2 term extraction code sample
## reference: Jennifer Sleeman
extracted_terms = []
for i in range(10):
  sentences = df['Review_Sentences'][i]
  all_terms=[]
  for sentence in sentences:
    all_terms = all_terms + run_rake(sentence)
  #get the frequency distribution across the terms
  fd=get_freq_dist(all_terms).most_common(15)

  for t in fd:
    if t[0][1] not in extracted_terms:
      extracted_terms.append(t[0][1])

In [None]:
print(f'For nltk term extraction from the 10 movie reviews there were {len(extracted_terms)} terms occured most frequently in the review sentences')

For nltk term extraction from the 10 movie reviews there were 131 terms occured most frequently in the review sentences


In [None]:
matched_terms_nltk = get_matched_term_list(extracted_terms, merged_term_list)
matched_terms_manual = get_matched_term_list(cleaned_terms, extracted_terms)
print(f'The number of matched terms between the previous methods and nltk methods: {len(matched_terms_nltk)}.')
print(f'The number of matched terms between the manual methods and nltk methods: {len(matched_terms_manual)}.')
print(f'A sample of the matched terms between the methods are: {matched_terms_nltk[:6]}')

The number of matched terms between the previous methods and nltk methods: 63.
The number of matched terms between the manual methods and nltk methods: 27.
A sample of the matched terms between the methods are: ['neeson', 'taken', 'liam neeson blows', 'kills 75 albanians', 'deeply insane', 'gasket']


In [None]:
merged_term_list = merged_term_list + extracted_terms
## remove duplicates
merged_term_list = list(set(merged_term_list))

## generate dataframe frequncy count in reviews
df_nltk_check = check_terms_in_df(df, merged_term_list)
df_nltk_check['Ground_Truth'] = df_nltk_check['Term'].apply(lambda x: 1 if x in cleaned_terms else 0)
df_nltk_check['GPT'] = df_nltk_check['Term'].apply(lambda x: 1 if x in list(all_gpt_rows.keys()) else 0)
df_nltk_check['Dandelion'] = df_nltk_check['Term'].apply(lambda x: 1 if x in list(all_dandelion_rows.keys()) else 0)
df_nltk_check['NLTK'] = df_nltk_check['Term'].apply(lambda x: 1 if x in extracted_terms else 0)
df_nltk_check = df_nltk_check[['Term','Ground_Truth','GPT','Dandelion','NLTK','DOC1','DOC2','DOC3','DOC4','DOC5','DOC6','DOC7','DOC8','DOC9','DOC10','TotalTerms']]
df_nltk_check.head(12)

Unnamed: 0,Term,Ground_Truth,GPT,Dandelion,NLTK,DOC1,DOC2,DOC3,DOC4,DOC5,DOC6,DOC7,DOC8,DOC9,DOC10,TotalTerms
0,la,1,0,0,0,8,11,11,10,9,14,14,3,12,15,107
1,us,1,0,0,0,12,9,10,4,7,4,15,7,6,25,99
2,one,0,0,0,1,8,8,2,7,2,8,15,3,6,9,68
3,taken,0,1,1,1,11,3,2,1,4,1,3,15,3,6,49
4,end,0,0,0,1,4,6,5,2,2,2,3,2,5,12,43
5,ive,0,0,1,0,3,5,3,3,3,3,6,5,5,6,42
6,bryan,0,0,0,1,0,7,7,4,1,6,1,3,1,10,40
7,man,0,0,0,1,5,4,1,1,1,3,2,3,3,15,38
8,action,0,0,1,1,4,2,1,2,3,4,4,12,0,5,37
9,neeson,0,0,0,1,2,2,5,5,4,3,3,5,5,3,37


<h2>Programmatic extraction: Google API</h2>

Using the Google NLP Term Extractor API, return phrases that have been extracted from the reviews

In [None]:
## using Google NLP for entity extraction
API_KEY = data_folder= userdata.get('google_api')

def get_google_nlp_entities(extract_text, api_key):
  API_ENDPOINT = f"https://language.googleapis.com/v1/documents:analyzeEntities?key={API_KEY}"
  # Request content
  request_body = {"document": {"content": extract_text, "type": "PLAIN_TEXT"}, "encodingType": "UTF8" }
  try:
    output = {}
    response = requests.post(API_ENDPOINT, json=request_body)
    response.raise_for_status()  # Raise an exception for HTTP errors
    ## loop through the response items to capture the top results
    for e in response.json()['entities']:
      if e['salience'] >=0.01:
        if e['name'] not in output:
          output[e['name']] ={'type':e['type'],
                              'salience': round(e['salience'],3),
                              'count': len(e['mentions'][0].keys())}
        else:
          output[e['name']]['count'] += len(e['mentions'][0].keys())
    return output
  except requests.exceptions.RequestException as e:
        print(f"Error making API request: {e}")
        if hasattr(e, 'response') and e.response is not None:
            print(f"Response content: {e.response.text}")
##NOTES
##Salience score [0.0, 1.0]
#Scores closer to 0.0: suggest the entity is less important or central to the document.
#Scores closer to 1.0: indicate the entity is highly important or central to the document.
#Score can be valuable for Information retrieval, Summarization, Content analysis
## reference: https://www.601media.com/google-salience-score-what-is-it/

In [None]:
movie_google_extract_terms = os.path.join(destination_folder, "LCK_google_nlp_Extracted_Terms.json")
if dev == False:
  with open(movie_google_extract_terms, 'r') as f:
        google_extract_terms = json.load(f)
elif dev == True:
  google_extract_terms = {}
  for r in df['FileName'].unique():
    review_text = df['MovieReview_normalized'].loc[df['FileName']==r].to_list()[0]
    google_extract_terms[r] = get_google_nlp_entities(review_text, API_KEY)
  ## save results
  with open(movie_google_extract_terms, 'w') as f:
        json.dump(google_extract_terms, f, indent=4)

In [None]:
google_results = []
for r in google_extract_terms:
  for t in list(google_extract_terms[r].keys()):
    if t not in google_results:
      google_results.append(t)
print(f'Google NPL Term Extraction API for the 10 movie reviews there were {len(google_results)} terms occured most frequently in the review sentences')

Google NPL Term Extraction API for the 10 movie reviews there were 152 terms occured most frequently in the review sentences


In [None]:
matched_terms_google = get_matched_term_list(google_results, merged_term_list)
matched_terms_manual = get_matched_term_list(cleaned_terms, google_results)
print(f'The number of matched terms between the previous methods and Google NLP Term Extraction: {len(matched_terms_google)}.')
print(f'The number of matched terms between the manual methods and Google NLP Term Extraction: {len(matched_terms_manual)}.')
print(f'A sample of the matched terms between the methods are: {matched_terms_google[:6]}')


The number of matched terms between the previous methods and Google NLP Term Extraction: 118.
The number of matched terms between the manual methods and Google NLP Term Extraction: 37.
A sample of the matched terms between the methods are: ['neeson', 'daughter', 'thing', 'spook', 'story', 'eagerness']


In [None]:
merged_term_list = merged_term_list + google_results
## remove duplicates
merged_term_list = list(set(merged_term_list))

## generate dataframe frequncy count in reviews
df_google_check = check_terms_in_df(df, merged_term_list)
df_google_check['Ground_Truth'] = df_google_check['Term'].apply(lambda x: 1 if x in cleaned_terms else 0)
df_google_check['GPT'] = df_google_check['Term'].apply(lambda x: 1 if x in list(all_gpt_rows.keys()) else 0)
df_google_check['Dandelion'] = df_google_check['Term'].apply(lambda x: 1 if x in list(all_dandelion_rows.keys()) else 0)
df_google_check['NLTK'] = df_google_check['Term'].apply(lambda x: 1 if x in extracted_terms else 0)
df_google_check['Google'] = df_google_check['Term'].apply(lambda x: 1 if x in google_results else 0)
df_google_check = df_google_check[['Term','Ground_Truth','GPT','Dandelion','NLTK','Google','DOC1','DOC2','DOC3','DOC4','DOC5','DOC6','DOC7','DOC8','DOC9','DOC10','TotalTerms']]
df_google_check.head(12)

Unnamed: 0,Term,Ground_Truth,GPT,Dandelion,NLTK,Google,DOC1,DOC2,DOC3,DOC4,DOC5,DOC6,DOC7,DOC8,DOC9,DOC10,TotalTerms
0,la,1,0,0,0,0,8,11,11,10,9,14,14,3,12,15,107
1,us,1,0,0,0,0,12,9,10,4,7,4,15,7,6,25,99
2,no,0,0,0,0,1,8,11,8,2,7,7,8,11,10,21,93
3,one,0,0,0,1,1,8,8,2,7,2,8,15,3,6,9,68
4,taken,0,1,1,1,0,11,3,2,1,4,1,3,15,3,6,49
5,all,0,0,0,0,1,6,6,4,2,2,3,4,6,3,8,44
6,end,0,0,0,1,0,4,6,5,2,2,2,3,2,5,12,43
7,ive,0,0,1,0,0,3,5,3,3,3,3,6,5,5,6,42
8,bryan,0,0,0,1,1,0,7,7,4,1,6,1,3,1,10,40
9,man,0,0,0,1,1,5,4,1,1,1,3,2,3,3,15,38


<h2>Step 4: Prepocessing Exploration</h2>
<ul>
<li>Experiment with lowercasing, stop-word removal, stemming, and combinations.</li>
<li>Note how these choices change extracted terms (better/worse, noise introduced/removed).</li>
</ul>

The text normalization thus far has focused on converting the strings to lower case and removing special characters.

In [None]:
# Create a list of stop words from nltk
stop_words = list(stopwords.words("english"))

In [None]:
list_with_char = [x for x in stop_words if x.find("'") != -1]
for x in list_with_char:
  if get_cleaned_text(x) not in stop_words:
    stop_words.append(get_cleaned_text(x))
stop_words[:5]

['a', 'about', 'above', 'after', 'again']

In [None]:
def remove_special_chars_and_digits(in_text):
    # Remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",in_text)
    return text
# Remove stop words
def remove_stop_words(in_text, stop_words):
    word_tokens = word_tokenize(in_text)
    updated_sentence = [w for w in word_tokens if not w in stop_words]
    filtered_sentence = " ".join(updated_sentence)
    return filtered_sentence


In [None]:
df['preprocessed_text_count'] = df['Review_Sentences'].apply(lambda x: sum(map(len, x)))
## process text in sentences
df['processed_sentence'] = df['Review_Sentences'].apply(lambda x: [remove_special_chars_and_digits(item) for item in x])
df['processed_sentence'] = df['processed_sentence'].apply(lambda x: [remove_stop_words(item, stop_words) for item in x])
df['processed_text_count'] = df['processed_sentence'].apply(lambda x: sum(map(len, x)))
## display results
df.head(2)

Unnamed: 0,FileName,MovieReview,MovieReview_normalized,Review_Sentences,preprocessed_text_count,processed_sentence,processed_text_count
0,LCK_DOC3_TAKEN.txt,"""Taken,"" which tells the story of how Liam Nee...",taken which tells the story of how liam neeson...,[taken which tells the story of how liam neeso...,3277,[taken tells story liam neeson blows gasket fl...,2268
1,LCK_DOC9_TAKEN.txt,"The conundrum posed by ""Taken"" is as old as ci...",the conundrum posed by taken is as old as cine...,[the conundrum posed by taken is as old as cin...,3840,"[conundrum posed taken old cinema, stars degra...",2520


In [None]:
## from discussion 2 term extraction code sample
## reference: Jennifer Sleeman
extracted_terms2 = []
for i in range(10):
  sentences2 = df['processed_sentence'][i]
  all_terms2=[]
  for sentence in sentences2:
    all_terms2 = all_terms2 + run_rake(sentence)
  #get the frequency distribution across the terms
  fd2=get_freq_dist(all_terms2).most_common(15)

  for t in fd2:
    if len(t[0][1])>2 and t[0][1] not in extracted_terms2:
      extracted_terms2.append(t[0][1])

In [None]:
print(f'For nltk term extraction from the 10 movie reviews there were {len(extracted_terms2)} terms occured most frequently in the pre-processed text review sentences')
print(f'The result of the text pre-processing reduced the total number of terms extracted using NLTK by {len(extracted_terms) - len(extracted_terms2)} terms.')
print(f'For nltk term extraction with pre-processing {len(extracted_terms2)} (stop word removal, numeric digits removed)')
print(f'For nltk term extraction without pre-processing {len(extracted_terms)} (lower case and special characters removed)')

For nltk term extraction from the 10 movie reviews there were 32 terms occured most frequently in the pre-processed text review sentences
The result of the text pre-processing reduced the total number of terms extracted using NLTK by 99 terms.
For nltk term extraction with pre-processing 32 (stop word removal, numeric digits removed)
For nltk term extraction without pre-processing 131 (lower case and special characters removed)


In [None]:
matched_terms_nltk2 = get_matched_term_list(extracted_terms, extracted_terms2)
matched_terms_manual = get_matched_term_list(cleaned_terms, extracted_terms2)
print(f'For nltk term extraction the number of matched terms between the two NLTK processes returns {len(matched_terms_nltk2)}')
print(f'For nltk term extraction the number of matched terms between the manual and the increased text preprocessing is {len(matched_terms_manual)}')
print(f'A sample of the matched terms between the methods are: {matched_terms_nltk2[:6]}')

For nltk term extraction the number of matched terms between the two NLTK processes returns 19
For nltk term extraction the number of matched terms between the manual and the increased text preprocessing is 3
A sample of the matched terms between the methods are: ['taken', 'kills 75 albanians', 'old', 'end', 'look', 'skilled']


In [None]:
merged_term_list = merged_term_list + extracted_terms2
## remove duplicates
merged_term_list = list(set(merged_term_list))
print(f'The total number of terms between the various methods: {len(merged_term_list)}.')

The total number of terms between the various methods: 629.


<h2>Step 5: Normalize aliases / synonyms </h2>
<ul>
<li>Identify cases where different terms reference the same thing (e.g. 'paris', 'arrondisment', 'charles de gaulle international', 'eiffel tower').</li>
</ul>
Code reference for synonyms and antonyms NLTK use:<br>
Subramanian, Dhilip. 2019. "Synonyms and Antonyms in Python
Text Mining — Extracting Synonyms and Antonyms." Medium.com
https://medium.com/data-science/synonyms-and-antonyms-in-python-a865a5e14ce8

In [None]:
#Checking synonym for the word "travel"
from nltk.corpus import wordnet
def get_syn(term):
  synonyms = []
  for syn in wordnet.synsets(term):
    for lm in syn.lemmas():
      if lm.name() not in synonyms:
             synonyms.append(lm.name())#adding into synonyms
  return synonyms

alias = {}
for t in merged_term_list:
  ## check alias values
  for a in list(alias.keys()):
    if t in alias[a]:
      alias[a].append(t)
      break
  if t not in alias:
    alias[t] = get_syn(t)

In [None]:
movie_syn_terms = os.path.join(destination_folder, "LCK_Alias_Terms.json")
with open(movie_syn_terms, 'r') as f:
  syn = json.load(f)


Generate the summary of aliased terms within each document

In [None]:
## generate dataframe frequncy count in reviews
df_alias_summary = check_terms_in_df(df, merged_term_list, alias_terms = syn[0])
#df_alias_summary.sort_values(by='alias_terms', ascending = False, inplace=True)

In [None]:
## display the top 15 results
df_alias_summary.head(30)

Unnamed: 0,Term,DOC3,DOC9,DOC10,DOC7,DOC6,DOC5,DOC8,DOC1,DOC4,DOC2,TotalTerms,alias_terms
0,la,11,12,15,14,14,9,3,8,10,11,107,los angeles
1,us,10,6,25,15,4,7,7,12,4,9,99,america
2,no,8,10,21,8,7,7,11,8,2,11,93,no
3,one,2,6,9,15,8,2,3,8,7,8,68,one
4,taken,2,3,6,3,1,4,15,11,1,3,49,taken
5,all,4,3,8,4,3,2,6,6,2,6,44,all
6,end,5,5,12,3,2,2,2,4,2,6,43,end
7,ive,3,5,6,6,3,3,5,3,3,5,42,ive
8,bryan,7,1,10,1,6,1,3,0,4,7,40,bryan mills
9,man,1,3,15,2,3,1,3,5,1,4,38,man


In [None]:
num_alias_terms = len(df_alias_summary['alias_terms'].unique())
print(f'The total number of alias terms {num_alias_terms}, a {len(merged_term_list)-num_alias_terms} reduction from the original term list.')

The total number of alias terms 422, a 207 reduction from the original term list.


In [None]:
# Get the values from the dictionary
alias_values = syn[0].values()

# Use Counter to count the occurrences of each value
value_counts = {}

for k,v in syn[0].items():
  if v not in value_counts.keys():
    value_counts[v]={}
    value_counts[v]['original_terms'] = [k]
    value_counts[v]['count'] = 1
  else:
    value_counts[v]['original_terms'].append(k)
    value_counts[v]['count'] += 1

top_alias_terms = {}
for k,v in value_counts.items():
  if v['count']>=5:
    top_alias_terms[k] = v

top_alias_terms


{'action movie': {'original_terms': ['action choreography',
   'action film',
   'action movie',
   'action thriller',
   'action thrillers',
   'actionthriller'],
  'count': 6},
 'criminals': {'original_terms': ['albanian nasties',
   'albanian ring',
   'albanian thugs',
   'criminals',
   'posse'],
  'count': 5},
 'sex traffickers': {'original_terms': ['albanian sex traffickers',
   'albanian white slavers',
   'nasty whiteslave traders',
   'sex traffickers',
   'sex trafficking',
   'sexual slavery',
   'trafficking in women',
   'white slave ring',
   'white slavers',
   'whiteslave traders'],
  'count': 10},
 'jason bourne': {'original_terms': ['bourne',
   'bourne thriller',
   'bournelike',
   'bournelike manhunt',
   'jason bourne',
   'jason bournelike skills',
   'jason ourneesque'],
  'count': 7},
 'bryan mills': {'original_terms': ['bryan',
   'bryan hours find',
   'bryan may',
   'bryan mills',
   'mills',
   'taken shows mills'],
  'count': 6},
 'spy': {'original_terms

In [None]:
df_google_check['extracted_terms_count'] = df_google_check[['Ground_Truth','GPT','Dandelion','NLTK','Google']].sum(axis=1)
df_google_check.sort_values(by='extracted_terms_count', ascending = False, inplace=True)
top_terms_df = df_google_check.loc[df_google_check['extracted_terms_count']>=2].copy()
top_terms_df['alias'] = top_terms_df['Term'].map(syn[0])
top_terms = list(top_terms_df['alias'].unique())


print(f'The number of top terms between term extraction methods is: {len(top_terms)}.')
print(f'A sample of the top terms between all the the methods are: {top_terms[:10]}')

The number of top terms between term extraction methods is: 105.
A sample of the top terms between all the the methods are: ['paris', 'liam neeson', 'pierre morel', 'famke janssen', 'maggie grace', 'cia', 'steven spielberg', 'osama bin laden', 'luc besson', 'bryan mills']


In [None]:
extraction_engines = ['Ground_Truth','GPT','Dandelion','NLTK','Google']
results = []
for e in extraction_engines:
  results.append([ int(top_terms_df[e].sum()), e])

In [None]:
sorted(results, reverse=True)

[[75, 'Dandelion'],
 [64, 'Ground_Truth'],
 [55, 'Google'],
 [45, 'GPT'],
 [38, 'NLTK']]

In [None]:
movie_terms = os.path.join(destination_folder, "LCK_Extracted_Terms.json")
with open(movie_terms, "w", encoding="utf-8") as file:
        json.dump(top_terms, file, indent=4)

print("List successfully saved file")

List successfully saved file
