## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

Text classification and text mining are crucial for converting unstructured text data into structured information that can be analyzed for decision-making. An intriguing task in this field is known as "Intent Detection," which revolves around automatically discerning the fundamental objectives or intentions behind a given text. This capability holds immense value in diverse applications like customer support, marketing, and chatbots. In this assignment, I will explore the features that prove beneficial when constructing a machine learning model for intent detection.



**Word2Vec Features:**

Word embeddings like Word2Vec capture semantic connections among words by representing them in a continuous vector space. These features aid the model in comprehending word context and meaning within text, facilitating the recognition of similarities and associations between different phrases or keywords, which in turn helps classify text into specific intents for intent detection.

**Linguistic Feature Extraction:**

Linguistic features encompass various elements of language, including parts of speech, syntactic structures, and grammatical patterns. These features offer valuable insights into the syntactical and grammatical characteristics of the text. For instance, the identification of verbs, nouns, or specific grammatical structures can play a role in determining the intention conveyed by a sentence.

**Dependency Parse:**

Dependency parsing involves the analysis of a sentence's grammatical structure by identifying word relationships, such as subject-verb-object dependencies. The inclusion of dependency parse features assists the model in grasping the underlying structure and semantics of a sentence, facilitating intent classification. For example, the model can deduce an intent related to purchasing when it recognizes the relationship between "buy" and "product" in a sentence.

**TF-IDF (Term Frequency-Inverse Document Frequency):**

TF-IDF is a numerical measure that reflects a word's significance within a document compared to a corpus of documents. It proves beneficial in pinpointing keywords or phrases that strongly indicate specific intents. Words with high TF-IDF scores are likely to be pivotal in determining a text's intent. For instance, within a customer support context, the presence of words like "refund" or "complaint" with high TF-IDF scores may strongly suggest an intent related to issue resolution.

**Entity Features:**

Entities refer to specific named objects or concepts, such as product names, locations, or dates. The recognition and extraction of entities from text play a critical role in intent detection. For instance, when a text mentions "booking a flight to New York on December 1st," the identification of entities like "New York" and "December 1st" is instrumental in classifying the intent as travel planning.

In summary, constructing a machine learning model for intent detection requires a combination of diverse features to effectively capture the intricacies of human language and context. Word2Vec features provide semantic understanding, linguistic features offer insights into grammar, dependency parse reveals sentence structure, TF-IDF identifies key terms, and entity recognition pinpoints specific objects or concepts. By integrating these features, the model can reliably categorize text into meaningful intents, thereby enabling businesses to streamline processes and enhance customer service efficiency.

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [3]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 5.3 MB/s eta 0:00:01
Collecting regex>=2021.8.3
  Downloading regex-2023.10.3-cp39-cp39-macosx_10_9_x86_64.whl (296 kB)
[K     |████████████████████████████████| 296 kB 63.1 MB/s eta 0:00:01
Installing collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2023.10.3


In [7]:
!pip install WordCloud

Collecting WordCloud
  Downloading wordcloud-1.9.2-cp39-cp39-macosx_10_9_x86_64.whl (161 kB)
[K     |████████████████████████████████| 161 kB 7.0 MB/s eta 0:00:01
Installing collected packages: WordCloud
Successfully installed WordCloud-1.9.2


In [8]:
import nltk
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests as rq
from wordcloud import WordCloud
from nltk.corpus import stopwords

# Define the URL to collect sample text data from IMDb movie reviews
url = 'https://www.imdb.com/title/tt7019842/reviews'

# Create a DataFrame with a column to store review text
df = pd.DataFrame(columns=['Review'])

# Send an HTTP request to the URL and parse the HTML content
req = rq.get(url).text
soup = bs(req, 'html.parser')

# Collect review text
review_text = soup.find_all('div', attrs={'class' : 'text show-more__control'})

# Create an empty list to store reviews
review_list = []

# Loop through the review_text and extract the text
for i in range(len(review_text)):
    review_list.append(review_text[i].get_text())

# Assign the list of reviews to the DataFrame
df['Review'] = review_list

# Calculate the word count for each review
df['word_Count'] = df['Review'].apply(lambda x: len(str(x).split(" ")))

# Display the DataFrame with review text and word count
df[['Review', 'word_Count']].head()


Unnamed: 0,Review,word_Count
0,A simple love story told in an engaging way.Th...,94
1,"Being a huge Sethupathi-Trisha fan, undoubtedl...",109
2,96 is one of the rare movies where after enter...,56
3,96 is a elegant movie that is delicately craft...,59
4,'96 Digged out my heart and put into a past wo...,27


In [9]:
df['char_Count'] = df['Review'].str.len() ## this also includes spaces
df[['Review','char_Count']].head()

Unnamed: 0,Review,char_Count
0,A simple love story told in an engaging way.Th...,511
1,"Being a huge Sethupathi-Trisha fan, undoubtedl...",661
2,96 is one of the rare movies where after enter...,330
3,96 is a elegant movie that is delicately craft...,325
4,'96 Digged out my heart and put into a past wo...,143


In [10]:
from nltk.corpus import stopwords

import nltk
nltk.download('stopwords')
stop = stopwords.words('english')

df['stopwords'] = df['Review'].apply(lambda x: len([x for x in x.split() if x in stop]))
df[['Review','stopwords']].head()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mounicatamalampudi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Review,stopwords
0,A simple love story told in an engaging way.Th...,39
1,"Being a huge Sethupathi-Trisha fan, undoubtedl...",45
2,96 is one of the rare movies where after enter...,28
3,96 is a elegant movie that is delicately craft...,26
4,'96 Digged out my heart and put into a past wo...,12


In [11]:
df['Hastags'] = df['Review'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
df[['Review','Hastags']].head()

Unnamed: 0,Review,Hastags
0,A simple love story told in an engaging way.Th...,0
1,"Being a huge Sethupathi-Trisha fan, undoubtedl...",0
2,96 is one of the rare movies where after enter...,0
3,96 is a elegant movie that is delicately craft...,0
4,'96 Digged out my heart and put into a past wo...,0


In [12]:
df['Numerics'] = df['Review'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df[['Review','Numerics']].head()

Unnamed: 0,Review,Numerics
0,A simple love story told in an engaging way.Th...,0
1,"Being a huge Sethupathi-Trisha fan, undoubtedl...",1
2,96 is one of the rare movies where after enter...,1
3,96 is a elegant movie that is delicately craft...,1
4,'96 Digged out my heart and put into a past wo...,0


In [13]:
df['Upper'] = df['Review'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
df[['Review','Upper']].head()

Unnamed: 0,Review,Upper
0,A simple love story told in an engaging way.Th...,2
1,"Being a huge Sethupathi-Trisha fan, undoubtedl...",2
2,96 is one of the rare movies where after enter...,0
3,96 is a elegant movie that is delicately craft...,0
4,'96 Digged out my heart and put into a past wo...,0


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [14]:
# You code here (Please add comments in the code):
text_classify = (df['Review'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
text_classify.columns =['words', 'tf']
text_classify





Unnamed: 0,words,tf
0,a,7
1,the,6
2,for,3
3,that,3
4,in,2
...,...,...
81,come,1
82,nostalgia,1
83,"laughter,",1
84,moments,1


In [15]:
import numpy as np 

for i,word in enumerate(text_classify['words']):
    text_classify.loc[i, 'idf'] = np.log(df.shape[0]/(len(df[df['Review'].str.contains(word)])))


In [16]:
text_classify['tfidf'] = text_classify['tf'] + text_classify['idf']
text_classify

Unnamed: 0,words,tf,idf,tfidf
0,a,7,0.000000,7.000000
1,the,6,0.223144,6.223144
2,for,3,0.385662,3.385662
3,that,3,0.916291,3.916291
4,in,2,0.000000,2.000000
...,...,...,...,...
81,come,1,1.832581,2.832581
82,nostalgia,1,3.218876,4.218876
83,"laughter,",1,3.218876,4.218876
84,moments,1,2.120264,3.120264


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [19]:
!pip install transformers


Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[K     |████████████████████████████████| 7.7 MB 6.1 MB/s eta 0:00:01
[?25hCollecting huggingface-hub<1.0,>=0.16.4
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[K     |████████████████████████████████| 295 kB 25.6 MB/s eta 0:00:01
Collecting tokenizers<0.15,>=0.14
  Downloading tokenizers-0.14.1-cp39-cp39-macosx_10_7_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 19.6 MB/s eta 0:00:01
[?25hCollecting safetensors>=0.3.1
  Downloading safetensors-0.4.0-cp39-cp39-macosx_10_7_x86_64.whl (439 kB)
[K     |████████████████████████████████| 439 kB 9.5 MB/s eta 0:00:01
Collecting fsspec
  Downloading fsspec-2023.9.2-py3-none-any.whl (173 kB)
[K     |████████████████████████████████| 173 kB 25.4 MB/s eta 0:00:01
Installing collected packages: fsspec, huggingface-hub, tokenizers, safetensors, transformers
Successfully installed fsspec-2023.9.2 huggingface-hub-0.17.3 s

In [20]:
# You code here (Please add comments in the code):


import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load the pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Sample text data 
texts = df['Review'].tolist()


query = "The cinematography in '96' beautifully captures the essence of nostalgia..."



# Tokenize and encode the query and texts
query_tokens = tokenizer(query, padding=True, truncation=True, return_tensors="pt")
text_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Encode the query and texts using BERT
with torch.no_grad():
    query_embeddings = model(**query_tokens).last_hidden_state.mean(dim=1)
    text_embeddings = model(**text_tokens).last_hidden_state.mean(dim=1)

# Calculate cosine similarity
cosine_similarities = cosine_similarity(query_embeddings, text_embeddings)

# Rank documents by similarity in descending order
ranking = np.argsort(cosine_similarities[0])[::-1]

# Print the ranked documents
print("Query:", query)
for i, idx in enumerate(ranking):
    print(f"Rank {i + 1}: {texts[idx]} (Similarity: {cosine_similarities[0][idx]:.4f})")




Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Query: The cinematography in '96' beautifully captures the essence of nostalgia...
Rank 1: #96 squeezed my heart out.. heavy heart beats , cheek muscles were twitching coz I was trying to control myself from tearing up and somehow managed to not let them fall😅 It's a heart touching experience. Lots of 80s 90s memories refreshed. The screenplay is slow which I didn't find as a negative because that's how the director let's then situations and characters breathe in and make us feel how real this is. It's the beautiful memories and conversations between 2 ppl over a night. It takes guts to make a movie like this without comprising the original script by adding any unwanted commercial elements. I am in awe of Director Prem and his beautiful script. Loved his creation.Kaadhaley kaadhaley bgm with 96 title card brought goosebumps. Sounds of birds chirping , water flowing- Moments were lifted to another level by Govind Vasantha s music and Chinmayi s voice. If possible watch it without any di