# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Word Frequency : Word frequency is used for sentiment analysis of the given text. A higher frequency means positive sentiment, whereas low frequency represents negative sentiment.
Sentiment Lexicons : Sentiment Lexicons is nothing but identifying the positive or negative sentiment by using words like good, bad, happy, sad, etc from the given text.
Analysis of Readability : Based on the complexity of the language we identify the sentiment analysis of the text.
Analysis of Content : Analyzing the given text based on the topic, language used, writing style helps to identiy the sentiment anlysis.
Emotions Detection : From the given text Identifying the emotions like happiness , anger, surprise, etc by using facial expressions and body language.
'''

'\nPlease write you answer here:\n\nWord Frequency : Word frequency is used for sentiment analysis of the given text. A higher frequency means positive sentiment, whereas low frequency represents negative sentiment.\nSentiment Lexicons : Sentiment Lexicons is nothing but identifying the positive or negative sentiment by using words like good, bad, happy, sad, etc from the given text.\nAnalysis of Readability : Based on the complexity of the language we identify the sentiment analysis of the text.\nAnalysis of Content : Analyzing the given text based on the topic, language used, writing style helps to identiy the sentiment anlysis.\nEmotions Detection : From the given text Identifying the emotions like happiness , anger, surprise, etc by using facial expressions and body language. \n'

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [2]:
# You code here (Please add comments in the code):
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import pandas as pd
import time

def get_response(request_url):
    headers =({'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36', 'Accept-Language': 'en-US, en;q=0.5'})
    return requests.get(request_url, headers = headers)

def get_elements_by_attributes(source_data, tag, attributes):
    scrapped_data = BeautifulSoup(source_data, "html.parser")
    return scrapped_data.find_all(tag, attributes)

# Extract reviews and ratings from IMDB
movie_review_url = "https://www.imdb.com/title/tt15433956/reviews?ref_=tt_urv"
reviews = []
ratings = []
while(len(reviews) < 100):
    response_data = get_response(movie_review_url).text
    elements = get_elements_by_attributes(response_data, "div", {"class" : "text show-more__control"})
    ratings_elements = get_elements_by_attributes(response_data, "span", {"class" : "rating-other-user-rating"})
    pagination_key = get_elements_by_attributes(response_data, 'div',{'class' : 'load-more-data'})[0]["data-key"]
    reviews.extend([i.text for i, j in zip(elements, range(1, len(elements) + 1))])
    ratings.extend([1 if(int(i.text.replace("\n", "").split("/")[0]) > 5) else 0 for i, j in zip(ratings_elements, range(1, len(ratings_elements) + 1))])
    print(f"Collected - {len(reviews)}")
    movie_review_url = "https://www.imdb.com/title/tt13751694/reviews/_ajax?&paginationKey="+pagination_key
print(reviews)
print(ratings)

Collected - 25
Collected - 50
Collected - 75
Collected - 100
["The film boasts a good storyline with a well-executed Indian superhero flair. Character development is commendable, particularly with the endearing Hanu Man, though the Super villain character could have been more captivating.Teja Sajja shines in his role, delivering a fantastic performance.Getup Seenu's look and performance is fantastic. Satya is fun.Vara Lakshmi portrays her character decently, although her articulation disorder may be distracting; considering this, a dubbing artist might enhance the overall experience.Vinay Rai as the antagonist is visually fitting, and his performance is nothing short of terrific.Amrita Iyer, resembling a side character from Telugu TV serials, gives a lackluster performance, suggesting she may not be best suited for the big screen.The supporting cast performs well, and the film benefits from appealing locations. While the cinematography is pleasing, sets and costumes, while good, lack a

In [3]:
! pip install textstat



In [4]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import opinion_lexicon
from textblob import TextBlob
from collections import Counter
import textstat

# NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')
nltk.download('opinion_lexicon')

def extract_features(review):

    # extraction of feature 1 - Word Count
    word_count = len(word_tokenize(review))

    # extraction of feature 2 - Sentiment Lexicon
    stop_words = set(stopwords.words('english'))
    words = [word.lower() for word in word_tokenize(review) if word.isalpha() and word.lower() not in stop_words]
    positive_words = set(opinion_lexicon.positive())
    negative_words = set(opinion_lexicon.negative())
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)
    sentiment_score = positive_count - negative_count

    # extraction of feature 3 - Content Analysis
    tone_words = [word for word in words if word in opinion_lexicon.words()]
    content_score = len(tone_words) / word_count if word_count > 0 else 0

    # extraction of feature 4 - Emotion Detection
    sid = SentimentIntensityAnalyzer()
    emotions = sid.polarity_scores(review)
    dominant_emotion = max(emotions, key=emotions.get)

    # extraction of feature 5 - Readability Analysis
    flesch_score = textstat.flesch_reading_ease(review)

    # print(dominant_emotion)
    return {
        'wrd_cnt': word_count,
        'sntmnt_scr': sentiment_score,
        'cntnt_scr': content_score,
        'domnt_emotn': 0 if(dominant_emotion == "neu") else 1,
        'flesch_scr': flesch_score
    }

def analyze_review(review):
    features = extract_features(review)

    # Rule-based sentiment analysis
    if features['sntmnt_scr'] >= 0:
        sentiment = '1' # 1 for Good or Neutral
    elif features['sntmnt_scr'] < 0:
        sentiment = '0' # 0 for Bad

    # Flesch reading ease score to determine sentiment
    if features['flesch_scr'] > 30:
        readability_sentiment = '1' # 1 if the score is > 30 (Good / Neutral)
    elif features['flesch_scr'] < 30: # 0 if the score is < 30 (Bad)
        readability_sentiment = '0'

    print("Sentiment lexicon:", sentiment)
    print("Readability analysis:", readability_sentiment)
    print("Dominant Emotion:", features['domnt_emotn'])
    print("Content Score:", features['cntnt_scr'])
    return features

word_count = []
sentiment_score = []
content_score = []
dominant_emotion = []
flesh_score = []
for review in reviews:
  features = analyze_review(review)
  word_count.append(features['wrd_cnt'])
  sentiment_score.append(features['sntmnt_scr'])
  content_score.append(features['cntnt_scr'])
  dominant_emotion.append(features['domnt_emotn'])
  flesh_score.append(features['flesch_scr'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!


Sentiment lexicon: 1
Readability analysis: 1
Dominant Emotion: 0
Content Score: 0.115
Sentiment lexicon: 1
Readability analysis: 1
Dominant Emotion: 1
Content Score: 0.07534246575342465
Sentiment lexicon: 1
Readability analysis: 1
Dominant Emotion: 1
Content Score: 0.06267029972752043
Sentiment lexicon: 1
Readability analysis: 1
Dominant Emotion: 1
Content Score: 0.05263157894736842
Sentiment lexicon: 1
Readability analysis: 1
Dominant Emotion: 1
Content Score: 0.08620689655172414
Sentiment lexicon: 1
Readability analysis: 1
Dominant Emotion: 0
Content Score: 0.05982905982905983
Sentiment lexicon: 1
Readability analysis: 1
Dominant Emotion: 1
Content Score: 0.11836734693877551
Sentiment lexicon: 1
Readability analysis: 1
Dominant Emotion: 1
Content Score: 0.12149532710280374
Sentiment lexicon: 1
Readability analysis: 1
Dominant Emotion: 0
Content Score: 0.040268456375838924
Sentiment lexicon: 1
Readability analysis: 1
Dominant Emotion: 1
Content Score: 0.059602649006622516
Sentiment le

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [5]:
# You code here (Please add comments in the code):
from sklearn.feature_selection import chi2
import numpy as np

# Create a single numpy array with all the features
features = np.abs(np.hstack(
    (np.array(word_count).reshape(-1, 1), np.array(sentiment_score).reshape(-1, 1), np.array(content_score).reshape(-1, 1), np.array(dominant_emotion).reshape(-1, 1), np.array(flesh_score).reshape(-1, 1))
    ))

print(features)
# Chi-Square test for selecting the features by providing the ratings as a target attribute
chi_scores, p_values = chi2(features, np.array(ratings))

# Rank features based on Chi-Square scores
feature_names = ['Word Count', 'Sentiment Score', 'Content Score', 'Dominant Emotion', 'Flesh_score']
feature_scores = sorted(zip(feature_names, chi_scores), key=lambda x: x[1], reverse=True)

# Print ranked features
print("Ranked Features based on Chi-Square Score:")
for feature, score in feature_scores:
    print(f"{feature}: {score}")

[[2.00000000e+02 1.10000000e+01 1.15000000e-01 0.00000000e+00
  3.35100000e+01]
 [1.46000000e+02 9.00000000e+00 7.53424658e-02 1.00000000e+00
  7.96000000e+01]
 [3.67000000e+02 1.30000000e+01 6.26702997e-02 1.00000000e+00
  6.43400000e+01]
 [2.09000000e+02 3.00000000e+00 5.26315789e-02 1.00000000e+00
  5.34400000e+01]
 [1.16000000e+02 4.00000000e+00 8.62068966e-02 1.00000000e+00
  4.34900000e+01]
 [1.17000000e+02 1.00000000e+00 5.98290598e-02 0.00000000e+00
  8.59900000e+01]
 [2.45000000e+02 2.10000000e+01 1.18367347e-01 1.00000000e+00
  7.00900000e+01]
 [1.07000000e+02 1.10000000e+01 1.21495327e-01 1.00000000e+00
  6.73500000e+01]
 [1.49000000e+02 0.00000000e+00 4.02684564e-02 0.00000000e+00
  6.49100000e+01]
 [1.51000000e+02 7.00000000e+00 5.96026490e-02 1.00000000e+00
  4.06900000e+01]
 [1.58000000e+02 4.00000000e+00 1.13924051e-01 1.00000000e+00
  4.24500000e+01]
 [1.94000000e+02 8.00000000e+00 5.15463918e-02 1.00000000e+00
  6.21700000e+01]
 [2.08000000e+02 1.70000000e+01 9.134615

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [6]:
# You code here (Please add comments in the code):
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Assigned reviews to Text_data
text_data = reviews

# Sample Query
query = "It's really great how you blended mythology and superhero genre loved every bit in the movie. Thank you for giving us such a wonderful movie."

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize and encode the query
query_tokens = tokenizer.encode(query, add_special_tokens=True, truncation=True, max_length=512, padding='max_length', return_tensors='pt')

# Encode the query using BERT model
with torch.no_grad():
    query_embedding = model(query_tokens)[0][:, 0, :].squeeze(0)

# List to store similarity scores
similarity_scores = []

# Process each text data
for text in text_data:

    text_tokens = tokenizer.encode(text, add_special_tokens=True, truncation=True, max_length=512, padding='max_length', return_tensors='pt')

    with torch.no_grad():
        text_embedding = model(text_tokens)[0][:, 0, :].squeeze(0)

    similarity = cosine_similarity(query_embedding.reshape(1, -1), text_embedding.reshape(1, -1))[0][0]

    similarity_scores.append((text, similarity))

similarity_scores.sort(key=lambda x: x[1], reverse=True)

for rank, (text, similarity) in enumerate(similarity_scores, 1):
    print(f"Rank {rank}: Similarity Score: {similarity:.4f}\n{text}\n")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Rank 1: Similarity Score: 0.9506
I was so excited about this movie and I went to watch it todat and God what a worthless movie it was!What the hell was that movie all about? Absolutely no storyline, extremely violent and that too without any reason. As far as acting is concerned, it was good, but then again what is acting worth if the storyline is absolutely nonsense!It is humble request that please stop making such nonsense! People do not pay 300-400 rupees to watch something like this, I honestly regret wasting my money on this. Moreover, it felt like it was made in opposition to feminism, seems like some really felt agonized by the idea of it! Not recommended by me at all.

Rank 2: Similarity Score: 0.9459
This is pure Cinema ,Screenplay ,BGM everything is top notch
Prashant Varma what a direction man you have raised the bar. It's really great how you blended mythology and superhero genre loved every bit in the movie. Thank you for giving us such a wonderful movie.Tejja Sajja brothe

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [7]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

I learnt about the feature selection and extraction terminologies practically.
This exercise consists of all the stages of NLP like pre-processing, feature extraction and selection.
'''

'\nPlease write you answer here:\n\nI learnt about the feature selection and extraction terminologies practically.\nThis exercise consists of all the stages of NLP like pre-processing, feature extraction and selection.\n'