<a href="https://colab.research.google.com/github/greeshmanth-5/Greeshmanth_INFO5731_Fall2023/blob/main/In_class_exercise_03_10082023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

## **Sentiment Analysis of Movie Reviews. In this task, we want to classify movie reviews as either "positive" or "negative" based on the sentiment expressed in the text.**

### **1)Bag of Words (BoW) Features:**

**Description :** The frequency of each word in the text is represented by BoW, which treats each word as a feature.

**Use :** It detects the presence of specific words that indicate positive or negative emotion. Words such as "excellent" or "awful" can, for example, have a significant impact on sentiment.

### **2)TF-IDF (Term Frequency-Inverse Document Frequency) Features:**

**Description :** The TF-IDF algorithm computes the importance of each word in a document in relation to a corpus of documents.

**Use :** It aids in identifying terms that are not only common in the document but also relatively uncommon in the total corpus. Rare words may include extra sentiment information.



Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [7]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text data
sample_data = [
    "This movie is absolutely fantastic! I loved every minute of it.",
    "The acting was terrible, and the plot made no sense.",
    "I'm not sure how to feel about this film. It had its moments, but overall, it was mediocre.",
]

# Tokenization and Stopword Removal
nltk.download("punkt")
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    # Tokenize text
    tokens = word_tokenize(text.lower())

    # Remove punctuation and stopwords
    filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

    return " ".join(filtered_tokens)

# Preprocess the sample data
preprocessed_data = [preprocess_text(text) for text in sample_data]

# Bag of Words (BoW) Features
count_vectorizer = CountVectorizer()
bow_features = count_vectorizer.fit_transform(preprocessed_data)

print("Bag of Words (BoW) Features:")
print(bow_features.toarray())
print("\n")

# TF-IDF Features
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(preprocessed_data)

print("TF-IDF Features:")
print(tfidf_features.toarray())
print("\n")


Bag of Words (BoW) Features:
[[1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0 0]
 [0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1]
 [0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0]]


TF-IDF Features:
[[0.40824829 0.         0.40824829 0.40824829 0.         0.
  0.40824829 0.         0.         0.40824829 0.         0.40824829
  0.         0.         0.         0.         0.        ]
 [0.         0.4472136  0.         0.         0.         0.
  0.         0.4472136  0.         0.         0.         0.
  0.         0.4472136  0.4472136  0.         0.4472136 ]
 [0.         0.         0.         0.         0.40824829 0.40824829
  0.         0.         0.40824829 0.         0.40824829 0.
  0.40824829 0.         0.         0.40824829 0.        ]]




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [9]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import chi2

# Sample text data
sample_data = [
    "This movie is absolutely fantastic! I loved every minute of it.",
    "The acting was terrible, and the plot made no sense.",
    "I'm not sure how to feel about this film. It had its moments, but overall, it was mediocre.",
]

# Sample labels for sentiment (0: Negative, 1: Positive)
labels = [1, 0, 1]

# Tokenization and Stopword Removal
# ... (same preprocessing steps as before)

# Bag of Words (BoW) Features
count_vectorizer = CountVectorizer()
bow_features = count_vectorizer.fit_transform(preprocessed_data)

# TF-IDF Features
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(preprocessed_data)

# Feature selection using chi-squared test
def rank_features(features, labels):
    chi2_scores, _ = chi2(features, labels)
    feature_names = np.array(count_vectorizer.get_feature_names_out())
    sorted_indices = np.argsort(chi2_scores)[::-1]

    # Print feature names and their chi-squared scores in descending order
    for i in sorted_indices:
        print(f"Feature: {feature_names[i]}, Chi-Squared Score: {chi2_scores[i]}")

# Rank features for BoW
print("Ranking of BoW Features:")
rank_features(bow_features, labels)
print("\n")

# Rank features for TF-IDF
print("Ranking of TF-IDF Features:")
rank_features(tfidf_features, labels)
print("\n")


Ranking of BoW Features:
Feature: terrible, Chi-Squared Score: 2.0000000000000004
Feature: sense, Chi-Squared Score: 2.0000000000000004
Feature: plot, Chi-Squared Score: 2.0000000000000004
Feature: acting, Chi-Squared Score: 2.0000000000000004
Feature: made, Chi-Squared Score: 2.0000000000000004
Feature: loved, Chi-Squared Score: 0.5
Feature: every, Chi-Squared Score: 0.5
Feature: fantastic, Chi-Squared Score: 0.5
Feature: feel, Chi-Squared Score: 0.5
Feature: film, Chi-Squared Score: 0.5
Feature: mediocre, Chi-Squared Score: 0.5
Feature: sure, Chi-Squared Score: 0.5
Feature: minute, Chi-Squared Score: 0.5
Feature: moments, Chi-Squared Score: 0.5
Feature: movie, Chi-Squared Score: 0.5
Feature: overall, Chi-Squared Score: 0.5
Feature: absolutely, Chi-Squared Score: 0.5


Ranking of TF-IDF Features:
Feature: terrible, Chi-Squared Score: 0.894427190999916
Feature: sense, Chi-Squared Score: 0.894427190999916
Feature: plot, Chi-Squared Score: 0.894427190999916
Feature: acting, Chi-Squared S

Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [2]:
pip install transformers


Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m48.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
Insta

In [3]:
pip install transformers torch scipy




In [8]:
import numpy as np
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data
sample_data = [
    "This movie is absolutely fantastic! I loved every minute of it.",
    "The acting was terrible, and the plot made no sense.",
    "I'm not sure how to feel about this film. It had its moments, but overall, it was mediocre.",
]

# Define a query
query = "I enjoyed watching this movie. It was great!"

# Load BERT tokenizer and model (you can choose a different BERT variant if needed)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Encode the query and text data into BERT embeddings
query_tokens = tokenizer(query, padding=True, truncation=True, return_tensors="pt")
text_tokens = tokenizer(sample_data, padding=True, truncation=True, return_tensors="pt")

# Get BERT embeddings for the query and text data
with torch.no_grad():
    query_embeddings = model(**query_tokens).last_hidden_state.mean(dim=1).numpy()
    text_embeddings = model(**text_tokens).last_hidden_state.mean(dim=1).numpy()

# Calculate cosine similarity between the query and each text
cosine_similarities = cosine_similarity(query_embeddings, text_embeddings)

# Rank the similarity scores in descending order
sorted_indices = np.argsort(cosine_similarities[0])[::-1]

# Print the ranked similarity scores and corresponding text data
for index in sorted_indices:
    print(f"Similarity Score: {cosine_similarities[0][index]}")
    print(f"Text: {sample_data[index]}\n")


Similarity Score: 0.8284611701965332
Text: This movie is absolutely fantastic! I loved every minute of it.

Similarity Score: 0.7313329577445984
Text: I'm not sure how to feel about this film. It had its moments, but overall, it was mediocre.

Similarity Score: 0.5957400798797607
Text: The acting was terrible, and the plot made no sense.

