<a href="https://colab.research.google.com/github/gtakhil95/Akhil_INFO5731_Fall2024/blob/main/Gundampalli_Akhil_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Sentiment Analysis of Customer Reviews
Problem Definition:
Based on the text content of the customer reviews, classify them as either positive, negative, or neutral.
Feature Types and Their Usefulness:
1.	Bag of Words (BoW):
Description: With this feature, the text is transformed into a matrix, with each row representing an article (or review) and each column suggesting a term from the dictionary. The word's frequency in the document is represented by the value in each cell.
Why Useful: BoW helps in identifying the quantity and frequency of words in a text. In sentiment analysis, positive words like "good," "excellent," or negative words like "bad," and "terrible," can be used to classify sentiments.

2.	TF-IDF (Term Frequency - Inverse Document Frequency):
Description: This feature estimates a word's significance within a document about a group of documents. TF-IDF gives weight to words depending on how often they appear in a document compared to how common they are across all documents.
3.	Why Useful: Some words—like "the" or "is"—may be used a lot, but they don't really mean anything. By highlighting more significant and unique terms that are essential for sentiment recognition, TF-IDF assists in lessening the weight of these words.

4.	Sentiment Lexicons (Polarity Features):
Description: These are pre-compiled lists of words (positive, negative, neutral, etc.) that have known sentiment values. Textual terms can be correlated with the lexicon and the sentiment ratings that result can be used as features.
Why Useful: We can directly assign scores to words based on prior sentiment knowledge by using lexicons like SentiWordNet or AFINN. A review is more likely to be categorized as positive if it has a large number of positive vocabulary terms.

5.	Part of Speech (POS) Tags:
Description: With this feature, words in a phrase can be marked with verbs, adjectives, and other parts of speech that they relate to.
Why Useful: Adjectives and adverbs are particularly significant in sentiment analysis because they frequently convey feelings and viewpoints. Verbs such as "love" and "hate" can indicate sentimental acts. The model can be assisted in concentrating on the most pertinent words for sentiment identification by including POS tags as features.

6.	N-grams (Bigrams, Trigrams):
Description: These are word combinations that come one after the other. Bigrams, which are combinations of two words, or trigrams, which are combinations of three words, can be highly useful when trying to convey meaning and context over a single word.
Why Useful: Important multi-word phrases like "not good" or "very bad" may go unnoticed by a single-word feature like BoW, but N-grams can capture them. Sequences and phrases can provide a more complex understanding of sentiment, particularly when negation is present (e.g., "not bad").

7.	Negation Handling (Negation Features):
Description: A particular function that keeps track of negation words (such "not," "never") and how they affect other words. This can be achieved by adding a flag to phrases that contain negation or by inverting the sentiment of words that come after negation.
Why Useful: Negations (e.g., "not bad" vs. "bad") may completely change the meaning of a sentence. Effectively managing them results in accurate sentiment classification.

8.	Word Embeddings (e.g., Word2Vec, GloVe):
Description: Word embeddings use a continuous vector space to represent words as vectors, therefore maintaining the semantic content and relationships between words. In the vector space, similar words are grouped closer.
Why Useful: Embeddings help in the model's understanding of word context. For example, even though the terms "great" and "excellent" don't appear together frequently in the text, they will be dealt with similarly because they have comparable vector representations.

Conclusion:
When combined, these features can successfully capture many written elements that affect sentiment. Sentiment lexicons and part-of-speech tags concentrate on emotion and opinion words, whereas Bag of Words and TF-IDF deal with word frequency. Word embeddings aid in the model's comprehension of word meaning, while n-grams and negation handling supply context.

'''

'\nSentiment Analysis of Customer Reviews\nProblem Definition:\nBased on the text content of the customer reviews, classify them as either positive, negative, or neutral.\nFeature Types and Their Usefulness:\n1.\tBag of Words (BoW):\nDescription: With this feature, the text is transformed into a matrix, with each row representing an article (or review) and each column suggesting a term from the dictionary. The word\'s frequency in the document is represented by the value in each cell.\nWhy Useful: BoW helps in identifying the quantity and frequency of words in a text. In sentiment analysis, positive words like "good," "excellent," or negative words like "bad," and "terrible," can be used to classify sentiments.\n\n2.\tTF-IDF (Term Frequency - Inverse Document Frequency):\nDescription: This feature estimates a word\'s significance within a document about a group of documents. TF-IDF gives weight to words depending on how often they appear in a document compared to how common they are ac

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [3]:
# You code here (Please add comments in the code):
!pip install nltk scikit-learn gensim



In [8]:
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk import pos_tag
from sklearn.preprocessing import LabelEncoder
from nltk.util import ngrams
from gensim.models import Word2Vec
from collections import defaultdict

nltk.download('punkt')
nltk.download('opinion_lexicon')
nltk.download('averaged_perceptron_tagger')

# Sample text data
reviews = [
    "The product is fantastic, I absolutely love it!",
    "I hate the quality of this item. It's terrible!",
    "It's okay, not great but not bad either.",
    "This is the worst purchase I've ever made.",
    "I am so happy with this product!"
]

# 1. Bag of Words (BoW) Feature Extraction
vectorizer_bow = CountVectorizer(stop_words=list(ENGLISH_STOP_WORDS)) # Changed to list(ENGLISH_STOP_WORDS)
bow_features = vectorizer_bow.fit_transform(reviews)
print("Bag of Words Features:\n", pd.DataFrame(bow_features.toarray(), columns=vectorizer_bow.get_feature_names_out()))

# 2. TF-IDF Feature Extraction
# Convert ENGLISH_STOP_WORDS to a list
tfidf_vectorizer = TfidfVectorizer(stop_words=list(ENGLISH_STOP_WORDS))  # Change here
tfidf_features = tfidf_vectorizer.fit_transform(reviews)
print("\nTF-IDF Features:\n", pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out()))

# 3. Sentiment Lexicon (Polarity Features)
def sentiment_lexicon_features(text):
    tokens = word_tokenize(text.lower())
    pos_words = [word for word in tokens if word in opinion_lexicon.positive()]
    neg_words = [word for word in tokens if word in opinion_lexicon.negative()]
    return len(pos_words), len(neg_words)

lexicon_features = [sentiment_lexicon_features(review) for review in reviews]
lexicon_df = pd.DataFrame(lexicon_features, columns=["Positive_Words", "Negative_Words"])
print("\nSentiment Lexicon Features:\n", lexicon_df)

# 4. Part of Speech (POS) Tagging
def pos_tag_features(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    pos_count = defaultdict(int)
    for _, tag in pos_tags:
        pos_count[tag] += 1
    return pos_count

pos_features = [pos_tag_features(review) for review in reviews]
pos_df = pd.DataFrame(pos_features).fillna(0).astype(int)
print("\nPart of Speech (POS) Features:\n", pos_df)

# 5. N-grams (Bigrams, Trigrams)
def ngram_features(text, n):
    tokens = word_tokenize(text.lower())
    return list(ngrams(tokens, n))

# Convert ENGLISH_STOP_WORDS to a list before passing to CountVectorizer
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words=list(ENGLISH_STOP_WORDS))
bigram_features = bigram_vectorizer.fit_transform(reviews)
print("\nBigrams:\n", pd.DataFrame(bigram_features.toarray(), columns=bigram_vectorizer.get_feature_names_out()))

# 6. Negation Handling
negation_words = {"not", "no", "never", "none", "nobody"}

def negation_features(text):
    tokens = word_tokenize(text.lower())
    neg_flag = 0
    neg_tokens = []
    for word in tokens:
        if word in negation_words:
            neg_flag = 1
        if neg_flag == 1:
            neg_tokens.append("NEG_" + word)
            neg_flag = 0
        else:
            neg_tokens.append(word)
    return " ".join(neg_tokens)

negated_reviews = [negation_features(review) for review in reviews]
print("\nNegation Handled Text:\n", negated_reviews)

# 7. Word Embeddings (Word2Vec)
# Tokenizing each review into a list of words
tokenized_reviews = [word_tokenize(review.lower()) for review in reviews]

# Training a Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_reviews, vector_size=50, window=5, min_count=1, workers=4)

# Function to extract word embedding features (average vector of words in a review)
def word2vec_features(review):
    tokens = word_tokenize(review.lower())
    word_vectors = [w2v_model.wv[word] for word in tokens if word in w2v_model.wv]
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(50)

word2vec_features_list = [word2vec_features(review) for review in reviews]
print("\nWord2Vec Embedding Features:\n", pd.DataFrame(word2vec_features_list))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Bag of Words Features:
    absolutely  bad  fantastic  great  happy  hate  item  love  okay  product  \
0           1    0          1      0      0     0     0     1     0        1   
1           0    0          0      0      0     1     1     0     0        0   
2           0    1          0      1      0     0     0     0     1        0   
3           0    0          0      0      0     0     0     0     0        0   
4           0    0          0      0      1     0     0     0     0        1   

   purchase  quality  terrible  ve  worst  
0         0        0         0   0      0  
1         0        1         1   0      0  
2         0        0         0   0      0  
3         1        0         0   1      1  
4         0        0         0   0      0  

TF-IDF Features:
    absolutely      bad  fantastic    great     happy  hate  item      love  \
0    0.523358  0.00000   0.523358  0.00000  0.000000   0.0   0.0  0.523358   
1    0.000000  0.00000   0.000000  0.00000  0.000000   0

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [9]:
# You code here (Please add comments in the code):
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Sample labels corresponding to the reviews (positive, negative, neutral)
labels = ['positive', 'negative', 'neutral', 'negative', 'positive']

# Encode labels to numerical values for feature selection
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(labels)

# 1. Combining all features for Chi-Square test
# Bag of Words features (BoW)
bow_features_array = bow_features.toarray()

# TF-IDF features
tfidf_features_array = tfidf_features.toarray()

# Sentiment lexicon features (positive, negative word counts)
lexicon_features_array = np.array(lexicon_features)

# Part of Speech (POS) features
pos_features_array = pos_df.values

# Bigram features
bigram_features_array = bigram_features.toarray()

# Concatenate all feature arrays into one large feature matrix
combined_features = np.hstack([
    bow_features_array,
    tfidf_features_array,
    lexicon_features_array,
    pos_features_array,
    bigram_features_array
])

# 2. Apply Chi-Square feature selection
chi2_selector = SelectKBest(chi2, k='all')  # We want to rank all features
chi2_selector.fit(combined_features, y)

# Get Chi-Square scores and p-values
chi2_scores = chi2_selector.scores_
p_values = chi2_selector.pvalues_

# 3. Rank features based on Chi-Square scores
feature_names = np.concatenate([
    vectorizer_bow.get_feature_names_out(),
    tfidf_vectorizer.get_feature_names_out(),
    ["positive_word_count", "negative_word_count"],
    pos_df.columns,
    bigram_vectorizer.get_feature_names_out()
])

# Create a DataFrame for easy ranking and display
feature_rankings = pd.DataFrame({
    'Feature': feature_names,
    'Chi2 Score': chi2_scores,
    'P-value': p_values
})

# Sort features by Chi-Square scores in descending order
ranked_features = feature_rankings.sort_values(by='Chi2 Score', ascending=False)

# Display the top-ranked features
print("Top-ranked features based on Chi-Square test:\n", ranked_features.head(10))


Top-ranked features based on Chi-Square test:
                 Feature  Chi2 Score   P-value
1                   bad    4.000000  0.135335
51           okay great    4.000000  0.135335
47            great bad    4.000000  0.135335
8                  okay    4.000000  0.135335
42                   CC    4.000000  0.135335
3                 great    4.000000  0.135335
35                   JJ    3.583333  0.166682
38                   RB    3.583333  0.166682
9               product    3.000000  0.223130
31  negative_word_count    2.875000  0.237521


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [12]:
# You code here (Please add comments in the code):
!pip install transformers torch sklearn

Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [13]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load pre-trained BERT model and tokenizer from Huggingface
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample text data (same as before)
reviews = [
    "The product is fantastic, I absolutely love it!",
    "I hate the quality of this item. It's terrible!",
    "It's okay, not great but not bad either.",
    "This is the worst purchase I've ever made.",
    "I am so happy with this product!"
]

# Query: the sentence we want to match with the most relevant documents
query = "I like the quality of this product"

# Function to get the BERT embeddings for a given text
def get_bert_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)
    with torch.no_grad():  # We don't need gradients, just the embeddings
        outputs = model(**inputs)
    # The last hidden state is the embedding for each token; we take the mean of all token embeddings
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.numpy()

# Get BERT embeddings for the query and all documents
query_embedding = get_bert_embedding(query, tokenizer, model)

# List to store the document embeddings
document_embeddings = [get_bert_embedding(review, tokenizer, model) for review in reviews]

# Calculate cosine similarity between query and each document
similarities = [cosine_similarity(query_embedding, doc_embedding)[0][0] for doc_embedding in document_embeddings]

# Rank the documents by similarity (in descending order)
ranked_indices = np.argsort(similarities)[::-1]

# Display the ranked documents along with their similarity scores
print("Query:", query)
print("\nRanked documents based on similarity:")

for rank, idx in enumerate(ranked_indices):
    print(f"Rank {rank + 1}: Document: \"{reviews[idx]}\" | Similarity: {similarities[idx]:.4f}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Query: I like the quality of this product

Ranked documents based on similarity:
Rank 1: Document: "I am so happy with this product!" | Similarity: 0.7787
Rank 2: Document: "The product is fantastic, I absolutely love it!" | Similarity: 0.7604
Rank 3: Document: "I hate the quality of this item. It's terrible!" | Similarity: 0.7492
Rank 4: Document: "This is the worst purchase I've ever made." | Similarity: 0.6618
Rank 5: Document: "It's okay, not great but not bad either." | Similarity: 0.5848


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [2]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''

This task offered me a valuable opportunity to learn about text feature extraction, ranking, and similarity measurement—especially when using BERT. TF-IDF, Word2Vec, Bag of Words (BoW), sentiment lexicon features, and other important techniques were important in helping me understand how to represent text data for machine learning. During the feature extraction process, it became clear how crucial it was to record both deep (like word embeddings) and shallow (like word frequency) data to accurately capture the text's semantic meaning.
Managing and normalizing various feature types (dense and sparse) and addressing computational limitations were two challenges, particularly when working with huge models like BERT. Employing BERT for text similarity demonstrated that complex natural language processing (NLP) models are able to extract extensive contextual information, which is useful when matching searches to documents based on meaning rather than word similarity.
This exercise is very relevant to information systems and natural language processing (NLP) since text mining, sentiment analysis, and document similarity are commonly utilized in business intelligence applications, search systems, and consumer feedback analysis. Leveraging BERT demonstrates how deep learning is becoming increasingly important for creating reliable, AI-driven recommendation and information retrieval systems.
Furthermore, Chi-Square and other feature selection methods are essential for improving models for interpretability and efficiency, especially when working with massive amounts of text data in enterprise applications. My comprehension of NLP approaches and their significance in practical information systems solutions has improved due to this activity.

'''

"\n\nThis task offered me a valuable opportunity to learn about text feature extraction, ranking, and similarity measurement—especially when using BERT. TF-IDF, Word2Vec, Bag of Words (BoW), sentiment lexicon features, and other important techniques were important in helping me understand how to represent text data for machine learning. During the feature extraction process, it became clear how crucial it was to record both deep (like word embeddings) and shallow (like word frequency) data to accurately capture the text's semantic meaning. \nManaging and normalizing various feature types (dense and sparse) and addressing computational limitations were two challenges, particularly when working with huge models like BERT. Employing BERT for text similarity demonstrated that complex natural language processing (NLP) models are able to extract extensive contextual information, which is useful when matching searches to documents based on meaning rather than word similarity.\nThis exercise i