<a href="https://colab.research.google.com/github/harishk1998/HarishBabu_INFO5731_Fall2024/blob/main/Kancharla_HarishBabu_Exercise_3_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''


Sentiment analysis is an interesting task where we aim to categorize text as "positive," "negative," or "neutral."
It’s widely used in things like reviews, social media comments, or feedback to gauge how people feel.

Here are five types of features that can really help in building a sentiment analysis model:

Word Frequencies: The number of times specific words appear in the text can strongly hint at the sentiment.
For instance, words like "great," "awesome," or "fantastic" often signal a positive sentiment, while "awful," "poor,"
or "terrible" lean towards negative.

TF-IDF (Term Frequency-Inverse Document Frequency): This technique helps highlight important words in a document
that might be rare across the whole dataset. It gives more weight to unique words in a specific
text while reducing the impact of common words like "the."

N-grams: These are groups of words that appear together, like "very nice." Instead of looking at single words,
we focus on pairs or triples of words (bigrams or trigrams), which give more context and can capture sentiment
more effectively.

Sentiment Scores: Pre-built tools (like VADER or AFINN) can assign a sentiment score to individual words.
Adding up these scores can provide an overall sentiment score for the whole text, making it easier to classify.

POS Tags (Part-of-Speech Tags): These tags tell us the grammatical role of each word (like whether it’s a noun, verb,
or adjective). Since adjectives and adverbs often carry strong emotions, identifying them can help with sentiment
analysis.

Each of these features brings something unique to the table. Word frequencies tell us how often certain words pop up,
while TF-IDF ensures we focus on important terms. N-grams capture context, sentiment scores offer numerical insight,
and POS tags help us understand the structure and emotional weight of the text. Together, these features help paint
a full picture of the sentiment behind the words.



'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [4]:
!pip install nltk scikit-learn pandas
import pandas as pd
import nltk
from nltk import ngrams
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk import pos_tag


nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('averaged_perceptron_tagger')


sample_texts = [
    "The pasta was delicious and full of flavor.",
    "I had a bad experience with the steak; it was overcooked.",
    "The dessert was amazing, and I would recommend it to anyone!",
    "This place has average service, but the food is great.",
    "I really loved the ambiance of the restaurant."
]

count_vectorizer = CountVectorizer()
word_count_matrix = count_vectorizer.fit_transform(sample_texts)  # Count words
word_count_df = pd.DataFrame(word_count_matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

#Calculate TF-IDF scores
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(sample_texts)  # Calculate TF-IDF
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())  # Create a DataFrame
#Create bigrams (pairs of words)
bigrams = [list(ngrams(word_tokenize(text.lower()), 2)) for text in sample_texts]
bigram_counts = [pd.Series(bigram).value_counts() for bigram in bigrams]  # Count bigrams
bigram_df = pd.DataFrame(bigram_counts).fillna(0)
# Get sentiment scores
sentiment_analyzer = SentimentIntensityAnalyzer()
sentiment_results = [sentiment_analyzer.polarity_scores(text) for text in sample_texts]  # Get scores
sentiment_df = pd.DataFrame(sentiment_results)
#Get part-of-speech tags
pos_tags_list = [pos_tag(word_tokenize(text)) for text in sample_texts]  # Get POS tags
pos_df = pd.DataFrame([{word: tag for word, tag in tags} for tags in pos_tags_list])  # Create a DataFrame


print("Word Frequencies:")
print(word_count_df, "\n")

print("TF-IDF Scores:")
print(tfidf_df, "\n")

print("Bigrams:")
print(bigram_df, "\n")

print("Sentiment Scores:")
print(sentiment_df, "\n")

print("POS Tags:")
print(pos_df, "\n")





Word Frequencies:
   amazing  ambiance  and  anyone  average  bad  but  delicious  dessert  \
0        0         0    1       0        0    0    0          1        0   
1        0         0    0       0        0    1    0          0        0   
2        1         0    1       1        0    0    0          0        1   
3        0         0    0       0        1    0    1          0        0   
4        0         1    0       0        0    0    0          0        0   

   experience  ...  recommend  restaurant  service  steak  the  this  to  was  \
0           0  ...          0           0        0      0    1     0   0    1   
1           1  ...          0           0        0      1    1     0   0    1   
2           0  ...          1           0        0      0    1     0   1    1   
3           0  ...          0           0        1      0    1     1   0    0   
4           0  ...          0           1        0      0    2     0   0    0   

   with  would  
0     0      0  
1   

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [9]:
# Import necessary libraries
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
import numpy as np

# Download necessary NLTK data files
nltk.download('punkt')
nltk.download('vader_lexicon')

# Sample text data and their sentiment labels
texts = [
    "The food was absolutely wonderful! I loved it.",
    "This is the worst meal I've ever had.",
    "The service was average, nothing special.",
    "What a delicious dessert! I will definitely return.",
    "The ambiance was nice, but the food was bad."
]
labels = [1, 0, 0, 1, 0]  # 1 for positive, 0 for negative

# Convert text data into numeric format using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(texts)

# Perform Chi-Squared feature selection
chi2_values, p_values = chi2(X, labels)

# Create a DataFrame for features and their Chi-Squared scores
feature_names = tfidf_vectorizer.get_feature_names_out()
chi2_df = pd.DataFrame({'Feature': feature_names, 'Chi2 Score': chi2_values})

# Rank features by Chi-Squared score in descending order
chi2_df = chi2_df.sort_values(by='Chi2 Score', ascending=False)

# Display the ranked features
print("Ranked Features based on Chi-Squared Score:")
print(chi2_df.reset_index(drop=True))


Ranked Features based on Chi-Squared Score:
       Feature  Chi2 Score
0   absolutely    0.644494
1    wonderful    0.644494
2           it    0.644494
3        loved    0.644494
4         will    0.612372
5   definitely    0.612372
6    delicious    0.612372
7      dessert    0.612372
8         what    0.612372
9       return    0.612372
10     nothing    0.305377
11     average    0.305377
12     special    0.305377
13     service    0.305377
14          ve    0.246451
15        this    0.246451
16        meal    0.246451
17          is    0.246451
18         had    0.246451
19        ever    0.246451
20       worst    0.246451
21        nice    0.240023
22    ambiance    0.240023
23         but    0.240023
24         bad    0.240023
25         the    0.154982
26         was    0.079079
27        food    0.055113


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [10]:
# You code here (Please add comments in the code):

!pip install transformers torch scikit-learn

import torch
from transformers import BertTokenizer, BertModel
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data (documents)
texts = [
    "The food was absolutely wonderful! I loved it.",
    "This is the worst meal I've ever had.",
    "The service was average, nothing special.",
    "What a delicious dessert! I will definitely return.",
    "The ambiance was nice, but the food was bad."
]

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode texts using BERT
def encode_texts(texts):
    # Tokenize and convert texts to tensor format
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt', max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)  # Get BERT outputs
    return outputs.last_hidden_state.mean(dim=1)  # Average pooling over token embeddings

# Encode the documents
document_embeddings = encode_texts(texts)

# Define and encode a query
query = "I had a fantastic dining experience!"
query_embedding = encode_texts([query])  # Encode the query

# Calculate cosine similarity between query and documents
similarities = cosine_similarity(query_embedding, document_embeddings).flatten()

# Create a DataFrame for texts and similarity scores
results_df = pd.DataFrame({
    'Text': texts,
    'Similarity': similarities
})

# Rank documents by similarity in descending order
results_df = results_df.sort_values(by='Similarity', ascending=False).reset_index(drop=True)

# Display the ranked results
print("Ranked Documents based on Similarity to Query:")
print(results_df)







Ranked Documents based on Similarity to Query:
                                                Text  Similarity
0     The food was absolutely wonderful! I loved it.    0.846537
1  What a delicious dessert! I will definitely re...    0.738802
2              This is the worst meal I've ever had.    0.629565
3       The ambiance was nice, but the food was bad.    0.613991
4          The service was average, nothing special.    0.586754


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:


Working on extracting features from text data was an enlightening journey. I learned a lot about key techniques
in Natural Language Processing (NLP)

TF-IDF Vectorization: Understanding how to quantify word importance based on frequency across documents was
eye-opening and foundational for text analysis.

Feature Selection: Using Chi-Squared for identifying important features showed me how to filter out less significant
words, ultimately improving model performance.

BERT Embeddings: Learning to use BERT for encoding text highlighted the power of deep learning in understanding context,
which is vital for tasks like semantic similarity.


I have faced a few challenges along the way:
BERT Outputs: Figuring out how to extract and average embeddings from BERT was a bit confusing at first, requiring some
extra reading and understanding of the model’s structure.

Cosine Similarity: Implementing cosine similarity to rank documents was tricky. I had to be careful with input formatting
and output interpretation.

Library Issues: Managing library installations and ensuring compatibility among transformers and torch was sometimes
frustrating.

Relevance to Your Field of Study
This exercise is super relevant to NLP. Feature extraction and selection are crucial steps in preparing text data for
various tasks, like sentiment analysis and topic modeling. Knowing how to effectively represent text helps in creating
more accurate models.

With deep learning becoming a big part of NLP, getting familiar with embedding techniques like BERT is essential for
tackling more complex applications in the future. Overall, this exercise has been both rewarding and informative,
setting a solid foundation for my exploration of NLP.

'''