<a href="https://colab.research.google.com/github/pramodgangula19/5731_Spring24/blob/main/gangula_pramod_exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
A fun text classification task would be to categorize positive and negative customer reviews of a product. This task can help businesses understand how their customers feel about their products and services, allowing them to improve their offerings.

We can build a machine learning model for this task using a variety of features, including:

Bag of Words (BoW): This feature represents each word's occurrence in the text, regardless of order of appearance. The BoW feature is useful because it tracks the frequency of words indicating positive or negative sentiment.

N-grams are contiguous textual sequences of n words. Using n-grams can help you capture contextual information about words.

POS tagging: This feature assigns each word a part of speech, such as noun, verb, adjective, and so on. POS tagging can aid in the identification of sentiment-bearing words such as adjectives and adverbs.

Sentiment lexicons are pre-defined lists of positive or negative words or phrases. Including sentiment lexicons as features can help the model better identify a text's sentiment.

Punctuation and capitalization: These elements can help convey the intensity of the text's emotion. A message conveyed in all caps, for example, may be more powerful.

Finally, we can create a more accurate and robust machine learning model by incorporating features such as BoW, N-grams, POS tagging, sentiment lexicons, and punctuation and capitalization.
'''

"\nPlease write you answer here:\nA fun text classification task would be to categorize positive and negative customer reviews of a product. This task can help businesses understand how their customers feel about their products and services, allowing them to improve their offerings.\n\nWe can build a machine learning model for this task using a variety of features, including:\n\nBag of Words (BoW): This feature represents each word's occurrence in the text, regardless of order of appearance. The BoW feature is useful because it tracks the frequency of words indicating positive or negative sentiment.\n\nN-grams are contiguous textual sequences of n words. Using n-grams can help you capture contextual information about words.\n\nPOS tagging: This feature assigns each word a part of speech, such as noun, verb, adjective, and so on. POS tagging can aid in the identification of sentiment-bearing words such as adjectives and adverbs.\n\nSentiment lexicons are pre-defined lists of positive or

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.sentiment import SentimentIntensityAnalyzer

# Sample text data
text_data = ["I really enjoyed the movie, it was great!",              "The food was terrible, I would not recommend it to anyone.",             "The customer service was excellent, very friendly staff.",             "The product was not what I expected, I was disappointed."]

# Define functions for feature extraction
def bag_of_words(text):
    words = word_tokenize(text)
    words = [word.lower() for word in words if word.isalpha() and word not in stopwords.words("english")]
    return dict(nltk.FreqDist(words))

def ngrams(text, n=2):
    words = word_tokenize(text)
    words = [word.lower() for word in words if word.isalpha() and word not in stopwords.words("english")]
    return list(nltk.ngrams(words, n))

def pos_tagging(text):
    words = word_tokenize(text)
    return nltk.pos_tag(words)

def sentiment(text):
    sia = SentimentIntensityAnalyzer()
    return sia.polarity_scores(text)

def punctuation(text):
    num_exclamations = text.count('!')
    num_question_marks = text.count('?')
    num_punctuations = num_exclamations + num_question_marks
    return {'exclamations': num_exclamations, 'question_marks': num_question_marks, 'total_punctuations': num_punctuations}

# Example usage
for text in text_data:
    print("Text:", text)
    print("Bag of Words:", bag_of_words(text))
    print("Bigrams:", ngrams(text, n=2))
    print("POS Tagging:", pos_tagging(text))
    print("Sentiment:", sentiment(text))
    print("Punctuation:", punctuation(text))
    print("")


Text: I really enjoyed the movie, it was great!
Bag of Words: {'i': 1, 'really': 1, 'enjoyed': 1, 'movie': 1, 'great': 1}
Bigrams: [('i', 'really'), ('really', 'enjoyed'), ('enjoyed', 'movie'), ('movie', 'great')]
POS Tagging: [('I', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'), ('the', 'DT'), ('movie', 'NN'), (',', ','), ('it', 'PRP'), ('was', 'VBD'), ('great', 'JJ'), ('!', '.')]
Sentiment: {'neg': 0.0, 'neu': 0.385, 'pos': 0.615, 'compound': 0.8395}
Punctuation: {'exclamations': 1, 'question_marks': 0, 'total_punctuations': 1}

Text: The food was terrible, I would not recommend it to anyone.
Bag of Words: {'the': 1, 'food': 1, 'terrible': 1, 'i': 1, 'would': 1, 'recommend': 1, 'anyone': 1}
Bigrams: [('the', 'food'), ('food', 'terrible'), ('terrible', 'i'), ('i', 'would'), ('would', 'recommend'), ('recommend', 'anyone')]
POS Tagging: [('The', 'DT'), ('food', 'NN'), ('was', 'VBD'), ('terrible', 'JJ'), (',', ','), ('I', 'PRP'), ('would', 'MD'), ('not', 'RB'), ('recommend', 'VB'), ('it',

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
import numpy as np
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize  # assuming you are using NLTK for tokenization

# Sample text data
text_data = ["This is a positive review.",
             "This is a negative review.",
             "I enjoyed this product.",
             "I didn't like this product."]

# Convert the text data into feature vectors using bag of words
corpus = []
for text in text_data:
    corpus.append(' '.join(word_tokenize(text)))

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus).toarray()
y = np.array([1, 0, 1, 0])  # Positive reviews are labeled as 1, negative reviews are labeled as 0

# Compute the MI scores for each feature
mi_scores = mutual_info_classif(X, y)

# Create a dictionary of feature name and MI score pairs
feature_scores = dict(zip(vectorizer.get_feature_names_out(), mi_scores))

# Rank the features based on their importance in the descending order
ranked_features = sorted(feature_scores, key=feature_scores.get, reverse=True)

# Print the top 5 features with the highest MI scores
print("Top 5 features with the highest MI scores:")
for feature in ranked_features[:5]:
    print(feature, feature_scores[feature])


Top 5 features with the highest MI scores:
negative 0.45833333333333315
did 0.0
enjoyed 0.0
is 0.0
like 0.0


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
pip install transformers




In [None]:
pip list


Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
aiohttp                          3.9.3
aiosignal                        1.3.1
alabaster                        0.7.16
albumentations                   1.3.1
altair                           4.2.2
annotated-types                  0.6.0
anyio                            3.7.1
appdirs                          1.4.4
argon2-cffi                      23.1.0
argon2-cffi-bindings             21.2.0
array-record                     0.5.0
arviz                            0.15.1
astropy                          5.3.4
astunparse                       1.6.3
async-timeout                    4.0.3
atpublic                         4.0
attrs                            23.2.0
audioread                        3.0.1
autograd                         1.6.2
Babel                            2.14.0
backcall                         0.2.0
beautifulsoup4                   4.12.3
bi

In [None]:
from transformers import BertTokenizer, BertModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Example text data
text_data = ["This product is amazing and works really well.",
             "I was disappointed with this product, it didn't work as expected.",
             "I'm very happy with my purchase, the product exceeded my expectations.",
             "This product is terrible, it doesn't work at all."]

# BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Query
query = "I'm very happy with my purchase, the product works perfectly."

# Tokenize and encode the query
input_ids = torch.tensor(tokenizer.encode(query, add_special_tokens=True)).unsqueeze(0)

# Generate the query vector
with torch.no_grad():
    last_hidden_states = model(input_ids)[0]
    query_vector = torch.mean(last_hidden_states, dim=1).squeeze().numpy()

# Compute the similarity between the query vector and each text vector
text_vectors = []
for text in text_data:
    # Tokenize and encode the text
    input_ids = torch.tensor(tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0)

    # Generate the text vector
    with torch.no_grad():
        last_hidden_states = model(input_ids)[0]
        text_vector = torch.mean(last_hidden_states, dim=1).squeeze().numpy()

    text_vectors.append(text_vector)

# Compute the cosine similarity between the query vector and each text vector
similarity_scores = cosine_similarity([query_vector], text_vectors)

# Rank the text data based on their similarity scores
ranked_text_data = [text_data[i] for i in similarity_scores.argsort()[0][::-1]]

# Print the ranked text data
print("Ranked text data based on similarity to the query:")
for text in ranked_text_data:
    print(text)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked text data based on similarity to the query:
I'm very happy with my purchase, the product exceeded my expectations.
I was disappointed with this product, it didn't work as expected.
This product is amazing and works really well.
This product is terrible, it doesn't work at all.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Learning Experience:
Overall, the experience of working on extracting features from text data was enlightening. I found the process to be highly beneficial in understanding how to transform raw text into a format suitable for machine learning models. One key concept that stood out to me was the importance of preprocessing techniques such as tokenization, stemming, and stop-word removal. These techniques helped in cleaning and standardizing the text data, making it easier to extract meaningful features.

Challenges Encountered:
While working on the exercises, I encountered a few challenges. One of the main difficulties was selecting the most appropriate features for the task at hand. With the abundance of available techniques for feature extraction, it was sometimes challenging to determine which features would be most informative for the model. Additionally, fine-tuning parameters for certain feature extraction methods, such as TF-IDF vectorization, required experimentation and careful consideration.

Relevance to Your Field of Study:
This exercise is highly relevant to the field of NLP. Feature extraction is a fundamental step in natural language processing tasks such as text classification, sentiment analysis, and information retrieval. By understanding how to extract relevant features from text data, NLP practitioners can build more effective models for a wide range of applications. Additionally, the techniques learned in this exercise can be applied to real-world datasets, allowing for the development of practical solutions to natural language processing problems. Overall, this exercise provided valuable insights into the feature extraction process and its importance in NLP.





'''

'\nPlease write you answer here:\n\nLearning Experience:\nOverall, the experience of working on extracting features from text data was enlightening. I found the process to be highly beneficial in understanding how to transform raw text into a format suitable for machine learning models. One key concept that stood out to me was the importance of preprocessing techniques such as tokenization, stemming, and stop-word removal. These techniques helped in cleaning and standardizing the text data, making it easier to extract meaningful features.\n\nChallenges Encountered:\nWhile working on the exercises, I encountered a few challenges. One of the main difficulties was selecting the most appropriate features for the task at hand. With the abundance of available techniques for feature extraction, it was sometimes challenging to determine which features would be most informative for the model. Additionally, fine-tuning parameters for certain feature extraction methods, such as TF-IDF vectorizati