<a href="https://colab.research.google.com/github/Harsh0487/Harsh_INFO5731_fall2024/blob/main/harsh_patel_exercise_3(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:

# Text Features:-

# 1.Bag-of-Words (BoW):-
#     -> Represents the frequency of each word in the review.
#     -> Helpful because it captures the overall sentiment of the review.

# 2.Term Frequency-Inverse Document Frequency (TF-IDF):-
#     -> An extension of BoW that considers the importance of each word in the dataset.
#     -> Adjusts for the frequency of words appearing in general, providing a better sense of word importance.

# 3.Sentiment-bearing Words:
#     -> Extracts words with strong sentiment, such as "love", "hate", "amazing", etc.
#     -> Captures the emotional tone of the review.



# Structural Features:-

# 1.Length of the Review:-
#     -> Represents the number of words or characters in the review.
#     -> Longer reviews might be more informative or more negative.

# 2.Number of Sentences:-
#     -> Represents the number of sentences in the review.
#     -> Reviews with more sentences might be more detailed or more positive.




# Semantic Features

# 1.Named Entities:
#     -> Extracts named entities such as movie titles, actor names, or director names.
#     -> Captures the context of the review.

# 2.Part-of-Speech (POS) Tags:
#     -> Extracts POS tags for each word, such as noun, verb, adjective, etc.
#     -> Captures the grammatical structure of the review.

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:


!pip install pandas scikit-learn nltk

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import pos_tag, word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

data = {
    'reviews': [
        "I absolutely love this product! It's fantastic and works great.",
        "This is the worst purchase I have ever made. I'm very disappointed.",
        "It's okay, not what I expected but it does the job.",
        "Amazing quality! Highly recommend to everyone.",
        "Terrible experience. Will not buy again!!!"
    ]
}

df = pd.DataFrame(data)

def extract_bow(df):
    vectorizer = CountVectorizer()
    bow_features = vectorizer.fit_transform(df['reviews']).toarray()
    return pd.DataFrame(bow_features, columns=vectorizer.get_feature_names_out())

def extract_tfidf(df):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_features = tfidf_vectorizer.fit_transform(df['reviews']).toarray()
    return pd.DataFrame(tfidf_features, columns=tfidf_vectorizer.get_feature_names_out())

def extract_pos_tags(df):
    pos_tags = df['reviews'].apply(lambda x: pos_tag(word_tokenize(x)))
    return pos_tags

def extract_sentiment_scores(df):
    sia = SentimentIntensityAnalyzer()
    sentiment_scores = df['reviews'].apply(lambda x: sia.polarity_scores(x))
    return pd.DataFrame(sentiment_scores.tolist())

def extract_ngrams(df):
    vectorizer = CountVectorizer(ngram_range=(2, 2))
    ngram_features = vectorizer.fit_transform(df['reviews']).toarray()
    return pd.DataFrame(ngram_features, columns=vectorizer.get_feature_names_out())

bow_features = extract_bow(df)
tfidf_features = extract_tfidf(df)
pos_tags = extract_pos_tags(df)
sentiment_scores = extract_sentiment_scores(df)
ngram_features = extract_ngrams(df)

print("Bag of Words Features:\n", bow_features)
print("\nTF-IDF Features:\n", tfidf_features)
print("\nPart-of-Speech Tags:\n", pos_tags)
print("\nSentiment Scores:\n", sentiment_scores)
print("\nN-grams Features:\n", ngram_features)




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Bag of Words Features:
    absolutely  again  amazing  and  but  buy  disappointed  does  ever  \
0           1      0        0    1    0    0             0     0     0   
1           0      0        0    0    0    0             1     0     1   
2           0      0        0    0    1    0             0     1     0   
3           0      0        1    0    0    0             0     0     0   
4           0      1        0    0    0    1             0     0     0   

   everyone  ...  recommend  terrible  the  this  to  very  what  will  works  \
0         0  ...          0         0    0     1   0     0     0     0      1   
1         0  ...          0         0    1     1   0     1     0     0      0   
2         0  ...          0         0    1     0   0     0     1     0      0   
3         1  ...          1         0    0     0   1     0     0     0      0   
4         0  ...          0         1    0     0   0     0     0     1      0   

   worst  
0      0  
1      1  
2      0  


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [2]:


!pip install pandas scikit-learn nltk

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import mutual_info_classif, chi2
from nltk import pos_tag, word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
from sklearn.preprocessing import LabelEncoder

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

data = {
    'reviews': [
        "I absolutely love this product! It's fantastic and works great.",
        "This is the worst purchase I have ever made. I'm very disappointed.",
        "It's okay, not what I expected but it does the job.",
        "Amazing quality! Highly recommend to everyone.",
        "Terrible experience. Will not buy again!!!"
    ],
    'label': ['positive', 'negative', 'neutral', 'positive', 'negative']
}

df = pd.DataFrame(data)

label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['label'])

def extract_bow(df):
    vectorizer = CountVectorizer()
    bow_features = vectorizer.fit_transform(df['reviews']).toarray()
    return pd.DataFrame(bow_features, columns=vectorizer.get_feature_names_out()), vectorizer

bow_features, bow_vectorizer = extract_bow(df)

def extract_tfidf(df):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_features = tfidf_vectorizer.fit_transform(df['reviews']).toarray()
    return pd.DataFrame(tfidf_features, columns=tfidf_vectorizer.get_feature_names_out()), tfidf_vectorizer

tfidf_features, tfidf_vectorizer = extract_tfidf(df)

def extract_pos_tags(df):
    pos_tags = df['reviews'].apply(lambda x: pos_tag(word_tokenize(x)))
    return pos_tags

pos_tags = extract_pos_tags(df)

def extract_sentiment_scores(df):
    sia = SentimentIntensityAnalyzer()
    sentiment_scores = df['reviews'].apply(lambda x: sia.polarity_scores(x))
    return pd.DataFrame(sentiment_scores.tolist())

sentiment_scores = extract_sentiment_scores(df)

def extract_ngrams(df):
    vectorizer = CountVectorizer(ngram_range=(2, 2))
    ngram_features = vectorizer.fit_transform(df['reviews']).toarray()
    return pd.DataFrame(ngram_features, columns=vectorizer.get_feature_names_out())

ngram_features = extract_ngrams(df)

ig_scores = mutual_info_classif(bow_features, df['label'])
bow_ranked_features = sorted(zip(bow_vectorizer.get_feature_names_out(), ig_scores), key=lambda x: x[1], reverse=True)

chi2_scores, _ = chi2(tfidf_features, df['label'])
tfidf_ranked_features = sorted(zip(tfidf_vectorizer.get_feature_names_out(), chi2_scores), key=lambda x: x[1], reverse=True)

corr_scores = [np.corrcoef(sentiment_scores[col], df['label'])[0][1] for col in sentiment_scores.columns]
sentiment_ranked_features = sorted(zip(sentiment_scores.columns, corr_scores), key=lambda x: abs(x[1]), reverse=True)

print("Bag of Words Ranked Features (IG):")
for feature, score in bow_ranked_features:
    print(f"{feature}: {score:.4f}")

print("\nTF-IDF Ranked Features (CHI):")
for feature, score in tfidf_ranked_features:
    print(f"{feature}: {score:.4f}")

print("\nSentiment Scores Ranked Features (Correlation):")
for feature, score in sentiment_ranked_features:
    print(f"{feature}: {score:.4f}")






Bag of Words Ranked Features (IG):
amazing: 0.4583
disappointed: 0.4583
the: 0.4583
does: 0.3333
okay: 0.3333
what: 0.3333
everyone: 0.2083
experience: 0.2083
is: 0.2083
it: 0.2083
purchase: 0.2083
quality: 0.2083
works: 0.2083
great: 0.0833
absolutely: 0.0000
again: 0.0000
and: 0.0000
but: 0.0000
buy: 0.0000
ever: 0.0000
expected: 0.0000
fantastic: 0.0000
have: 0.0000
highly: 0.0000
job: 0.0000
love: 0.0000
made: 0.0000
not: 0.0000
product: 0.0000
recommend: 0.0000
terrible: 0.0000
this: 0.0000
to: 0.0000
very: 0.0000
will: 0.0000
worst: 0.0000

TF-IDF Ranked Features (CHI):
but: 1.2709
does: 1.2709
expected: 1.2709
job: 1.2709
okay: 1.2709
what: 1.2709
it: 1.1125
again: 0.6310
buy: 0.6310
experience: 0.6310
terrible: 0.6310
will: 0.6310
amazing: 0.6124
everyone: 0.6124
highly: 0.6124
quality: 0.6124
recommend: 0.6124
to: 0.6124
absolutely: 0.5206
and: 0.5206
fantastic: 0.5206
great: 0.5206
love: 0.5206
product: 0.5206
works: 0.5206
disappointed: 0.4918
ever: 0.4918
have: 0.4918
is: 0

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [5]:


!pip install scikit-learn nltk

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

data = {
    'reviews': [
        "I absolutely love this product! It's fantastic and works great.",
        "This is the worst purchase I have ever made. I'm very disappointed.",
        "It's okay, not what I expected but it does the job.",
        "Amazing quality! Highly recommend to everyone.",
        "Terrible experience. Will not buy again!!!"
    ]
}

query = "This product is amazing and works perfectly."

vectorizer = TfidfVectorizer()
documents = [query] + data['reviews']
tfidf_matrix = vectorizer.fit_transform(documents)

similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:]).flatten()

ranked_indices = np.argsort(similarities)[::-1]

print("Ranking of documents based on similarity to query:")
for i in ranked_indices:
    print(f"Similarity: {similarities[i]:.4f} - Text: {data['reviews'][i]}")


Ranking of documents based on similarity to query:
Similarity: 0.4238 - Text: I absolutely love this product! It's fantastic and works great.
Similarity: 0.1762 - Text: This is the worst purchase I have ever made. I'm very disappointed.
Similarity: 0.1283 - Text: Amazing quality! Highly recommend to everyone.
Similarity: 0.0000 - Text: Terrible experience. Will not buy again!!!
Similarity: 0.0000 - Text: It's okay, not what I expected but it does the job.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:

# Learning Experience:-

# 1.Feature Extraction Techniques:-
#     -> Gained a deeper understanding of various methods such as Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and sentiment analysis using lexicons. Each method offers unique advantages and drawbacks, and recognizing these helps in choosing the appropriate technique for effective text classification.

# 2.Dimensionality Reduction and Feature Selection:-
#     -> Learned about feature selection techniques like Information Gain (IG) and Chi-Square tests, which are crucial for enhancing model performance by identifying relevant features. This knowledge is essential for creating models that generalize well to new data.

# 3.Cosine Similarity for Text Ranking:-
#     -> Acquired skills in calculating cosine similarity between text vectors, a fundamental technique for measuring similarity in high-dimensional spaces. This is valuable for document retrieval and recommendation systems.

# 4.Practical Application of Libraries:-
#     -> Gained hands-on experience with scikit-learn and NLTK, providing practical exposure to industry-standard tools for working with text data.




# Challenges Encountered

# 1.Understanding BERT and Deep Learning Models:-
#     -> Initially found it challenging to grasp how BERT operates, particularly its embeddings and the significance of the pooled output (CLS token). Transitioning from traditional methods like TF-IDF to deep learning models required a shift in perspective on text representation.

# 2.Library Dependencies:-
#     -> Encountered difficulties setting up the environment and ensuring that all necessary libraries were installed correctly. Compatibility issues with specific versions of dependencies, especially with the transformers library, were a concern.

# 3.Performance Considerations:-
#     -> Realized that computational efficiency becomes crucial with larger datasets or more complex models. Learning how to optimize code for performance while maintaining accuracy is an ongoing process.





# Relevance to Your Field of Study

# 1.Sentiment Analysis:-
#     -> The skills acquired in sentiment classification help businesses understand customer feedback and improve products and services.

# 2.Information Retrieval:-
#     -> Ranking documents based on relevance to a query is essential for search engines and recommendation systems.

# 3.Text Classification:-
#     -> Effective feature extraction directly impacts the performance of machine learning models used for categorizing text into predefined classes.

