<a href="https://colab.research.google.com/github/madhan444-s/Madhan_INFO5731_Spring2024/blob/main/Dadi_Madhan_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Sentiment analysis of messages on social media could be an intriguing text classification assignment. Finding the sentiment—whether good, negative, or neutral—expressed in a text is the goal of sentiment analysis. Understanding consumer mood, industry trends, and brand impression may all be gained by examining social media sentiment analysis. The following list contains five different feature types that could be helpful in creating a sentiment analysis machine learning model:

1. Bag-of-Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF):
    The text's word frequencies or TF-IDF scores for every word.
BoW and TF-IDF assess a word's significance in a document. When expressing feeling, words with greater frequencies or TF-IDF scores may be quite important.

2. N-grams:
   Adjacent word sequences (bigrams or trigrams).
N-grams depict the relationships and context of words. Phrases such as "very good" or "not happy" have a different effect on sentiment than single words.

3. Part-of-Speech (POS) Tags:
  Verbs, adjectives, and other POS markers are distributed throughout the text.
Sentiment may be influenced differently by various POS tags. Adjectives and verbs, for example, could convey more feeling than nouns.

4. Emotion Lexicons:
    The existence or occurrence of terms linked to certain feelings (such happiness, sorrow, or fury).
Lexicons of emotions aid in expressing the text's emotional tone. The feeling conveyed in the material might be powerfully indicated by certain emotive terms.

5. Sentiment Lexicons:
    The existence or occurrence of terms linked to certain feelings (such happiness, sorrow, or fury).
Lexicons of emotions aid in expressing the text's emotional tone. The feeling conveyed in the material might be powerfully indicated by certain emotive terms.

Collectively, these attributes give a rich representation of the textual material, allowing the machine learning model to understand patterns and produce accurate sentiment predictions. Using a combination of these feature categories can improve the model's capacity to capture subtle expressions in social media text.

'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [24]:
# You code here (Please add comments in the code):
# !pip install nltk
# !python -m nltk.downloader punkt
import nltk
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import gensim  # Add this line to import gensim

# Download NLTK resource for part-of-speech tagging
nltk.download('averaged_perceptron_tagger')

# Sample text data
sample_data = [
    "I love this product. It's amazing!",
    "The service was terrible. I'm very disappointed.",
    "The weather today is neither good nor bad.",
    "Not sure why anyone would buy this. It's a waste of money.",
    "This movie is great. I enjoyed every moment of it."
]

# Function to extract Bag-of-Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF)
def extract_bow_tfidf(texts, method='bow'):
    if method == 'bow':
        vectorizer = CountVectorizer()
    elif method == 'tfidf':
        vectorizer = TfidfVectorizer()

    features = vectorizer.fit_transform(texts)
    feature_names = vectorizer.get_feature_names_out()

    return pd.DataFrame(features.toarray(), columns=feature_names)

# Function to extract N-grams
def extract_ngrams(texts, n=2):
    ngrams_list = []

    for text in texts:
        words = word_tokenize(text)
        ngrams_list.extend(ngrams(words, n))

    return ngrams_list

# Function to extract Part-of-Speech (POS) Tags
def extract_pos_tags(texts):
    pos_tags_list = []

    for text in texts:
        words = word_tokenize(text)
        pos_tags = pos_tag(words)
        pos_tags_list.extend(pos_tags)

    return pos_tags_list

# Function to extract Emotion Lexicons
def extract_emotion_lexicons(texts):
    # Assume a simple list of positive and negative emotion words
    positive_emotion_words = ["love", "amazing", "joy", "good", "great"]
    negative_emotion_words = ["terrible", "disappointed", "bad", "waste"]

    emotion_features = []

    for text in texts:
        words = word_tokenize(text)
        positive_count = sum(1 for word in words if word in positive_emotion_words)
        negative_count = sum(1 for word in words if word in negative_emotion_words)
        emotion_features.append({'Positive_Count': positive_count, 'Negative_Count': negative_count})

    return pd.DataFrame(emotion_features)

# Function to extract Sentiment Lexicons
def extract_sentiment_lexicons(texts):
    # Assume a simple list of positive and negative sentiment words
    positive_sentiment_words = ["love", "amazing", "joy", "good", "great"]
    negative_sentiment_words = ["terrible", "disappointed", "bad", "waste"]

    sentiment_features = []

    for text in texts:
        words = word_tokenize(text)
        positive_count = sum(1 for word in words if word in positive_sentiment_words)
        negative_count = sum(1 for word in words if word in negative_sentiment_words)
        sentiment_features.append({'Positive_Count': positive_count, 'Negative_Count': negative_count})

    return pd.DataFrame(sentiment_features)

if __name__ == "__main__":
    # Extracting features
    bow_features = extract_bow_tfidf(sample_data, method='bow')
    tfidf_features = extract_bow_tfidf(sample_data, method='tfidf')
    ngrams_features = extract_ngrams(sample_data, n=2)
    pos_tags_features = extract_pos_tags(sample_data)
    emotion_lexicons_features = extract_emotion_lexicons(sample_data)
    sentiment_lexicons_features = extract_sentiment_lexicons(sample_data)

    # Displaying extracted features
    print("Bag-of-Words Features:")
    print(bow_features)

    print("\nTF-IDF Features:")
    print(tfidf_features)

    print("\nN-grams Features:")
    print(ngrams_features)

    print("\nPart-of-Speech (POS) Tags Features:")
    print(pos_tags_features)

    print("\nEmotion Lexicons Features:")
    print(emotion_lexicons_features)

    print("\nSentiment Lexicons Features:")
    print(sentiment_lexicons_features)




Bag-of-Words Features:
   amazing  anyone  bad  buy  disappointed  enjoyed  every  good  great  is  \
0        1       0    0    0             0        0      0     0      0   0   
1        0       0    0    0             1        0      0     0      0   0   
2        0       0    1    0             0        0      0     1      0   1   
3        0       1    0    1             0        0      0     0      0   0   
4        0       0    0    0             0        1      1     0      1   1   

   ...  terrible  the  this  today  very  was  waste  weather  why  would  
0  ...         0    0     1      0     0    0      0        0    0      0  
1  ...         1    1     0      0     1    1      0        0    0      0  
2  ...         0    1     0      1     0    0      0        1    0      0  
3  ...         0    0     1      0     0    0      1        0    1      1  
4  ...         0    0     1      0     0    0      0        0    0      0  

[5 rows x 32 columns]

TF-IDF Features:
    a

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [23]:
# You code here (Please add comments in the code):

'''
Here are the most important features based on the paper, ranked in descending order:
1. Document Frequency (DF): It is the number of documents in the corpus that contain the term.
2. TF-IDF: This method takes into account both term frequency (TF) and inverse document frequency (IDF) to measure the importance of a term.
3. Mutual Information (MI): It measures the mutual dependency between two variables.
4. Information Gain (IG): It is used to measure the dependence between features and class labels.
5. χ2 (CHI): It measures the degree of independence between a term and a category.
6. Correlation Coefficient (CC): It is a variant of the CHI measure.
Others: The paper also mentions other methods like Term ReLatedness (TRL), CMFS, Distinguishing Feature Selector (DFS), SpreadFx, Bi-Normal Separation (BNS), Maximum Discrimination (MD), Linear Measure (LM), Posterior Inclusion Probability (PIP), IGFSS, Subspace Sample (SS), Weight-based Sampling (WS), Uniform Sampling (US), and Best Terms (BT).

'''

'\nHere are the most important features based on the paper, ranked in descending order:\n1. Document Frequency (DF): It is the number of documents in the corpus that contain the term.\n2. TF-IDF: This method takes into account both term frequency (TF) and inverse document frequency (IDF) to measure the importance of a term.\n3. Mutual Information (MI): It measures the mutual dependency between two variables.\n4. Information Gain (IG): It is used to measure the dependence between features and class labels.\n5. χ2 (CHI): It measures the degree of independence between a term and a category.\n6. Correlation Coefficient (CC): It is a variant of the CHI measure.\nOthers: The paper also mentions other methods like Term ReLatedness (TRL), CMFS, Distinguishing Feature Selector (DFS), SpreadFx, Bi-Normal Separation (BNS), Maximum Discrimination (MD), Linear Measure (LM), Posterior Inclusion Probability (PIP), IGFSS, Subspace Sample (SS), Weight-based Sampling (WS), Uniform Sampling (US), and Bes

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [21]:
# Install the necessary library
!pip install sentence_transformers

# Import libraries
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Sample text data
sample_data = [
    "I love this product. It's amazing!",
    "The service was terrible. I'm very disappointed.",
    "The weather today is neither good nor bad.",
    "Not sure why anyone would buy this. It's a waste of money.",
    "This movie is great. I enjoyed every moment of it."
]

# Sample query
query = "I want to purchase a good product. What are the best options?"

# Load a pre-trained BERT-based model (you can choose a different model if needed)
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Encode the query and sample data
query_embedding = model.encode(query, convert_to_tensor=True)
sample_data_embeddings = model.encode(sample_data, convert_to_tensor=True)

# Reshape the 1D array to 2D array
query_embedding = query_embedding.reshape(1, -1)

# Calculate cosine similarity between the query and each text in the data
cosine_similarities = cosine_similarity(query_embedding, sample_data_embeddings).flatten()

# Create a DataFrame to display the results
result_df = pd.DataFrame({'Text': sample_data, 'Cosine_Similarity': cosine_similarities})

# Rank the results based on cosine similarity in descending order
result_df = result_df.sort_values(by='Cosine_Similarity', ascending=False)

# Display the result DataFrame
result_df




Unnamed: 0,Text,Cosine_Similarity
0,I love this product. It's amazing!,0.404019
3,Not sure why anyone would buy this. It's a was...,0.311158
4,This movie is great. I enjoyed every moment of...,0.056885
2,The weather today is neither good nor bad.,-0.024932
1,The service was terrible. I'm very disappointed.,-0.063588


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [22]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

The exercise provided a valuable learning experience in extracting features from text data for NLP.
Key concepts like Bag-of-Words, TF-IDF, N-grams, POS tagging, and sentiment lexicons were well-covered, enhancing understanding.
Challenges included environment setup and library installations, particularly with pre-trained models.
The relevance to NLP is significant, as feature extraction is foundational for tasks like sentiment analysis, text classification, and information retrieval.
The exercise offered practical insights applicable to real-world NLP scenarios.


'''

'\nPlease write you answer here:\n\nThe exercise provided a valuable learning experience in extracting features from text data for NLP.\nKey concepts like Bag-of-Words, TF-IDF, N-grams, POS tagging, and sentiment lexicons were well-covered, enhancing understanding.\nChallenges included environment setup and library installations, particularly with pre-trained models. \nThe relevance to NLP is significant, as feature extraction is foundational for tasks like sentiment analysis, text classification, and information retrieval.\nThe exercise offered practical insights applicable to real-world NLP scenarios.\n\n\n'