<a href="https://colab.research.google.com/github/kesireddysiva/Sivanarayana_INFO_5737_Spring_2024/blob/main/Sivanarayana_Reddy_kesireddy_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

I was interested in doing text classification task detecting spam email. In this task, the goal is to distinguish between spam and non-spam  emails. The features extracted in the provided code can be extended to build a machine learning model for this task. Here are five types of features that might be useful:

1. **Bag-of-Words Features:**
2. **TF-IDF Features:**
3. **Sentiment Features:**
4. **Exclamation Mark Count:**
5. **Hyperlink Presence:**
   
By combining these features, a machine learning model can learn to distinguish between spam and non-spam emails based on the specific characteristics commonly found in each category. It's important to note that feature selection and engineering are iterative processes, and additional features or variations of existing features may further improve model performance.

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [2]:
# You code here (Please add comments in the code):
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
nltk.download('stopwords')
nltk.download('vader_lexicon')
# Sample email data
emails = [
    "Hey, congratulations! You've won a free cruise. Click the link to claim your prize now!",
    "Hi, just wanted to remind you about our meeting tomorrow at 10 AM. Regards, John",
    "URGENT: Your account needs verification. Please click the link and provide your details.",
    "Meeting agenda attached. Let me know if you have any questions. Thanks, Sarah"
]

# Labels for the sample emails (1 for spam, 0 for non-spam)
labels = [1, 0, 1, 0]

# Create a DataFrame for easy handling of data
df = pd.DataFrame({'email': emails, 'label': labels})

# Function to extract bag-of-words features
def extract_bag_of_words(text):
    stop_words = set(stopwords.words('english'))
    vectorizer = CountVectorizer(stop_words=list(stop_words))
    features = vectorizer.fit_transform([text])
    return features.toarray()[0]

# Function to extract TF-IDF features
def extract_tfidf(text):
    vectorizer = TfidfVectorizer()
    features = vectorizer.fit_transform([text])
    return features.toarray()[0]

# Function to extract sentiment features
def extract_sentiment_features(text):
    sid = SentimentIntensityAnalyzer()
    sentiment_scores = sid.polarity_scores(text)
    return sentiment_scores['compound']

# Function to extract exclamation mark count
def extract_exclamation_mark_count(text):
    return text.count('!')

# Function to extract hyperlink presence
def extract_hyperlink_presence(text):
    # Check if the text contains "http://" or "https://"
    return int("http://" in text or "https://" in text)

# Apply feature extraction functions to the sample data
df['bag_of_words'] = df['email'].apply(extract_bag_of_words)
df['tfidf'] = df['email'].apply(extract_tfidf)
df['sentiment'] = df['email'].apply(extract_sentiment_features)
df['exclamation_count'] = df['email'].apply(extract_exclamation_mark_count)
df['hyperlink_presence'] = df['email'].apply(extract_hyperlink_presence)

# Display the extracted features
print("Bag-of-Words Features:")
print(df['bag_of_words'])

print("\nTF-IDF Features:")
print(df['tfidf'])

print("\nSentiment Features:")
print(df['sentiment'])

print("\nExclamation Mark Count:")
print(df['exclamation_count'])

print("\nHyperlink Presence:")
print(df['hyperlink_presence'])




Bag-of-Words Features:
0       [1, 1, 1, 1, 1, 1, 1, 1]
1       [1, 1, 1, 1, 1, 1, 1, 1]
2    [1, 1, 1, 1, 1, 1, 1, 1, 1]
3       [1, 1, 1, 1, 1, 1, 1, 1]
Name: bag_of_words, dtype: object

TF-IDF Features:
0    [0.2581988897471611, 0.2581988897471611, 0.258...
1    [0.2581988897471611, 0.2581988897471611, 0.258...
2    [0.2581988897471611, 0.2581988897471611, 0.258...
3    [0.2773500981126146, 0.2773500981126146, 0.277...
Name: tfidf, dtype: object

Sentiment Features:
0    0.9411
1    0.0000
2    0.5904
3    0.4404
Name: sentiment, dtype: float64

Exclamation Mark Count:
0    2
1    0
2    0
3    0
Name: exclamation_count, dtype: int64

Hyperlink Presence:
0    0
1    0
2    0
3    0
Name: hyperlink_presence, dtype: int64


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [11]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Pad bag_of_words and tfidf arrays to have the same length
max_features = max(max(len(x) for x in df['bag_of_words']), max(len(x) for x in df['tfidf']))
df['bag_of_words'] = df['bag_of_words'].apply(lambda x: np.pad(x, (0, max_features - len(x))))
df['tfidf'] = df['tfidf'].apply(lambda x: np.pad(x, (0, max_features - len(x))))

# Combine all features into a single array
X = np.concatenate([np.vstack(df['bag_of_words'].values),
                    np.vstack(df['tfidf'].values),
                    df['sentiment'].values.reshape(-1, 1),
                    df['exclamation_count'].values.reshape(-1, 1),
                    df['hyperlink_presence'].values.reshape(-1, 1)], axis=1)

# Labels
y = df['label'].values

# Train a RandomForestClassifier to get feature importance
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)

# Get feature importances
feature_importances = clf.feature_importances_

# Create a DataFrame to display feature importance
feature_names = [f'bag_of_words_{i}' for i in range(max_features)] + \
                [f'tfidf_{i}' for i in range(max_features)] + \
                ['sentiment', 'exclamation_count', 'hyperlink_presence']
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})

# Sort features by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display the ranked features
print("Ranked Features based on Importance:")
print(feature_importance_df)


Ranked Features based on Importance:
               Feature  Importance
30           sentiment    0.291188
31   exclamation_count    0.122605
27            tfidf_12    0.083653
25            tfidf_10    0.055556
17             tfidf_2    0.049808
29            tfidf_14    0.047254
28            tfidf_13    0.045977
18             tfidf_3    0.045977
20             tfidf_5    0.042146
26            tfidf_11    0.040230
21             tfidf_6    0.039591
15             tfidf_0    0.038314
22             tfidf_7    0.030651
19             tfidf_4    0.022989
16             tfidf_1    0.020434
8       bag_of_words_8    0.017241
23             tfidf_8    0.005109
24             tfidf_9    0.001277
0       bag_of_words_0    0.000000
1       bag_of_words_1    0.000000
14     bag_of_words_14    0.000000
13     bag_of_words_13    0.000000
12     bag_of_words_12    0.000000
11     bag_of_words_11    0.000000
10     bag_of_words_10    0.000000
9       bag_of_words_9    0.000000
7       bag_of_wor

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [12]:
# You code here (Please add comments in the code):

!pip install transformers scikit-learn






In [13]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Sample email data
emails = [
    "Hey, congratulations! You've won a free cruise. Click the link to claim your prize now!",
    "Hi, just wanted to remind you about our meeting tomorrow at 10 AM. Regards, John",
    "URGENT: Your account needs verification. Please click the link and provide your details.",
    "Meeting agenda attached. Let me know if you have any questions. Thanks, Sarah"
]

# Query
query = "Meeting tomorrow at 10 AM"

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize and obtain embeddings for the query
query_tokens = tokenizer(query, return_tensors='pt', truncation=True, padding=True)
with torch.no_grad():
    query_outputs = model(**query_tokens)

# Extract the embeddings for the query
query_embedding = query_outputs['last_hidden_state'].mean(dim=1).squeeze().numpy()

# Tokenize and obtain embeddings for each email
email_embeddings = []
for email in emails:
    email_tokens = tokenizer(email, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        email_outputs = model(**email_tokens)

    # Extract the embeddings for each email
    email_embedding = email_outputs['last_hidden_state'].mean(dim=1).squeeze().numpy()
    email_embeddings.append(email_embedding)

# Calculate cosine similarity between the query and each email
similarities = cosine_similarity([query_embedding], email_embeddings).flatten()

# Create a DataFrame to display the ranked similarities
ranked_similarity_df = pd.DataFrame({'Email': emails, 'Similarity': similarities})

# Sort emails by similarity in descending order
ranked_similarity_df = ranked_similarity_df.sort_values(by='Similarity', ascending=False)

# Display the ranked similarities
print("Ranked Similarities:")
print(ranked_similarity_df)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Similarities:
                                               Email  Similarity
1  Hi, just wanted to remind you about our meetin...    0.661114
3  Meeting agenda attached. Let me know if you ha...    0.619736
2  URGENT: Your account needs verification. Pleas...    0.509751
0  Hey, congratulations! You've won a free cruise...    0.487396


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



Extracting features from text data enhanced my understanding of NLP. Concepts like Bag-of-Words, TF-IDF, and sentiment analysis proved beneficial. Learning feature selection methods, including classifier-based importance ranking, added depth to optimizing feature sets.
Challenges included handling feature dimensions, aligning bag_of_words and tfidf vectors of varying lengths, and ensuring library compatibility. Detailed attention to concatenation and feature alignment was necessary.
