<a href="https://colab.research.google.com/github/muppallajhansi/Jhansi_INFO5731_Fall2024/blob/main/Muppalla_Jhansi_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Text classification or text mining task:
The task is to find out the topic modelling for the News articles.

What kind of features might be useful for you to build the machine learning model?
The features that are useful to build the machine learning model are:
1. Word Frequencies (Bag of Words)
2. TF-IDF (Term Frequency-Inverse Document Frequency)
3. Named Entities
4. Latent Semantic Analysis (LSA) Features
5. Document Length and Structure

List your features and explain why these features might be helpful.

1. Word Frequencies (Bag of Words)
The bag of words feature count the frequencies of each word independently without considering the order.
This helps to identify main vocabulary associated with different topics.
Example: "Game", "Player" and "Score" are related to the sports topic.

2. TF-IDF (Term Frequency-Inverse Document Frequency)
It measure how important a word is for a particular document to how frequently it appears in other documents.It helps to distinguish the main themes of the article.
Example: The common words like "the," "and," or "news" are likely to appear across all topics and carry little information.

3. Named Entities
It identifies the proper nouns such as dates in the text. It helps to categorize articles into related topics.
Example:"World cup" is a word that indicates it is related to sports.

4. Latent Semantic Analysis (LSA) Features
It is a technique that captures the relationship between words by projecting them into a lower-dimensional space.
Words that frequently co-occur in similar contexts often belong to the same topic.
Example: In articles about space exploration, "NASA," "astronaut may frequently co-occur, even if they don’t always appear directly next to each other.


5. Document Length and Structure
It captures the length of the article and whether the text contains sections like headings, bullet points, or subheadings.
It helps us to do indepth analysis on specific topics.
Example: A short article with bullet points might be more likely to focus on financial summaries, while long articles with multiple headings might indicate in-depth political analysis.

'''

'\nText classification or text mining task:\nThe task is to find out the topic modelling for the News articles.\n\nWhat kind of features might be useful for you to build the machine learning model?\nThe features that are useful to build the machine learning model are:\n1. Word Frequencies (Bag of Words)\n2. TF-IDF (Term Frequency-Inverse Document Frequency)\n3. Named Entities\n4. Latent Semantic Analysis (LSA) Features\n5. Document Length and Structure\n\nList your features and explain why these features might be helpful.\n\n1. Word Frequencies (Bag of Words)\nThe bag of words feature count the frequencies of each word independently without considering the order.\nThis helps to identify main vocabulary associated with different topics.\nExample: "Game", "Player" and "Score" are related to the sports topic.\n\n2. TF-IDF (Term Frequency-Inverse Document Frequency)\nIt measure how important a word is for a particular document to how frequently it appears in other documents.It helps to dis

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [2]:
!pip install nltk scikit-learn spacy gensim
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import re

# input data
documents = [
    "Flooding in Nepal has claimed over 100 lives, with many areas affected by landslides.",
    "Austria's far-right Freedom Party is poised for a historic national election victory.",
    "Israeli forces killed Hezbollah chief Hassan Nasrallah in an airstrike.",
    "NASA astronauts are preparing for a mission to Mars, advancing space exploration.",
    "Severe flooding in Kathmandu has led to widespread damage and school closures."
]



# Initialize spaCy's NER model
nlp = spacy.load('en_core_web_sm')

# 1. Word Frequencies (Bag of Words)
def extract_bag_of_words(docs):
    vectorizer = CountVectorizer(stop_words='english')
    bow_matrix = vectorizer.fit_transform(docs)
    return bow_matrix, vectorizer.get_feature_names_out()

# 2. TF-IDF
def extract_tfidf(docs):
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf_vectorizer.fit_transform(docs)
    return tfidf_matrix, tfidf_vectorizer.get_feature_names_out()

# 3. Named Entities
def extract_named_entities(docs):
    named_entities = []
    for doc in docs:
        spacy_doc = nlp(doc)
        named_entities.append([(ent.text, ent.label_) for ent in spacy_doc.ents])
    return named_entities

# 4. Latent Semantic Analysis (LSA)
def extract_lsa_features(docs, n_components=2):
    # Using TF-IDF matrix as input for LSA
    tfidf_matrix, feature_names = extract_tfidf(docs)
    lsa = TruncatedSVD(n_components=n_components)
    lsa_matrix = lsa.fit_transform(tfidf_matrix)
    return lsa_matrix, lsa.components_

# 5. Document Length and Structure
def extract_document_length_and_structure(docs):
    lengths = [len(doc.split()) for doc in docs]
    structures = [bool(re.search(r"\b(heading|bullet point|subheading)\b", doc.lower())) for doc in docs]
    return lengths, structures

# Feature extraction
bow_matrix, bow_features = extract_bag_of_words(documents)
tfidf_matrix, tfidf_features = extract_tfidf(documents)
named_entities = extract_named_entities(documents)
lsa_matrix, lsa_components = extract_lsa_features(documents)
doc_lengths, doc_structures = extract_document_length_and_structure(documents)




# Displaying other extracted features
print("\n1. Bag of Words (Word Frequencies):")
print(bow_matrix.toarray())
print("Features:", bow_features)

print("\n2. TF-IDF:")
print(tfidf_matrix.toarray())
print("Features:", tfidf_features)

print("\n3. Named Entities:")
for i, entities in enumerate(named_entities):
    print(f"Document {i+1}: {entities}")

print("\n4. Latent Semantic Analysis (LSA):")
print("LSA Matrix:\n", lsa_matrix)
print("LSA Components:\n", lsa_components)

print("\n5. Document Length and Structure:")
for i, (length, structure) in enumerate(zip(doc_lengths, doc_structures)):
    print(f"Document {i+1}: Length = {length}, Has Structure = {structure}")



1. Bag of Words (Word Frequencies):
[[1 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0
  0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1
  0 0 0 1 0]
 [0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
  0 0 0 0 0]
 [0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0
  0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  1 1 0 0 1]]
Features: ['100' 'advancing' 'affected' 'airstrike' 'areas' 'astronauts' 'austria'
 'chief' 'claimed' 'closures' 'damage' 'election' 'exploration' 'far'
 'flooding' 'forces' 'freedom' 'hassan' 'hezbollah' 'historic' 'israeli'
 'kathmandu' 'killed' 'landslides' 'led' 'lives' 'mars' 'mission' 'nasa'
 'nasrallah' 'national' 'nepal' 'party' 'poised' 'preparing' 'right'
 'school' 'severe' 'space' 'victory' 'widespread']

2. TF-IDF:
[[0.36152912 0.         0.36152912 0.         0.36152912 0.
  0.         0.         0.36152912 0

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [4]:
import numpy as np

# 1. Rank TF-IDF Features
def rank_tfidf_features(tfidf_matrix, tfidf_features, top_n=10):
    # Sum TF-IDF scores for each term across all documents
    feature_importances = np.sum(tfidf_matrix.toarray(), axis=0)
    sorted_indices = np.argsort(feature_importances)[::-1]  # Descending order
    ranked_features = [(tfidf_features[i], feature_importances[i]) for i in sorted_indices[:top_n]]
    return ranked_features

# 2. Rank Named Entities
def rank_named_entities(named_entities):
    entity_count = {}
    for doc_entities in named_entities:
        for entity, label in doc_entities:
            if entity in entity_count:
                entity_count[entity] += 1
            else:
                entity_count[entity] = 1
    # Sort entities by frequency
    sorted_entities = sorted(entity_count.items(), key=lambda x: x[1], reverse=True)
    return sorted_entities

# 3. Rank LSA Components
def rank_lsa_components(lsa_matrix, lsa_components, top_n=10):
    # Sum the absolute values of components to get importance
    component_importances = np.sum(np.abs(lsa_components), axis=0)
    sorted_indices = np.argsort(component_importances)[::-1]
    ranked_components = [(f"Component {i+1}", component_importances[i]) for i in sorted_indices[:top_n]]
    return ranked_components

# Get rankings
ranked_tfidf = rank_tfidf_features(tfidf_matrix, tfidf_features)
ranked_entities = rank_named_entities(named_entities)
ranked_lsa = rank_lsa_components(lsa_matrix, lsa_components)

# Display results
print("\nRanked TF-IDF Features:")
for feature, score in ranked_tfidf:
    print(f"Feature: {feature}, Score: {score}")

print("\nRanked Named Entities:")
for entity, count in ranked_entities:
    print(f"Entity: {entity}, Count: {count}")

print("\nRanked LSA Components:")
for component, importance in ranked_lsa:
    print(f"{component}, Importance: {importance}")




Ranked TF-IDF Features:
Feature: flooding, Score: 0.5833588309315438
Feature: widespread, Score: 0.36152911730069653
Feature: lives, Score: 0.36152911730069653
Feature: affected, Score: 0.36152911730069653
Feature: areas, Score: 0.36152911730069653
Feature: claimed, Score: 0.36152911730069653
Feature: closures, Score: 0.36152911730069653
Feature: damage, Score: 0.36152911730069653
Feature: kathmandu, Score: 0.36152911730069653
Feature: landslides, Score: 0.36152911730069653

Ranked Named Entities:
Entity: Nepal, Count: 1
Entity: over 100, Count: 1
Entity: Austria, Count: 1
Entity: Freedom Party, Count: 1
Entity: Israeli, Count: 1
Entity: Hezbollah, Count: 1
Entity: Hassan Nasrallah, Count: 1
Entity: NASA, Count: 1
Entity: Mars, Count: 1
Entity: Kathmandu, Count: 1

Ranked LSA Components:
Component 15, Importance: 0.3959957487519463
Component 31, Importance: 0.30678599553894803
Component 33, Importance: 0.30678599553894803
Component 17, Importance: 0.30678599553894803
Component 14, Imp

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [5]:
!pip install torch transformers scikit-learn



In [6]:
import numpy as np
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data (replace with real news articles)
documents = [
    "Flooding in Nepal has claimed over 100 lives, with many areas affected by landslides.",
    "Austria's far-right Freedom Party is poised for a historic national election victory.",
    "Israeli forces killed Hezbollah chief Hassan Nasrallah in an airstrike.",
    "NASA astronauts are preparing for a mission to Mars, advancing space exploration.",
    "Severe flooding in Kathmandu has led to widespread damage and school closures."
]

# Query
query = "Hezbollah leader killed in an Israeli airstrike"

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode text
def encode_text(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Return the mean of the token embeddings as the document representation
    return outputs.last_hidden_state.mean(dim=1).numpy()

# Encode the query and documents
query_embedding = encode_text(query)
document_embeddings = np.vstack([encode_text(doc) for doc in documents])

# Calculate cosine similarity
similarity_scores = cosine_similarity(query_embedding, document_embeddings)[0]

# Rank documents based on similarity scores
ranked_indices = np.argsort(similarity_scores)[::-1]  # Descending order

# Display results
print("\nRanked Documents based on Similarity to Query:")
for index in ranked_indices:
    print(f"Document: {documents[index]}, Similarity Score: {similarity_scores[index]:.4f}")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]


Ranked Documents based on Similarity to Query:
Document: Israeli forces killed Hezbollah chief Hassan Nasrallah in an airstrike., Similarity Score: 0.8846
Document: Flooding in Nepal has claimed over 100 lives, with many areas affected by landslides., Similarity Score: 0.6146
Document: Severe flooding in Kathmandu has led to widespread damage and school closures., Similarity Score: 0.6073
Document: Austria's far-right Freedom Party is poised for a historic national election victory., Similarity Score: 0.5836
Document: NASA astronauts are preparing for a mission to Mars, advancing space exploration., Similarity Score: 0.5276


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [7]:

'''
My overall learning experience was very good and i have practical experience about feature extraction, feature selection and text similarity.
I have faced few challenges while building the BERT model and finding the best query.
This exercise is relevant to the field of Natural Language Processing (NLP) as it contains fundamental techniques for feature extraction and similarity measurement, which are important tasks like information retrieval and text classification.
'''

'\nMy overall learning experience was very good and i have practical experience about feature extraction, feature selection and text similarity.\nI have faced few challenges while building the BERT model and finding the best query.\nThis exercise is relevant to the field of Natural Language Processing (NLP) as it contains fundamental techniques for feature extraction and similarity measurement, which are important tasks like information retrieval and text classification.\n'