<a href="https://colab.research.google.com/github/nagamani0604/Nagamani_INFO5731_Fall2024/blob/main/Somireddy_Nagamani_Exercise_3_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

An interesting text classification task is topic categorization for news articles. Sorting news articles into pre-established categories—such as politics, sports, entertainment, and technology—is the goal. The following five feature types can be used in order to create a machine learning model for this task:


1. Bag-of-Words (BoW) Features
A simple example in which every article is defined by the frequency or absence of specific words.
Words like "game" or "team" in sports, or "government" or "election" in politics, have a tendency to have a strong connection with particular subjects. The topic can be strongly derived from the frequency of particular words, even in the absence of an understanding of word order.

2. TF-IDF (Term Frequency-Inverse Document Frequency)
A weighted version of word counts that includes word frequency in each article.
Common words like "the" and "is" are given less weight by TF-IDF, which also highlights topic-specific words that are more informative but occur less frequently.

3. Named Entity Recognition (NER)
Recognizing and classifying named entities in the text, such as individuals, groups, dates, and places.
There are some things that are closely related to particular subjects. Refers to athletes point to a sports article, whereas mentions of politicians or political organizations most likely indicate a politics article.

4. Document Length
The number of words in an article.
In general, different subjects are covered in different lengths. For example, sports articles are typically longer and include brief event summaries, whereas business or politics articles may offer more in-depth analysis. The length of a document can be a basic but useful feature.

5. Topic-Specific Keywords
Utilizing keyword extraction algorithms or past knowledge to identify words or phrases that serve as important indicators of particular categories.
In sports, terms like "match," and "score" are highly representative of the subject matter; in politics, terms like "policy," and "election" are similarly representative. These keywords offer clear hints regarding the article's subject.
When these features are combined, the articles are extensively represented, which improves the model's comprehension of the context and content for topic categorization.




'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [12]:
!pip install scikit-learn spacy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import spacy

nlp = spacy.load("en_core_web_sm")

documents = [
    "The government announced new policies to boost the economy after the election.",
    "Renewable energy sources like solar and wind power are becoming more popular.",
    "Microsoft announced a partnership with other tech companies to advance AI research.",
    "The Olympic Games are around the corner, and athletes are gearing up for the competition."
]

# 1. Bag-of-Words (BoW) Features
vectorizer_bow = CountVectorizer()
bow_matrix = vectorizer_bow.fit_transform(documents)

print("Bag-of-Words Feature Matrix (BoW):\n", bow_matrix.toarray())
print("BoW Feature Names:", vectorizer_bow.get_feature_names_out())

# 2. TF-IDF (Term Frequency-Inverse Document Frequency) Features
vectorizer_tfidf = TfidfVectorizer()
tfidf_matrix = vectorizer_tfidf.fit_transform(documents)

print("\nTF-IDF Feature Matrix:\n", tfidf_matrix.toarray())
print("TF-IDF Feature Names:", vectorizer_tfidf.get_feature_names_out())

# 3. Named Entity Recognition (NER) Features
def extract_named_entities(doc):
    doc_nlp = nlp(doc)
    entities = [(ent.text, ent.label_) for ent in doc_nlp.ents]
    return entities

named_entities = [extract_named_entities(doc) for doc in documents]
print("\nNamed Entities (NER):")
for i, entities in enumerate(named_entities):
    print(f"Document {i+1}: {entities}")

# 4. Document Length
doc_lengths = [len(doc.split()) for doc in documents]
print("\nDocument Lengths (in words):", doc_lengths)

# 5. Topic-Specific Keywords
keywords = {
    "politics": ["government", "election", "policy"],
    "sports": ["Olympic", "athletes", "competition", "game"],
    "technology": ["Microsoft", "AI", "tech", "research"],
    "environment": ["energy", "solar", "wind", "power"]
}

def extract_keywords(doc, keywords):
    doc_tokens = doc.split()
    found_keywords = {}
    for category, kw_list in keywords.items():
        found = [kw for kw in kw_list if kw in doc_tokens]
        if found:
            found_keywords[category] = found
    return found_keywords

doc_keywords = [extract_keywords(doc, keywords) for doc in documents]
print("\nTopic-Specific Keywords:")
for i, kw in enumerate(doc_keywords):
    print(f"Document {i+1}: {kw}")



Bag-of-Words Feature Matrix (BoW):
 [[0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 3
  1 0 0 0]
 [0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 0
  0 0 1 0]
 [1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 1 0
  1 0 0 1]
 [0 0 0 1 0 2 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 3
  0 1 0 0]]
BoW Feature Names: ['advance' 'after' 'ai' 'and' 'announced' 'are' 'around' 'athletes'
 'becoming' 'boost' 'companies' 'competition' 'corner' 'economy'
 'election' 'energy' 'for' 'games' 'gearing' 'government' 'like'
 'microsoft' 'more' 'new' 'olympic' 'other' 'partnership' 'policies'
 'popular' 'power' 'renewable' 'research' 'solar' 'sources' 'tech' 'the'
 'to' 'up' 'wind' 'with']

TF-IDF Feature Matrix:
 [[0.         0.26882576 0.         0.         0.21194532 0.
  0.         0.         0.         0.26882576 0.         0.
  0.         0.26882576 0.26882576 0.         0.         0.
  0.         0.26882576 0.       

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [14]:
# You code here (Please add comments in the code):

# Required libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split

data = {'text': ['I love coding in Python', 'Data analysis is fascinating', 'Python makes data analysis easy',
                'Pandas makes data manipulation easier', 'Understanding AI impacts our future'],
        'label': [1, 1, 1, 0, 0]}  # 1 = Positive, 0 = Negative

df = pd.DataFrame(data)

# Extract Features using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(df['text'])  # Feature matrix
y = df['label']  # Labels

# Apply Chi-Square test
chi2_scores, p_values = chi2(X, y)

# Create a DataFrame with feature names and their corresponding Chi-Square scores
feature_ranking = pd.DataFrame({'feature': tfidf_vectorizer.get_feature_names_out(), 'chi2_score': chi2_scores})

# Rank features based on Chi-Square scores in descending order
ranked_features = feature_ranking.sort_values(by='chi2_score', ascending=False)

print("Ranked Features that are based on Chi-Square Test:")
print(ranked_features)





Ranked Features that are based on Chi-Square Test:
          feature  chi2_score
15         pandas    0.740849
4          easier    0.740849
13   manipulation    0.740849
0              ai    0.670820
7          future    0.670820
14            our    0.670820
8         impacts    0.670820
17  understanding    0.670820
1        analysis    0.597156
16         python    0.573138
6     fascinating    0.378676
10             is    0.378676
5            easy    0.361484
11           love    0.348906
2          coding    0.348906
9              in    0.348906
3            data    0.037978
12          makes    0.020479


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [15]:
!pip install transformers torch scikit-learn

from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import torch
import numpy as np

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

documents = [
    "The government announced new policies to boost the economy after the election.",
    "Renewable energy sources like solar and wind power are becoming more popular.",
    "Microsoft announced a partnership with other tech companies to advance AI research.",
    "The Olympic Games are around the corner, and athletes are gearing up for the competition."
]

query = "Government policies to boost the economy"

def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :].numpy()
    return cls_embedding

query_embedding = get_bert_embedding(query)

document_embeddings = np.vstack([get_bert_embedding(doc) for doc in documents])

# Calculate cosine similarity between query and each document
similarities = cosine_similarity(query_embedding, document_embeddings)[0]

# Rank documents based on similarity in descending order
ranked_indices = np.argsort(similarities)[::-1]

print("Ranked Documents based on Query Similarity:\n")
for idx in ranked_indices:
    print(f"Document: {documents[idx]}\nSimilarity Score: {similarities[idx]:.4f}\n")






Ranked Documents based on Query Similarity:

Document: The government announced new policies to boost the economy after the election.
Similarity Score: 0.8176

Document: Renewable energy sources like solar and wind power are becoming more popular.
Similarity Score: 0.7989

Document: The Olympic Games are around the corner, and athletes are gearing up for the competition.
Similarity Score: 0.7886

Document: Microsoft announced a partnership with other tech companies to advance AI research.
Similarity Score: 0.7816



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

I acquired valuable experience and a deeper understanding of different Natural Language Processing (NLP) techniques working on feature extraction from text data.
Key ideas like Named Entity Recognition (NER), Term Frequency-Inverse Document Frequency (TF-IDF), Bag-of-Words (BoW), and the usage of pre-trained models like BERT made it easier for me to understand the various ways that textual data can be represented for machine learning tasks.

Understanding how various feature extraction techniques differ in complexity and applicability was one of the major challenges.
For example, BERT embeddings require more computational power and knowledge of deep learning models, while Bag-of-Words and TF-IDF are simple but less successful in capturing word meanings.

This exercise has a lot to do with natural language processing (NLP), which deals with converting unformatted text into formats that computers can understand.
Text classification, sentiment analysis, information retrieval, and document ranking are just a few of the NLP tasks that heavily rely on feature extraction techniques like BoW, TF-IDF, and BERT.




'''