<a href="https://colab.research.google.com/github/harshita0147/Harshita_Pamu_INFO5731_Fall2024/blob/main/INFO5731_Exercise_3_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:


1. IDF (Inverse Document Frequency)
What it does: TF-IDF is a numerical computation that shows how important certain words are in any document with relation to all other documents. It is compounded by term-frequency (how many times a word appears in the document) scrolling and inverse-contention-recurrence (the rarer a word is compared to the complete set of observations).
Why this is helpful: Words with high TF-IDF scores are usually those that give more meaning to the content of the document. Also due to the process of weights from tf-idf focusing more on sentences or phrases with higher average values, this can help the model select the sentences having more informative information for summarization.

2. Sentence Position
What it is: The place given in the document to a sentence. An inaugural sentence and closing sentence in most documents (esp. news articles, research papers) are the most important for comprehension.
What makes it special: Key sentences often have a position-based nominal location (introduction or conclusion, respectively). Just as text position and its variants serve as heuristics to help the model figure out where (generally) informative content is likely to be, so can we use sentence position for guidance.

3. Named Entity Recognition (NER)
Type: NER labels the text such as Person name or Organization name or Location name or Date, etc. It ceasefires important bits of information.
When it helps: Named entities can represent crucial data. In this way, you can score sentences with more named entities as important in a summary. For instance, in news articles such as a person, place, and date is very important to how the information being presented.

4. Sentence Length
Definition: The count of words or characters in a sentence.
Why we like it: If a sentence is very short you will not get enough information, and the other way around, if the sentence goes from one corner to another, it can be more challenging. Medium-length sentences are often informative, and thus good candidates for summaries. This feature eliminates sentences that are either overgeneralized or overly complex.

5. Part-of-Speech (POS) Tagging
What is used for: POS tagging: This can include identifying words in the text as noun, verb, adjective, and so on.
Why it works: Nouns and verbs generally deliver the main thrust of a sentence, while adjectives and adverbs constitute supplementary, secondary detail. Through observations of the POS tags, the model can give greater weight to sentences containing more signal-carrying parts of speech, namely nouns and verbs.

Bonus Feature:
Sentiment Analysis
What It Is: A function that determines the polarity of a sentence (positive, negative, or neutral). In different types of text sentiment is used to express the weight or importance, (in the contents of this place it helps us indicate how they express their emotions) especially in opinion articles and reviews.
Why it's helpful: If the tone of a document is crucial to its message (as an example, in a review or political speech), one may want to highlight sentences with more sentiment for these parts.




'''

SyntaxError: unterminated string literal (detected at line 24) (<ipython-input-3-1ca9337f8160>, line 24)

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from collections import defaultdict
import numpy as np


nlp = spacy.load('en_core_web_sm')


text = """
Apple is looking at buying U.K. startup for $1 billion. The tech giant reported a significant growth in revenue.
Tim Cook said that the company is focusing on innovative solutions to drive the future of technology.
"""


doc = nlp(text)


sentences = list(doc.sents)

def extract_tfidf_features(sentences):

    sentences_str = [str(sent) for sent in sentences]


    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences_str)


    tfidf_mean = tfidf_matrix.mean(axis=1)
    return np.squeeze(np.asarray(tfidf_mean))


def extract_sentence_position(sentences):
    total_sentences = len(sentences)
    return np.array([i / total_sentences for i in range(total_sentences)])


def extract_named_entities(sentences):
    ner_counts = []
    for sent in sentences:
        entities = sent.ents
        ner_counts.append(len(entities))
    return np.array(ner_counts)


def extract_sentence_length(sentences):
    return np.array([len(sent) for sent in sentences])


def extract_pos_tags(sentences):
    pos_counts = []
    for sent in sentences:
        doc_sent = nlp(str(sent))
        pos_count = defaultdict(int)
        for token in doc_sent:
            pos_count[token.pos_] += 1


        pos_counts.append({
            'NOUN': pos_count.get('NOUN', 0),
            'VERB': pos_count.get('VERB', 0),
            'ADJ': pos_count.get('ADJ', 0),
            'ADV': pos_count.get('ADV', 0)
        })
    return pos_counts

tfidf_features = extract_tfidf_features(sentences)
position_features = extract_sentence_position(sentences)
ner_features = extract_named_entities(sentences)
length_features = extract_sentence_length(sentences)
pos_features = extract_pos_tags(sentences)


for i, sent in enumerate(sentences):
    print(f"Sentence {i + 1}: {sent}")
    print(f"  TF-IDF Mean: {tfidf_features[i]}")
    print(f"  Sentence Position: {position_features[i]}")
    print(f"  Named Entities: {ner_features[i]}")
    print(f"  Sentence Length: {length_features[i]}")
    print(f"  POS Counts: {pos_features[i]}")
    print("-" * 40)





Sentence 1: 
Apple is looking at buying U.K. startup for $1 billion.
  TF-IDF Mean: 0.09396825119256713
  Sentence Position: 0.0
  Named Entities: 3
  Sentence Length: 13
  POS Counts: {'NOUN': 1, 'VERB': 2, 'ADJ': 0, 'ADV': 0}
----------------------------------------
Sentence 2: The tech giant reported a significant growth in revenue. 

  TF-IDF Mean: 0.0939682511925671
  Sentence Position: 0.3333333333333333
  Named Entities: 0
  Sentence Length: 11
  POS Counts: {'NOUN': 4, 'VERB': 1, 'ADJ': 1, 'ADV': 0}
----------------------------------------
Sentence 3: Tim Cook said that the company is focusing on innovative solutions to drive the future of technology.

  TF-IDF Mean: 0.13204887941935345
  Sentence Position: 0.6666666666666666
  Named Entities: 1
  Sentence Length: 19
  POS Counts: {'NOUN': 4, 'VERB': 3, 'ADJ': 1, 'ADV': 0}
----------------------------------------


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import mutual_info_classif
import numpy as np
import pandas as pd
from collections import defaultdict


nlp = spacy.load('en_core_web_sm')


text = """
Apple is looking at buying U.K. startup for $1 billion. The tech giant reported a significant growth in revenue.
Tim Cook said that the company is focusing on innovative solutions to drive the future of technology.
"""


doc = nlp(text)


sentences = list(doc.sents)


print(f"Number of sentences: {len(sentences)}")


def extract_tfidf_features(sentences):

    sentences_str = [str(sent) for sent in sentences]


    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences_str)


    tfidf_mean = tfidf_matrix.mean(axis=1)
    return np.squeeze(np.asarray(tfidf_mean))


def extract_sentence_position(sentences):
    total_sentences = len(sentences)
    return np.array([i / total_sentences for i in range(total_sentences)])

def extract_named_entities(sentences):
    ner_counts = []
    for sent in sentences:
        entities = sent.ents
        ner_counts.append(len(entities))
    return np.array(ner_counts)


def extract_sentence_length(sentences):
    return np.array([len(sent) for sent in sentences])


def extract_pos_tags(sentences):
    pos_counts = []
    for sent in sentences:
        doc_sent = nlp(str(sent))
        pos_count = defaultdict(int)
        for token in doc_sent:
            pos_count[token.pos_] += 1


        pos_counts.append({
            'NOUN': pos_count.get('NOUN', 0),
            'VERB': pos_count.get('VERB', 0),
            'ADJ': pos_count.get('ADJ', 0),
            'ADV': pos_count.get('ADV', 0)
        })
    return pos_counts


tfidf_features = extract_tfidf_features(sentences)
position_features = extract_sentence_position(sentences)
ner_features = extract_named_entities(sentences)
length_features = extract_sentence_length(sentences)
pos_features = extract_pos_tags(sentences)


print(f"TF-IDF Features Length: {len(tfidf_features)}")
print(f"Position Features Length: {len(position_features)}")
print(f"NER Features Length: {len(ner_features)}")
print(f"Length Features Length: {len(length_features)}")
print(f"POS Features Length: {len(pos_features)}")


sentence_labels = np.array([1, 0, 1])

feature_data = {
    'TF-IDF Mean': tfidf_features,
    'Sentence Position': position_features,
    'Named Entities': ner_features,
    'Sentence Length': length_features,
    'NOUN Count': [pos['NOUN'] for pos in pos_features],
    'VERB Count': [pos['VERB'] for pos in pos_features],
    'ADJ Count': [pos['ADJ'] for pos in pos_features],
    'ADV Count': [pos['ADV'] for pos in pos_features]
}


df = pd.DataFrame(feature_data)


scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(df)


mi_scores = mutual_info_classif(scaled_features, sentence_labels)


mi_ranking = pd.DataFrame({
    'Feature': df.columns,
    'Mutual Information Score': mi_scores
})


ranked_features = mi_ranking.sort_values(by='Mutual Information Score', ascending=False)


print("Ranked Features based on Mutual Information scores:")
print(ranked_features)


Number of sentences: 3
TF-IDF Features Length: 3
Position Features Length: 3
NER Features Length: 3
Length Features Length: 3
POS Features Length: 3
Ranked Features based on Mutual Information scores:
             Feature  Mutual Information Score
0        TF-IDF Mean                         0
1  Sentence Position                         0
2     Named Entities                         0
3    Sentence Length                         0
4         NOUN Count                         0
5         VERB Count                         0
6          ADJ Count                         0
7          ADV Count                         0


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')


def get_bert_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors='pt', max_length=512, truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)

    sentence_embedding = outputs.last_hidden_state[:, 0, :].numpy()
    return sentence_embedding


text = """
Apple is looking at buying U.K. startup for $1 billion. The tech giant reported a significant growth in revenue.
Tim Cook said that the company is focusing on innovative solutions to drive the future of technology.
"""


sentences = text.split('\n')


query = "Apple is interested in acquiring a new company to expand its business."


query_embedding = get_bert_embedding(query)


sentence_embeddings = np.vstack([get_bert_embedding(sent) for sent in sentences if sent.strip() != ""])


cosine_similarities = cosine_similarity(query_embedding, sentence_embeddings).flatten()


ranked_indices = np.argsort(cosine_similarities)[::-1]


print("Ranked Sentences based on Similarity to the Query:")
for idx in ranked_indices:
    print(f"Sentence: {sentences[idx]} \nSimilarity Score: {cosine_similarities[idx]}")
    print("-" * 40)






The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Sentences based on Similarity to the Query:
Sentence:  
Similarity Score: 0.8995685577392578
----------------------------------------
Sentence: Apple is looking at buying U.K. startup for $1 billion. The tech giant reported a significant growth in revenue.  
Similarity Score: 0.888329267501831
----------------------------------------


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
The 3rd and 4th question is more tough than the other too.But it was a good exercise to learn the concept of text mining and a very good oppertunity to refere to the book Deng, X., Li, Y., Weng, J., & Zhang, J. (2019).





'''