<a href="https://colab.research.google.com/github/kc6699c/Komal_INFO5731_Fall2024/blob/main/Cherukuri_INFO5731_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
In the current world there is more fake news than fact news. An interesting task is fake news detection, to identify whether the news is a fact or fake.
I will scrape the data from the guardian news website.

There are many features that can be extracted from the text data
1. Bag of Words
2. Parts of Speech
3. N-grams
4. TF-IDF
5. Word Embeddings


'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

url = 'https://www.theguardian.com/us'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser') # Send a request to fetch the page content

tags_to_extract = ['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li'] #we use the html structure to specify which data to extract

# Collect sentences in a list
sentences = []
for tag in tags_to_extract:
    elements = soup.find_all(tag)
    for element in elements:
        text = element.get_text(separator=" ", strip=True)  # Get text and remove extra spaces
        sentence_list = re.split(r'(?<=[.!?]) +', text)  # Split by sentence-ending punctuation
        # Add sentences with at least 8 words
        sentences.extend([sentence for sentence in sentence_list if len(sentence.split()) >= 8])

df = pd.DataFrame(sentences, columns=['Sentence'])

csv_filename = 'guardian_sentences.csv'
df.to_csv(csv_filename, index=False)

print(f"Sentences saved to {csv_filename}.")

Sentences saved to guardian_sentences.csv.


In [1]:
pip install pandas scikit-learn nltk



## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [4]:
# You code here (Please add comments in the code):
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv('/content/guardian_sentences.csv')

vectorizer = TfidfVectorizer(stop_words='english') # Initialize the TF-IDF Vectorizer

tfidf_matrix = vectorizer.fit_transform(df['Sentence'])  # Fit and transform the text data

feature_names = vectorizer.get_feature_names_out()

tfidf_scores = tfidf_matrix.sum(axis=0).A1 # Sum the TF-IDF scores

feature_df = pd.DataFrame({'Feature': feature_names, 'Score': tfidf_scores})

# Rank the features by their scores in descending order
ranked_features = feature_df.sort_values(by='Score', ascending=False)

print(ranked_features.head(10))  # Display the ranked features

        Feature     Score
1030      trump  6.649595
48          ago  6.558124
681         new  5.481117
449      harris  4.843865
460      helene  4.695527
913       smith  4.222738
665   nasrallah  4.058570
450      hassan  4.058570
358     fashion  3.914937
610      maggie  3.607302


In [2]:
# You code here (Please add comments in the code):
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk import word_tokenize, pos_tag

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

df = pd.read_csv('/content/guardian_sentences.csv')

print("Original Rows:")
print(df.head())

# 1. Bag of Words (BoW) Transformation
vectorizer = CountVectorizer(stop_words='english')
X_bow = vectorizer.fit_transform(df['Sentence'])

bow_df = pd.DataFrame(X_bow.toarray(), columns=vectorizer.get_feature_names_out()) # Convert BoW matrix to a DataFrame for readability

print("\nBag of Words (BoW):")
print(bow_df.head())

# Parts of Speech (POS) Tagging
def pos_tagging(text):
    tokens = word_tokenize(text)
    return pos_tag(tokens)

df['POS'] = df['Sentence'].apply(pos_tagging)  # Replace 'Sentence' with your actual text column name

print("\nPart of Speech (POS) Tagging:")
print(df[['Sentence', 'POS']].head())

# Print final DataFrame
print("\nFinal DataFrame after BoW and POS tagging:")
print(df.head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Original Rows:
                                            Sentence
0        What’s really at risk in the 2024 election?
1  Adam Gabbatt guides you through the biggest qu...
2  We’ll focus not just on the odds, but the stakes.
3   Stay up to date on all of Donald Trump’s trials.
4  Guardian staff will send news and updates dire...

Bag of Words (BoW):
   10  10h  11h  12  12h  14  15  1h  2010  2024  ...  worship  worst  writer  \
0   0    0    0   0    0   0   0   0     0     1  ...        0      0       0   
1   0    0    0   0    0   0   0   0     0     0  ...        0      0       0   
2   0    0    0   0    0   0   0   0     0     0  ...        0      0       0   
3   0    0    0   0    0   0   0   0     0     0  ...        0      0       0   
4   0    0    0   0    0   0   0   0     0     0  ...        0      0       0   

   wrong  year  years  york  young  zelenskyy  zuckerberg  
0      0     0      0     0      0          0           0  
1      0     0      0     0      0     

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [5]:
pip install transformers torch scikit-learn



In [6]:
import pandas as pd
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model and tokenizer from Hugging Face
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    outputs = model(**inputs)
    # Get the [CLS] token embeddings
    return outputs.last_hidden_state[:, 0, :].detach().numpy()

df = pd.read_csv('/content/guardian_sentences.csv')

query = "I'm sweating right now"

query_embedding = get_bert_embedding(query)

# Calculate BERT embeddings for all the sentences
sentence_embeddings = [get_bert_embedding(sentence) for sentence in df['Sentence']]

similarities = [cosine_similarity(query_embedding, sentence_embedding)[0][0] for sentence_embedding in sentence_embeddings]

df['Similarity'] = similarities

df_sorted = df.sort_values(by='Similarity', ascending=False)

df_sorted.to_csv('ranked_sentences_by_similarity.csv', index=False)

print(df_sorted[['Sentence', 'Similarity']].head())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

                                              Sentence  Similarity
202  It’s useful that the latest AI can ‘think’, bu...    0.936407
83   The Audio Long Read No god in the machine: the...    0.934671
245  The Audio Long Read No god in the machine: the...    0.928473
95   Books Stuart Murdoch: ‘I feel like this book w...    0.924237
94        I’m sweating right now telling you about it’    0.923774


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

We need more time to understand and work on this carefully. Within the short time it is hard to actually code.

'''