<a href="https://colab.research.google.com/github/mushfiq-hussain/hussain_mushfiq_exersice_03/blob/main/hussain_mushfiq_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
Here is an example of an interesting text classification task and some potential features for building a machine learning model:

Task: Classify news articles as political or non-political.

Potential features:

N-grams: Frequency of politically-charged words and phrases like "election", "candidate", "policy". These would indicate political content.

Named entities: Presence of names of political figures, parties, governmental organizations. Also indicates political content.

Sentiment analysis: Political articles often express strong positive or negative sentiment. This feature could help with classification.

Article source: Source of article (website domain, publisher) can help indicate if political or not.

Word embeddings: Using pre-trained word vectors can help detect semantic similarity of words to political domains.

Article section: Section of newspaper that article appears could indicate political or not.

The goal is to extract features that capture political rhetoric, actors, issues and sentiment. The variety of lexical, semantic, metadata and source-based features can help the machine learning model determine if an article discusses politics.

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:
# You code here (Please add comments in the code):
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob

# Sample text data
text1 = "The prime minister debated the opposition leader on the new economic policy."
text2 = "The new restaurant on Main Street is having its grand opening tomorrow."

nlp = spacy.load("en_core_web_sm")

def get_features(text):

  # Extract named entities
  doc = nlp(text)
  entities = [ent.label_ for ent in doc.ents]

  # N-gram features
  cv = CountVectorizer(ngram_range=(1,2))
  cv_features = cv.fit_transform([text])

  # Sentiment analysis
  blob = TextBlob(text)
  sentiment = blob.sentiment.polarity

  # Create dictionary of features
  features = {
    "entities": entities,
    "ngrams": list(cv.get_feature_names()),
    "sentiment": sentiment
  }

  return features

text1_features = get_features(text1)
text2_features = get_features(text2)

print(text1_features)
print(text2_features)




AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
# You code here (Please add comments in the code):
from sklearn.feature_selection import chi2

# Extract features for a dataset
X = [get_features(text) for text in texts]

# Convert to numpy array
X = np.array(X)

# Target labels
y = [1 if 'political' in text else 0 for text in texts]

# Compute chi-squared stats between features and target
chi2_scores, p_values = chi2(X, y)

# Get feature names
feature_names = cv.get_feature_names() + ['entities', 'sentiment']

# Create dataframe with feature names and chi-squared scores
feature_scores = pd.DataFrame({'feature':feature_names, 'chi2_score':chi2_scores})

# Sort features by chi-squared score in descending order
ranked_features = feature_scores.sort_values('chi2_score', ascending=False)

print(ranked_features)





## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
# You code here (Please add comments in the code):
import tensorflow_hub as hub
import tensorflow as tf

# Sample texts
texts = [text1, text2, text3, text4]

# Load pre-trained BERT model from TensorFlow Hub
bert_model = hub.load("https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1")

# Create tokenizer to preprocess text into tokens
tokenizer = hub.KerasLayer(bert_model, trainable=False)

# Define input query
query = "The prime minister gave a speech today"

# Tokenize query to generate input tokens for BERT
query_tokens = tokenizer([query])

# Tokenize sample texts to generate input tokens for BERT
texts_tokens = tokenizer(texts)

# Pass query and text tokens through BERT to get embeddings
query_embeddings = bert_model(query_tokens)
texts_embeddings = bert_model(texts_tokens)

# Compute cosine similarity between query and text embeddings
cos_sim = tf.keras.losses.cosine_similarity(query_embeddings, texts_embeddings)

# Sort texts by cosine similarity in descending order
ranked_texts = tf.argsort(cos_sim, direction='DESCENDING')

print("Text ranked by similarity:")
# Print ranked texts
for i in ranked_texts:
  print(f"{i+1}. {texts[i]}")

The key steps are:

Tokenize query and texts using BERT tokenizer
Get embeddings from BERT model
Compute cosine similarity between query and text embeddings
Rank texts by sorting based on cosine similarity scores.





# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Extracting features from text data is key in NLP. Going through the process of identifying relevant features like entities, sentiment, etc and implementing them in code has helped reinforce my understanding.
Using pre-trained models like BERT to generate semantically meaningful embeddings for similarity analysis is very powerful. This exercise provided great practice in leveraging such models.
Overall, implementing the end-to-end pipeline from data preprocessing to feature extraction to similarity ranking has improved my applied skills in NLP techniques.
Thinking of relevant features for a particular text classification task requires some domain experience. For an unfamiliar dataset, more exploration may be needed to identify useful features
Overall, this was a great practical exercise to improve my understanding of core NLP concepts and workflows. The step-by-step implementation has prepared me for tackling similar problems in real applications.




'''