## The third In-class-exercise (2/28/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
An interesting text classification task could be classifying customer reviews of a product into positive, negative or neutral sentiments.
This task is useful for businesses to understand customer feedback and improve their product or service.

To build a machine learning model for this task, the following features could be helpful:

Bag of Words (BoW) - A BoW approach can be used to represent each review as a vector of word frequencies.
This is a commonly used feature in text classification tasks and it can help capture important keywords and their frequency in a review.

N-grams - N-grams refer to contiguous sequences of n words in a text.
They can be useful to capture the context of words in a review.
For example, bigrams (n=2) can capture phrases like "customer service" which can be a strong indicator of sentiment.

Part-of-speech (POS) tags - POS tags can help identify the role of each word in a sentence such as noun, verb, adjective etc.
This information can be used to capture grammatical features of a review and their impact on sentiment.

Sentiment lexicons - Sentiment lexicons are lists of words with their associated sentiment polarity (positive or negative).
They can be useful for identifying the sentiment of a review based on the presence of certain sentiment words.

Named entities - Named entities refer to specific names of people, organizations, and places in a text.
They can be useful for identifying the subject of a review and understanding how the sentiment relates to specific entities.

By using a combination of these features, the machine learning model can learn to distinguish between positive, negative, and neutral sentiments in customer reviews of a product.




'''

'\nPlease write you answer here:\nAn interesting text classification task could be classifying customer reviews of a product into positive, negative or neutral sentiments.\nThis task is useful for businesses to understand customer feedback and improve their product or service.\n\nTo build a machine learning model for this task, the following features could be helpful:\n\nBag of Words (BoW) - A BoW approach can be used to represent each review as a vector of word frequencies.\nThis is a commonly used feature in text classification tasks and it can help capture important keywords and their frequency in a review.\n\nN-grams - N-grams refer to contiguous sequences of n words in a text.\nThey can be useful to capture the context of words in a review.\nFor example, bigrams (n=2) can capture phrases like "customer service" which can be a strong indicator of sentiment.\n\nPart-of-speech (POS) tags - POS tags can help identify the role of each word in a sentence such as noun, verb, adjective et

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction. 

In [2]:
# You code here (Please add comments in the code):
import warnings
warnings.filterwarnings('ignore')

import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text data
texts = ['I love this movie!','This restaurant has terrible service.','The weather is beautiful today.','I feel neutral about this topic.','The book was not good.']

# Preprocess the text data
stop_words = set(stopwords.words('english'))
lemmatizer = nltk.WordNetLemmatizer()

def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token not in stop_words]
    return ' '.join(tokens)

preprocessed_texts = [preprocess_text(text) for text in texts]

# Bag-of-words features
bow_vectorizer = CountVectorizer()
bow_features = bow_vectorizer.fit_transform(preprocessed_texts)

# N-gram features
ngram_vectorizer = CountVectorizer(ngram_range=(2,2))
ngram_features = ngram_vectorizer.fit_transform(preprocessed_texts)

# Part-of-speech (POS) features
pos_tagged_texts = [nltk.pos_tag(nltk.word_tokenize(text)) for text in texts]

pos_features = []
for tagged_text in pos_tagged_texts:
    feature_dict = {}
    for token, tag in tagged_text:
        feature_dict[tag] = feature_dict.get(tag, 0) + 1
    pos_features.append(feature_dict)

# Sentiment lexicon features
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()
sentiment_lexicon_features = [sia.polarity_scores(text) for text in texts]

#Syntax Features
syntax_features = []
for text in texts:
    parse_tree = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)))
    feature_dict = {}
    for subtree in parse_tree.subtrees(filter=lambda t: t.label() == 'S'):
        subtree_words = [word for word, tag in subtree.leaves()]
        feature_dict[' '.join(subtree_words)] = True
    syntax_features.append(feature_dict)


# Print the extracted features
print('Bag-of-words features:')
print(bow_vectorizer.get_feature_names())
print(bow_features.toarray())
print('')

print('N-gram features:')
print(ngram_vectorizer.get_feature_names())
print(ngram_features.toarray())
print('')

print('Part-of-speech (POS) features:')
print(pos_features)
print('')

print('Sentiment lexicon features:')
print(sentiment_lexicon_features)
print('')

print('Syntax features:')
print(syntax_features)
print('')



Bag-of-words features:
['beautiful', 'book', 'feel', 'good', 'love', 'movie', 'neutral', 'restaurant', 'service', 'terrible', 'today', 'topic', 'weather']
[[0 0 0 0 1 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 1 1 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 1 0 1]
 [0 0 1 0 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 0 0 0 0 0 0 0 0]]

N-gram features:
['beautiful today', 'book good', 'feel neutral', 'love movie', 'neutral topic', 'restaurant terrible', 'terrible service', 'weather beautiful']
[[0 0 0 1 0 0 0 0]
 [0 0 0 0 0 1 1 0]
 [1 0 0 0 0 0 0 1]
 [0 0 1 0 1 0 0 0]
 [0 1 0 0 0 0 0 0]]

Part-of-speech (POS) features:
[{'PRP': 1, 'VBP': 1, 'DT': 1, 'NN': 1, '.': 1}, {'DT': 1, 'NN': 2, 'VBZ': 1, 'JJ': 1, '.': 1}, {'DT': 1, 'NN': 2, 'VBZ': 1, 'JJ': 1, '.': 1}, {'PRP': 1, 'VBP': 1, 'JJ': 1, 'IN': 1, 'DT': 1, 'NN': 1, '.': 1}, {'DT': 1, 'NN': 1, 'VBD': 1, 'RB': 1, 'JJ': 1, '.': 1}]

Sentiment lexicon features:
[{'neg': 0.0, 'neu': 0.308, 'pos': 0.692, 'compound': 0.6696}, {'neg': 0.437, 'neu': 0.563, 'pos': 0.0, 'compound':

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)." Select the most important features you extracted above, rank the features based on their importance in the descending order. 

In [3]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import chi2
from scipy.sparse import hstack
import textstat
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')

# Sample data
mytext = ["The service at this restaurant was excellent. The food was delicious and the atmosphere was perfect.",
          "I had a terrible experience at this restaurant. The service was slow and the food was cold.",
          "I highly recommend this book. It's well-written and the characters are engaging.",
          "This book was just okay. It didn't really capture my attention, but it wasn't terrible either.",
          "I regret buying this product. It was not what I was expecting and it didn't work as advertised."]

# Preprocessing
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Remove stopwords
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

mytext_preprocessed = [preprocess_text(text) for text in mytext]

# Target class labels
labels = np.array([1, 0, 1, 2, 0])  # 1 for positive, 0 for negative, 2 for neutral

# Bag of Words feature extraction
vectorizer_bow = CountVectorizer()
bow_features = vectorizer_bow.fit_transform(mytext_preprocessed)

# N-Grams feature extraction
vectorizer_ngrams = CountVectorizer(ngram_range=(2, 2))
ngram_features = vectorizer_ngrams.fit_transform(mytext_preprocessed)

# Part-of-speech feature extraction
pos_vectorizer = CountVectorizer(token_pattern=r'\b\w\w+\b|!|\?|\"|\'', ngram_range=(1,1), analyzer='word', 
                                 stop_words='english')
pos_features = pos_vectorizer.fit_transform(mytext_preprocessed)

# Sentiment Lexicon feature extraction
analyzer = SentimentIntensityAnalyzer()
lexicon_features = []
for doc in mytext_preprocessed:
    vs = analyzer.polarity_scores(doc)
    # apply non-negative transformation
    lexicon_features.append([abs(vs['neg']), abs(vs['neu']), abs(vs['pos']), abs(vs['compound'])])
lexicon_features = np.array(lexicon_features)

# Readability feature extraction
readability_features = []
for doc in mytext_preprocessed:
    flesch_score = textstat.flesch_reading_ease(doc)
    smog_score = textstat.smog_index(doc)
    # apply non-negative transformation
    readability_features.append([abs(flesch_score), abs(smog_score)])
readability_features = np.array(readability_features)

# Concatenate all features horizontally
features = hstack((bow_features, ngram_features, pos_features, lexicon_features, readability_features))

# Get feature names
feature_names = vectorizer_bow.get_feature_names() + vectorizer_ngrams.get_feature_names() + pos_vectorizer.get_feature_names() + ['neg', 'neu', 'pos', 'compound'] + ['flesch_score', 'smog_score']

# Chi-Square feature selection
chi2_scores, _ = chi2(features, labels)
feature_scores = list(zip(feature_names, chi2_scores))
feature_scores.sort(key=lambda x: x[1], reverse=True)
print("Chi-Square scores for all features:")
for feature, score in feature_scores:
    print(feature, score)

Chi-Square scores for all features:
flesch_score 19.385948616600796
attention 4.000000000000001
capture 4.000000000000001
either 4.000000000000001
okay 4.000000000000001
really 4.000000000000001
wasnt 4.000000000000001
attention wasnt 4.000000000000001
book okay 4.000000000000001
capture attention 4.000000000000001
didnt really 4.000000000000001
okay didnt 4.000000000000001
really capture 4.000000000000001
terrible either 4.000000000000001
wasnt terrible 4.000000000000001
attention 4.000000000000001
capture 4.000000000000001
okay 4.000000000000001
really 4.000000000000001
wasnt 4.000000000000001
book 1.75
didnt 1.75
terrible 1.75
book 1.75
didnt 1.75
terrible 1.75
advertised 1.5
atmosphere 1.5
buying 1.5
characters 1.5
cold 1.5
delicious 1.5
engaging 1.5
excellent 1.5
expecting 1.5
experience 1.5
highly 1.5
perfect 1.5
product 1.5
recommend 1.5
regret 1.5
slow 1.5
wellwritten 1.5
work 1.5
atmosphere perfect 1.5
book wellwritten 1.5
buying product 1.5
characters engaging 1.5
delicious a

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jeeva\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order. 

In [5]:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Tokenize text data and query
texts = ["This product is amazing! It exceeded my expectations. I would definitely recommend it to anyone.", 
            "I was really disappointed with this product. It didn't work as advertised.", 
            "I love this product so much! It has made my life so much easier.",    
            "This product is just okay. It wasn't great, but it wasn't terrible either.",  
            "I would never buy this product again. It was a complete waste of money."]
         
query = "restaurant service was great, but the food was cold"

# Compute the embeddings for the query and texts
query_embedding = model.encode([query])[0]
text_embeddings = model.encode(texts)

# Compute the cosine similarities between the query and texts
similarities = cosine_similarity([query_embedding], text_embeddings)[0]

# Rank the texts based on their similarity to the query
from tabulate import tabulate

table = []
for i, (text, similarity) in enumerate(ranked_texts):
    table.append([i+1, text, round(similarity, 4)])
    
print(tabulate(table, headers=['Rank', 'Text', 'Similarity'], tablefmt='orgtbl'))


|   Rank | Text                                                                                             |   Similarity |
|--------+--------------------------------------------------------------------------------------------------+--------------|
|      1 | This product is just okay. It wasn't great, but it wasn't terrible either.                       |       0.5735 |
|      2 | I was really disappointed with this product. It didn't work as advertised.                       |       0.4877 |
|      3 | I love this product so much! It has made my life so much easier.                                 |       0.4121 |
|      4 | This product is amazing! It exceeded my expectations. I would definitely recommend it to anyone. |       0.3285 |
|      5 | I would never buy this product again. It was a complete waste of money.                          |       0.294  |
