<a href="https://colab.research.google.com/github/hiteshsurya17/hiteshsurya17-INFO-5731-Section-020---Computational-Methods-for-Information-Systems-Fall-2024-1-/blob/main/Hitesh_chowdary_suryadevara_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
A text classification task that I fount interesting is classifying legal case documents by their case type.
Features which can extract and identify domain specific key-words to classify the documents into different types of cases
are very useful.
The five features I picked for this task are:
1. TF-IDF (Term Frequency-Inverse Document Frequency).
2. Legal Domain-Specific Keywords.
3. N-grams (Bigrams/Trigrams).
4. Named Entity Recognition (NER).
5. POS (Part-of-Speech) Tagging.

'''

'\nA text classification task that I fount interesting is classifying legal case documents by their case type.\nFeatures which can extract and identify domain specific key-words to classify the documents into different types of cases \nare very useful.\nThe five features I picked for this task are:\n1. TF-IDF (Term Frequency-Inverse Document Frequency).\n2. Legal Domain-Specific Keywords.\n3. N-grams (Bigrams/Trigrams).\n4. Named Entity Recognition (NER).\n5. POS (Part-of-Speech) Tagging.\n\n'

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [4]:
!pip install scikit-learn spacy nltk
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [5]:
import spacy
import nltk
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder

# example legal documents
legal_documents = [
    "The plaintiff claims that the defendant breached the contract by failing to deliver the goods as promised.",
    "The employee filed a lawsuit against the employer for wrongful termination and breach of employment contract.",
    "The company was sued for intellectual property theft and patent infringement.",
    "The court ruled that the custody of the children should be granted to the mother in the family law case.",
    "The defendant was charged with burglary and theft under the criminal law."
]

# classification Labels for the different document types .
labels = ["Contract Dispute", "Labor Law", "Intellectual Property", "Family Law", "Criminal Law"]

nlp = spacy.load("en_core_web_sm")

# TF-IDF (Term Frequency-Inverse Document Frequency)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(legal_documents)
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())

# example Domain-Specific Keywords
keywords = {
    "Contract Dispute": ["contract", "breach", "plaintiff", "defendant"],
    "Labor Law": ["employee", "termination", "employer"],
    "Intellectual Property": ["intellectual property", "patent", "infringement"],
    "Family Law": ["custody", "children", "family"],
    "Criminal Law": ["defendant", "burglary", "theft", "criminal"]
}

def extract_keywords(doc, keywords):
    keyword_features = {}
    for case_type, words in keywords.items():
        keyword_features[case_type] = sum([1 for word in words if word in doc.lower()])
    return keyword_features

keyword_features = [extract_keywords(doc, keywords) for doc in legal_documents]
print("\nKeyword Features:")
for doc_num, features in enumerate(keyword_features, 1):
    print(f"Document {doc_num}: {features}")

# N-Grams (Bigrams)
def get_ngrams(doc, n=2):
    tokens = nltk.word_tokenize(doc)
    return list(ngrams(tokens, n))

nltk.download('punkt')
for doc_num, doc in enumerate(legal_documents, 1):
    bigrams = get_ngrams(doc, n=2)
    print(f"\nBigrams for Document {doc_num}:")
    print(bigrams)

# Named Entity Recognition (NER)
def extract_named_entities(doc):
    ner_entities = []
    parsed_doc = nlp(doc)
    for ent in parsed_doc.ents:
        ner_entities.append((ent.text, ent.label_))
    return ner_entities

print("\nNamed Entities:")
for doc_num, doc in enumerate(legal_documents, 1):
    entities = extract_named_entities(doc)
    print(f"Document {doc_num} Named Entities: {entities}")

# Part-of-Speech (POS) Tagging
def pos_tagging(doc):
    parsed_doc = nlp(doc)
    pos_tags = [(token.text, token.pos_) for token in parsed_doc]
    return pos_tags

print("\nPart-of-Speech Tags:")
for doc_num, doc in enumerate(legal_documents, 1):
    pos_tags = pos_tagging(doc)
    print(f"Document {doc_num} POS Tags: {pos_tags}")





TF-IDF Matrix:
[[0.         0.         0.25618649 0.         0.         0.25618649
  0.         0.25618649 0.         0.         0.         0.25618649
  0.         0.20668965 0.         0.         0.         0.20668965
  0.25618649 0.         0.         0.         0.25618649 0.
  0.         0.         0.25618649 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.25618649 0.25618649 0.         0.         0.         0.
  0.         0.20668965 0.48829694 0.         0.20668965 0.
  0.         0.         0.        ]
 [0.28502303 0.19088324 0.         0.         0.28502303 0.
  0.         0.         0.         0.         0.         0.
  0.         0.22995478 0.         0.         0.         0.
  0.         0.28502303 0.28502303 0.28502303 0.         0.
  0.28502303 0.22995478 0.         0.         0.         0.
  0.         0.         0.28502303 0.         0.22995478 0.
  0.         0.         0.         0.         0.         0.
  0.28502303 0.         

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Document 3 Named Entities: []
Document 4 Named Entities: []
Document 5 Named Entities: []

Part-of-Speech Tags:
Document 1 POS Tags: [('The', 'DET'), ('plaintiff', 'NOUN'), ('claims', 'VERB'), ('that', 'SCONJ'), ('the', 'DET'), ('defendant', 'NOUN'), ('breached', 'VERB'), ('the', 'DET'), ('contract', 'NOUN'), ('by', 'ADP'), ('failing', 'VERB'), ('to', 'PART'), ('deliver', 'VERB'), ('the', 'DET'), ('goods', 'NOUN'), ('as', 'SCONJ'), ('promised', 'VERB'), ('.', 'PUNCT')]
Document 2 POS Tags: [('The', 'DET'), ('employee', 'NOUN'), ('filed', 'VERB'), ('a', 'DET'), ('lawsuit', 'NOUN'), ('against', 'ADP'), ('the', 'DET'), ('employer', 'NOUN'), ('for', 'ADP'), ('wrongful', 'ADJ'), ('termination', 'NOUN'), ('and', 'CCONJ'), ('breach', 'NOUN'), ('of', 'ADP'), ('employment', 'NOUN'), ('contract', 'NOUN'), ('.', 'PUNCT')]
Document 3 POS Tags: [('The', 'DET'), ('company', 'NOUN'), ('was', 'AUX'), ('sued', 'VERB'), ('for', 'ADP'), ('intellectual', 'ADJ'), ('property', 'NOUN'), ('theft', 'NOUN'), ('

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [15]:
# One feature I used from the paper is Correlation Coefficient:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# example legal documents
legal_documents = [
    "The plaintiff claims that the defendant breached the contract by failing to deliver the goods as promised.",
    "The employee filed a lawsuit against the employer for wrongful termination and breach of employment contract.",
    "The company was sued for intellectual property theft and patent infringement.",
    "The court ruled that the custody of the children should be granted to the mother in the family law case.",
    "The defendant was charged with burglary and theft under the criminal law."
]


# Correlation Coefficient
# Computing the TF-IDF Matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(legal_documents).toarray()

tfidf_df = pd.DataFrame(tfidf_matrix, columns=tfidf_vectorizer.get_feature_names_out())

# Computing the Correlation Coefficient Matrix
correlation_matrix = tfidf_df.corr()

# Displaying the Correlation Coefficient Matrix
print("\nCorrelation Coefficient Matrix:")
print(correlation_matrix)

'''
Although all the features are important ,the relative performance of these features can vary from
the dataset we are using them on . The order of ranking I am going with is:
1. TF-IDF Legal
2. Domain-Specific Keywords
3. N-grams (Bigrams/Trigrams)
4. Named Entity Recognition (NER)
5. POS (Part-of-Speech) Tagging
'''


Correlation Coefficient Matrix:
               against       and        as        be    breach  breached  \
against       1.000000  0.294937 -0.250000 -0.250000  1.000000 -0.250000   
and           0.294937  1.000000 -0.607837 -0.607837  0.294937 -0.607837   
as           -0.250000 -0.607837  1.000000 -0.250000 -0.250000  1.000000   
be           -0.250000 -0.607837 -0.250000  1.000000 -0.250000 -0.250000   
breach        1.000000  0.294937 -0.250000 -0.250000  1.000000 -0.250000   
breached     -0.250000 -0.607837  1.000000 -0.250000 -0.250000  1.000000   
burglary     -0.250000  0.450283 -0.250000 -0.250000 -0.250000 -0.250000   
by           -0.250000 -0.607837  1.000000 -0.250000 -0.250000  1.000000   
case         -0.250000 -0.607837 -0.250000  1.000000 -0.250000 -0.250000   
charged      -0.250000  0.450283 -0.250000 -0.250000 -0.250000 -0.250000   
children     -0.250000 -0.607837 -0.250000  1.000000 -0.250000 -0.250000   
claims       -0.250000 -0.607837  1.000000 -0.250000 -0

' \nAlthough all the features are important ,the relative performance of these features can vary from \nthe dataset we are using them on . The order of ranking I am going with is:\n1. TF-IDF Legal \n2. Domain-Specific Keywords \n3. N-grams (Bigrams/Trigrams) \n4. Named Entity Recognition (NER) \n5. POS (Part-of-Speech) Tagging\n'

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [14]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# example legal documents
legal_documents = [
    "The plaintiff claims that the defendant breached the contract by failing to deliver the goods as promised.",
    "The employee filed a lawsuit against the employer for wrongful termination and breach of employment contract.",
    "The company was sued for intellectual property theft and patent infringement.",
    "The court ruled that the custody of the children should be granted to the mother in the family law case.",
    "The defendant was charged with burglary and theft under the criminal law."
]

query = "breach of contract lawsuit"

# Loading BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state[:, 0, :].numpy()

query_embedding = get_bert_embedding(query)

doc_embeddings = np.array([get_bert_embedding(doc) for doc in legal_documents]).squeeze()

# Calculate cosine similarity
similarity_scores = cosine_similarity(query_embedding, doc_embeddings).flatten()

# Ranking documents based on similarity scores in descending order
ranked_indices = np.argsort(similarity_scores)[::-1]

print("Ranking of Documents based on Similarity to Query:")
for idx in ranked_indices:
    print(f"Document: {legal_documents[idx]} | Similarity Score: {similarity_scores[idx]:.4f}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranking of Documents based on Similarity to Query:
Document: The employee filed a lawsuit against the employer for wrongful termination and breach of employment contract. | Similarity Score: 0.8186
Document: The company was sued for intellectual property theft and patent infringement. | Similarity Score: 0.8035
Document: The defendant was charged with burglary and theft under the criminal law. | Similarity Score: 0.7989
Document: The plaintiff claims that the defendant breached the contract by failing to deliver the goods as promised. | Similarity Score: 0.7591
Document: The court ruled that the custody of the children should be granted to the mother in the family law case. | Similarity Score: 0.7303


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [16]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''

Learning Experience: I have learnt many new feature extraction methods and libraries to implement them in
this exercise.

Challenges Encountered: I have encountered many challenges understanding and implementing new features but using
the internet and abundant documentation on these libraries i have completed the assignment.

Relevance to Your Field of Study: This exercise is most relevent to my field of study (NLP).


'''

'\n\nLearning Experience: I have learnt many new feature extraction methods and libraries to implement them in \nthis exercise.\n\nChallenges Encountered: I have encountered many challenges understanding and implementing new features but using \nthe internet and abundant documentation on these libraries i have completed the assignment.\n\nRelevance to Your Field of Study: This exercise is most relevent to my field of study (NLP).\n\n\n'