<a href="https://colab.research.google.com/github/lilasu086/Individual_Coding_Project/blob/main/UnsupervisedMachineLearning_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description of Data

The AI-GA (Artificial Intelligence Generated Abstracts) dataset is a collection of paper abstracts, either AI-generated or original.

The AI-generated abstracts are generated using state-of-the-art language generation techniques (GPT-3 model).

The dataset is provided in CSV format, with each row representing a single sample (i.e.,  a single abstract).

*The ultimate goal of this assignment is to classify the abstracts based on the source (i.e., whether it is AI-generated or original).*

Total sample size: 14,331 (7,248 AI-generated and 7,082 original)

Each sample contains three columns: abstract, title, and label. The label indicates whether the sample is an original abstract (labeled as 0) or an AI-generated abstract (labeled as 1).

##Package installs and imports

DO NOT CHANGE THIS CODE

In [None]:
!pip3 install nltk spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

Load dataset **"ai-ga-dataset.csv"** as a csv file and save it as a dataframe named **"abstracts_df"**

In [None]:
abstracts_df = pd.read_csv("https://raw.githubusercontent.com/elhamod/BA820/main/Assignment/Assignment2/ai-ga-dataset.csv")
abstracts_df.head()

Unnamed: 0,doc_id,title,abstract,label
0,1,Exaggerated Autophagy in Stanford Type A Aorti...,\n\nThis study presents a novel transcriptome ...,1
1,2,ABO blood types and sepsis mortality,\n\nThe ABO blood types have been associated w...,1
2,3,AAV8-Mediated Angiotensin-Converting Enzyme 2 ...,\n\nTitle: AAV8-Mediated Angiotensin-Convertin...,1
3,4,MyCare study: protocol for a controlled trial ...,INTRODUCTION: People with serious mental illne...,0
4,5,Exploring collective emotion transmission in f...,Collective emotion is the synchronous converge...,0


In [None]:
abstracts_df

Unnamed: 0,doc_id,title,abstract,label
0,1,Exaggerated Autophagy in Stanford Type A Aorti...,\n\nThis study presents a novel transcriptome ...,1
1,2,ABO blood types and sepsis mortality,\n\nThe ABO blood types have been associated w...,1
2,3,AAV8-Mediated Angiotensin-Converting Enzyme 2 ...,\n\nTitle: AAV8-Mediated Angiotensin-Convertin...,1
3,4,MyCare study: protocol for a controlled trial ...,INTRODUCTION: People with serious mental illne...,0
4,5,Exploring collective emotion transmission in f...,Collective emotion is the synchronous converge...,0
...,...,...,...,...
14325,14326,Social marketing interventions to promote phys...,BACKGROUND: Falls are a significant source of ...,0
14326,14327,Ganoderic acid A is the effective ingredient o...,Autosomal dominant polycystic kidney disease (...,0
14327,14328,Variability of contact process in complex netw...,We study numerically how the structures of dis...,0
14328,14329,Phospholipase A(2) in skin biology: new insigh...,\n\nThis paper aims to elucidate the role of P...,1


##Inspection:

**Maximum marks: 5**

- Print the number of abstracts that are human or AI generated, respectively.
- Check if any abstracts have invalid values. Address them appropriately.
- Check if any labels have invalid values. Address them appropriately

In [None]:
abstracts_df['label'].value_counts()

1    7248
0    7082
Name: label, dtype: int64

In [None]:
abstracts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14330 entries, 0 to 14329
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   doc_id    14330 non-null  int64 
 1   title     14330 non-null  object
 2   abstract  14330 non-null  object
 3   label     14330 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 447.9+ KB


In [None]:
abstracts_df['abstract'].isnull().value_counts()

False    14330
Name: abstract, dtype: int64

In [None]:
abstracts_df['label'].isnull().value_counts()

False    14330
Name: label, dtype: int64

In [None]:
abstracts_df['abstract'].nunique()

14320

In [None]:
abstracts_df['label'].nunique()

2

**Answer**:
1. The number of abstracts that are human is 7,082. The number of abstracts that are AI generated is 7,248.
2. There are no invalid values in either the labels or abstracts, so there's no need for me to handle them.

#Pre-processing

##Question 1.1: text cleaning

**Maximum marks: 5**

Perform pre-processing on all abstracts by lower-casing and removing all non-alpha-numeric characters (i.e., only keep numbers, English alphabet letters, and white spaces).

In [None]:
abstracts_df['abstract'] = abstracts_df['abstract'].str.lower()
abstracts_df

Unnamed: 0,doc_id,title,abstract,label
0,1,Exaggerated Autophagy in Stanford Type A Aorti...,\n\nthis study presents a novel transcriptome ...,1
1,2,ABO blood types and sepsis mortality,\n\nthe abo blood types have been associated w...,1
2,3,AAV8-Mediated Angiotensin-Converting Enzyme 2 ...,\n\ntitle: aav8-mediated angiotensin-convertin...,1
3,4,MyCare study: protocol for a controlled trial ...,introduction: people with serious mental illne...,0
4,5,Exploring collective emotion transmission in f...,collective emotion is the synchronous converge...,0
...,...,...,...,...
14325,14326,Social marketing interventions to promote phys...,background: falls are a significant source of ...,0
14326,14327,Ganoderic acid A is the effective ingredient o...,autosomal dominant polycystic kidney disease (...,0
14327,14328,Variability of contact process in complex netw...,we study numerically how the structures of dis...,0
14328,14329,Phospholipase A(2) in skin biology: new insigh...,\n\nthis paper aims to elucidate the role of p...,1


In [None]:
abstracts_df['abstract'] = abstracts_df['abstract'].str.replace('[^a-zA-Z0-9\s]+','')
#abstracts_df['abstract'] = abstracts_df['abstract'].str.replace('[\n]','')
abstracts_df['abstract'] = abstracts_df['abstract'].str.strip()
abstracts_df

  abstracts_df['abstract'] = abstracts_df['abstract'].str.replace('[^a-zA-Z0-9\s]+','')


Unnamed: 0,doc_id,title,abstract,label
0,1,Exaggerated Autophagy in Stanford Type A Aorti...,this study presents a novel transcriptome pilo...,1
1,2,ABO blood types and sepsis mortality,the abo blood types have been associated with ...,1
2,3,AAV8-Mediated Angiotensin-Converting Enzyme 2 ...,title aav8mediated angiotensinconverting enzym...,1
3,4,MyCare study: protocol for a controlled trial ...,introduction people with serious mental illnes...,0
4,5,Exploring collective emotion transmission in f...,collective emotion is the synchronous converge...,0
...,...,...,...,...
14325,14326,Social marketing interventions to promote phys...,background falls are a significant source of m...,0
14326,14327,Ganoderic acid A is the effective ingredient o...,autosomal dominant polycystic kidney disease a...,0
14327,14328,Variability of contact process in complex netw...,we study numerically how the structures of dis...,0
14328,14329,Phospholipase A(2) in skin biology: new insigh...,this paper aims to elucidate the role of phosp...,1


`Note`

The functions `abstracts_df['abstract'].str.replace('[\n]','')` and `abstracts_df['abstract'].str.strip()` don't have exactly the same functionality.

- `abstracts_df['abstract'].str.replace('[\n]','')` replaces newline characters in the abstracts with an empty string, effectively removing all newline characters from the abstracts.
- `abstracts_df['abstract'].str.strip()` removes leading and trailing whitespace and newline characters from the abstracts, but does not remove newline characters within the abstracts.

So, if you want to remove all newline characters from the abstracts, you would use the first method. If you only want to remove leading and trailing whitespace and newline characters, you would use the second method.

In [None]:
'''import re

def preprocessing(abstract):
    # Use regular expression to remove non-alphanumeric characters
    return re.sub(r'[^a-zA-Z0-9\s]', '', abstract)

# Apply the function to all abstracts in the DataFrame
abstracts_df['abstract'] = abstracts_df['abstract'].apply(preprocessing)

# Print the pre-processed DataFrame
abstracts_df'''

"import re\n\ndef preprocessing(abstract):\n    # Use regular expression to remove non-alphanumeric characters\n    return re.sub(r'[^a-zA-Z0-9\\s]', '', abstract)\n\n# Apply the function to all abstracts in the DataFrame\nabstracts_df['abstract'] = abstracts_df['abstract'].apply(preprocessing)\n\n# Print the pre-processed DataFrame\nabstracts_df"

## Question 1.2: Stemming or Lemmatization

**Maximum Marks: 7.5**

We enhance the effectiveness of our text analysis algorithms by normalizing words and reducing them to their root/base forms.

Write a function `process_text` that



1.   removes `english` stop words.
2.   uses `PorterStemmer` and `WordNetLemmatizer` to stem AND lemmatize the tokenized abstracts.

The function would take in a document and return its tokenization as a list of tokens.

To verify its functionality, call the function with the first abstract as input, and then print the transformed abstract as a full text (i.e., as a string, not as a list of tokens).

`Note`

When processing natural language text, it's often necessary to convert words into their base forms to reduce variations and facilitate comparisons. Two commonly used tools for this purpose in natural language processing are PorterStemmer and WordNetLemmatizer.

1. **PorterStemmer**:
   - PorterStemmer is an algorithmic-based stemmer that converts words into their stems or base forms by removing suffixes. This process does not consider the semantics of words, but rather truncates word endings based on a set of predefined rules to make them easier to match and compare.
   - For example, the words "running" and "runs" would both be stemmed to "run" when processed by PorterStemmer.

2. **WordNetLemmatizer**:
   - WordNetLemmatizer is a lemmatizer that takes into account both the form and semantics of words to convert them into their base forms. It utilizes the WordNet lexical database to look up the base forms of words and applies corresponding lemmatization rules.
   - For example, the words "running" and "runs" would both be lemmatized to "run" by WordNetLemmatizer, as it identifies their common base form.

In summary, PorterStemmer truncates word endings based on rules without considering semantics, while WordNetLemmatizer considers word semantics to identify the true base form of words.

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [None]:
'''def process_text(text):
  ## Fill in this function ##
  tokens = word_tokenize(text)
  tokens = [token for token in tokens if token not in stop_words]
  stemmed_tokens = [stemmer.stem(token) for token in tokens]
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
  stemmed_text = ', '.join(stemmed_tokens)
  lemmatized_text = ', '.join(lemmatized_tokens)

  return stemmed_text, lemmatized_text'''

In [None]:
def process_text(text):
  ## Fill in this function ##
  tokens = word_tokenize(text)
  tokens = [token for token in tokens if token not in stop_words]
  stemmed_tokens = [stemmer.stem(token) for token in tokens]
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in stemmed_tokens]

  return lemmatized_tokens

In [None]:
'''abstracts_df['stemmed_tokens'], abstracts_df['lemmatized_tokens'] = zip(*abstracts_df['abstract'].apply(process_text)) #In Python, the asterisk (*) operator is used to unpack the elements of an iterable (such as a list or tuple) into individual arguments.
abstracts_df'''

In [None]:
'''## Test the function
print("Stemmed Tokens:")
print(abstracts_df.loc[0, 'stemmed_tokens'])
print()
print("Lemmatized Tokens:")
print(abstracts_df.loc[0, 'lemmatized_tokens'])'''

In [None]:
'''
abstract = abstracts_df.loc[0, 'abstract']
stemmed_tokens, lemmatized_tokens = process_text(abstract)

stemmed_text = ', '.join(stemmed_tokens)
lemmatized_text = ', '.join(lemmatized_tokens)

print("Stemmed Tokens:")
print(stemmed_text)

print("\nLemmatized Tokens:")
print(lemmatized_text)'''


In [None]:
## Test the function
test = ', '.join(process_text(abstracts_df['abstract'][0]))
print(test)

studi, present, novel, transcriptom, pilot, analysi, human, ascend, aortic, tissu, explor, mechan, behind, exagger, autophagi, stanford, type, aortic, dissect, recent, establish, excess, autophagi, associ, increas, risk, progress, complic, destruct, form, thorac, aortic, injuri, howev, underli, molecular, pathway, remain, mostli, unknown, investig, mechan, conduct, rna, sequenc, experi, ribosomaldeplet, sampl, ten, ascend, aorta, dissect, surgic, resect, seven, patient, stanford, type, aortic, dissect, staad, result, provid, insight, possibl, molecular, marker, might, contribut, acceler, autophag, activ, use, research, exagger, pathway, regul, stabil, staad, patholog


#Vectorization

Next, we will try different vector representations and see how well each performs.

## Question 2.1: Bag of Words

**Maximum Marks: 5**

Perform Bag of Words on the abstracts and store the vector representation as a DataFrame.

You are expected to apply the `process_text` tokenization.

Print the head of the resulting DataFrame.

How many tokens does BoW yield?

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(tokenizer=process_text)

cv.fit(abstracts_df['abstract'])
dtm = cv.transform(abstracts_df['abstract'])
bow = pd.DataFrame(dtm.toarray(), columns=cv.get_feature_names_out())
bow



Unnamed: 0,0,00,000,0000,000001,000007,00001,00002,00003,000032,...,zymogen,zymographi,zymosan,zymosaninduc,zythia,zyz,zyz803,zzn,zzz,zzzn
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14325,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14326,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14327,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
14328,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
'''cv_s = CountVectorizer()

cv_s.fit(abstracts_df['stemmed_tokens'])
dtm_s = cv_s.transform(abstracts_df['stemmed_tokens'])
bow_s = pd.DataFrame(dtm_s.toarray(), columns=cv_s.get_feature_names_out())
bow_s'''

In [None]:
'''cv_l = CountVectorizer()

cv_l.fit(abstracts_df['lemmatized_tokens'])
dtm_l = cv_l.transform(abstracts_df['lemmatized_tokens'])
bow_l = pd.DataFrame(dtm_l.toarray(), columns=cv_l.get_feature_names_out())
bow_l'''

**Answer:**

For the method, BoW yield 66,987 tokens.

## Question 2.2: TF-IDF

**Maximum Marks: 5**

Using TF-IDF with `process_text` tokenization, vectorize the abstracts. Then, find the top 5 most similar abstracts to the document with doc_id=6 (shown below) in terms of content.

In [None]:
query_index = 6
print("document id.", query_index, ": ", abstracts_df["abstract"].iloc[query_index])

document id. 6 :  background advantages of multiple arterial conduits for coronary artery bypass grafting cabg have been reported previously we aimed to evaluate the midterm outcomes of multiple arterial cabg mabg among patients with mild to moderate left ventricular systolic dysfunction lvsd methods this multicenter study using propensity score matching took place from january 2013 to june 2019 in jiangsu province and shanghai china with a mean and maximum followup of 33 and 68 years respectively we included patients with mild to moderate lvsd undergoing primary isolated multivessel cabg with left internal thoracic artery the inhospital and midterm outcomes of mabg versus conventional left internal thoracic artery supplemented by saphenous vein grafts single arterial cabg were compared the primary end points were death from all causes and death from cardiovascular causes the secondary end points were stroke myocardial infarction repeat revascularization and a composite of all mentione

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf_model = TfidfVectorizer(tokenizer=process_text)

tfidf_model.fit(abstracts_df["abstract"])

df_tfidf_transformed = tfidf_model.transform(abstracts_df["abstract"])
#tfidf_vectors = pd.DataFrame(df_tfidf_transformed.toarray(), columns=tfidf_model.get_feature_names_out())
#tfidf_vectors



In [None]:
df_tfidf_transformed.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
'''tfidf_model_s = TfidfVectorizer()

tfidf_model_s.fit(abstracts_df['stemmed_tokens'])

df_tfidf_transformed_s = tfidf_model_s.transform(abstracts_df['stemmed_tokens'])
tfidf_vectors_s = pd.DataFrame(df_tfidf_transformed_s.toarray(), columns=tfidf_model_s.get_feature_names_out())
tfidf_vectors_s'''

In [None]:
'''tfidf_model_l = TfidfVectorizer()

tfidf_model_l.fit(abstracts_df['lemmatized_tokens'])

df_tfidf_transformed_l = tfidf_model_l.transform(abstracts_df['lemmatized_tokens'])
tfidf_vectors_l = pd.DataFrame(df_tfidf_transformed_l.toarray(), columns=tfidf_model_l.get_feature_names_out())
tfidf_vectors_l'''

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
q_vector = tfidf_model.transform([abstracts_df["abstract"].iloc[query_index]])
q_result = pd.DataFrame(cosine_similarity(q_vector, df_tfidf_transformed))

In [None]:
abstract_list = list(q_result.sort_values(by=0, axis=1, ascending= False).iloc[0, :6].index)
abstract_list = abstract_list[1:6]

In [None]:
abstract_list

[8270, 6632, 3076, 4288, 3244]

In [None]:
print("The top 5 most similar abstracts to the document with doc_id=6 (shown below) in terms of content are: ")
for index in abstract_list:
  title = abstracts_df["title"].iloc[index]
  print(title)

The top 5 most similar abstracts to the document with doc_id=6 (shown below) in terms of content are: 
Does additional coronary artery bypass grafting to aortic valve replacement in elderly patients affect the early and long-term outcome?
Does additional coronary artery bypass grafting to aortic valve replacement in elderly patients affect the early and long-term outcome?
Preoperative right ventricular dysfunction requires high vasoactive and inotropic support during off-pump coronary artery bypass grafting
Early changes in diaphragmatic function evaluated using ultrasound in cardiac surgery patients: a cohort study
Sex, Age, and Socioeconomic Differences in Nonfatal Stroke Incidence and Subsequent Major Adverse Outcomes


In [None]:
'''q_vector_s = tfidf_model_s.transform(abstracts_df["stemmed_tokens"].iloc[query_index])
pd.DataFrame(cosine_similarity(q_vector_s, tfidf_vectors_s))'''

In [None]:
'''q_vector_l = tfidf_model_l.transform(abstracts_df["stemmed_tokens"].iloc[query_index])
pd.DataFrame(cosine_similarity(q_vector_l, tfidf_vectors_l))'''

Q2.2 Answer

The index of the top 5 most similar abstracts to the document with doc_id=6 (shown below) in terms of content is 8270, 6632, 3076, 4288, and 3244. Titles are:

1.Does additional coronary artery bypass grafting to aortic valve replacement in elderly patients affect the early and long-term outcome?

2.Does additional coronary artery bypass grafting to aortic valve replacement in elderly patients affect the early and long-term outcome?

3.Preoperative right ventricular dysfunction requires high vasoactive and inotropic support during off-pump coronary artery bypass grafting

4.Early changes in diaphragmatic function evaluated using ultrasound in cardiac surgery patients: a cohort study

5.Sex, Age, and Socioeconomic Differences in Nonfatal Stroke Incidence and Subsequent Major Adverse Outcomes

## Question 2.3 Word2Vec

**Maximum Marks: 7.5**

Now repeat Q 2.2 but using Word2Vec. For each token, the model should consider the two adjacent tokens on its left and the two on its right. Use a `workers=4` as a parameter to speed up computations. Include **all** possible words that occur in the abstracts.

Use vector averaging to calculate the vector representation of the sentence based on the vectors of its constituent words.

How do the results of Word2Vec and TF-IDF compare?

In [None]:
from gensim.models import Word2Vec

In [None]:
tokenized_word2vec = [process_text(t) for t in abstracts_df['abstract']]
model_word2vec = Word2Vec(sentences=tokenized_word2vec, vector_size=300, window=2, min_count=1, workers=4, negative=20, epochs=50)

In [None]:
model_word2vec = model_word2vec.wv

In [None]:
def get_word_embedding(word, model):
    if word in model.key_to_index:
        return model[word]
    else:
        # Return a zero vector for Out-of-vocabulary
        return np.zeros(model.vector_size)

In [None]:
# Construct the embeddings (i.e., vectorization) using your Word2Vec model
embeddings = [] # List of message embeddings
for tokenized_document in tokenized_word2vec:# Iterate through the messages
  message_word_embeddings = [get_word_embedding(word, model_word2vec) for word in tokenized_document] # Calculate the embedding for each word in the message. Put them all in a list.
  message_embedding = np.mean(message_word_embeddings if len(message_word_embeddings) >0 else [np.zeros(model_word2vec.vector_size)], axis=0) # Average the word embeddings to get a sentence embedding
  embeddings = embeddings + [message_embedding] # Add the current message embedding into the list of embeddings for all messages.

embeddings = np.array(embeddings)
embeddings

array([[ 0.11034942, -0.4945055 , -0.24293113, ...,  0.58066237,
        -0.1354874 ,  0.19698319],
       [ 0.17572708, -0.5020566 , -0.24783997, ...,  0.39126864,
         0.37660497,  0.45509362],
       [-0.00186191, -0.25109252, -0.4485579 , ...,  0.59668475,
        -0.01498802,  0.11994077],
       ...,
       [-0.16302033, -0.56428593, -0.14549582, ...,  0.6531042 ,
        -0.00505586,  0.5549178 ],
       [ 0.01457141, -0.3587075 , -0.34182087, ...,  0.6981138 ,
        -0.09771834,  0.20925331],
       [ 0.34881055, -0.3895549 ,  0.33703044, ...,  0.3992461 ,
         0.13547543,  0.4437612 ]], dtype=float32)

In [None]:
q_result = list(pd.DataFrame(cosine_similarity(embeddings[query_index].reshape(1, -1), embeddings.reshape(embeddings.shape[0], -1))).sort_values(by=0, axis=1, ascending= False).iloc[0, :6].index)[1:6]

In [None]:
q_result

[4727, 8270, 8588, 14180, 4878]

In [None]:
'''print("The top 5 most similar abstracts to the document with doc_id=6 (shown below) in terms of content are: ")
for index in q_result:
  title = abstracts_df["title"].iloc[index]
  print(title)'''

In [None]:
title_0 = abstracts_df["title"].iloc[query_index]
label_0 = abstracts_df["label"].iloc[query_index]
print(title_0, label_0)

Multiple arterial conduits for multi-vessel coronary artery bypass grafting in patients with mild to moderate left ventricular systolic dysfunction: a multicenter retrospective study 0


In [None]:
for index in abstract_list:
  title_2 = abstracts_df["title"].iloc[index]
  label_2 = abstracts_df["label"].iloc[index]
  print(title_2, label_2)

Does additional coronary artery bypass grafting to aortic valve replacement in elderly patients affect the early and long-term outcome? 0
Does additional coronary artery bypass grafting to aortic valve replacement in elderly patients affect the early and long-term outcome? 1
Preoperative right ventricular dysfunction requires high vasoactive and inotropic support during off-pump coronary artery bypass grafting 1
Early changes in diaphragmatic function evaluated using ultrasound in cardiac surgery patients: a cohort study 1
Sex, Age, and Socioeconomic Differences in Nonfatal Stroke Incidence and Subsequent Major Adverse Outcomes 0


In [None]:
for index in q_result:
  title_1 = abstracts_df["title"].iloc[index]
  label_1 = abstracts_df["label"].iloc[index]
  print(title_1, label_1)

Cerebrovascular autoregulation and arterial carbon dioxide in patients with acute respiratory distress syndrome: a prospective observational cohort study 0
Does additional coronary artery bypass grafting to aortic valve replacement in elderly patients affect the early and long-term outcome? 0
Risk Factors for Dysphagia and the Impact on Outcome After Spontaneous Subarachnoid Hemorrhage 0
Risk of Readmission and Mortality Following Hospitalization with Hypercapnic Respiratory Failure 0
Cancer patients with community-acquired pneumonia treated in intensive care have poorer outcomes associated with increased illness severity and septic shock at admission to intensive care: a retrospective cohort study 0


**Answer:**

When comparing the results obtained from Word2Vec and TF-IDF, I utilized cosine similarity to identify the top 5 most similar abstracts to the document with doc_id=6 in terms of content.

Upon comparing the labels of these results with the label of document doc_id=6, I observed that the Word2Vec method yielded superior outcomes. Specifically, the labels generated by the Word2Vec method precisely matched the label of the target document (doc_id=6), all being original abstracts. In contrast, the TF-IDF method produced three AI-generated abstracts as labels, which differed from the target document label.

Based on this comparison, I conclude that the Word2Vec method is more effective than TF-IDF.

# Classification

## Question 3.1: GloVe

**Maximum Marks: 7.5**

Instead of training our own Word2Vec model, we decided to use a [GloVe](https://nlp.stanford.edu/projects/glove/) model that was pre-trained by researchers at Stanford University. They used a much larger amount of text in their training (e.g., Wikipedia).

For this question, simply use `get_tokens(doc)` below for tokenization.

**Note:** *Vectorizing the entire dataset using GloVe may take 5-10 minutes. Use the guidelines we discussed in class to test and develop your code before fully applying it to the entire dataset.*

In [None]:
from gensim import downloader

# load the GloVe model
glove_model = downloader.load("glove-wiki-gigaword-50")



In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

def get_tokens(doc):
    doc_tokenized = nlp(doc)
    tokens = [token.text for token in doc_tokenized]
    return tokens

In [None]:
glove_tokens = []
for abstract in abstracts_df['abstract']:
    tokens = get_tokens(abstract)
    glove_tokens.append(tokens)

In [None]:
embeddings_glove = [] # List of message embeddings
for tokenized_document in glove_tokens:# Iterate through the messages
  message_word_embeddings = [get_word_embedding(word, glove_model) for word in tokenized_document ] # Calculate the embedding for each word in the message. Put them all in a list.
  message_embedding = np.mean(message_word_embeddings if len(message_word_embeddings) >0 else [np.zeros(model_word2vec.vector_size)], axis=0) # Average the word embeddings to get a sentence embedding
  embeddings_glove = embeddings_glove + [message_embedding] # Add the current message embedding into the list of embeddings for all messages.

embeddings_glove = np.array(embeddings_glove)
embeddings_glove

array([[ 0.60367081,  0.12071785, -0.04317449, ...,  0.17389467,
        -0.05993591, -0.05317814],
       [ 0.51261615,  0.11428579,  0.0767181 , ...,  0.11840849,
         0.01182068, -0.00559276],
       [ 0.49807717,  0.02423132, -0.06043726, ...,  0.15202371,
         0.01323068,  0.06847766],
       ...,
       [ 0.50933164,  0.11201573, -0.16306889, ...,  0.09674443,
        -0.0443515 , -0.12033479],
       [ 0.55038424,  0.11670404, -0.1271348 , ...,  0.1681476 ,
        -0.03529005, -0.08673325],
       [ 0.33897869,  0.11229256,  0.0465749 , ...,  0.1029314 ,
         0.12011714, -0.08015687]])

## Question 3.2: Random Forest Classifier

**Maximum Marks: 7.5**

Using a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), compare the classification results using GloVe to those using TF-IDF. Does GloVe do better or worse? Are there any particular issues you faced? Elaborate on your findings and justify them.

Use a test set of 20% the total dataset size. Use `random_state = 42`.

Print the `classification_report` of your model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
''''def get_split_datasets(X, y, stratify=False, stratify_size = 400):
  if stratify:
    # split into ham and spam
    y_treat = y[y == 1]
    X_treat = X[y == 1]
    y_control = y[y == 0]
    X_control = X[y == 0]

    # split and randomly sample into train and test
    X_treat_train, X_treat_test, y_treat_train, y_treat_test = train_test_split(X_treat, y_treat, test_size=0.2, random_state=42)
    X_control_train, X_control_test, y_control_train, y_control_test = train_test_split(X_control, y_control, test_size=0.2, random_state=42)

    # merge again with equal number of samples per class
    X_train = pd.concat([X_treat_train[:stratify_size], X_control_train[:stratify_size]], axis=0)
    X_test = pd.concat([X_treat_test[:stratify_size], X_control_test[:stratify_size]], axis=0)
    y_train = y_treat_train[:stratify_size].append(y_control_train[:stratify_size])
    y_test = y_treat_test[:stratify_size].append(y_control_test[:stratify_size])
  else:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  return X_train, X_test, y_train, y_test'''

In [None]:
y = abstracts_df["label"]

In [None]:
#stratify = False
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(df_tfidf_transformed, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
rf_classifier = RandomForestClassifier(random_state=42)

In [None]:
rf_classifier.fit(X_train_tfidf, y_train_tfidf)
y_pred_tfidf = rf_classifier.predict(X_test_tfidf)

In [None]:
print(classification_report(y_test_tfidf,y_pred_tfidf))

              precision    recall  f1-score   support

           0       0.96      0.91      0.93      1426
           1       0.92      0.96      0.94      1440

    accuracy                           0.93      2866
   macro avg       0.94      0.93      0.93      2866
weighted avg       0.94      0.93      0.93      2866



In [None]:
#stratify = False
X_train_glove, X_test_glove, y_train_glove, y_test_glove = train_test_split(embeddings_glove, y, test_size=0.2, random_state=42)

In [None]:
rf_classifier.fit(X_train_glove, y_train_glove)
y_pred_glove = rf_classifier.predict(X_test_glove)

In [None]:
print(classification_report(y_test_glove,y_pred_glove))

              precision    recall  f1-score   support

           0       0.88      0.86      0.87      1426
           1       0.87      0.88      0.87      1440

    accuracy                           0.87      2866
   macro avg       0.87      0.87      0.87      2866
weighted avg       0.87      0.87      0.87      2866



**Answer**:


The classification report suggests that the results obtained using GloVe embeddings are inferior to those using TF-IDF under the RandomForestClassifier. Specifically, both F-1 score and accuracy metrics indicate poorer performance with GloVe.

This could be attributed to two main factors: Dimensionality and Model Selection. Firstly, GloVe embeddings usually have higher dimensionality compared to TF-IDF vectors. Given the relatively small dataset size (14330 data points × 4 features), the increased dimensionality of GloVe embeddings might exacerbate issues, resulting in degraded performance.

Secondly, Random Forests generally excel with TF-IDF representations due to their direct feature representation leveraging word frequency and inverse document frequency. These features offer straightforward information for document differentiation, aligning well with the tree-based model's interpretability and utility.

In summary, the combination of higher dimensionality in GloVe embeddings and Random Forests' preference for TF-IDF representations could explain the observed performance disparity.


`Note`

There could be several reasons why the F-1 score or accuracy of GloVe embeddings might be worse than those of TF-IDF. Here are a few potential explanations:

1. **Embedding Quality**: The quality of GloVe embeddings might not be as suitable for the specific task at hand compared to TF-IDF vectors. GloVe embeddings are trained on large corpora and might not capture the relevant semantic information for the classification task as effectively as TF-IDF vectors, especially if the task involves domain-specific or nuanced language.

2. **Dimensionality**: GloVe embeddings typically have a higher dimensionality compared to TF-IDF vectors. If the dataset is relatively small or the model is prone to overfitting, the higher dimensionality of GloVe embeddings might exacerbate this issue, leading to worse performance.

3. **Fine-tuning**: GloVe embeddings are pre-trained on large corpora and might not be fine-tuned specifically for the classification task at hand. In contrast, TF-IDF vectors are derived directly from the training data and might capture more task-specific information.

4. **Data Representation**: TF-IDF vectors represent documents based on the frequency of words, which might be more informative for certain classification tasks, especially if the task relies heavily on keyword frequency. GloVe embeddings, on the other hand, represent documents based on semantic similarity, which might not always align perfectly with the classification requirements.

5. **Model Selection**: The choice of classification model could also impact the performance difference between GloVe embeddings and TF-IDF vectors. Some models might be better suited to handle the characteristics of one representation over the other.

To address this discrepancy, you could experiment with different hyperparameters, model architectures, or even try fine-tuning the GloVe embeddings on your specific classification task to see if performance improves. Additionally, it's essential to thoroughly analyze the data and the nature of the classification task to determine which representation method is more appropriate.

`Note`

Generally, Random Forests tend to perform better with TF-IDF representations. Here's why:

1. **Direct Feature Representation**: TF-IDF provides a more direct representation of features by leveraging word frequency and inverse document frequency. These features can often offer straightforward information to differentiate between documents. Such direct feature representations are typically easier to interpret and utilize in tree-based models like Random Forests.

2. **Sparse Data Handling**: TF-IDF vectors are often sparse as they only contain information about the frequency of words appearing in documents. For tree-based models like Random Forests, handling sparse data is more efficient, as trees can easily handle missing values in features.

3. **Robustness to Outliers and Noise**: TF-IDF vectors tend to be more robust to outliers and noise because they primarily rely on word frequencies. In contrast, GloVe embeddings might be more sensitive to outliers as they map words to continuous high-dimensional spaces, which can lead to unstable representations in the presence of noise or outliers.

However, this doesn't mean it's always the case. Depending on the dataset and the task at hand, GloVe embeddings might perform better, especially if the task involves semantic similarity or considers relationships between words. Therefore, it's advisable to evaluate the performance of different feature representations using techniques like cross-validation to determine which method performs better for a given task.