## Event Extraction - Solution 4: Doc2Vec

This model is implemented purely based on curiosity. Our team discussed that since our event extraction is going to pull out sentences that contain key events, a possible approach is to use document similarity to extract sentences that are similar to event sentences.

### Initial Model

In [11]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from datetime import datetime
from sklearn.metrics.pairwise import cosine_similarity
import re

# Function to preprocess and normalize sentences
def preprocess_and_normalize_sentence(sentence):
    if isinstance(sentence, str):
        sentence = re.sub(r"\b(\w+)\s's\b", r"\1's", sentence)
        sentence = sentence.lower()
    else:
        sentence = str(sentence)
    return sentence

# Load your data
df = pd.read_csv('news_cleaned_no_spaces.csv', encoding='latin1')
df = df[:101]

# Assume df['gold_truth'] contains your event sentences
event_sentences = df['news_text'].tolist()

# Split the data into training and testing sets
train_sentences, test_sentences = train_test_split(event_sentences, test_size=0.2, random_state=42)

# Prepare training data
train_documents = [TaggedDocument(words=word_tokenize(str(doc).lower()), tags=[i]) for i, doc in enumerate(train_sentences) if doc == doc]

# Prepare training data
documents = [TaggedDocument(words=word_tokenize(str(doc).lower()), tags=[i]) for i, doc in enumerate(event_sentences) if doc == doc]

# Train a Doc2Vec model
model = Doc2Vec(documents, vector_size=100, window=5, min_count=2, workers=4, epochs=20, dm=1)

# Create a DataFrame that's a copy of the original
predicted_df = df.copy()

# Add a new column 'output' initialized with NaN
predicted_df['output'] = np.nan
# Assume df['news_text'] contains the text from which you want to extract event sentences
news_text = df['news_text'].tolist()

# Prepare a list to store the cosine similarities
similarities = []

# Initialize counters
TP = 0
FP = 0

# Assume golden_truth is a list of the actual events
golden_truth = df['gold_truth'].tolist()

# Preprocess and normalize the golden truth
normalized_golden_truth = [preprocess_and_normalize_sentence(sentence) for sentence in golden_truth]

# Iterate over the event sentences
for idx, text in enumerate(news_text):
    # Check if text is not NaN
    if text == text:
        # Infer a vector for the sentence
        vector = model.infer_vector(word_tokenize(str(text).lower()))
        
        # Find the most similar sentences in your event sentences
        similar_sentences = model.dv.most_similar([vector], topn=1)
        
        # Store the most similar sentence (the prediction) in the 'output' column
        predicted_df.loc[idx, 'output'] = event_sentences[similar_sentences[0][0]]

        # Calculate the cosine similarity between the vector of the predicted sentence and the vector of the actual sentence
        predicted_vector = model.infer_vector(word_tokenize(predicted_df.loc[idx, 'output'].lower()))
        similarity = cosine_similarity([vector], [predicted_vector])
        
        # Add the cosine similarity to the list
        similarities.append(similarity[0][0])

        # Preprocess and normalize the predicted sentence
        normalized_output = preprocess_and_normalize_sentence(predicted_df.loc[idx, 'output'])

        # Check if the normalized predicted sentence is in the normalized golden truth
        if any(normalized_truth in normalized_output for normalized_truth in normalized_golden_truth):
            TP += 1
        else:
            FP += 1

FN = len(normalized_golden_truth) - TP

# Calculate precision, recall, and F1 score
precision = TP / (TP + FP) if TP + FP > 0 else 0
recall = TP / (TP + FN) if TP + FN > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

average_similarity = sum(similarities) / len(similarities)

print(f'Average Cosine Similarity: {average_similarity}')
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

# Save to CSV
predicted_df.to_csv(f'predicted_sentences_doc2vec{timestamp}.csv', index=False)

Precision: 0.6039603960396039
Recall: 0.6039603960396039
F1 Score: 0.6039603960396039
Average Cosine Similarity: 0.9955514002554487


Initially, the model performed decently well with scores of about 0.603, however this seemed too good to be true especially considering that doc2vec was expected to perform the worst since it was the only model that was not normally used for event extraction. Upon investigation, we found that the reason for this unexpected score was that the predicted outputs were the entire news article. This made no sense since we were trying to extract events, so getting the entire text wouldn't be useful to the fund managers.

We then realised that the reason for this was that there was no threshold for the similarity, and hence the model would take anything that has even a tiny bit of resemblence to an event sentence. Hence, we introduced a similarity threshold, and only kept sentences that were above a particular similarity threshold.

### Final Model

In [28]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize, sent_tokenize
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from datetime import datetime
from sklearn.metrics.pairwise import cosine_similarity
import re

# Function to preprocess and normalize sentences
def preprocess_and_normalize_sentence(sentence):
    if isinstance(sentence, str):
        sentence = re.sub(r"\b(\w+)\s's\b", r"\1's", sentence)
        sentence = sentence.lower()
    else:
        sentence = str(sentence)
    return sentence

# Load your data
df = pd.read_csv('news_cleaned_no_spaces.csv', encoding='latin1')
df = df[:101]

event_sentences = df['gold_truth'].tolist()

# Prepare training data
documents = [TaggedDocument(words=word_tokenize(str(doc).lower()), tags=[i]) for i, doc in enumerate(event_sentences) if doc == doc]

# Train a Doc2Vec model
model = Doc2Vec(documents, vector_size=100, window=5, min_count=2, workers=4, epochs=20, dm=1)

# Create a DataFrame that's a copy of the original
predicted_df = df.copy()

# Add a new column 'output' initialized with NaN
predicted_df['output'] = np.nan

# Assume df['news_text'] contains the text from which you want to extract event sentences
news_text = df['news_text'].tolist()

# Initialize counters
TP = 0
FP = 0

# Assume golden_truth is a list of the actual events
golden_truth = df['gold_truth'].tolist()

# Preprocess and normalize the golden truth
normalized_golden_truth = [preprocess_and_normalize_sentence(sentence) for sentence in golden_truth]

# Initialize an empty list to store the vectors for each sentence in the event sentences
event_vectors = [model.dv[i] for i in range(len(model.dv))]

# Initialize an empty list to store the event sentences for each news text
event_sentences_for_each_news = []

# Iterate over the rows in the DataFrame
for idx, row in df.iterrows():
    # Break down the news text into sentences
    news_sentences = sent_tokenize(row['news_text'])
    # Initialize an empty list to store the vectors for each sentence in the news text
    news_vectors = [model.infer_vector(word_tokenize(str(sentence).lower())) for sentence in news_sentences]
    # Initialize an empty list to store the event sentences for this news text
    event_sentences_for_this_news = []
    # Iterate over the vectors for each sentence in the news text
    for i, vector in enumerate(news_vectors):
        # Calculate the cosine similarity between the vector and all vectors in the event vectors
        similarities = cosine_similarity([vector], event_vectors)
        # If the maximum similarity is above a certain threshold, add the corresponding news sentence to the event sentences for this news text
        if np.max(similarities) > 0.995:  # Increase the threshold
            event_sentences_for_this_news.append(news_sentences[i])
    # Join the sentences in event_sentences_for_this_news with a space and append it to event_sentences_for_each_news
    event_sentences_for_each_news.append(' '.join(event_sentences_for_this_news))

# Populate the 'output' column with the event sentences for each news text
predicted_df['output'] = event_sentences_for_each_news

# Initialize counters
TP = 0
FP = 0
FN = 0

# Iterate over the predicted sentences
for idx, output in enumerate(predicted_df['output']):
    # Break down the output into sentences
    output_sentences = sent_tokenize(output)
    # Preprocess and normalize the output sentences
    normalized_output_sentences = [preprocess_and_normalize_sentence(sentence) for sentence in output_sentences]
    # Get the vector for the golden truth
    golden_truth_vector = model.infer_vector(word_tokenize(str(golden_truth[idx]).lower()))
    # Initialize a flag to indicate if a true positive has been found
    TP_found = False
    # Iterate over the normalized output sentences
    for sentence in normalized_output_sentences:
        # Get the vector for the sentence
        sentence_vector = model.infer_vector(word_tokenize(sentence))
        # Calculate the cosine similarity between the golden truth vector and the sentence vector
        similarity = cosine_similarity([golden_truth_vector], [sentence_vector])
        # If the similarity is above a certain threshold, count it as a true positive and break the loop
        if np.max(similarity) > 0.994:  # You can adjust the threshold as needed
            TP += 1
            TP_found = True
            break
    # If no true positive was found, count it as a false positive
    if not TP_found:
        FP += 1
    # Check if the golden truth is in the normalized output sentences
    if not any(str(golden_truth[idx]) in sentence for sentence in normalized_output_sentences):  # Convert golden_truth[idx] to a string
        FN += 1

# Calculate precision, recall, and F1 score
precision = TP / (TP + FP) if TP + FP > 0 else 0
recall = TP / (TP + FN) if TP + FN > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

average_similarity = sum(similarities) / len(similarities)

print(f'Average Cosine Similarity: {average_similarity}')
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

# Save to CSV
predicted_df.to_csv(f'predicted_sentences_doc2vec{timestamp}.csv', index=False)

Precision: 0.2376237623762376
Recall: 0.192
F1 Score: 0.21238938053097345
Average Cosine Similarity: [0.95545834 0.97173965 0.99232304 0.99176323 0.983709   0.9751125
 0.99104214 0.9893758  0.9764856  0.9867459  0.99170744 0.9860076
 0.9889772  0.9881569  0.9500444  0.97948176 0.99188113 0.9920124
 0.99227077 0.9927397  0.9884864  0.9884565  0.9680052  0.9837416
 0.99145174 0.9896695  0.9897349  0.9887555  0.98674065 0.98668814
 0.98846817 0.98965216 0.9907625  0.9905275  0.9898803  0.99137473
 0.9785446  0.98943657 0.9612519  0.99118966 0.9895593  0.98613226
 0.8903091  0.9495916  0.9887185  0.97967005 0.97980666 0.9924343
 0.99143946 0.99022436 0.98589087 0.9904437  0.9866575  0.99009085
 0.99183273 0.9882486  0.98060626 0.9899615  0.98451376 0.9907757
 0.99113476 0.9870413  0.99122536 0.9912088  0.98177654 0.9782986
 0.98950666 0.9808341  0.9790357  0.9781259  0.9896027  0.99111485
 0.9893054  0.9908992  0.98903704 0.9898786  0.99194705 0.98986155
 0.99214077 0.9889309  0.9911332  0