# ---------------------------------------------------------------------------------------------------------------------------------------------



# Subset 1: Small Size, Low Quality (20 manual data)

# Subset 2: Medium Size, Low Quality (50 manual data)

# Subset 3: Large Size, Low Quality (100 manual data)

# Subset 4: Small Size, High Quality (20 manual data)

# Subset 5: Large Size, High Quality (100 manual data)

# ---------------------------------------------------------------------------------------------------------------------------------------------

# **NLP Application Models**


# 1.   Text Classification
# 2.   Named Entity Recognition
# 3.   Part of Speech Tagging
# 4.   Sentiment Analysis
# 5.   Text Summarizer


# ---------------------------------------------------------------------------------------------------------------------------------------------

# **Subset 1**

## Text Classification

This model is a text classification application using SpaCy with a main objective of classifying emails into ‘SPAM’ and ‘HAM’ classes. First, a blank model is created in SpaCy and then a text classifier is attached with two orthodox models which are SPAM and HAM. The training data consists of 20 pieces of short email like of texts with mapped categories and no other processing is done other than making text lowercase. The data is then divided into 3 portions and used for training, validation and testing, with 60% reserved for training, 20% for validation and 20% for testing.

Due to the small dataset size the model is trained in small batches, and the loss for each iteration is printed as a way to monitor the progress. Once the training ends, the performance of the model can be quantified with user inputs or available samples while providing the probabilities for each output label.

The code also contains a method for predicting the class for a new email given the model built whether the email is SPAM or HAM. Last but not least, tests are performed on the test dataset in order to obtain other relevant metrics such as accuracy, precision, recall and F1 score. This makes it possible to assess the performance of this model in a numerical way. In addition to this, the model also allows one to type in text and attempts to provide the class of the model based on its internal logic. Therefore the training set contains 20 short available labelled samples.

In [None]:
import spacy
from spacy.training import Example
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import re

# Create a blank SpaCy model and add the text classifier component
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("SPAM")
textcat.add_label("HAM")

# Define a minimal preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    return text

# Example training data
train_data = [
    ("This is spam", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello, how are you?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You WON a million dollars!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Claim YOUR free PRIZE now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Meeting at 10 AM tomorrow", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your INVOICE is attached", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("EXCLUSIVE offer just for you!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Get a FREE iPhone today", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we reschedule our call?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Update your account details", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Limited time deal, buy now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Your PACKAGE has been shipped", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Win a TRIP to Hawaii now", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Important meeting AGENDA", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congratulations! You've been SELECTED", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we DISCUSS this project?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are you eating your mom's left over dinner ? Do you feel my Love ?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello! How's you and how did saturday go? I was just texting to see if you'd decided to do anything tomo. Not that i'm trying to invite myself or anything!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I am waiting machan. Call me once you free.", {"cats": {"SPAM": 0, "HAM": 1}})
]

# Apply minimal text preprocessing
train_data = [(preprocess_text(text), annotations) for text, annotations in train_data]

# Prepare training data into SpaCy's Example format
train_examples = []
for text, annotations in train_data:
    doc = nlpTC.make_doc(text)
    example = Example.from_dict(doc, annotations)
    train_examples.append(example)

# Split data into training, validation, and testing sets
train_examples, test_examples = train_test_split(train_examples, test_size=0.2, random_state=42)
train_examples, val_examples = train_test_split(train_examples, test_size=0.25, random_state=42)  # 20% of the remaining data is used for validation

# Print the split data to visualize each set
print("TRAINING SET (60% of the data):")
for example in train_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

print("\nVALIDATION SET (20% of the data):")
for example in val_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

print("\nTESTING SET (20% of the data):")
for example in test_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

# Training the model with more iterations for small datasets
n_iter = 10  # Set iterations
optimizer = nlpTC.initialize()

for i in range(n_iter):
    losses = {}
    for batch in spacy.util.minibatch(train_examples, size=2):  # Small batch size for small data
        for example in batch:
            nlpTC.update([example], sgd=optimizer, losses=losses)
    print(f"Iteration {i+1}/{n_iter} - Loss: {losses['textcat']}")

# Testing the model
print("\nSample Prediction Output with probabilities:")
doc = nlpTC("Claim your prize now!")
print(doc.cats)

# Function to classify user input emails
def classify_email(email):
    email = preprocess_text(email)
    doc = nlpTC(email)
    spam_score = doc.cats['SPAM']
    ham_score = doc.cats['HAM']

    if spam_score > ham_score:
        return "SPAM"
    else:
        return "HAM"

# Calculate accuracy, precision, recall, and F1 score on the test set
true_labels = [1 if example.reference.cats['SPAM'] == 1 else 0 for example in test_examples]
predicted_labels = [1 if classify_email(example.reference.text) == 'SPAM' else 0 for example in test_examples]

# Calculate and print metrics
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')

# Display results
print(f"\nAccuracy: {accuracy * 100:.4f}%")
print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

# Allow users to test the model by inputting their own data
while True:
    user_input = input("\nEnter a sample email for classification (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    classification = classify_email(user_input)
    print(f"The email is classified as: {classification}")


TRAINING SET (60% of the data):
Text: limited time deal, buy now! - Label: {'SPAM': 1, 'HAM': 0}
Text: win a trip to hawaii now - Label: {'SPAM': 1, 'HAM': 0}
Text: update your account details - Label: {'SPAM': 0, 'HAM': 1}
Text: i am waiting machan. call me once you free. - Label: {'SPAM': 0, 'HAM': 1}
Text: your package has been shipped - Label: {'SPAM': 0, 'HAM': 1}
Text: exclusive offer just for you! - Label: {'SPAM': 1, 'HAM': 0}
Text: hello! how's you and how did saturday go? i was just texting to see if you'd decided to do anything tomo. not that i'm trying to invite myself or anything! - Label: {'SPAM': 0, 'HAM': 1}
Text: you won a million dollars! - Label: {'SPAM': 1, 'HAM': 0}
Text: meeting at 10 am tomorrow - Label: {'SPAM': 0, 'HAM': 1}
Text: get a free iphone today - Label: {'SPAM': 1, 'HAM': 0}
Text: claim your free prize now! - Label: {'SPAM': 1, 'HAM': 0}
Text: important meeting agenda - Label: {'SPAM': 0, 'HAM': 1}

VALIDATION SET (20% of the data):
Text: can we resche

## Named Entity Recognition

This model assesses the performance of named entity recognition (NER) based on spaCy’s pre-trained model of `en_core_web_sm`. The training data comprises 20 sentences which contain various named entities including organizations, locations, dates, people, etc. with document entity annotations which comprise of indexes of beginning and ending characters of the respective entity and their respective labels. The data is first processed by changing all the texts to output lower case characters only to standardize the text. The NER model takes in each sentence and predicts the entities therein by recognizing and labeling a span of text. For the purposes of evaluation, entities of both the ground truth dataset (in this case annotated by the ghostwriter) and those predicted by the spaCy tool are collected.

In order to assess the results of the model below, the entities are assessed by converting the outputs into a binary system that registers 1 if both counts of the true and the predicted entity are matched and registries 0 otherwise. The precision, recall and F1 measures are then calculated from the matches of the ground truth and the outputs of the model with respect to entities present. Precision is the fraction of relevant entities retrieved out of the total number of entities presented by the model, recall is the fraction of relevant entities retrieved over the total amount of relevant entities present and the F1 score is the average of precision and recall with respect to the NER model.

In [2]:
import spacy
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load a pre-trained NER model
nlp = spacy.load("en_core_web_sm")

# Sample training data (text and true entity annotations)
training_data = [
    ("Microsoft announced a new AI initiative in Seattle.", [(0, 9, "ORG"), (39, 46, "GPE")]),
    ("Google I/O will take place in May 2023.", [(0, 10, "EVENT"), (29, 37, "DATE")]),
    ("The unemployment rate in the U.S. dropped to 3.5%.", [(34, 38, "PERCENT"), (27, 31, "GPE")]),
    ("The Chinese economy grew by 5% last year.", [(4, 11, "NORP")]),
    ("Sundar Pichai is the CEO of Google.", [(0, 13, "PERSON"), (28, 34, "ORG")]),
    ("Tesla secured $2 billion in new funding.", [(14, 22, "MONEY")]),
    ("Amazon is opening a new office in Vancouver.", [(0, 6, "ORG"), (36, 45, "GPE")]),
    ("Samsung released its new Galaxy S22 phone.", [(0, 7, "ORG"), (23, 32, "PRODUCT")]),
    ("The Pacific Ocean is the largest body of water on Earth.", [(4, 17, "LOC")]),
    ("The headquarters of IBM is in New York City.", [(21, 24, "ORG"), (31, 44, "GPE")]),

    ("Satya Nadella leads Microsoft Corporation.", [(0, 12, "PERSON"), (19, 38, "ORG")]),
    ("The FIFA World Cup will be held in Qatar in 2022.", [(4, 18, "EVENT"), (34, 39, "GPE"), (43, 47, "DATE")]),
    ("Apple plans to invest $10 billion in manufacturing.", [(23, 32, "MONEY")]),
    ("A new skyscraper is being built in Dubai.", [(33, 38, "GPE")]),
    ("70% of the world's population is now online.", [(0, 3, "PERCENT")]),
    ("Elon Musk founded SpaceX and Tesla.", [(0, 9, "PERSON"), (17, 23, "ORG"), (28, 33, "ORG")]),
    ("The startup raised $50 million in Series B.", [(15, 25, "MONEY")]),
    ("The next Apple event is scheduled for March 25th.", [(9, 14, "ORG"), (39, 48, "DATE")]),
    ("The new company is aiming for a 15% market share.", [(28, 31, "PERCENT")]),
    ("Apple's iPhone 14 is expected to launch in 2023.", [(0, 5, "ORG"), (7, 15, "PRODUCT"), (46, 50, "DATE")])
]

# Preprocess: Convert all texts to lowercase
preprocessed_data = [(text.lower(), entities) for text, entities in training_data]

# Initialize lists for storing true and predicted entities
all_true_entities = []
all_pred_entities = []

# Iterate through training data
for text, true_entities in training_data:
    # Run NER model
    doc = nlp(text)

    # Predicted entities from the model
    pred_entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]

    # Store true and predicted entities for all examples
    all_true_entities.extend([(ent[0], ent[1], ent[2]) for ent in true_entities])
    all_pred_entities.extend([(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents])

# Convert to binary classification (1 for correct entity, 0 for incorrect)
y_true = [1 if ent in all_true_entities else 0 for ent in all_pred_entities]
y_pred = [1 for _ in all_pred_entities]  # Assuming all predictions are correct for now

# Calculate Precision, Recall, F1
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

Precision: 36.8421%
Recall: 100.0000%
F1 Score: 53.8462%


## Part of Speech Tagging

This is a model for tagging parts of speech based on the built-in POS tagger trained in advance by spaCy. The dataset consists of 20 sentences containing POS tags for every word within them which are gathered manually. The textual content is preprocessed to lower case and jumbled in order to maintain randomness in the training, validation, and test sets - as well as any other data. Decidedly 60% was assigned for the training process, 20% for the validation while the last 20% was allocated for testing.

The critical section is the process_data function, which takes each sentence in a dataset, processes it with spaCy, obtains POS tags suggested by a model, and performs a comparison with the corresponding true ones. The function named ensure_equal_length guarantees that both true tags and predicted tags lists are of the same size and any bulge or excess is cut off to prevent inaccuracies within the metric evaluation process.

Accuracy, precision, recall, and F1 score for training validation and testing are calculated separately using the evaluate_metrics function. This output then combines the results of all three sets in order to provide the overall assessment of the model’s effectiveness by taking the mean values of the metrics across all those three sets. Finally, the obtained values are displayed in the last stage, providing the various metrics of the model concerning accurate, precision, recall, and F1 score for speech tagging classification in the three datasets.

In [None]:
import spacy
import random
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load spaCy's POS tagging model
nlp = spacy.load("en_core_web_sm")

# Sample expanded training data: list of (text, true_pos_tags) pairs
training_data = [
    ("She sells seashells by the seashore.", ['PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN']),
    ("The quick brown fox jumps over the lazy dog.", ['DET', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'ADP', 'DET', 'ADJ', 'NOUN']),
    ("I love coding in Python.", ['PRON', 'VERB', 'NOUN', 'ADP', 'PROPN']),
    ("Birds fly in the sky.", ['NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("Alice and Bob went to the market.", ['PROPN', 'CCONJ', 'PROPN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("Reading books is fun.", ['VERB', 'NOUN', 'AUX', 'ADJ']),
    ("My car is very fast.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("We are going to the zoo.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("It is raining today.", ['PRON', 'AUX', 'VERB', 'NOUN']),
    ("Programming languages are interesting.", ['NOUN', 'NOUN', 'AUX', 'ADJ']),

    ("The cat sleeps on the mat.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("John likes to play soccer.", ['PROPN', 'VERB', 'PART', 'VERB', 'NOUN']),
    ("She is learning French.", ['PRON', 'AUX', 'VERB', 'PROPN']),
    ("The weather is nice today.", ['DET', 'NOUN', 'AUX', 'ADJ', 'NOUN']),
    ("He bought a new laptop yesterday.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN', 'NOUN']),
    ("They are swimming in the pool.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("The pizza smells delicious.", ['DET', 'NOUN', 'VERB', 'ADJ']),
    ("Can you help me with this project?", ['AUX', 'PRON', 'VERB', 'PRON', 'ADP', 'DET', 'NOUN']),
    ("This task is quite difficult.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("He enjoys reading books.", ['PRON', 'VERB', 'VERB', 'NOUN']),
]

# Preprocess text: convert to lowercase and shuffle the training data
training_data = [(text.lower(), tags) for text, tags in training_data]
random.shuffle(training_data)

# Split data into training, validation, and test sets (60% train, 20% validation, 20% test)
train_data, temp_data = train_test_split(training_data, test_size=0.4, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Initialize lists to store true and predicted POS tags for all sets
all_true_pos_tags_train, all_predicted_pos_tags_train = [], []
all_true_pos_tags_val, all_predicted_pos_tags_val = [], []
all_true_pos_tags_test, all_predicted_pos_tags_test = [], []

# Function to process data and evaluate POS tagging
def process_data(data, all_true_pos_tags, all_predicted_pos_tags):
    for text, true_pos_tags in data:
        # Process the text with spaCy
        doc = nlp(text)
        # Extract predicted POS tags
        predicted_pos_tags = [token.pos_ for token in doc]
        # Extend lists with true and predicted tags for evaluation
        all_true_pos_tags.extend(true_pos_tags)
        all_predicted_pos_tags.extend(predicted_pos_tags)

# Process training, validation, and test data
process_data(train_data, all_true_pos_tags_train, all_predicted_pos_tags_train)
process_data(val_data, all_true_pos_tags_val, all_predicted_pos_tags_val)
process_data(test_data, all_true_pos_tags_test, all_predicted_pos_tags_test)

# Ensure both lists are the same length to avoid ValueError
def ensure_equal_length(true_tags, predicted_tags):
    if len(true_tags) != len(predicted_tags):
        min_length = min(len(true_tags), len(predicted_tags))
        true_tags = true_tags[:min_length]
        predicted_tags = predicted_tags[:min_length]
    return true_tags, predicted_tags

# Ensure correct lengths for all sets
all_true_pos_tags_train, all_predicted_pos_tags_train = ensure_equal_length(all_true_pos_tags_train, all_predicted_pos_tags_train)
all_true_pos_tags_val, all_predicted_pos_tags_val = ensure_equal_length(all_true_pos_tags_val, all_predicted_pos_tags_val)
all_true_pos_tags_test, all_predicted_pos_tags_test = ensure_equal_length(all_true_pos_tags_test, all_predicted_pos_tags_test)

# Function to calculate metrics for a dataset
def evaluate_metrics(true_tags, predicted_tags):
    accuracy = accuracy_score(true_tags, predicted_tags)
    precision = precision_score(true_tags, predicted_tags, average='weighted')
    recall = recall_score(true_tags, predicted_tags, average='weighted')
    f1 = f1_score(true_tags, predicted_tags, average='weighted')
    return accuracy, precision, recall, f1

# Evaluate on training, validation, and test sets
metrics_train = evaluate_metrics(all_true_pos_tags_train, all_predicted_pos_tags_train)
metrics_val = evaluate_metrics(all_true_pos_tags_val, all_predicted_pos_tags_val)
metrics_test = evaluate_metrics(all_true_pos_tags_test, all_predicted_pos_tags_test)

# Combine all metrics into single print statement
total_accuracy = (metrics_train[0] + metrics_val[0] + metrics_test[0]) / 3
total_precision = (metrics_train[1] + metrics_val[1] + metrics_test[1]) / 3
total_recall = (metrics_train[2] + metrics_val[2] + metrics_test[2]) / 3
total_f1 = (metrics_train[3] + metrics_val[3] + metrics_test[3]) / 3

# Print consolidated metrics
print("Consolidated Metrics across Training, Validation, and Test Data:")
print(f"Accuracy: {total_accuracy * 100:.4f}%")
print(f"Precision: {total_precision * 100:.4f}%")
print(f"Recall: {total_recall * 100:.4f}%")
print(f"F1 Score: {total_f1 * 100:.4f}%")

Consolidated Metrics across Training, Validation, and Test Data:
Accuracy: 30.7145%
Precision: 36.3046%
Recall: 30.7145%
F1 Score: 32.9712%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Sentiment Analysis

This is a model of sentiment analysis based on spaCy’s text classification system and a Naive Bayes classifier provided by the scikit-learn library. It begins with instantiating a blank-scratched spaCy model to which a text classification component is incorporated to classify text as POSITIVE or NEGATIVE. The training data is composed of a number of sentences with a certain attitudinal label which is provided in a dict format.

In between training the model and fitting it to available data, all the sentences in the text are changed to lower case during text preparation. Next, ostensible text is transformed into vector form using CountVectorizer, which involves changing a feature text dataset into a form that can be processed algorithmically. The entire data is divided into two parts where one part is use for training and the other testing, each part contributing half of the available data.

A Naive Bayes model is applied to the training data that has been vectorized, and results are provided for the test data. The results of the experiment are reported interms of accuracy, precision, recall and F1 measure. Lastly, these measures are displayed to assess their usefulness in sentiment classification by the model.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load a blank model and add text classifier
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

train_data = [
    ("I'm so frustrated with how slow my internet is.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I'm so happy with my new job!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The customer service at that store is excellent.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The movie was a complete waste of time.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That movie was truly heartwarming and beautiful.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been feeling really down lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The food at the new restaurant was absolutely delicious.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t get the job, and I feel so defeated.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The sunset this evening was breathtaking.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m really upset that I missed the deadline.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("I finally finished the book, and it was such a rewarding read.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("This weather is terrible, I can’t wait for it to end.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The surprise party was such a success!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My laptop crashed again, and I lost all my work.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The concert was absolutely mind-blowing!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been struggling with my workload lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The flowers you sent me are absolutely stunning.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I regret spending money on that product.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just found out I won the contest! I’m over the moon.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Everything seems to be going wrong lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
]

#Lowercasing
train_data = [(text.lower(), labels) for text, labels in train_data]


In [None]:
# Extract text data from train_data
text = [data[0] for data in train_data]
labels = [data[1]['cats']['POSITIVE'] for data in train_data] # Extract labels

# Vectorize text data using the extracted text list
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

# Train a Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict sentiments
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy * 100:.4f}%")
print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

Accuracy: 70.0000%
Precision: 60.0000%
Recall: 75.0000%
F1 Score: 66.6667%


## Text Summarizer

This is a model of Text summarization using the natural language processing toolspaCy. The essence of the tool is to create a summary of a piece of text in relation to a provided summary. The tapplying summarization code focuses on the case of climate change – outlining the causes of such a phenomenon, the effects it can have and the ways in which one needs to act swiftly to mitigate such detrimental changes; how to put together everything in the…

`Extractive_summary` function deals with the input text by transforming it into lower-case letters, stripping the punctuation, and then, tokenizing it into separate sentences, the first few of which, were use as a summary. In addition, once the summary has been created an attempt is made to check how it fares in comparison with a pre set summary and thus derive measures like precision, recall and f1 score aimed at assessing the performance of the summary created. The obtained values regarding these ratings are presented in the form of percentages.

This summarization approach can be implemented likewise very easily to any other subsets of data which makes it useful for many text analytics tasks in different situations. Using the same code structure. The code will also work for different topics or datasets which is why it is functional.


In [None]:
import spacy
from sklearn.metrics import precision_score, recall_score, f1_score
from nltk.tokenize import sent_tokenize
import nltk

# Download the punkt tokenizer if not already downloaded
nltk.download('punkt')

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Example text and reference summary
text = """Climate change is one of the most pressing issues of our time. The increasing levels of greenhouse gases in the atmosphere
have led to rising global temperatures. As a result, glaciers are melting, sea levels are rising, and extreme weather events
are becoming more frequent. Many governments around the world have pledged to reduce carbon emissions, but progress has been slow.
Renewable energy sources such as solar and wind power offer hope, but their adoption has not been widespread enough to make a significant impact yet.
Urgent action is needed to address this global crisis before it’s too late."""

reference_summary = """Climate change is caused by greenhouse gases and is leading to rising temperatures and extreme weather.
Renewable energy offers hope, but its adoption is slow."""

# Extractive summarization function with lowercase preprocessing
def extractive_summary(text, num_sentences=3):
    doc = nlp(text.lower())  # Convert text to lowercase before processing
    sentences = [sent.text.lower() for sent in doc.sents]  # Convert sentences to lowercase
    return sentences[:num_sentences]  # Return the first `num_sentences` as the summary

# Tokenizing the reference and generated summaries into sentences
generated_summary = extractive_summary(text)  # Summary in lowercase
reference_sentences = [sent.lower() for sent in sent_tokenize(reference_summary)]  # Reference in lowercase

# Convert to binary relevance: 1 if the sentence appears in the reference summary, 0 otherwise
y_true = [1 if sent in reference_sentences else 0 for sent in sent_tokenize(text.lower())]  # Compare with reference
y_pred = [1 if sent in generated_summary else 0 for sent in sent_tokenize(text.lower())]  # Compare with generated summary

# Ensure y_true and y_pred are of the same length
if len(y_true) != len(y_pred):
    min_length = min(len(y_true), len(y_pred))
    y_true = y_true[:min_length]
    y_pred = y_pred[:min_length]

# Calculate precision, recall, and F1 score
precision = precision_score(y_true, y_pred) * 100  # Convert to percentage
recall = recall_score(y_true, y_pred) * 100  # Convert to percentage
f1 = f1_score(y_true, y_pred) * 100  # Convert to percentage

# Output results
print(f"Generated Summary: {' '.join(generated_summary)}")
print(f"Precision: {precision:.2f}%")
print(f"Recall: {recall:.2f}%")
print(f"F1 Score: {f1:.2f}%")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Generated Summary: climate change is one of the most pressing issues of our time. the increasing levels of greenhouse gases in the atmosphere 
have led to rising global temperatures. as a result, glaciers are melting, sea levels are rising, and extreme weather events 
are becoming more frequent.
Precision: 0.00%
Recall: 0.00%
F1 Score: 0.00%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# ---------------------------------------------------------------------------------------------------------------------------------------------

# **Subset 2**

## Text Classification

The code from the subset 1 is also the same in this code but the training data is consist of 50 short emails.

In [None]:
import spacy
from spacy.training import Example
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import re

# Create a blank SpaCy model and add the text classifier component
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("SPAM")
textcat.add_label("HAM")

# Define a minimal preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    return text


#Example Training data
train_data = [
    ("This is spam", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello, how are you?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You won a million dollars!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Claim your free prize now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Meeting at 10 AM tomorrow", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your invoice is attached", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Exclusive offer just for you!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Get a free iPhone today", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we reschedule our call?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Update your account details", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Limited time deal, buy now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Your package has been shipped", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Win a trip to Hawaii now", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Important meeting agenda", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congratulations! You've been selected", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we discuss this project?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are you eating your mom's left over dinner ? Do you feel my Love ?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello! How's you and how did saturday go? I was just texting to see if you'd decided to do anything tomo. Not that i'm trying to invite myself or anything!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I am waiting machan. Call me once you free.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Sorry, I'll call later", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("You will be in the place of that man", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Please don't text me anymore. I have nothing else to say.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thanks a lot for your wishes on my birthday. Thanks you for making my birthday truly memorable.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Aight, I'll hit you up when I get some cash", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Dont worry. I guess he's busy.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Good stuff, will do.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("What time you coming down later?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Sounds great! Are you home now?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Finally the match heading towards draw as your prediction.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Tired. I haven't slept well the past few nights.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Where are you?when wil you reach here?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Please call our customer service representative on FREEPHONE 0808 145 4742 between 9am-11pm as you have WON a guaranteed £1000 cash or £5000 prize!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("What you doing? how are you?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I'm back, lemme know when you're ready", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Lose 20 pounds in just 2 weeks with our miracle weight loss pills! 100% natural and safe. Order today and get a special discount: www.weightlosspills.com. Hurry, offer expires soon!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Awesome, I'll see you in a bit", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Just sent it. So what type of food do you like?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I accidentally deleted the message. Resend please.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("FREE MESSAGE Activate your 500 FREE Text Messages by replying to this message with the word FREE For terms & conditions, visit www.07781482378.com", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("I cant pick the phone right now. Pls send a message", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("He says he'll give me a call when his friend's got the money but that he's definitely buying before the end of the week", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You made my day. Do have a great day too.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Great news! You have been pre-approved for a personal loan of $10,000 with a low-interest rate! No credit check required. Apply today at www.getmymoney.com and get instant cash!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("No problem. How are you doing?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Just sleeping and surfing", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Busy here. Trying to finish for new year. I am looking forward to finally meeting you", {"cats": {"SPAM": 0, "HAM": 1}})
]

# Apply minimal text preprocessing
train_data = [(preprocess_text(text), annotations) for text, annotations in train_data]

# Prepare training data into SpaCy's Example format
train_examples = []
for text, annotations in train_data:
    doc = nlpTC.make_doc(text)
    example = Example.from_dict(doc, annotations)
    train_examples.append(example)

# Split data into training, validation, and testing sets
train_examples, test_examples = train_test_split(train_examples, test_size=0.2, random_state=42)
train_examples, val_examples = train_test_split(train_examples, test_size=0.25, random_state=42)  # 20% of the remaining data is used for validation

# Print the split data to visualize each set
print("TRAINING SET (60% of the data):")
for example in train_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

print("\nVALIDATION SET (20% of the data):")
for example in val_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

print("\nTESTING SET (20% of the data):")
for example in test_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

# Training the model with more iterations for small datasets
n_iter = 10  # Set iterations
optimizer = nlpTC.initialize()

for i in range(n_iter):
    losses = {}
    for batch in spacy.util.minibatch(train_examples, size=2):  # Small batch size for small data
        for example in batch:
            nlpTC.update([example], sgd=optimizer, losses=losses)
    print(f"Iteration {i+1}/{n_iter} - Loss: {losses['textcat']}")

# Testing the model
print("\nSample Prediction Output with probabilities:")
doc = nlpTC("Claim your prize now!")
print(doc.cats)

# Function to classify user input emails
def classify_email(email):
    email = preprocess_text(email)
    doc = nlpTC(email)
    spam_score = doc.cats['SPAM']
    ham_score = doc.cats['HAM']

    if spam_score > ham_score:
        return "SPAM"
    else:
        return "HAM"

# Calculate accuracy, precision, recall, and F1 score on the test set
true_labels = [1 if example.reference.cats['SPAM'] == 1 else 0 for example in test_examples]
predicted_labels = [1 if classify_email(example.reference.text) == 'SPAM' else 0 for example in test_examples]

# Calculate and print metrics
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')

# Display results
print(f"\nAccuracy: {accuracy * 100:.4f}%")
print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

# Allow users to test the model by inputting their own data
while True:
    user_input = input("\nEnter a sample email for classification (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    classification = classify_email(user_input)
    print(f"The email is classified as: {classification}")


TRAINING SET (60% of the data):
Text: you won a million dollars! - Label: {'SPAM': 1, 'HAM': 0}
Text: can we discuss this project? - Label: {'SPAM': 0, 'HAM': 1}
Text: where are you?when wil you reach here? - Label: {'SPAM': 0, 'HAM': 1}
Text: you will be in the place of that man - Label: {'SPAM': 0, 'HAM': 1}
Text: sorry, i'll call later - Label: {'SPAM': 0, 'HAM': 1}
Text: no problem. how are you doing? - Label: {'SPAM': 0, 'HAM': 1}
Text: dont worry. i guess he's busy. - Label: {'SPAM': 0, 'HAM': 1}
Text: winner!! as a valued network customer you have been selected to receivea £900 prize reward! to claim call 09061701461. claim code kl341. valid 12 hours only. - Label: {'SPAM': 1, 'HAM': 0}
Text: win a trip to hawaii now - Label: {'SPAM': 1, 'HAM': 0}
Text: busy here. trying to finish for new year. i am looking forward to finally meeting you - Label: {'SPAM': 0, 'HAM': 1}
Text: exclusive offer just for you! - Label: {'SPAM': 1, 'HAM': 0}
Text: did you catch the bus ? are you frying 

## Named Entity Recognition

The code from the subset 1 is also the same in this code but the training data is consist of 50 data.


In [None]:
!python -m spacy download en_core_web_md
import spacy
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load a pre-trained NER model
nlp = spacy.load("en_core_web_md")

# Sample training data (text and true entity annotations)
training_data = [
    ("Microsoft announced a new AI initiative in Seattle.", [(0, 9, "ORG"), (39, 46, "GPE")]),
    ("Google I/O will take place in May 2023.", [(0, 10, "EVENT"), (29, 37, "DATE")]),
    ("The unemployment rate in the U.S. dropped to 3.5%.", [(34, 38, "PERCENT"), (27, 31, "GPE")]),
    ("The Chinese economy grew by 5% last year.", [(4, 11, "NORP")]),
    ("Sundar Pichai is the CEO of Google.", [(0, 13, "PERSON"), (28, 34, "ORG")]),
    ("Tesla secured $2 billion in new funding.", [(14, 22, "MONEY")]),
    ("Amazon is opening a new office in Vancouver.", [(0, 6, "ORG"), (36, 45, "GPE")]),
    ("Samsung released its new Galaxy S22 phone.", [(0, 7, "ORG"), (23, 32, "PRODUCT")]),
    ("The Pacific Ocean is the largest body of water on Earth.", [(4, 17, "LOC")]),
    ("The headquarters of IBM is in New York City.", [(21, 24, "ORG"), (31, 44, "GPE")]),

    ("Satya Nadella leads Microsoft Corporation.", [(0, 12, "PERSON"), (19, 38, "ORG")]),
    ("The FIFA World Cup will be held in Qatar in 2022.", [(4, 18, "EVENT"), (34, 39, "GPE"), (43, 47, "DATE")]),
    ("Apple plans to invest $10 billion in manufacturing.", [(23, 32, "MONEY")]),
    ("A new skyscraper is being built in Dubai.", [(33, 38, "GPE")]),
    ("70% of the world's population is now online.", [(0, 3, "PERCENT")]),
    ("Elon Musk founded SpaceX and Tesla.", [(0, 9, "PERSON"), (17, 23, "ORG"), (28, 33, "ORG")]),
    ("The startup raised $50 million in Series B.", [(15, 25, "MONEY")]),
    ("The next Apple event is scheduled for March 25th.", [(9, 14, "ORG"), (39, 48, "DATE")]),
    ("The new company is aiming for a 15% market share.", [(28, 31, "PERCENT")]),
    ("Apple's iPhone 14 is expected to launch in 2023.", [(0, 5, "ORG"), (7, 15, "PRODUCT"), (46, 50, "DATE")]),

    ("A German scientist won the Nobel Prize.", [(2, 8, "NORP")]),
    ("Facebook plans to launch new features in June.", [(0, 7, "ORG"), (30, 35, "DATE")]),
    ("The CEO of Apple, Tim Cook, announced new products.", [(14, 22, "PERSON"), (4, 9, "ORG")]),
    ("NASA's Perseverance rover landed on Mars.", [(0, 4, "ORG"), (34, 38, "GPE")]),
    ("The 2024 Summer Olympics will take place in Paris.", [(4, 24, "EVENT"), (40, 45, "GPE")]),
    ("The inflation rate reached 8.6% last month.", [(28, 32, "PERCENT")]),
    ("Coca-Cola launched a new flavor this spring.", [(0, 10, "ORG"), (36, 41, "DATE")]),
    ("The World Health Organization declared a health emergency.", [(4, 30, "ORG")]),
    ("Berkshire Hathaway's stock price increased by $500.", [(0, 23, "ORG"), (37, 40, "MONEY")]),
    ("In 2020, remote work became the new normal.", [(3, 7, "DATE")]),

    ("Mark Zuckerberg met with world leaders to discuss technology.", [(0, 15, "PERSON")]),
    ("The Great Wall of China is a popular tourist attraction.", [(4, 20, "LOC")]),
    ("The Grammy Awards will be held in Los Angeles.", [(0, 14, "EVENT"), (30, 43, "GPE")]),
    ("Intel announced a new chip that will improve processing speed.", [(0, 5, "ORG")]),
    ("The stock market saw a decline of 4% today.", [(28, 31, "PERCENT")]),
    ("Microsoft is acquiring LinkedIn for $26.2 billion.", [(0, 9, "ORG"), (26, 39, "ORG"), (44, 57, "MONEY")]),
    ("SpaceX plans to launch its Starship rocket next year.", [(0, 6, "ORG"), (34, 39, "DATE")]),
    ("The next big tech conference is set for September.", [(9, 13, "EVENT"), (38, 47, "DATE")]),
    ("The United Nations addresses global challenges.", [(4, 17, "ORG")]),
    ("Bill Gates founded Microsoft in 1975.", [(0, 10, "PERSON"), (21, 29, "ORG"), (32, 36, "DATE")]),

    ("A recent study showed that 60% of students prefer online classes.", [(36, 38, "PERCENT")]),
    ("The Louvre Museum is located in Paris.", [(4, 22, "ORG"), (30, 35, "GPE")]),
    ("The 2022 World Cup will be hosted in Qatar.", [(4, 18, "EVENT"), (35, 40, "GPE")]),
    ("Netflix added 8 million new subscribers in 2021.", [(7, 14, "ORG"), (23, 24, "MONEY"), (29, 33, "DATE")]),
    ("The first electric car was launched by Tesla in 2008.", [(29, 34, "ORG"), (39, 43, "DATE")]),
    ("Researchers found a new species of frog in Madagascar.", [(36, 49, "LOC")]),
    ("In 2019, the world saw significant advancements in AI.", [(3, 7, "DATE")]),
    ("The White House issued a statement regarding climate change.", [(4, 15, "GPE")]),
    ("Elon Musk is the founder of SpaceX and Tesla.", [(0, 9, "PERSON"), (23, 29, "ORG"), (34, 39, "ORG")])
    ("Bill Gates pledged $100 million to fight malaria.", [(0, 10, "PERSON"), (24, 34, "MONEY")]),
]

# Preprocess: Convert all texts to lowercase
preprocessed_data = [(text.lower(), entities) for text, entities in training_data]

# Initialize lists for storing true and predicted entities
all_true_entities = []
all_pred_entities = []

# Iterate through training data
for text, true_entities in training_data:
    # Run NER model
    doc = nlp(text)

    # Predicted entities from the model
    pred_entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]

    # Store true and predicted entities for all examples
    all_true_entities.extend([(ent[0], ent[1], ent[2]) for ent in true_entities])
    all_pred_entities.extend([(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents])

# Convert to binary classification (1 for correct entity, 0 for incorrect)
y_true = [1 if ent in all_true_entities else 0 for ent in all_pred_entities]
y_pred = [1 for _ in all_pred_entities]  # Assuming all predictions are correct for now

# Calculate Precision, Recall, F1
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

Collecting en-core-web-md==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Precision: 36.9565%
Recall: 100.0000%
F1 Score: 53.9683%


## Part of Speech Tagging

The code from the subset 1 is also the same in this code but the training data is consist of 50 data.


In [None]:
!python -m spacy download en_core_web_md
import spacy
import random
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load spaCy's POS tagging model
nlp = spacy.load("en_core_web_md")

# Sample expanded training data: list of (text, true_pos_tags) pairs
training_data = [
    ("She sells seashells by the seashore.", ['PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN']),
    ("The quick brown fox jumps over the lazy dog.", ['DET', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'ADP', 'DET', 'ADJ', 'NOUN']),
    ("I love coding in Python.", ['PRON', 'VERB', 'NOUN', 'ADP', 'PROPN']),
    ("Birds fly in the sky.", ['NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("Alice and Bob went to the market.", ['PROPN', 'CCONJ', 'PROPN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("Reading books is fun.", ['VERB', 'NOUN', 'AUX', 'ADJ']),
    ("My car is very fast.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("We are going to the zoo.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("It is raining today.", ['PRON', 'AUX', 'VERB', 'NOUN']),
    ("Programming languages are interesting.", ['NOUN', 'NOUN', 'AUX', 'ADJ']),

    ("The cat sleeps on the mat.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("John likes to play soccer.", ['PROPN', 'VERB', 'PART', 'VERB', 'NOUN']),
    ("She is learning French.", ['PRON', 'AUX', 'VERB', 'PROPN']),
    ("The weather is nice today.", ['DET', 'NOUN', 'AUX', 'ADJ', 'NOUN']),
    ("He bought a new laptop yesterday.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN', 'NOUN']),
    ("They are swimming in the pool.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("The pizza smells delicious.", ['DET', 'NOUN', 'VERB', 'ADJ']),
    ("Can you help me with this project?", ['AUX', 'PRON', 'VERB', 'PRON', 'ADP', 'DET', 'NOUN']),
    ("This task is quite difficult.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("He enjoys reading books.", ['PRON', 'VERB', 'VERB', 'NOUN']),

    ("The dog barked loudly at the strangers.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),
    ("I have a meeting tomorrow.", ['PRON', 'AUX', 'DET', 'NOUN', 'ADJ']),
    ("They will travel to Spain next year.", ['PRON', 'AUX', 'VERB', 'ADP', 'PROPN', 'ADV', 'NOUN']),
    ("He plays the guitar beautifully.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADV']),
    ("The book on the shelf is mine.", ['DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'AUX', 'PRON']),
    ("Jessica ran a marathon last summer.", ['PROPN', 'VERB', 'DET', 'NOUN', 'ADJ', 'NOUN']),
    ("Cooking is a wonderful hobby.", ['VERB', 'AUX', 'DET', 'ADJ', 'NOUN']),
    ("The stars shine brightly in the night sky.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN', 'NOUN']),
    ("I am learning how to code.", ['PRON', 'AUX', 'VERB', 'ADV', 'ADP', 'VERB']),
    ("The flowers bloom in spring.", ['DET', 'NOUN', 'VERB', 'ADP', 'NOUN']),

    ("My friends enjoy hiking on weekends.", ['DET', 'NOUN', 'VERB', 'VERB', 'ADP', 'NOUN']),
    ("Dogs are great companions.", ['NOUN', 'AUX', 'ADJ', 'NOUN']),
    ("She wrote an amazing story.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("The sun rises in the east.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("He plays soccer every weekend.", ['PRON', 'VERB', 'NOUN', 'ADV', 'NOUN']),
    ("Reading novels helps improve vocabulary.", ['VERB', 'NOUN', 'VERB', 'VERB', 'NOUN']),
    ("My family enjoys movie nights.", ['DET', 'NOUN', 'VERB', 'NOUN', 'NOUN']),
    ("She is very talented in music.", ['PRON', 'AUX', 'ADV', 'ADJ', 'ADP', 'NOUN']),
    ("We will celebrate his birthday soon.", ['PRON', 'AUX', 'VERB', 'PRON', 'NOUN', 'ADV']),
    ("The wind blew fiercely during the storm.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),

    ("They went hiking in the mountains.", ['PRON', 'VERB', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("This recipe is quite easy.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("The teacher explains concepts clearly.", ['DET', 'NOUN', 'VERB', 'NOUN', 'ADV']),
    ("We have been working on this project.", ['PRON', 'AUX', 'VERB', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("He prefers tea over coffee.", ['PRON', 'VERB', 'NOUN', 'ADP', 'NOUN']),
    ("The child laughed joyfully at the joke.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),
    ("She likes to dance at parties.", ['PRON', 'VERB', 'PART', 'VERB', 'ADP', 'NOUN']),
    ("They are playing video games right now.", ['PRON', 'AUX', 'VERB', 'NOUN', 'ADV', 'ADV']),
    ("The baby laughed at the puppy.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She danced gracefully across the stage.", ['PRON', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN'])
]

# Preprocess text: convert to lowercase and shuffle the training data
training_data = [(text.lower(), tags) for text, tags in training_data]
random.shuffle(training_data)

# Split data into training, validation, and test sets (60% train, 20% validation, 20% test)
train_data, temp_data = train_test_split(training_data, test_size=0.4, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Initialize lists to store true and predicted POS tags for all sets
all_true_pos_tags_train, all_predicted_pos_tags_train = [], []
all_true_pos_tags_val, all_predicted_pos_tags_val = [], []
all_true_pos_tags_test, all_predicted_pos_tags_test = [], []

# Function to process data and evaluate POS tagging
def process_data(data, all_true_pos_tags, all_predicted_pos_tags):
    for text, true_pos_tags in data:
        # Process the text with spaCy
        doc = nlp(text)
        # Extract predicted POS tags
        predicted_pos_tags = [token.pos_ for token in doc]
        # Extend lists with true and predicted tags for evaluation
        all_true_pos_tags.extend(true_pos_tags)
        all_predicted_pos_tags.extend(predicted_pos_tags)

# Process training, validation, and test data
process_data(train_data, all_true_pos_tags_train, all_predicted_pos_tags_train)
process_data(val_data, all_true_pos_tags_val, all_predicted_pos_tags_val)
process_data(test_data, all_true_pos_tags_test, all_predicted_pos_tags_test)

# Ensure both lists are the same length to avoid ValueError
def ensure_equal_length(true_tags, predicted_tags):
    if len(true_tags) != len(predicted_tags):
        min_length = min(len(true_tags), len(predicted_tags))
        true_tags = true_tags[:min_length]
        predicted_tags = predicted_tags[:min_length]
    return true_tags, predicted_tags

# Ensure correct lengths for all sets
all_true_pos_tags_train, all_predicted_pos_tags_train = ensure_equal_length(all_true_pos_tags_train, all_predicted_pos_tags_train)
all_true_pos_tags_val, all_predicted_pos_tags_val = ensure_equal_length(all_true_pos_tags_val, all_predicted_pos_tags_val)
all_true_pos_tags_test, all_predicted_pos_tags_test = ensure_equal_length(all_true_pos_tags_test, all_predicted_pos_tags_test)

# Function to calculate metrics for a dataset
def evaluate_metrics(true_tags, predicted_tags):
    accuracy = accuracy_score(true_tags, predicted_tags)
    precision = precision_score(true_tags, predicted_tags, average='weighted')
    recall = recall_score(true_tags, predicted_tags, average='weighted')
    f1 = f1_score(true_tags, predicted_tags, average='weighted')
    return accuracy, precision, recall, f1

# Evaluate on training, validation, and test sets
metrics_train = evaluate_metrics(all_true_pos_tags_train, all_predicted_pos_tags_train)
metrics_val = evaluate_metrics(all_true_pos_tags_val, all_predicted_pos_tags_val)
metrics_test = evaluate_metrics(all_true_pos_tags_test, all_predicted_pos_tags_test)

# Combine all metrics into single print statement
total_accuracy = (metrics_train[0] + metrics_val[0] + metrics_test[0]) / 3
total_precision = (metrics_train[1] + metrics_val[1] + metrics_test[1]) / 3
total_recall = (metrics_train[2] + metrics_val[2] + metrics_test[2]) / 3
total_f1 = (metrics_train[3] + metrics_val[3] + metrics_test[3]) / 3

# Print consolidated metrics
print("Consolidated Metrics across Training, Validation, and Test Data:")
print(f"Accuracy: {total_accuracy * 100:.4f}%")
print(f"Precision: {total_precision * 100:.4f}%")
print(f"Recall: {total_recall * 100:.4f}%")
print(f"F1 Score: {total_f1 * 100:.4f}%")

Collecting en-core-web-md==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Consolidated Metrics across Training, Validation, and Test Data:
Accuracy: 16.3483%
Precision: 18.7806%
Recall: 16.3483%
F1 Score: 17.4103%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Sentiment Analysis

The code from the subset 1 is also the same in this code but the training data is consist of 50 data.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load a blank model and add text classifier
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")


train_data = [
    ("I'm so frustrated with how slow my internet is.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I'm so happy with my new job!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The customer service at that store is excellent.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The movie was a complete waste of time.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That movie was truly heartwarming and beautiful.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been feeling really down lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The food at the new restaurant was absolutely delicious.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t get the job, and I feel so defeated.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The sunset this evening was breathtaking.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m really upset that I missed the deadline.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("I finally finished the book, and it was such a rewarding read.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("This weather is terrible, I can’t wait for it to end.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The surprise party was such a success!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My laptop crashed again, and I lost all my work.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The concert was absolutely mind-blowing!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been struggling with my workload lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The flowers you sent me are absolutely stunning.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I regret spending money on that product.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just found out I won the contest! I’m over the moon.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Everything seems to be going wrong lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("You did a fantastic job on that presentation.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m tired of dealing with all this stress.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I couldn’t be happier with how everything turned out.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My car broke down again, and I’m so frustrated.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That was one of the most enjoyable dinners I’ve had in ages.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m feeling really overwhelmed with everything going on.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m incredibly grateful for the support I’ve received.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t enjoy the event; it was a total letdown.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m so proud of everything we’ve accomplished this year.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("That comment really hurt my feelings.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("That was the best coffee I’ve had in a while!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I can’t believe how rude they were to me.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I got a promotion at work, and I couldn’t be more thrilled.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My phone screen cracked, and now I have to get it replaced.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("What a beautiful and sunny day!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m really anxious about everything going on.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Spending time with family over the holidays was perfect.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t get enough sleep last night, and now I’m exhausted.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("This new app makes my life so much easier.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been really unmotivated lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("That vacation was exactly what I needed.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("That presentation did not go well at all.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I appreciate all the effort you put into this project.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My relationship with my friends hasn’t been great lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’ve made some great new friends recently.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The traffic was horrible, and I barely made it on time.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I had such a fun time with the kids at the park today.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t enjoy the book at all; it was so boring.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just got my dream job, and I’m beyond excited!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I feel like I’ve been making one mistake after another.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}})
]

#Lowercasing
train_data = [(text.lower(), labels) for text, labels in train_data]

In [None]:
# Extract text data from train_data
text = [data[0] for data in train_data]
labels = [data[1]['cats']['POSITIVE'] for data in train_data] # Extract labels

# Vectorize text data using the extracted text list
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

# Train a Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict sentiments
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy * 100:.4f}%")
print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

Accuracy: 68.0000%
Precision: 62.5000%
Recall: 83.3333%
F1 Score: 71.4286%


## Text Summarizer

In [None]:
!python -m spacy download en_core_web_md
import spacy
from sklearn.metrics import precision_score, recall_score, f1_score
from nltk.tokenize import sent_tokenize
import nltk

# Download the punkt tokenizer if not already downloaded
nltk.download('punkt')

# Load spaCy model
nlp = spacy.load("en_core_web_md")

# Example text and reference summary
text = """Climate change is one of the most pressing issues of our time. The increasing levels of greenhouse gases in the atmosphere
have led to rising global temperatures. As a result, glaciers are melting, sea levels are rising, and extreme weather events
are becoming more frequent. Many governments around the world have pledged to reduce carbon emissions, but progress has been slow.
Renewable energy sources such as solar and wind power offer hope, but their adoption has not been widespread enough to make a significant impact yet.
Urgent action is needed to address this global crisis before it’s too late."""

reference_summary = """Climate change is caused by greenhouse gases and is leading to rising temperatures and extreme weather.
Renewable energy offers hope, but its adoption is slow."""

# Extractive summarization function with lowercase preprocessing
def extractive_summary(text, num_sentences=3):
    doc = nlp(text.lower())  # Convert text to lowercase before processing
    sentences = [sent.text.lower() for sent in doc.sents]  # Convert sentences to lowercase
    return sentences[:num_sentences]  # Return the first `num_sentences` as the summary

# Tokenizing the reference and generated summaries into sentences
generated_summary = extractive_summary(text)  # Summary in lowercase
reference_sentences = [sent.lower() for sent in sent_tokenize(reference_summary)]  # Reference in lowercase

# Convert to binary relevance: 1 if the sentence appears in the reference summary, 0 otherwise
y_true = [1 if sent in reference_sentences else 0 for sent in sent_tokenize(text.lower())]  # Compare with reference
y_pred = [1 if sent in generated_summary else 0 for sent in sent_tokenize(text.lower())]  # Compare with generated summary

# Ensure y_true and y_pred are of the same length
if len(y_true) != len(y_pred):
    min_length = min(len(y_true), len(y_pred))
    y_true = y_true[:min_length]
    y_pred = y_pred[:min_length]

# Calculate precision, recall, and F1 score
precision = precision_score(y_true, y_pred) * 100  # Convert to percentage
recall = recall_score(y_true, y_pred) * 100  # Convert to percentage
f1 = f1_score(y_true, y_pred) * 100  # Convert to percentage

# Output results
print(f"Generated Summary: {' '.join(generated_summary)}")
print(f"Precision: {precision:.2f}%")
print(f"Recall: {recall:.2f}%")
print(f"F1 Score: {f1:.2f}%")


Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Generated Summary: climate change is one of the most pressing issues of our time. the increasing levels of greenhouse gases in the atmosphere 
have led to rising global temperatures. as a result, glaciers are melting, sea levels are rising, and extreme weather events 
are becoming more frequent.
Precision: 0.00%
Recall: 0.00%
F1 Score: 0.00%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# ---------------------------------------------------------------------------------------------------------------------------------------------

# **Subset 3**

## Text Classification

The code from the subset 1 is also the same in this code but the training data is consist of 100 data.

In [None]:
import spacy
from spacy.training import Example
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import re

# Create a blank SpaCy model and add the text classifier component
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("SPAM")
textcat.add_label("HAM")

# Define a minimal preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    return text

#Example Training data
train_data = [
    ("This is spam", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello, how are you?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You won a million dollars!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Claim your free prize now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Meeting at 10 AM tomorrow", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your invoice is attached", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Exclusive offer just for you!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Get a free iPhone today", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we reschedule our call?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Update your account details", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Limited time deal, buy now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Your package has been shipped", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Win a trip to Hawaii now", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Important meeting agenda", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congratulations! You've been selected", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we discuss this project?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are you eating your mom's left over dinner ? Do you feel my Love ?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello! How's you and how did saturday go? I was just texting to see if you'd decided to do anything tomo. Not that i'm trying to invite myself or anything!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I am waiting machan. Call me once you free.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Sorry, I'll call later", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("You will be in the place of that man", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Please don't text me anymore. I have nothing else to say.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thanks a lot for your wishes on my birthday. Thanks you for making my birthday truly memorable.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Aight, I'll hit you up when I get some cash", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Dont worry. I guess he's busy.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Good stuff, will do.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("What time you coming down later?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Sounds great! Are you home now?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Finally the match heading towards draw as your prediction.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Tired. I haven't slept well the past few nights.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Where are you?when wil you reach here?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Please call our customer service representative on FREEPHONE 0808 145 4742 between 9am-11pm as you have WON a guaranteed £1000 cash or £5000 prize!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("What you doing? how are you?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I'm back, lemme know when you're ready", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Lose 20 pounds in just 2 weeks with our miracle weight loss pills! 100% natural and safe. Order today and get a special discount: www.weightlosspills.com. Hurry, offer expires soon!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Awesome, I'll see you in a bit", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Just sent it. So what type of food do you like?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I accidentally deleted the message. Resend please.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("FREE MESSAGE Activate your 500 FREE Text Messages by replying to this message with the word FREE For terms & conditions, visit www.07781482378.com", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("I cant pick the phone right now. Pls send a message", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("He says he'll give me a call when his friend's got the money but that he's definitely buying before the end of the week", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You made my day. Do have a great day too.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Great news! You have been pre-approved for a personal loan of $10,000 with a low-interest rate! No credit check required. Apply today at www.getmymoney.com and get instant cash!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("No problem. How are you doing?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Just sleeping and surfing", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Busy here. Trying to finish for new year. I am looking forward to finally meeting you", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Cool, text me when you're ready", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("URGENT! Your Mobile number has been awarded with a £2000 prize GUARANTEED. Call 09058094455 from land line. Claim 3030. Valid 12hrs only", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Thanks for this hope you had a good day today", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I hope you that's the result of being consistently intelligent and kind. Start asking him about practicum links and keep your ears open and all the best. ttyl", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Just a quick reminder about our meeting tomorrow at 10:00 AM in the conference room. Please bring the project update documents with you.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Do you want to grab lunch today? I’m thinking of trying that new Italian place near the office. Let me know if you’re up for it!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thank you for your recent purchase! Attached is the invoice for your order #45678. Your items will be shipped within 3-5 business days.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Guaranteed Highest Daily Rebates in the PH! Bet daily on any games and get up to 0.8% with UNLIMITED bonus! Check out more promos now: https://peryagame.com", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("I wanted to give you a quick update on the project. We’ve completed the design phase and will be moving into development next week. Let me know if you have any questions or need clarification on your tasks.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You’re invited to our annual company holiday party! Join us on December 15th at 7:00 PM for an evening of food, drinks, and fun. Please RSVP by December 1st.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Just wanted to let you know that we arrived safely at the cabin. The weather is beautiful, and we’re planning to go hiking tomorrow. I’ll send you some pictures later.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I hope this message finds you well. I wanted to follow up on my application for the software engineer position. I’m very interested in the role and would appreciate any updates on the hiring process.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thank you for placing an order with us! Your order #78965 has been confirmed and is currently being processed. You will receive a shipping notification once it has been dispatched.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I’ve booked my tickets for the trip to Hawaii! We’re flying out on the 12th and coming back on the 18th. Let me know if you’re still interested in joining us—it’s going to be a great trip!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("PeryaGame 5% Cashback with Unlimited Bonus! PeryaGame offer Guaranteed Highest Daily Rebates in the PH! Check out more promos now: https://peryagame.com", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Welcome to our weekly newsletter! This week, we’re sharing tips on how to improve productivity and stay organized. Be sure to check out our latest articles and join our upcoming webinar on time management.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I hope you’re all doing well. I wanted to share a quick update on the project status. We’re on track to complete the next phase by the end of the week. I’ll schedule a meeting for next Monday to discuss the next steps.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Are we still on for dinner tonight? I’ve made a reservation at 7:30 PM at the new Thai place downtown. Let me know if anything changes.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thank you for applying for the Marketing Manager position at XYZ Company. We are pleased to invite you for an interview on Monday, October 5th, at 2:00 PM. The interview will be conducted via Zoom, and the details will be sent to you soon.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Just a quick reminder that your homework assignment on Chapter 3 is due by Friday. Make sure to review the key concepts and submit your work on time.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("We are writing to inform you that a new software update is available for your device. Version 4.3 includes bug fixes, security enhancements, and new features. Please update your software at your earliest convenience to ensure optimal performance.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Claim your free 1999 bonus without deposit! Download the RG777 APP now for an 188P bonus. Install here: https://bit.ly/3Tn80Nc. Enjoy your rewards!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Dear User, we have detected suspicious activity in your bank account. To prevent your account from being suspended, please click on the link below and verify your details: www.banksecure.com. Failure to do so within 24 hours will result in account suspension.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("We’d like to invite you to our upcoming webinar on 'Building Effective Remote Teams.' The session will take place on September 30th at 11:00 AM. You’ll learn tips and strategies for managing remote employees and improving team collaboration.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your appointment with Dr. Smith has been confirmed for Thursday, October 8th, at 10:00 AM. If you need to reschedule, please contact us at least 24 hours in advance.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Just a reminder that our family vacation is coming up soon! We’ll be flying to Florida on the 15th, so make sure to pack everything by then. Also, don’t forget to bring sunscreen and your camera!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Earn $5,000 per week working from home! No experience needed. Start today and make money by simply filling out surveys. Click here to learn more: www.earnmoneyathome.com.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Just a heads up, we’re meeting at the library on Friday at 3:00 PM to review for the final exam. I’ll bring my notes, and we can go over the key chapters together. Let me know if that time works for everyone.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("We hope you’re enjoying your recent purchase from our store. We’d love to hear your feedback! Please take a moment to complete our brief survey, and let us know how we did.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your travel itinerary has been confirmed. You are scheduled to depart from New York on Flight 5678 at 9:00 AM on October 20th. Your return flight from London will be on Flight 6789 at 5:00 PM on October 27th.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Just a reminder to RSVP for our wedding on November 12th! We’re so excited to celebrate this special day with our family and friends. Please let us know if you’ll be attending by October 1st.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Search I88J1L1! Mag l0gin Get FREE B0nus P8888! PROMO CODE: 2MQS0CS live N0w!Claim unlimited B0nus N0w D0nt Miss 0ut Limited Days 0nly", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Our next book club meeting is scheduled for Tuesday, October 6th, at 6:00 PM. We’ll be discussing “The Alchemist” by Paulo Coelho. Make sure to finish reading it before the meeting, and bring your thoughts and questions for the discussion.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thanks for sending over the initial draft. I’ve made a few changes to the document, and you can find the updated version attached. Let me know if you have any questions.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congratulations! You have been selected as the winner of our $1,000,000 prize! Click here to claim your reward now: www.claimprize.com. Act fast! Offer expires soon.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Just a quick reminder about the office potluck on Friday! Don’t forget to bring a dish to share with your colleagues. Looking forward to seeing everyone’s culinary creations!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I hope you are doing well. I wanted to let you know that I have submitted my final report for the course. Please confirm when you receive it.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Please find attached the agenda for tomorrow’s meeting. We’ll be discussing the Q4 sales targets and the marketing strategy for the new product launch. Let me know if you have any points to add.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Hurry! Get a 90% discount on all our products! This is a one-time offer just for you! Visit www.superdeals.com and use code SAVE90 at checkout. Don't miss out on this amazing opportunity!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Wishing you a very happy birthday! I hope you have a fantastic day filled with joy, laughter, and cake! Let’s catch up soon.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("I just wanted to follow up on our last meeting. Have you had a chance to review the proposal we sent over? We’d love to hear your thoughts and discuss the next steps.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("We’re planning a family gathering at Grandma’s house next Sunday. We’ll have lunch around 1 PM, and it’ll be great to catch up with everyone. Let me know if you can make it!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("GCash: Account verification needed due to suspicious transaction. Kindly Visit: gcares-protect-ph.li to continue using our services", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Please note that the lecture scheduled for Tuesday, October 12th, has been moved to Thursday, October 14th, at 3 PM. The classroom remains the same. I apologize for any inconvenience.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("This is a reminder of your upcoming dentist appointment on Monday, October 18th, at 9:30 AM. Please contact us if you need to reschedule.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("This is a reminder that the book you borrowed, “The Catcher in the Rye,” is due for return on October 7th. Please return or renew it by then to avoid any late fees.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You’re all invited to our annual block party on Saturday, October 16th! We’ll have food, games, and music from 12 PM to 6 PM. Bring your family, and let’s have some fun!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Hi there! You have been selected for an all-expense-paid trip to the Bahamas! To claim your FREE vacation, all you need to do is fill out a quick survey. Click here now: www.freevacation.com.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Here are the notes from our weekly team meeting held today. Please review and let me know if there are any changes or additions. We’ll be following up on these action items next week.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your reservation at Oceanview Resort has been confirmed for October 10th to October 15th. We look forward to welcoming you. If you need assistance or have special requests, feel free to contact us.", {"cats": {"SPAM": 0, "HAM": 1}})
]

# Apply minimal text preprocessing
train_data = [(preprocess_text(text), annotations) for text, annotations in train_data]

# Prepare training data into SpaCy's Example format
train_examples = []
for text, annotations in train_data:
    doc = nlpTC.make_doc(text)
    example = Example.from_dict(doc, annotations)
    train_examples.append(example)

# Split data into training, validation, and testing sets
train_examples, test_examples = train_test_split(train_examples, test_size=0.2, random_state=42)
train_examples, val_examples = train_test_split(train_examples, test_size=0.25, random_state=42)  # 20% of the remaining data is used for validation

# Print the split data to visualize each set
print("TRAINING SET (60% of the data):")
for example in train_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

print("\nVALIDATION SET (20% of the data):")
for example in val_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

print("\nTESTING SET (20% of the data):")
for example in test_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

# Training the model with more iterations for small datasets
n_iter = 10  # Set iterations
optimizer = nlpTC.initialize()

for i in range(n_iter):
    losses = {}
    for batch in spacy.util.minibatch(train_examples, size=2):  # Small batch size for small data
        for example in batch:
            nlpTC.update([example], sgd=optimizer, losses=losses)
    print(f"Iteration {i+1}/{n_iter} - Loss: {losses['textcat']}")

# Testing the model
print("\nSample Prediction Output with probabilities:")
doc = nlpTC("Claim your prize now!")
print(doc.cats)

# Function to classify user input emails
def classify_email(email):
    email = preprocess_text(email)
    doc = nlpTC(email)
    spam_score = doc.cats['SPAM']
    ham_score = doc.cats['HAM']

    if spam_score > ham_score:
        return "SPAM"
    else:
        return "HAM"

# Calculate accuracy, precision, recall, and F1 score on the test set
true_labels = [1 if example.reference.cats['SPAM'] == 1 else 0 for example in test_examples]
predicted_labels = [1 if classify_email(example.reference.text) == 'SPAM' else 0 for example in test_examples]

# Calculate and print metrics
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')

# Display results
print(f"\nAccuracy: {accuracy * 100:.4f}%")
print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

# Allow users to test the model by inputting their own data
while True:
    user_input = input("\nEnter a sample email for classification (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    classification = classify_email(user_input)
    print(f"The email is classified as: {classification}")


TRAINING SET (60% of the data):
Text: dear user, we have detected suspicious activity in your bank account. to prevent your account from being suspended, please click on the link below and verify your details: www.banksecure.com. failure to do so within 24 hours will result in account suspension. - Label: {'SPAM': 1, 'HAM': 0}
Text: hello, how are you? - Label: {'SPAM': 0, 'HAM': 1}
Text: can we discuss this project? - Label: {'SPAM': 0, 'HAM': 1}
Text: thank you for applying for the marketing manager position at xyz company. we are pleased to invite you for an interview on monday, october 5th, at 2:00 pm. the interview will be conducted via zoom, and the details will be sent to you soon. - Label: {'SPAM': 0, 'HAM': 1}
Text: can we reschedule our call? - Label: {'SPAM': 0, 'HAM': 1}
Text: i'm back, lemme know when you're ready - Label: {'SPAM': 0, 'HAM': 1}
Text: just a quick reminder about our meeting tomorrow at 10:00 am in the conference room. please bring the project update documen

## Named Entity Recognition

The code from the subset 1 is also the same in this code but the training data is consist of 100 data.

In [None]:
!python -m spacy download en_core_web_lg
import spacy
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load a pre-trained NER model
nlp = spacy.load("en_core_web_lg")

# Sample training data (text and true entity annotations)
training_data = [
    ("Microsoft announced a new AI initiative in Seattle.", [(0, 9, "ORG"), (39, 46, "GPE")]),
    ("Google I/O will take place in May 2023.", [(0, 10, "EVENT"), (29, 37, "DATE")]),
    ("The unemployment rate in the U.S. dropped to 3.5%.", [(34, 38, "PERCENT"), (27, 31, "GPE")]),
    ("The Chinese economy grew by 5% last year.", [(4, 11, "NORP")]),
    ("Sundar Pichai is the CEO of Google.", [(0, 13, "PERSON"), (28, 34, "ORG")]),
    ("Tesla secured $2 billion in new funding.", [(14, 22, "MONEY")]),
    ("Amazon is opening a new office in Vancouver.", [(0, 6, "ORG"), (36, 45, "GPE")]),
    ("Samsung released its new Galaxy S22 phone.", [(0, 7, "ORG"), (23, 32, "PRODUCT")]),
    ("The Pacific Ocean is the largest body of water on Earth.", [(4, 17, "LOC")]),
    ("The headquarters of IBM is in New York City.", [(21, 24, "ORG"), (31, 44, "GPE")]),

    ("Satya Nadella leads Microsoft Corporation.", [(0, 12, "PERSON"), (19, 38, "ORG")]),
    ("The FIFA World Cup will be held in Qatar in 2022.", [(4, 18, "EVENT"), (34, 39, "GPE"), (43, 47, "DATE")]),
    ("Apple plans to invest $10 billion in manufacturing.", [(23, 32, "MONEY")]),
    ("A new skyscraper is being built in Dubai.", [(33, 38, "GPE")]),
    ("70% of the world's population is now online.", [(0, 3, "PERCENT")]),
    ("Elon Musk founded SpaceX and Tesla.", [(0, 9, "PERSON"), (17, 23, "ORG"), (28, 33, "ORG")]),
    ("The startup raised $50 million in Series B.", [(15, 25, "MONEY")]),
    ("The next Apple event is scheduled for March 25th.", [(9, 14, "ORG"), (39, 48, "DATE")]),
    ("The new company is aiming for a 15% market share.", [(28, 31, "PERCENT")]),
    ("Apple's iPhone 14 is expected to launch in 2023.", [(0, 5, "ORG"), (7, 15, "PRODUCT"), (46, 50, "DATE")]),

    ("A German scientist won the Nobel Prize.", [(2, 8, "NORP")]),
    ("Facebook plans to launch new features in June.", [(0, 7, "ORG"), (30, 35, "DATE")]),
    ("The CEO of Apple, Tim Cook, announced new products.", [(14, 22, "PERSON"), (4, 9, "ORG")]),
    ("NASA's Perseverance rover landed on Mars.", [(0, 4, "ORG"), (34, 38, "GPE")]),
    ("The 2024 Summer Olympics will take place in Paris.", [(4, 24, "EVENT"), (40, 45, "GPE")]),
    ("The inflation rate reached 8.6% last month.", [(28, 32, "PERCENT")]),
    ("Coca-Cola launched a new flavor this spring.", [(0, 10, "ORG"), (36, 41, "DATE")]),
    ("The World Health Organization declared a health emergency.", [(4, 30, "ORG")]),
    ("Berkshire Hathaway's stock price increased by $500.", [(0, 23, "ORG"), (37, 40, "MONEY")]),
    ("In 2020, remote work became the new normal.", [(3, 7, "DATE")]),

    ("Mark Zuckerberg met with world leaders to discuss technology.", [(0, 15, "PERSON")]),
    ("The Great Wall of China is a popular tourist attraction.", [(4, 20, "LOC")]),
    ("The Grammy Awards will be held in Los Angeles.", [(0, 14, "EVENT"), (30, 43, "GPE")]),
    ("Intel announced a new chip that will improve processing speed.", [(0, 5, "ORG")]),
    ("The stock market saw a decline of 4% today.", [(28, 31, "PERCENT")]),
    ("Microsoft is acquiring LinkedIn for $26.2 billion.", [(0, 9, "ORG"), (26, 39, "ORG"), (44, 57, "MONEY")]),
    ("SpaceX plans to launch its Starship rocket next year.", [(0, 6, "ORG"), (34, 39, "DATE")]),
    ("The next big tech conference is set for September.", [(9, 13, "EVENT"), (38, 47, "DATE")]),
    ("The United Nations addresses global challenges.", [(4, 17, "ORG")]),
    ("Bill Gates founded Microsoft in 1975.", [(0, 10, "PERSON"), (21, 29, "ORG"), (32, 36, "DATE")]),

    ("A recent study showed that 60% of students prefer online classes.", [(36, 38, "PERCENT")]),
    ("The Louvre Museum is located in Paris.", [(4, 22, "ORG"), (30, 35, "GPE")]),
    ("The 2022 World Cup will be hosted in Qatar.", [(4, 18, "EVENT"), (35, 40, "GPE")]),
    ("Netflix added 8 million new subscribers in 2021.", [(7, 14, "ORG"), (23, 24, "MONEY"), (29, 33, "DATE")]),
    ("The first electric car was launched by Tesla in 2008.", [(29, 34, "ORG"), (39, 43, "DATE")]),
    ("Researchers found a new species of frog in Madagascar.", [(36, 49, "LOC")]),
    ("In 2019, the world saw significant advancements in AI.", [(3, 7, "DATE")]),
    ("The White House issued a statement regarding climate change.", [(4, 15, "GPE")]),
    ("Elon Musk is the founder of SpaceX and Tesla.", [(0, 9, "PERSON"), (23, 29, "ORG"), (34, 39, "ORG")]),
    ("Tesla plans to produce 20 million cars by 2030.", [(0, 5, "ORG"), (34, 40, "PERCENT"), (44, 48, "DATE")]),

    ("The next FIFA World Cup will be in 2026.", [(9, 13, "EVENT"), (25, 29, "DATE")]),
    ("Apple's market share reached an all-time high.", [(0, 5, "ORG"), (27, 35, "PERCENT")]),
    ("Amazon Prime Video will launch new shows this fall.", [(0, 6, "ORG"), (36, 41, "DATE")]),
    ("Google's headquarters is in Mountain View.", [(0, 6, "ORG"), (29, 32, "GPE")]),
    ("Facebook was founded by Mark Zuckerberg.", [(0, 8, "ORG"), (22, 36, "PERSON")]),
    ("The United Kingdom is hosting the G7 summit.", [(4, 17, "GPE"), (31, 35, "EVENT")]),
    ("Sony released the PlayStation 5 in late 2020.", [(0, 4, "ORG"), (16, 30, "PRODUCT"), (34, 38, "DATE")]),
    ("The next lunar eclipse will be on November 8th.", [(9, 14, "EVENT"), (27, 34, "DATE")]),
    ("Elon Musk is developing a new satellite internet service.", [(0, 9, "PERSON"), (30, 40, "PRODUCT")]),
    ("The Amazon rainforest is crucial for biodiversity.", [(4, 10, "LOC")]),

    ("Toyota unveiled its electric car lineup this year.", [(0, 6, "ORG"), (39, 43, "DATE")]),
    ("The Summer Olympics will take place in Tokyo in 2021.", [(4, 24, "EVENT"), (38, 43, "GPE"), (46, 50, "DATE")]),
    ("The Eiffel Tower is one of the most visited monuments.", [(4, 15, "LOC")]),
    ("NASA's Artemis program aims to return humans to the Moon.", [(0, 4, "ORG"), (35, 50, "EVENT")]),
    ("The stock market experienced a significant downturn.", [(4, 9, "LOC")]),
    ("Gold prices surged to an all-time high this week.", [(0, 4, "MONEY"), (25, 35, "DATE")]),
    ("The Met Gala is a major fundraising event.", [(4, 12, "EVENT")]),
    ("The Berlin Wall fell in 1989.", [(4, 15, "LOC"), (19, 23, "DATE")]),
    ("Instagram was acquired by Facebook in 2012.", [(0, 9, "ORG"), (23, 30, "ORG"), (34, 38, "DATE")]),
    ("Microsoft will invest in renewable energy projects.", [(0, 9, "ORG")]),

    ("The World Cup is set to take place in Qatar.", [(4, 10, "EVENT"), (26, 32, "GPE")]),
    ("The Great Barrier Reef is located off the coast of Australia.", [(4, 21, "LOC")]),
    ("Bill Gates and Melinda Gates announced their divorce.", [(0, 10, "PERSON"), (15, 27, "PERSON")]),
    ("The 2024 presidential election will be highly competitive.", [(4, 36, "EVENT"), (40, 50, "DATE")]),
    ("SpaceX's Falcon Heavy launched successfully last year.", [(0, 6, "ORG"), (7, 17, "PRODUCT"), (36, 41, "DATE")]),
    ("The new iPhone model features advanced camera technology.", [(4, 9, "ORG"), (20, 27, "PRODUCT")]),
    ("Alibaba's revenue soared during the pandemic.", [(0, 7, "ORG")]),
    ("The Cannes Film Festival is a prestigious event.", [(4, 27, "EVENT")]),
    ("The Tesla Model 3 has become very popular.", [(0, 5, "ORG"), (10, 22, "PRODUCT")]),
    ("Virtual reality is gaining traction in gaming.", [(0, 7, "LOC")]),

    ("The COVID-19 vaccine rollout has accelerated globally.", [(4, 12, "EVENT")]),
    ("Google's Android operating system dominates the market.", [(0, 6, "ORG")]),
    ("The Nobel Peace Prize was awarded to Malala Yousafzai.", [(4, 28, "EVENT"), (33, 50, "PERSON")]),
    ("The tech industry is evolving rapidly with AI advancements.", [(4, 12, "LOC")]),
    ("Elon Musk plans to send humans to Mars.", [(0, 9, "PERSON"), (23, 27, "GPE")]),
    ("The 2021 Tokyo Olympics faced many challenges.", [(4, 25, "EVENT"), (31, 36, "DATE")]),
    ("The British Royal Family attended the funeral of Prince Philip.", [(4, 31, "GPE"), (39, 50, "PERSON")]),
    ("Netflix is producing a new documentary series.", [(0, 7, "ORG")]),
    ("The Paris Agreement addresses climate change issues.", [(4, 18, "EVENT")]),
    ("The Olympic Games in Paris are highly anticipated.", [(4, 20, "EVENT"), (26, 31, "GPE")]),

    ("The smartphone market is becoming saturated.", [(4, 14, "LOC")]),
    ("Amazon is facing increased competition from Walmart.", [(0, 6, "ORG"), (33, 39, "ORG")]),
    ("The United Nations General Assembly meets annually.", [(4, 36, "ORG")]),
    ("The 2023 Cricket World Cup will be hosted by India.", [(4, 27, "EVENT"), (40, 45, "GPE")]),
    ("Tesla's stock prices have fluctuated dramatically.", [(0, 5, "ORG")]),
    ("The Grammy Awards are held every year.", [(4, 18, "EVENT")]),
    ("The Eiffel Tower attracts millions of tourists every year.", [(4, 15, "LOC"), (36, 41, "DATE")]),
    ("NASA's Mars Rover is searching for signs of life.", [(0, 4, "ORG")]),
    ("The 2024 U.S. Presidential election is coming up.", [(4, 26, "EVENT"), (30, 34, "DATE")]),
    ("Tesla is set to launch its new Cybertruck.", [(0, 5, "ORG"), (30, 34, "PRODUCT")]),
]


# Preprocess: Convert all texts to lowercase
preprocessed_data = [(text.lower(), entities) for text, entities in training_data]

# Initialize lists for storing true and predicted entities
all_true_entities = []
all_pred_entities = []

# Iterate through training data
for text, true_entities in training_data:
    # Run NER model
    doc = nlp(text)

    # Predicted entities from the model
    pred_entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]

    # Store true and predicted entities for all examples
    all_true_entities.extend([(ent[0], ent[1], ent[2]) for ent in true_entities])
    all_pred_entities.extend([(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents])

# Convert to binary classification (1 for correct entity, 0 for incorrect)
y_true = [1 if ent in all_true_entities else 0 for ent in all_pred_entities]
y_pred = [1 for _ in all_pred_entities]  # Assuming all predictions are correct for now

# Calculate Precision, Recall, F1
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Precision: 35.3933%
Recall: 100.0000%
F1 Score: 52.2822%


## Part of Speech Tagging

The code from the subset 1 is also the same in this code but the training data is consist of 100 data.

In [None]:
!python -m spacy download en_core_web_lg
import spacy
import random
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load spaCy's POS tagging model
nlp = spacy.load("en_core_web_lg")

# Sample expanded training data: list of (text, true_pos_tags) pairs
training_data = [
    ("She sells seashells by the seashore.", ['PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN']),
    ("The quick brown fox jumps over the lazy dog.", ['DET', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'ADP', 'DET', 'ADJ', 'NOUN']),
    ("I love coding in Python.", ['PRON', 'VERB', 'NOUN', 'ADP', 'PROPN']),
    ("Birds fly in the sky.", ['NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("Alice and Bob went to the market.", ['PROPN', 'CCONJ', 'PROPN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("Reading books is fun.", ['VERB', 'NOUN', 'AUX', 'ADJ']),
    ("My car is very fast.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("We are going to the zoo.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("It is raining today.", ['PRON', 'AUX', 'VERB', 'NOUN']),
    ("Programming languages are interesting.", ['NOUN', 'NOUN', 'AUX', 'ADJ']),

    ("The cat sleeps on the mat.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("John likes to play soccer.", ['PROPN', 'VERB', 'PART', 'VERB', 'NOUN']),
    ("She is learning French.", ['PRON', 'AUX', 'VERB', 'PROPN']),
    ("The weather is nice today.", ['DET', 'NOUN', 'AUX', 'ADJ', 'NOUN']),
    ("He bought a new laptop yesterday.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN', 'NOUN']),
    ("They are swimming in the pool.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("The pizza smells delicious.", ['DET', 'NOUN', 'VERB', 'ADJ']),
    ("Can you help me with this project?", ['AUX', 'PRON', 'VERB', 'PRON', 'ADP', 'DET', 'NOUN']),
    ("This task is quite difficult.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("He enjoys reading books.", ['PRON', 'VERB', 'VERB', 'NOUN']),

    ("The dog barked loudly at the strangers.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),
    ("I have a meeting tomorrow.", ['PRON', 'AUX', 'DET', 'NOUN', 'ADJ']),
    ("They will travel to Spain next year.", ['PRON', 'AUX', 'VERB', 'ADP', 'PROPN', 'ADV', 'NOUN']),
    ("He plays the guitar beautifully.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADV']),
    ("The book on the shelf is mine.", ['DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'AUX', 'PRON']),
    ("Jessica ran a marathon last summer.", ['PROPN', 'VERB', 'DET', 'NOUN', 'ADJ', 'NOUN']),
    ("Cooking is a wonderful hobby.", ['VERB', 'AUX', 'DET', 'ADJ', 'NOUN']),
    ("The stars shine brightly in the night sky.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN', 'NOUN']),
    ("I am learning how to code.", ['PRON', 'AUX', 'VERB', 'ADV', 'ADP', 'VERB']),
    ("The flowers bloom in spring.", ['DET', 'NOUN', 'VERB', 'ADP', 'NOUN']),

    ("My friends enjoy hiking on weekends.", ['DET', 'NOUN', 'VERB', 'VERB', 'ADP', 'NOUN']),
    ("Dogs are great companions.", ['NOUN', 'AUX', 'ADJ', 'NOUN']),
    ("She wrote an amazing story.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("The sun rises in the east.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("He plays soccer every weekend.", ['PRON', 'VERB', 'NOUN', 'ADV', 'NOUN']),
    ("Reading novels helps improve vocabulary.", ['VERB', 'NOUN', 'VERB', 'VERB', 'NOUN']),
    ("My family enjoys movie nights.", ['DET', 'NOUN', 'VERB', 'NOUN', 'NOUN']),
    ("She is very talented in music.", ['PRON', 'AUX', 'ADV', 'ADJ', 'ADP', 'NOUN']),
    ("We will celebrate his birthday soon.", ['PRON', 'AUX', 'VERB', 'PRON', 'NOUN', 'ADV']),
    ("The wind blew fiercely during the storm.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),

    ("They went hiking in the mountains.", ['PRON', 'VERB', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("This recipe is quite easy.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("The teacher explains concepts clearly.", ['DET', 'NOUN', 'VERB', 'NOUN', 'ADV']),
    ("We have been working on this project.", ['PRON', 'AUX', 'VERB', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("He prefers tea over coffee.", ['PRON', 'VERB', 'NOUN', 'ADP', 'NOUN']),
    ("The child laughed joyfully at the joke.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),
    ("She likes to dance at parties.", ['PRON', 'VERB', 'PART', 'VERB', 'ADP', 'NOUN']),
    ("They are playing video games right now.", ['PRON', 'AUX', 'VERB', 'NOUN', 'ADV', 'ADV']),
    ("The cat chased the mouse.", ['DET', 'NOUN', 'VERB', 'DET', 'NOUN']),
    ("Jack and Jill went up the hill.", ['PROPN', 'CCONJ', 'PROPN', 'VERB', 'ADP', 'DET', 'NOUN']),

    ("The children are laughing in the park.", ['DET', 'NOUN', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She loves to read novels.", ['PRON', 'VERB', 'PART', 'VERB', 'NOUN']),
    ("The fish swims in the ocean.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("He wrote a letter yesterday.", ['PRON', 'VERB', 'DET', 'NOUN', 'NOUN']),
    ("They are playing soccer after school.", ['PRON', 'AUX', 'VERB', 'NOUN', 'ADP', 'NOUN']),
    ("The chef prepares delicious meals.", ['DET', 'NOUN', 'VERB', 'ADJ', 'NOUN']),
    ("We will visit our grandparents next weekend.", ['PRON', 'AUX', 'VERB', 'PRON', 'NOUN', 'ADV', 'NOUN']),
    ("The dog fetches the ball.", ['DET', 'NOUN', 'VERB', 'DET', 'NOUN']),
    ("She enjoys painting landscapes.", ['PRON', 'VERB', 'VERB', 'NOUN']),
    ("The phone rang unexpectedly.", ['DET', 'NOUN', 'VERB', 'ADV']),

    ("They will join us for dinner.", ['PRON', 'AUX', 'VERB', 'PRON', 'ADP', 'NOUN']),
    ("He is running very fast.", ['PRON', 'AUX', 'VERB', 'ADV', 'ADJ']),
    ("The train arrives at the station.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She sings beautifully on stage.", ['PRON', 'VERB', 'ADV', 'ADP', 'NOUN']),
    ("The baby cried all night.", ['DET', 'NOUN', 'VERB', 'DET', 'NOUN']),
    ("We are going shopping tomorrow.", ['PRON', 'AUX', 'VERB', 'VERB', 'NOUN']),
    ("He finished his homework before dinner.", ['PRON', 'VERB', 'PRON', 'NOUN', 'ADP', 'NOUN']),
    ("The sun sets in the west.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She has a beautiful voice.", ['PRON', 'AUX', 'DET', 'ADJ', 'NOUN']),
    ("The garden is full of flowers.", ['DET', 'NOUN', 'AUX', 'ADJ', 'ADP', 'NOUN']),

    ("They watched a movie last night.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADJ', 'NOUN']),
    ("The children played happily at the playground.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),
    ("We are studying for the exam.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("He kicked the ball into the goal.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN']),
    ("She is going to the concert tonight.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN', 'ADV']),
    ("The computer crashed during the update.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("I saw a shooting star.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("They will attend the meeting next week.", ['PRON', 'AUX', 'VERB', 'DET', 'NOUN', 'ADV', 'NOUN']),
    ("The mountain trail is steep.", ['DET', 'NOUN', 'AUX', 'ADJ']),
    ("He traveled to Paris last summer.", ['PRON', 'VERB', 'ADP', 'PROPN', 'ADJ', 'NOUN']),

    ("She baked cookies for her friends.", ['PRON', 'VERB', 'NOUN', 'ADP', 'PRON', 'NOUN']),
    ("The artist painted a stunning mural.", ['DET', 'NOUN', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("We enjoy exploring new places.", ['PRON', 'VERB', 'VERB', 'ADJ', 'NOUN']),
    ("He repaired the broken fence.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("They discovered a hidden treasure.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("The baby laughed at the puppy.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She danced gracefully across the stage.", ['PRON', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),
    ("The team won the championship.", ['DET', 'NOUN', 'VERB', 'DET', 'NOUN']),
    ("I found a great restaurant.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("He enjoys hiking during the summer.", ['PRON', 'VERB', 'VERB', 'ADP', 'DET', 'NOUN']),

    ("The car sped down the highway.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She likes to play the piano.", ['PRON', 'VERB', 'PART', 'VERB', 'DET', 'NOUN']),
    ("They ran a marathon in record time.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'NOUN']),
    ("The flowers bloomed beautifully in spring.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'NOUN']),
    ("She plays the violin effortlessly.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADV']),
    ("We visited the art museum yesterday.", ['PRON', 'VERB', 'DET', 'NOUN', 'NOUN', 'NOUN']),
    ("The car sped down the highway.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She likes to play the piano.", ['PRON', 'VERB', 'PART', 'VERB', 'DET', 'NOUN']),
    ("They ran a marathon in record time.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'NOUN']),
    ("The flowers bloomed beautifully in spring.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'NOUN']),
]

# Preprocess text: convert to lowercase and shuffle the training data
training_data = [(text.lower(), tags) for text, tags in training_data]
random.shuffle(training_data)

# Split data into training, validation, and test sets (60% train, 20% validation, 20% test)
train_data, temp_data = train_test_split(training_data, test_size=0.4, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Initialize lists to store true and predicted POS tags for all sets
all_true_pos_tags_train, all_predicted_pos_tags_train = [], []
all_true_pos_tags_val, all_predicted_pos_tags_val = [], []
all_true_pos_tags_test, all_predicted_pos_tags_test = [], []

# Function to process data and evaluate POS tagging
def process_data(data, all_true_pos_tags, all_predicted_pos_tags):
    for text, true_pos_tags in data:
        # Process the text with spaCy
        doc = nlp(text)
        # Extract predicted POS tags
        predicted_pos_tags = [token.pos_ for token in doc]
        # Extend lists with true and predicted tags for evaluation
        all_true_pos_tags.extend(true_pos_tags)
        all_predicted_pos_tags.extend(predicted_pos_tags)

# Process training, validation, and test data
process_data(train_data, all_true_pos_tags_train, all_predicted_pos_tags_train)
process_data(val_data, all_true_pos_tags_val, all_predicted_pos_tags_val)
process_data(test_data, all_true_pos_tags_test, all_predicted_pos_tags_test)

# Ensure both lists are the same length to avoid ValueError
def ensure_equal_length(true_tags, predicted_tags):
    if len(true_tags) != len(predicted_tags):
        min_length = min(len(true_tags), len(predicted_tags))
        true_tags = true_tags[:min_length]
        predicted_tags = predicted_tags[:min_length]
    return true_tags, predicted_tags

# Ensure correct lengths for all sets
all_true_pos_tags_train, all_predicted_pos_tags_train = ensure_equal_length(all_true_pos_tags_train, all_predicted_pos_tags_train)
all_true_pos_tags_val, all_predicted_pos_tags_val = ensure_equal_length(all_true_pos_tags_val, all_predicted_pos_tags_val)
all_true_pos_tags_test, all_predicted_pos_tags_test = ensure_equal_length(all_true_pos_tags_test, all_predicted_pos_tags_test)

# Function to calculate metrics for a dataset
def evaluate_metrics(true_tags, predicted_tags):
    accuracy = accuracy_score(true_tags, predicted_tags)
    precision = precision_score(true_tags, predicted_tags, average='weighted')
    recall = recall_score(true_tags, predicted_tags, average='weighted')
    f1 = f1_score(true_tags, predicted_tags, average='weighted')
    return accuracy, precision, recall, f1

# Evaluate on training, validation, and test sets
metrics_train = evaluate_metrics(all_true_pos_tags_train, all_predicted_pos_tags_train)
metrics_val = evaluate_metrics(all_true_pos_tags_val, all_predicted_pos_tags_val)
metrics_test = evaluate_metrics(all_true_pos_tags_test, all_predicted_pos_tags_test)

# Combine all metrics into single print statement
total_accuracy = (metrics_train[0] + metrics_val[0] + metrics_test[0]) / 3
total_precision = (metrics_train[1] + metrics_val[1] + metrics_test[1]) / 3
total_recall = (metrics_train[2] + metrics_val[2] + metrics_test[2]) / 3
total_f1 = (metrics_train[3] + metrics_val[3] + metrics_test[3]) / 3

# Print consolidated metrics
print("Consolidated Metrics across Training, Validation, and Test Data:")
print(f"Accuracy: {total_accuracy * 100:.4f}%")
print(f"Precision: {total_precision * 100:.4f}%")
print(f"Recall: {total_recall * 100:.4f}%")
print(f"F1 Score: {total_f1 * 100:.4f}%")

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Consolidated Metrics across Training, Validation, and Test Data:
Accuracy: 13.6840%
Precision: 15.6382%
Recall: 13.6840%
F1 Score: 14.5731%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Sentiment Analysis

The code from the subset 1 is also the same in this code but the training data is consist of 100 data.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load a blank model and add text classifier
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

train_data = [
    ("I'm so frustrated with how slow my internet is.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I'm so happy with my new job!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The customer service at that store is excellent.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The movie was a complete waste of time.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That movie was truly heartwarming and beautiful.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been feeling really down lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The food at the new restaurant was absolutely delicious.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t get the job, and I feel so defeated.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The sunset this evening was breathtaking.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m really upset that I missed the deadline.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("I finally finished the book, and it was such a rewarding read.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("This weather is terrible, I can’t wait for it to end.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The surprise party was such a success!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My laptop crashed again, and I lost all my work.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The concert was absolutely mind-blowing!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been struggling with my workload lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The flowers you sent me are absolutely stunning.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I regret spending money on that product.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just found out I won the contest! I’m over the moon.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Everything seems to be going wrong lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("You did a fantastic job on that presentation.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m tired of dealing with all this stress.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I couldn’t be happier with how everything turned out.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My car broke down again, and I’m so frustrated.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That was one of the most enjoyable dinners I’ve had in ages.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m feeling really overwhelmed with everything going on.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m incredibly grateful for the support I’ve received.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t enjoy the event; it was a total letdown.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m so proud of everything we’ve accomplished this year.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("That comment really hurt my feelings.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("That was the best coffee I’ve had in a while!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I can’t believe how rude they were to me.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I got a promotion at work, and I couldn’t be more thrilled.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My phone screen cracked, and now I have to get it replaced.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("What a beautiful and sunny day!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m really anxious about everything going on.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Spending time with family over the holidays was perfect.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t get enough sleep last night, and now I’m exhausted.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("This new app makes my life so much easier.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been really unmotivated lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("That vacation was exactly what I needed.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("That presentation did not go well at all.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I appreciate all the effort you put into this project.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My relationship with my friends hasn’t been great lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’ve made some great new friends recently.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The traffic was horrible, and I barely made it on time.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I had such a fun time with the kids at the park today.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t enjoy the book at all; it was so boring.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just got my dream job, and I’m beyond excited!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I feel like I’ve been making one mistake after another.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("I had a terrible experience with the customer service rep.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’ve been feeling so energetic and positive lately!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I'm feeling completely hopeless today.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That was the most amazing concert I’ve ever attended!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I can’t believe I lost my wallet again.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("My birthday party was so much fun, I loved it!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The service at the restaurant was terrible, I’m never going back.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’ve been so productive today, I got everything done!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I had an argument with my best friend, and now I feel awful.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I received a surprise gift, and it made my entire day.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),

    ("I didn’t get the promotion, and now I’m feeling defeated.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That was the best vacation I’ve had in years.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been feeling anxious and restless lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m extremely proud of how far I’ve come.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I don’t know why, but I’m feeling really down today.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The presentation went really well, I’m so relieved.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m struggling to stay positive with everything going wrong.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I feel incredibly blessed to have such supportive friends.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My flight got canceled, and now I’m stuck at the airport.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I got a big raise at work, I’m so happy!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),

    ("I’ve been feeling very isolated and alone recently.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just adopted a puppy, and I’m beyond excited!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve had the worst headache all day, it won’t go away.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m thrilled to have finished that project ahead of time.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been getting really stressed about all my deadlines.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That new movie was so entertaining, I loved every minute.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m so frustrated with how long this process is taking.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’ve never been more excited for the future.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The noise in my neighborhood is driving me crazy.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I finally achieved my fitness goals, I feel amazing.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),

    ("I’ve been having such a hard time balancing work and life.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I got to reconnect with an old friend today, it was so great.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m feeling really insecure about everything right now.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That new café is fantastic, I’ll definitely be going back.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m so upset that my favorite show got canceled.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just finished a great workout, I feel so energized!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m so disappointed in how things turned out.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That speech was truly inspiring, I’m so motivated now.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m incredibly tired of dealing with all these problems.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just finished reading an amazing book, I couldn’t put it down.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),

    ("I feel like everything is falling apart.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just won tickets to see my favorite band live!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been feeling really disconnected from everyone lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just got engaged, and I couldn’t be happier!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m really disappointed with the service I received today.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m so excited to start this new chapter in my life.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m feeling really anxious about the upcoming event.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That was the most fun I’ve had in years.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m feeling really low and unsure about everything right now.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m grateful for all the blessings in my life.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}})
]

#Lowercasing
train_data = [(text.lower(), labels) for text, labels in train_data]

In [None]:
# Extract text data from train_data
text = [data[0] for data in train_data]
labels = [data[1]['cats']['POSITIVE'] for data in train_data] # Extract labels

# Vectorize text data using the extracted text list
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

# Train a Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict sentiments
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy * 100:.4f}%")
print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

Accuracy: 80.0000%
Precision: 73.3333%
Recall: 91.6667%
F1 Score: 81.4815%


## Text Summarizer

In [4]:
!python -m spacy download en_core_web_lg
import spacy
from sklearn.metrics import precision_score, recall_score, f1_score
from nltk.tokenize import sent_tokenize
import nltk

# Download the punkt tokenizer if not already downloaded
nltk.download('punkt')

# Load spaCy model
nlp = spacy.load("en_core_web_lg")

# Example text and reference summary
text = """Climate change is one of the most pressing issues of our time. The increasing levels of greenhouse gases in the atmosphere
have led to rising global temperatures. As a result, glaciers are melting, sea levels are rising, and extreme weather events
are becoming more frequent. Many governments around the world have pledged to reduce carbon emissions, but progress has been slow.
Renewable energy sources such as solar and wind power offer hope, but their adoption has not been widespread enough to make a significant impact yet.
Urgent action is needed to address this global crisis before it’s too late."""

reference_summary = """Climate change is caused by greenhouse gases and is leading to rising temperatures and extreme weather.
Renewable energy offers hope, but its adoption is slow."""

# Extractive summarization function with lowercase preprocessing
def extractive_summary(text, num_sentences=3):
    doc = nlp(text.lower())  # Convert text to lowercase before processing
    sentences = [sent.text.lower() for sent in doc.sents]  # Convert sentences to lowercase
    return sentences[:num_sentences]  # Return the first `num_sentences` as the summary

# Tokenizing the reference and generated summaries into sentences
generated_summary = extractive_summary(text)  # Summary in lowercase
reference_sentences = [sent.lower() for sent in sent_tokenize(reference_summary)]  # Reference in lowercase

# Convert to binary relevance: 1 if the sentence appears in the reference summary, 0 otherwise
y_true = [1 if sent in reference_sentences else 0 for sent in sent_tokenize(text.lower())]  # Compare with reference
y_pred = [1 if sent in generated_summary else 0 for sent in sent_tokenize(text.lower())]  # Compare with generated summary

# Ensure y_true and y_pred are of the same length
if len(y_true) != len(y_pred):
    min_length = min(len(y_true), len(y_pred))
    y_true = y_true[:min_length]
    y_pred = y_pred[:min_length]

# Calculate precision, recall, and F1 score
precision = precision_score(y_true, y_pred) * 100  # Convert to percentage
recall = recall_score(y_true, y_pred) * 100  # Convert to percentage
f1 = f1_score(y_true, y_pred) * 100  # Convert to percentage

# Output results
print(f"Generated Summary: {' '.join(generated_summary)}")
print(f"Precision: {precision:.2f}%")
print(f"Recall: {recall:.2f}%")
print(f"F1 Score: {f1:.2f}%")


Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Generated Summary: climate change is one of the most pressing issues of our time. the increasing levels of greenhouse gases in the atmosphere
have led to rising global temperatures. as a result, glaciers are melting, sea levels are rising, and extreme weather events
are becoming more frequent.
Precision: 0.00%
Recall: 0.00%
F1 Score: 0.00%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# ---------------------------------------------------------------------------------------------------------------------------------------------

# **Subset 4**

## Text Classification





This code is same as the code from subset 1 but the training data contains a total of 20 samples. A detailed preprocessing function is provided, which aims at lowering the case of text, scrapping special signs and stopwords from the text respectively which aims at improving the quality of the text before it is fed for training.


In [None]:
import spacy
from spacy.training import Example
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import re

# Create a blank SpaCy model and add the text classifier component
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("SPAM")
textcat.add_label("HAM")

# Example Training data
train_data = [
    ("This is spam", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello, how are you?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You won a million dollars!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Claim your free prize now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Meeting at 10 AM tomorrow", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your invoice is attached", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Exclusive offer just for you!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Get a free iPhone today", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we reschedule our call?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Update your account details", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Limited time deal, buy now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Your package has been shipped", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Win a trip to Hawaii now", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Important meeting agenda", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congratulations! You've been selected", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we discuss this project?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are you eating your mom's left over dinner ? Do you feel my Love ?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello! How's you and how did saturday go? I was just texting to see if you'd decided to do anything tomo. Not that i'm trying to invite myself or anything!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I am waiting machan. Call me once you free.", {"cats": {"SPAM": 0, "HAM": 1}}),
]

# Prepare training data into SpaCy's Example format
train_examples = []
for text, annotations in train_data:
    doc = nlpTC.make_doc(text) # create doc prior to preprocessing
    example = Example.from_dict(doc, annotations)
    train_examples.append(example)

# Initialize the textcat component with the training examples
nlpTC.initialize(lambda: train_examples)

# Define a comprehensive preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Tokenize and remove stopwords
    tokens = [token.text for token in nlpTC(text) if not token.is_stop]  # Directly extract tokens
    # Join tokens back to a string
    return ' '.join(tokens)

# Apply minimal text preprocessing to train_examples after initializing the pipeline
train_examples_preprocessed = []
for example in train_examples:
    # Process the text and create a new Doc object
    preprocessed_text = preprocess_text(example.reference.text)
    preprocessed_doc = nlpTC.make_doc(preprocessed_text)

    # Create a new Example with the preprocessed Doc and original annotations
    # Instead of example.reference.cats, use Example.from_dict with manual setting of cats
    new_example = Example.from_dict(preprocessed_doc, {"cats": example.reference.cats})
    train_examples_preprocessed.append(new_example)

train_examples = train_examples_preprocessed # replace with preprocessed examples
# Apply the comprehensive text preprocessing
train_data = [(preprocess_text(text), annotations) for text, annotations in train_data]

# Prepare training data into SpaCy's Example format
train_examples = []
for text, annotations in train_data:
    doc = nlpTC.make_doc(text)
    example = Example.from_dict(doc, annotations)
    train_examples.append(example)

# Split data into training, validation, and testing sets
train_examples, test_examples = train_test_split(train_examples, test_size=0.2, random_state=42)
train_examples, val_examples = train_test_split(train_examples, test_size=0.25, random_state=42)  # 20% of the remaining data is used for validation

# Print the split data to visualize each set
print("TRAINING SET (60% of the data):")
for example in train_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

print("\nVALIDATION SET (20% of the data):")
for example in val_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

print("\nTESTING SET (20% of the data):")
for example in test_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

# Training the model with more iterations for small datasets
n_iter = 10  # Set iterations
optimizer = nlpTC.initialize()

for i in range(n_iter):
    losses = {}
    for batch in spacy.util.minibatch(train_examples, size=2):  # Small batch size for small data
        for example in batch:
            nlpTC.update([example], sgd=optimizer, losses=losses)
    print(f"Iteration {i+1}/{n_iter} - Loss: {losses['textcat']}")

# Testing the model
print("\nSample Prediction Output with probabilities:")
doc = nlpTC("Claim your prize now!")
print(doc.cats)

# Function to classify user input emails
def classify_email(email):
    email = preprocess_text(email)
    doc = nlpTC(email)
    spam_score = doc.cats['SPAM']
    ham_score = doc.cats['HAM']

    if spam_score > ham_score:
        return "SPAM"
    else:
        return "HAM"

# Calculate accuracy, precision, recall, and F1 score on the test set
true_labels = [1 if example.reference.cats['SPAM'] == 1 else 0 for example in test_examples]
predicted_labels = [1 if classify_email(example.reference.text) == 'SPAM' else 0 for example in test_examples]

# Calculate and print metrics
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')

# Display results
print(f"\nAccuracy: {accuracy * 100:.4f}%")
print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

# Allow users to test the model by inputting their own data
while True:
    user_input = input("\nEnter a sample email for classification (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    classification = classify_email(user_input)
    print(f"The email is classified as: {classification}")


TRAINING SET (60% of the data):
Text: limited time deal buy - Label: {'SPAM': 1, 'HAM': 0}
Text: win trip hawaii - Label: {'SPAM': 1, 'HAM': 0}
Text: update account details - Label: {'SPAM': 0, 'HAM': 1}
Text: waiting machan free - Label: {'SPAM': 0, 'HAM': 1}
Text: package shipped - Label: {'SPAM': 0, 'HAM': 1}
Text: exclusive offer - Label: {'SPAM': 1, 'HAM': 0}
Text: hello s saturday texting d decided tomo m trying invite - Label: {'SPAM': 0, 'HAM': 1}
Text: won million dollars - Label: {'SPAM': 1, 'HAM': 0}
Text: meeting 10 tomorrow - Label: {'SPAM': 0, 'HAM': 1}
Text: free iphone today - Label: {'SPAM': 1, 'HAM': 0}
Text: claim free prize - Label: {'SPAM': 1, 'HAM': 0}
Text: important meeting agenda - Label: {'SPAM': 0, 'HAM': 1}

VALIDATION SET (20% of the data):
Text: reschedule - Label: {'SPAM': 0, 'HAM': 1}
Text: invoice attached - Label: {'SPAM': 0, 'HAM': 1}
Text: catch bus   frying egg   tea eating moms left dinner   feel love - Label: {'SPAM': 0, 'HAM': 1}
Text: congratula

## Named Entity Recognition

This code is same as the code from subset 1 but the training data contains a total of 20 samples. A detailed preprocessing function is provided, which aims at lowering the case of text, scrapping special signs and stopwords from the text respectively which aims at improving the quality of the text before it is fed for training.

In [None]:
import spacy
from sklearn.metrics import precision_score, recall_score, f1_score
import re

# Load a pre-trained NER model
nlp = spacy.load("en_core_web_sm")

# Sample training data (text and true entity annotations)
training_data = [
    ("Microsoft announced a new AI initiative in Seattle.", [(0, 9, "ORG"), (39, 46, "GPE")]),
    ("Google I/O will take place in May 2023.", [(0, 10, "EVENT"), (29, 37, "DATE")]),
    ("The unemployment rate in the U.S. dropped to 3.5%.", [(34, 38, "PERCENT"), (27, 31, "GPE")]),
    ("The Chinese economy grew by 5% last year.", [(4, 11, "NORP")]),
    ("Sundar Pichai is the CEO of Google.", [(0, 13, "PERSON"), (28, 34, "ORG")]),
    ("Tesla secured $2 billion in new funding.", [(14, 22, "MONEY")]),
    ("Amazon is opening a new office in Vancouver.", [(0, 6, "ORG"), (36, 45, "GPE")]),
    ("Samsung released its new Galaxy S22 phone.", [(0, 7, "ORG"), (23, 32, "PRODUCT")]),
    ("The Pacific Ocean is the largest body of water on Earth.", [(4, 17, "LOC")]),
    ("The headquarters of IBM is in New York City.", [(21, 24, "ORG"), (31, 44, "GPE")]),

    ("Satya Nadella leads Microsoft Corporation.", [(0, 12, "PERSON"), (19, 38, "ORG")]),
    ("The FIFA World Cup will be held in Qatar in 2022.", [(4, 18, "EVENT"), (34, 39, "GPE"), (43, 47, "DATE")]),
    ("Apple plans to invest $10 billion in manufacturing.", [(23, 32, "MONEY")]),
    ("A new skyscraper is being built in Dubai.", [(33, 38, "GPE")]),
    ("70% of the world's population is now online.", [(0, 3, "PERCENT")]),
    ("Elon Musk founded SpaceX and Tesla.", [(0, 9, "PERSON"), (17, 23, "ORG"), (28, 33, "ORG")]),
    ("The startup raised $50 million in Series B.", [(15, 25, "MONEY")]),
    ("The next Apple event is scheduled for March 25th.", [(9, 14, "ORG"), (39, 48, "DATE")]),
    ("The new company is aiming for a 15% market share.", [(28, 31, "PERCENT")]),
    ("Apple's iPhone 14 is expected to launch in 2023.", [(0, 5, "ORG"), (7, 15, "PRODUCT"), (46, 50, "DATE")]),

]

# Define a comprehensive preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Tokenization and stopword removal
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return ' '.join(tokens)

# Preprocess training data
preprocessed_data = [(preprocess_text(text), entities) for text, entities in training_data]

# Initialize lists for storing true and predicted entities
all_true_entities = []
all_pred_entities = []

# Iterate through preprocessed training data
for text, true_entities in preprocessed_data:
    # Run NER model
    doc = nlp(text)

    # Predicted entities from the model
    pred_entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]

    # Store true and predicted entities for all examples
    all_true_entities.extend([(ent[0], ent[1], ent[2]) for ent in true_entities])
    all_pred_entities.extend([(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents])

# Convert to binary classification (1 for correct entity, 0 for incorrect)
y_true = [1 if ent in all_true_entities else 0 for ent in all_pred_entities]
y_pred = [1 for _ in all_pred_entities]  # Assuming all predictions are correct for now

# Calculate Precision, Recall, F1
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")


Precision: 16.0000%
Recall: 100.0000%
F1 Score: 27.5862%


## Part of Speech Tagging

This code is same as the code from subset 1 but the training data contains a total of 20 samples. A detailed preprocessing function is provided, which aims at lowering the case of text, scrapping special signs and stopwords from the text respectively which aims at improving the quality of the text before it is fed for training.

In [None]:
import spacy
import random
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import re

# Load spaCy's POS tagging model
nlp = spacy.load("en_core_web_sm")

# Sample expanded training data: list of (text, true_pos_tags) pairs
training_data = [
    ("She sells seashells by the seashore.", ['PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN']),
    ("The quick brown fox jumps over the lazy dog.", ['DET', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'ADP', 'DET', 'ADJ', 'NOUN']),
    ("I love coding in Python.", ['PRON', 'VERB', 'NOUN', 'ADP', 'PROPN']),
    ("Birds fly in the sky.", ['NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("Alice and Bob went to the market.", ['PROPN', 'CCONJ', 'PROPN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("Reading books is fun.", ['VERB', 'NOUN', 'AUX', 'ADJ']),
    ("My car is very fast.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("We are going to the zoo.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("It is raining today.", ['PRON', 'AUX', 'VERB', 'NOUN']),
    ("Programming languages are interesting.", ['NOUN', 'NOUN', 'AUX', 'ADJ']),

    ("The cat sleeps on the mat.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("John likes to play soccer.", ['PROPN', 'VERB', 'PART', 'VERB', 'NOUN']),
    ("She is learning French.", ['PRON', 'AUX', 'VERB', 'PROPN']),
    ("The weather is nice today.", ['DET', 'NOUN', 'AUX', 'ADJ', 'NOUN']),
    ("He bought a new laptop yesterday.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN', 'NOUN']),
    ("They are swimming in the pool.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("The pizza smells delicious.", ['DET', 'NOUN', 'VERB', 'ADJ']),
    ("Can you help me with this project?", ['AUX', 'PRON', 'VERB', 'PRON', 'ADP', 'DET', 'NOUN']),
    ("This task is quite difficult.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("He enjoys reading books.", ['PRON', 'VERB', 'VERB', 'NOUN']),
]

# Preprocess text: convert to lowercase and shuffle the training data
training_data = [(text.lower(), tags) for text, tags in training_data]
random.shuffle(training_data)

# Split data into training, validation, and test sets (60% train, 20% validation, 20% test)
train_data, temp_data = train_test_split(training_data, test_size=0.4, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Initialize lists to store true and predicted POS tags for all sets
all_true_pos_tags_train, all_predicted_pos_tags_train = [], []
all_true_pos_tags_val, all_predicted_pos_tags_val = [], []
all_true_pos_tags_test, all_predicted_pos_tags_test = [], []

# Function to preprocess and evaluate POS tagging
def process_data(data, all_true_pos_tags, all_predicted_pos_tags):
    for text, true_pos_tags in data:
        # Special character removal
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        # Process the text with spaCy
        doc = nlp(text)
        # Extract predicted POS tags and filter out stopwords
        predicted_pos_tags = [token.pos_ for token in doc if not token.is_stop and not token.is_punct]
        # Extend lists with true and predicted tags for evaluation
        all_true_pos_tags.extend(true_pos_tags)
        all_predicted_pos_tags.extend(predicted_pos_tags)

# Process training, validation, and test data
process_data(train_data, all_true_pos_tags_train, all_predicted_pos_tags_train)
process_data(val_data, all_true_pos_tags_val, all_predicted_pos_tags_val)
process_data(test_data, all_true_pos_tags_test, all_predicted_pos_tags_test)

# Ensure both lists are the same length to avoid ValueError
def ensure_equal_length(true_tags, predicted_tags):
    if len(true_tags) != len(predicted_tags):
        min_length = min(len(true_tags), len(predicted_tags))
        true_tags = true_tags[:min_length]
        predicted_tags = predicted_tags[:min_length]
    return true_tags, predicted_tags

# Ensure correct lengths for all sets
all_true_pos_tags_train, all_predicted_pos_tags_train = ensure_equal_length(all_true_pos_tags_train, all_predicted_pos_tags_train)
all_true_pos_tags_val, all_predicted_pos_tags_val = ensure_equal_length(all_true_pos_tags_val, all_predicted_pos_tags_val)
all_true_pos_tags_test, all_predicted_pos_tags_test = ensure_equal_length(all_true_pos_tags_test, all_predicted_pos_tags_test)

# Function to calculate metrics for a dataset
def evaluate_metrics(true_tags, predicted_tags):
    accuracy = accuracy_score(true_tags, predicted_tags)
    precision = precision_score(true_tags, predicted_tags, average='weighted')
    recall = recall_score(true_tags, predicted_tags, average='weighted')
    f1 = f1_score(true_tags, predicted_tags, average='weighted')
    return accuracy, precision, recall, f1

# Evaluate on training, validation, and test sets
metrics_train = evaluate_metrics(all_true_pos_tags_train, all_predicted_pos_tags_train)
metrics_val = evaluate_metrics(all_true_pos_tags_val, all_predicted_pos_tags_val)
metrics_test = evaluate_metrics(all_true_pos_tags_test, all_predicted_pos_tags_test)

# Combine all metrics into single print statement
total_accuracy = (metrics_train[0] + metrics_val[0] + metrics_test[0]) / 3
total_precision = (metrics_train[1] + metrics_val[1] + metrics_test[1]) / 3
total_recall = (metrics_train[2] + metrics_val[2] + metrics_test[2]) / 3
total_f1 = (metrics_train[3] + metrics_val[3] + metrics_test[3]) / 3

# Print consolidated metrics
print("Consolidated Metrics across Training, Validation, and Test Data:")
print(f"Accuracy: {total_accuracy * 100:.4f}%")
print(f"Precision: {total_precision * 100:.4f}%")
print(f"Recall: {total_recall * 100:.4f}%")
print(f"F1 Score: {total_f1 * 100:.4f}%")

Consolidated Metrics across Training, Validation, and Test Data:
Accuracy: 13.3333%
Precision: 7.8721%
Recall: 13.3333%
F1 Score: 9.6354%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Sentiment Analysis

This code is same as the code from subset 1 but the training data contains a total of 20 samples. A detailed preprocessing function is provided, which aims at lowering the case of text, scrapping special signs and stopwords from the text respectively which aims at improving the quality of the text before it is fed for training.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load a blank model and add text classifier
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

train_data = [
    ("I'm so frustrated with how slow my internet is.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I'm so happy with my new job!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The customer service at that store is excellent.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The movie was a complete waste of time.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That movie was truly heartwarming and beautiful.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been feeling really down lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The food at the new restaurant was absolutely delicious.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t get the job, and I feel so defeated.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The sunset this evening was breathtaking.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m really upset that I missed the deadline.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("I finally finished the book, and it was such a rewarding read.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("This weather is terrible, I can’t wait for it to end.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The surprise party was such a success!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My laptop crashed again, and I lost all my work.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The concert was absolutely mind-blowing!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been struggling with my workload lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The flowers you sent me are absolutely stunning.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I regret spending money on that product.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just found out I won the contest! I’m over the moon.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Everything seems to be going wrong lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
]

# Define a comprehensive preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Tokenize and remove stopwords
    tokens = [token.text for token in nlpTC(text) if not token.is_stop]  # Directly extract tokens
    # Join tokens back to a string
    return ' '.join(tokens)

In [None]:
# Extract text data from train_data
text = [data[0] for data in train_data]
labels = [data[1]['cats']['POSITIVE'] for data in train_data] # Extract labels

# Vectorize text data using the extracted text list
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

# Train a Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict sentiments
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy * 100:.4f}%")
print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

Accuracy: 70.0000%
Precision: 60.0000%
Recall: 75.0000%
F1 Score: 66.6667%


## Text Summarizer

In [None]:
import spacy
from sklearn.metrics import precision_score, recall_score, f1_score
from nltk.tokenize import sent_tokenize
import nltk
import re

# Download the punkt tokenizer if not already downloaded
nltk.download('punkt')

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Example text and reference summary
text = """Climate change is one of the most pressing issues of our time. The increasing levels of greenhouse gases in the atmosphere
have led to rising global temperatures. As a result, glaciers are melting, sea levels are rising, and extreme weather events
are becoming more frequent. Many governments around the world have pledged to reduce carbon emissions, but progress has been slow.
Renewable energy sources such as solar and wind power offer hope, but their adoption has not been widespread enough to make a significant impact yet.
Urgent action is needed to address this global crisis before it’s too late."""

reference_summary = """Climate change is caused by greenhouse gases and is leading to rising temperatures and extreme weather.
Renewable energy offers hope, but its adoption is slow."""

# Function to preprocess text: remove special characters, stopwords, and tokenize
def preprocess_text(text):
    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Keep only alphanumeric characters and spaces
    doc = nlp(text.lower())  # Convert to lowercase and process with spaCy
    # Remove stopwords and return tokens
    tokens = [token.text for token in doc if not token.is_stop]
    return tokens

# Enhanced extractive summarization function
def extractive_summary(text, reference_summary, num_sentences=3):
    doc = nlp(text)  # Process the original text
    sentences = [sent.text for sent in doc.sents]  # Extract original sentences
    preprocessed_sentences = [preprocess_text(sent) for sent in sentences]  # Preprocess each sentence

    # Score sentences based on similarity to the reference summary
    ref_tokens = preprocess_text(reference_summary)  # Preprocess reference summary
    ref_doc = nlp(' '.join(ref_tokens))  # Create a spaCy doc from the preprocessed tokens

    # Score sentences based on similarity to the reference summary
    sentence_scores = [(sent, nlp(' '.join(preprocess_text(sent))).similarity(ref_doc)) for sent in sentences]
    ranked_sentences = sorted(sentence_scores, key=lambda x: x[1], reverse=True)

    # Select top sentences
    top_sentences = [sent[0] for sent in ranked_sentences[:num_sentences]]
    return top_sentences

# Tokenizing the reference and generated summaries into sentences
generated_summary = extractive_summary(text, reference_summary)  # Summary in lowercase
reference_sentences = [sent.lower() for sent in sent_tokenize(reference_summary)]  # Reference in lowercase

# Convert to binary relevance: 1 if the sentence appears in the reference summary, 0 otherwise
y_true = [1 if sent in reference_sentences else 0 for sent in sent_tokenize(text.lower())]
y_pred = [1 if sent in generated_summary else 0 for sent in sent_tokenize(text.lower())]

# Ensure y_true and y_pred are of the same length
if len(y_true) != len(y_pred):
    min_length = min(len(y_true), len(y_pred))
    y_true = y_true[:min_length]
    y_pred = y_pred[:min_length]

# Calculate precision, recall, and F1 score
precision = precision_score(y_true, y_pred) * 100
recall = recall_score(y_true, y_pred) * 100
f1 = f1_score(y_true, y_pred) * 100

# Output results
print(f"Generated Summary: {' '.join(generated_summary)}")
print(f"Precision: {precision:.2f}%")
print(f"Recall: {recall:.2f}%")
print(f"F1 Score: {f1:.2f}%")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Generated Summary: Renewable energy sources such as solar and wind power offer hope, but their adoption has not been widespread enough to make a significant impact yet. 
 As a result, glaciers are melting, sea levels are rising, and extreme weather events 
are becoming more frequent. Many governments around the world have pledged to reduce carbon emissions, but progress has been slow. 

Precision: 0.00%
Recall: 0.00%
F1 Score: 0.00%


  sentence_scores = [(sent, nlp(' '.join(preprocess_text(sent))).similarity(ref_doc)) for sent in sentences]
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# ---------------------------------------------------------------------------------------------------------------------------------------------

# **Subset 5**

## Text Classification


This code is same as the code from subset 1 but the training data contains a total of 100 samples. A detailed preprocessing function is provided, which aims at lowering the case of text, scrapping special signs and stopwords from the text respectively which aims at improving the quality of the text before it is fed for training.

In [None]:
import spacy
from spacy.training import Example
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import re

# Create a blank SpaCy model and add the text classifier component
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("SPAM")
textcat.add_label("HAM")

#Example Training data
train_data = [
    ("This is spam", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello, how are you?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You won a million dollars!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Claim your free prize now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Meeting at 10 AM tomorrow", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your invoice is attached", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Exclusive offer just for you!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Get a free iPhone today", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we reschedule our call?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Update your account details", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Limited time deal, buy now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Your package has been shipped", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Win a trip to Hawaii now", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Important meeting agenda", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congratulations! You've been selected", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we discuss this project?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are you eating your mom's left over dinner ? Do you feel my Love ?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello! How's you and how did saturday go? I was just texting to see if you'd decided to do anything tomo. Not that i'm trying to invite myself or anything!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I am waiting machan. Call me once you free.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Sorry, I'll call later", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("You will be in the place of that man", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Please don't text me anymore. I have nothing else to say.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thanks a lot for your wishes on my birthday. Thanks you for making my birthday truly memorable.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Aight, I'll hit you up when I get some cash", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Dont worry. I guess he's busy.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Good stuff, will do.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("What time you coming down later?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Sounds great! Are you home now?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Finally the match heading towards draw as your prediction.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Tired. I haven't slept well the past few nights.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Where are you?when wil you reach here?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Please call our customer service representative on FREEPHONE 0808 145 4742 between 9am-11pm as you have WON a guaranteed £1000 cash or £5000 prize!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("What you doing? how are you?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I'm back, lemme know when you're ready", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Lose 20 pounds in just 2 weeks with our miracle weight loss pills! 100% natural and safe. Order today and get a special discount: www.weightlosspills.com. Hurry, offer expires soon!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Awesome, I'll see you in a bit", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Just sent it. So what type of food do you like?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I accidentally deleted the message. Resend please.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("FREE MESSAGE Activate your 500 FREE Text Messages by replying to this message with the word FREE For terms & conditions, visit www.07781482378.com", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("I cant pick the phone right now. Pls send a message", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("He says he'll give me a call when his friend's got the money but that he's definitely buying before the end of the week", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You made my day. Do have a great day too.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Great news! You have been pre-approved for a personal loan of $10,000 with a low-interest rate! No credit check required. Apply today at www.getmymoney.com and get instant cash!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("No problem. How are you doing?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Just sleeping and surfing", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Busy here. Trying to finish for new year. I am looking forward to finally meeting you", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Cool, text me when you're ready", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("URGENT! Your Mobile number has been awarded with a £2000 prize GUARANTEED. Call 09058094455 from land line. Claim 3030. Valid 12hrs only", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Thanks for this hope you had a good day today", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I hope you that's the result of being consistently intelligent and kind. Start asking him about practicum links and keep your ears open and all the best. ttyl", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Just a quick reminder about our meeting tomorrow at 10:00 AM in the conference room. Please bring the project update documents with you.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Do you want to grab lunch today? I’m thinking of trying that new Italian place near the office. Let me know if you’re up for it!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thank you for your recent purchase! Attached is the invoice for your order #45678. Your items will be shipped within 3-5 business days.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Guaranteed Highest Daily Rebates in the PH! Bet daily on any games and get up to 0.8% with UNLIMITED bonus! Check out more promos now: https://peryagame.com", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("I wanted to give you a quick update on the project. We’ve completed the design phase and will be moving into development next week. Let me know if you have any questions or need clarification on your tasks.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You’re invited to our annual company holiday party! Join us on December 15th at 7:00 PM for an evening of food, drinks, and fun. Please RSVP by December 1st.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Just wanted to let you know that we arrived safely at the cabin. The weather is beautiful, and we’re planning to go hiking tomorrow. I’ll send you some pictures later.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I hope this message finds you well. I wanted to follow up on my application for the software engineer position. I’m very interested in the role and would appreciate any updates on the hiring process.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thank you for placing an order with us! Your order #78965 has been confirmed and is currently being processed. You will receive a shipping notification once it has been dispatched.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I’ve booked my tickets for the trip to Hawaii! We’re flying out on the 12th and coming back on the 18th. Let me know if you’re still interested in joining us—it’s going to be a great trip!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("PeryaGame 5% Cashback with Unlimited Bonus! PeryaGame offer Guaranteed Highest Daily Rebates in the PH! Check out more promos now: https://peryagame.com", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Welcome to our weekly newsletter! This week, we’re sharing tips on how to improve productivity and stay organized. Be sure to check out our latest articles and join our upcoming webinar on time management.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I hope you’re all doing well. I wanted to share a quick update on the project status. We’re on track to complete the next phase by the end of the week. I’ll schedule a meeting for next Monday to discuss the next steps.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Are we still on for dinner tonight? I’ve made a reservation at 7:30 PM at the new Thai place downtown. Let me know if anything changes.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thank you for applying for the Marketing Manager position at XYZ Company. We are pleased to invite you for an interview on Monday, October 5th, at 2:00 PM. The interview will be conducted via Zoom, and the details will be sent to you soon.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Just a quick reminder that your homework assignment on Chapter 3 is due by Friday. Make sure to review the key concepts and submit your work on time.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("We are writing to inform you that a new software update is available for your device. Version 4.3 includes bug fixes, security enhancements, and new features. Please update your software at your earliest convenience to ensure optimal performance.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Claim your free 1999 bonus without deposit! Download the RG777 APP now for an 188P bonus. Install here: https://bit.ly/3Tn80Nc. Enjoy your rewards!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Dear User, we have detected suspicious activity in your bank account. To prevent your account from being suspended, please click on the link below and verify your details: www.banksecure.com. Failure to do so within 24 hours will result in account suspension.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("We’d like to invite you to our upcoming webinar on 'Building Effective Remote Teams.' The session will take place on September 30th at 11:00 AM. You’ll learn tips and strategies for managing remote employees and improving team collaboration.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your appointment with Dr. Smith has been confirmed for Thursday, October 8th, at 10:00 AM. If you need to reschedule, please contact us at least 24 hours in advance.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Just a reminder that our family vacation is coming up soon! We’ll be flying to Florida on the 15th, so make sure to pack everything by then. Also, don’t forget to bring sunscreen and your camera!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Earn $5,000 per week working from home! No experience needed. Start today and make money by simply filling out surveys. Click here to learn more: www.earnmoneyathome.com.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Just a heads up, we’re meeting at the library on Friday at 3:00 PM to review for the final exam. I’ll bring my notes, and we can go over the key chapters together. Let me know if that time works for everyone.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("We hope you’re enjoying your recent purchase from our store. We’d love to hear your feedback! Please take a moment to complete our brief survey, and let us know how we did.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your travel itinerary has been confirmed. You are scheduled to depart from New York on Flight 5678 at 9:00 AM on October 20th. Your return flight from London will be on Flight 6789 at 5:00 PM on October 27th.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("Just a reminder to RSVP for our wedding on November 12th! We’re so excited to celebrate this special day with our family and friends. Please let us know if you’ll be attending by October 1st.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Search I88J1L1! Mag l0gin Get FREE B0nus P8888! PROMO CODE: 2MQS0CS live N0w!Claim unlimited B0nus N0w D0nt Miss 0ut Limited Days 0nly", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Our next book club meeting is scheduled for Tuesday, October 6th, at 6:00 PM. We’ll be discussing “The Alchemist” by Paulo Coelho. Make sure to finish reading it before the meeting, and bring your thoughts and questions for the discussion.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Thanks for sending over the initial draft. I’ve made a few changes to the document, and you can find the updated version attached. Let me know if you have any questions.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congratulations! You have been selected as the winner of our $1,000,000 prize! Click here to claim your reward now: www.claimprize.com. Act fast! Offer expires soon.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Just a quick reminder about the office potluck on Friday! Don’t forget to bring a dish to share with your colleagues. Looking forward to seeing everyone’s culinary creations!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("I hope you are doing well. I wanted to let you know that I have submitted my final report for the course. Please confirm when you receive it.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Please find attached the agenda for tomorrow’s meeting. We’ll be discussing the Q4 sales targets and the marketing strategy for the new product launch. Let me know if you have any points to add.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Hurry! Get a 90% discount on all our products! This is a one-time offer just for you! Visit www.superdeals.com and use code SAVE90 at checkout. Don't miss out on this amazing opportunity!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Wishing you a very happy birthday! I hope you have a fantastic day filled with joy, laughter, and cake! Let’s catch up soon.", {"cats": {"SPAM": 0, "HAM": 1}}),

    ("I just wanted to follow up on our last meeting. Have you had a chance to review the proposal we sent over? We’d love to hear your thoughts and discuss the next steps.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("We’re planning a family gathering at Grandma’s house next Sunday. We’ll have lunch around 1 PM, and it’ll be great to catch up with everyone. Let me know if you can make it!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("GCash: Account verification needed due to suspicious transaction. Kindly Visit: gcares-protect-ph.li to continue using our services", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Please note that the lecture scheduled for Tuesday, October 12th, has been moved to Thursday, October 14th, at 3 PM. The classroom remains the same. I apologize for any inconvenience.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("This is a reminder of your upcoming dentist appointment on Monday, October 18th, at 9:30 AM. Please contact us if you need to reschedule.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("This is a reminder that the book you borrowed, “The Catcher in the Rye,” is due for return on October 7th. Please return or renew it by then to avoid any late fees.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You’re all invited to our annual block party on Saturday, October 16th! We’ll have food, games, and music from 12 PM to 6 PM. Bring your family, and let’s have some fun!", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Hi there! You have been selected for an all-expense-paid trip to the Bahamas! To claim your FREE vacation, all you need to do is fill out a quick survey. Click here now: www.freevacation.com.", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Here are the notes from our weekly team meeting held today. Please review and let me know if there are any changes or additions. We’ll be following up on these action items next week.", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your reservation at Oceanview Resort has been confirmed for October 10th to October 15th. We look forward to welcoming you. If you need assistance or have special requests, feel free to contact us.", {"cats": {"SPAM": 0, "HAM": 1}})
]

# Prepare training data into SpaCy's Example format
train_examples = []
for text, annotations in train_data:
    doc = nlpTC.make_doc(text) # create doc prior to preprocessing
    example = Example.from_dict(doc, annotations)
    train_examples.append(example)

# Initialize the textcat component with the training examples
nlpTC.initialize(lambda: train_examples)

# Define a comprehensive preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Tokenize and remove stopwords
    tokens = [token.text for token in nlpTC(text) if not token.is_stop]  # Directly extract tokens
    # Join tokens back to a string
    return ' '.join(tokens)

# Apply minimal text preprocessing to train_examples after initializing the pipeline
train_examples_preprocessed = []
for example in train_examples:
    # Process the text and create a new Doc object
    preprocessed_text = preprocess_text(example.reference.text)
    preprocessed_doc = nlpTC.make_doc(preprocessed_text)

    # Create a new Example with the preprocessed Doc and original annotations
    # Instead of example.reference.cats, use Example.from_dict with manual setting of cats
    new_example = Example.from_dict(preprocessed_doc, {"cats": example.reference.cats})
    train_examples_preprocessed.append(new_example)

train_examples = train_examples_preprocessed # replace with preprocessed examples
# Apply the comprehensive text preprocessing
train_data = [(preprocess_text(text), annotations) for text, annotations in train_data]

# Prepare training data into SpaCy's Example format
train_examples = []
for text, annotations in train_data:
    doc = nlpTC.make_doc(text)
    example = Example.from_dict(doc, annotations)
    train_examples.append(example)

# Split data into training, validation, and testing sets
train_examples, test_examples = train_test_split(train_examples, test_size=0.2, random_state=42)
train_examples, val_examples = train_test_split(train_examples, test_size=0.25, random_state=42)  # 20% of the remaining data is used for validation

# Print the split data to visualize each set
print("TRAINING SET (60% of the data):")
for example in train_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

print("\nVALIDATION SET (20% of the data):")
for example in val_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

print("\nTESTING SET (20% of the data):")
for example in test_examples:
    print(f"Text: {example.reference.text} - Label: {example.reference.cats}")

# Training the model with more iterations for small datasets
n_iter = 10  # Set iterations
optimizer = nlpTC.initialize()

for i in range(n_iter):
    losses = {}
    for batch in spacy.util.minibatch(train_examples, size=2):  # Small batch size for small data
        for example in batch:
            nlpTC.update([example], sgd=optimizer, losses=losses)
    print(f"Iteration {i+1}/{n_iter} - Loss: {losses['textcat']}")

# Testing the model
print("\nSample Prediction Output with probabilities:")
doc = nlpTC("Claim your prize now!")
print(doc.cats)

# Function to classify user input emails
def classify_email(email):
    email = preprocess_text(email)
    doc = nlpTC(email)
    spam_score = doc.cats['SPAM']
    ham_score = doc.cats['HAM']

    if spam_score > ham_score:
        return "SPAM"
    else:
        return "HAM"

# Calculate accuracy, precision, recall, and F1 score on the test set
true_labels = [1 if example.reference.cats['SPAM'] == 1 else 0 for example in test_examples]
predicted_labels = [1 if classify_email(example.reference.text) == 'SPAM' else 0 for example in test_examples]

# Calculate and print metrics
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')

# Display results
print(f"\nAccuracy: {accuracy * 100:.4f}%")
print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

# Allow users to test the model by inputting their own data
while True:
    user_input = input("\nEnter a sample email for classification (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    classification = classify_email(user_input)
    print(f"The email is classified as: {classification}")


TRAINING SET (60% of the data):
Text: dear user detected suspicious activity bank account prevent account suspended click link verify details wwwbanksecurecom failure 24 hours result account suspension - Label: {'SPAM': 1, 'HAM': 0}
Text: hello - Label: {'SPAM': 0, 'HAM': 1}
Text: discuss project - Label: {'SPAM': 0, 'HAM': 1}
Text: thank applying marketing manager position xyz company pleased invite interview monday october 5th 200 pm interview conducted zoom details sent soon - Label: {'SPAM': 0, 'HAM': 1}
Text: reschedule - Label: {'SPAM': 0, 'HAM': 1}
Text: m lemme know ready - Label: {'SPAM': 0, 'HAM': 1}
Text: quick reminder meeting tomorrow 1000 conference room bring project update documents - Label: {'SPAM': 0, 'HAM': 1}
Text: notes weekly team meeting held today review let know changes additions following action items week - Label: {'SPAM': 0, 'HAM': 1}
Text: hope wanted share quick update project status track complete phase end week ill schedule meeting monday discuss steps -

## Named Entity Recognition

This code is same as the code from subset 1 but the training data contains a total of 100 samples. A detailed preprocessing function is provided, which aims at lowering the case of text, scrapping special signs and stopwords from the text respectively which aims at improving the quality of the text before it is fed for training.

In [None]:
!python -m spacy download en_core_web_lg
import spacy
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load a pre-trained NER model
nlp = spacy.load("en_core_web_lg")

# Sample training data (text and true entity annotations)
training_data = [
    ("Microsoft announced a new AI initiative in Seattle.", [(0, 9, "ORG"), (39, 46, "GPE")]),
    ("Google I/O will take place in May 2023.", [(0, 10, "EVENT"), (29, 37, "DATE")]),
    ("The unemployment rate in the U.S. dropped to 3.5%.", [(34, 38, "PERCENT"), (27, 31, "GPE")]),
    ("The Chinese economy grew by 5% last year.", [(4, 11, "NORP")]),
    ("Sundar Pichai is the CEO of Google.", [(0, 13, "PERSON"), (28, 34, "ORG")]),
    ("Tesla secured $2 billion in new funding.", [(14, 22, "MONEY")]),
    ("Amazon is opening a new office in Vancouver.", [(0, 6, "ORG"), (36, 45, "GPE")]),
    ("Samsung released its new Galaxy S22 phone.", [(0, 7, "ORG"), (23, 32, "PRODUCT")]),
    ("The Pacific Ocean is the largest body of water on Earth.", [(4, 17, "LOC")]),
    ("The headquarters of IBM is in New York City.", [(21, 24, "ORG"), (31, 44, "GPE")]),

    ("Satya Nadella leads Microsoft Corporation.", [(0, 12, "PERSON"), (19, 38, "ORG")]),
    ("The FIFA World Cup will be held in Qatar in 2022.", [(4, 18, "EVENT"), (34, 39, "GPE"), (43, 47, "DATE")]),
    ("Apple plans to invest $10 billion in manufacturing.", [(23, 32, "MONEY")]),
    ("A new skyscraper is being built in Dubai.", [(33, 38, "GPE")]),
    ("70% of the world's population is now online.", [(0, 3, "PERCENT")]),
    ("Elon Musk founded SpaceX and Tesla.", [(0, 9, "PERSON"), (17, 23, "ORG"), (28, 33, "ORG")]),
    ("The startup raised $50 million in Series B.", [(15, 25, "MONEY")]),
    ("The next Apple event is scheduled for March 25th.", [(9, 14, "ORG"), (39, 48, "DATE")]),
    ("The new company is aiming for a 15% market share.", [(28, 31, "PERCENT")]),
    ("Apple's iPhone 14 is expected to launch in 2023.", [(0, 5, "ORG"), (7, 15, "PRODUCT"), (46, 50, "DATE")]),

    ("A German scientist won the Nobel Prize.", [(2, 8, "NORP")]),
    ("Facebook plans to launch new features in June.", [(0, 7, "ORG"), (30, 35, "DATE")]),
    ("The CEO of Apple, Tim Cook, announced new products.", [(14, 22, "PERSON"), (4, 9, "ORG")]),
    ("NASA's Perseverance rover landed on Mars.", [(0, 4, "ORG"), (34, 38, "GPE")]),
    ("The 2024 Summer Olympics will take place in Paris.", [(4, 24, "EVENT"), (40, 45, "GPE")]),
    ("The inflation rate reached 8.6% last month.", [(28, 32, "PERCENT")]),
    ("Coca-Cola launched a new flavor this spring.", [(0, 10, "ORG"), (36, 41, "DATE")]),
    ("The World Health Organization declared a health emergency.", [(4, 30, "ORG")]),
    ("Berkshire Hathaway's stock price increased by $500.", [(0, 23, "ORG"), (37, 40, "MONEY")]),
    ("In 2020, remote work became the new normal.", [(3, 7, "DATE")]),

    ("Mark Zuckerberg met with world leaders to discuss technology.", [(0, 15, "PERSON")]),
    ("The Great Wall of China is a popular tourist attraction.", [(4, 20, "LOC")]),
    ("The Grammy Awards will be held in Los Angeles.", [(0, 14, "EVENT"), (30, 43, "GPE")]),
    ("Intel announced a new chip that will improve processing speed.", [(0, 5, "ORG")]),
    ("The stock market saw a decline of 4% today.", [(28, 31, "PERCENT")]),
    ("Microsoft is acquiring LinkedIn for $26.2 billion.", [(0, 9, "ORG"), (26, 39, "ORG"), (44, 57, "MONEY")]),
    ("SpaceX plans to launch its Starship rocket next year.", [(0, 6, "ORG"), (34, 39, "DATE")]),
    ("The next big tech conference is set for September.", [(9, 13, "EVENT"), (38, 47, "DATE")]),
    ("The United Nations addresses global challenges.", [(4, 17, "ORG")]),
    ("Bill Gates founded Microsoft in 1975.", [(0, 10, "PERSON"), (21, 29, "ORG"), (32, 36, "DATE")]),

    ("A recent study showed that 60% of students prefer online classes.", [(36, 38, "PERCENT")]),
    ("The Louvre Museum is located in Paris.", [(4, 22, "ORG"), (30, 35, "GPE")]),
    ("The 2022 World Cup will be hosted in Qatar.", [(4, 18, "EVENT"), (35, 40, "GPE")]),
    ("Netflix added 8 million new subscribers in 2021.", [(7, 14, "ORG"), (23, 24, "MONEY"), (29, 33, "DATE")]),
    ("The first electric car was launched by Tesla in 2008.", [(29, 34, "ORG"), (39, 43, "DATE")]),
    ("Researchers found a new species of frog in Madagascar.", [(36, 49, "LOC")]),
    ("In 2019, the world saw significant advancements in AI.", [(3, 7, "DATE")]),
    ("The White House issued a statement regarding climate change.", [(4, 15, "GPE")]),
    ("Elon Musk is the founder of SpaceX and Tesla.", [(0, 9, "PERSON"), (23, 29, "ORG"), (34, 39, "ORG")]),
    ("Tesla plans to produce 20 million cars by 2030.", [(0, 5, "ORG"), (34, 40, "PERCENT"), (44, 48, "DATE")]),

    ("The next FIFA World Cup will be in 2026.", [(9, 13, "EVENT"), (25, 29, "DATE")]),
    ("Apple's market share reached an all-time high.", [(0, 5, "ORG"), (27, 35, "PERCENT")]),
    ("Amazon Prime Video will launch new shows this fall.", [(0, 6, "ORG"), (36, 41, "DATE")]),
    ("Google's headquarters is in Mountain View.", [(0, 6, "ORG"), (29, 32, "GPE")]),
    ("Facebook was founded by Mark Zuckerberg.", [(0, 8, "ORG"), (22, 36, "PERSON")]),
    ("The United Kingdom is hosting the G7 summit.", [(4, 17, "GPE"), (31, 35, "EVENT")]),
    ("Sony released the PlayStation 5 in late 2020.", [(0, 4, "ORG"), (16, 30, "PRODUCT"), (34, 38, "DATE")]),
    ("The next lunar eclipse will be on November 8th.", [(9, 14, "EVENT"), (27, 34, "DATE")]),
    ("Elon Musk is developing a new satellite internet service.", [(0, 9, "PERSON"), (30, 40, "PRODUCT")]),
    ("The Amazon rainforest is crucial for biodiversity.", [(4, 10, "LOC")]),

    ("Toyota unveiled its electric car lineup this year.", [(0, 6, "ORG"), (39, 43, "DATE")]),
    ("The Summer Olympics will take place in Tokyo in 2021.", [(4, 24, "EVENT"), (38, 43, "GPE"), (46, 50, "DATE")]),
    ("The Eiffel Tower is one of the most visited monuments.", [(4, 15, "LOC")]),
    ("NASA's Artemis program aims to return humans to the Moon.", [(0, 4, "ORG"), (35, 50, "EVENT")]),
    ("The stock market experienced a significant downturn.", [(4, 9, "LOC")]),
    ("Gold prices surged to an all-time high this week.", [(0, 4, "MONEY"), (25, 35, "DATE")]),
    ("The Met Gala is a major fundraising event.", [(4, 12, "EVENT")]),
    ("The Berlin Wall fell in 1989.", [(4, 15, "LOC"), (19, 23, "DATE")]),
    ("Instagram was acquired by Facebook in 2012.", [(0, 9, "ORG"), (23, 30, "ORG"), (34, 38, "DATE")]),
    ("Microsoft will invest in renewable energy projects.", [(0, 9, "ORG")]),

    ("The World Cup is set to take place in Qatar.", [(4, 10, "EVENT"), (26, 32, "GPE")]),
    ("The Great Barrier Reef is located off the coast of Australia.", [(4, 21, "LOC")]),
    ("Bill Gates and Melinda Gates announced their divorce.", [(0, 10, "PERSON"), (15, 27, "PERSON")]),
    ("The 2024 presidential election will be highly competitive.", [(4, 36, "EVENT"), (40, 50, "DATE")]),
    ("SpaceX's Falcon Heavy launched successfully last year.", [(0, 6, "ORG"), (7, 17, "PRODUCT"), (36, 41, "DATE")]),
    ("The new iPhone model features advanced camera technology.", [(4, 9, "ORG"), (20, 27, "PRODUCT")]),
    ("Alibaba's revenue soared during the pandemic.", [(0, 7, "ORG")]),
    ("The Cannes Film Festival is a prestigious event.", [(4, 27, "EVENT")]),
    ("The Tesla Model 3 has become very popular.", [(0, 5, "ORG"), (10, 22, "PRODUCT")]),
    ("Virtual reality is gaining traction in gaming.", [(0, 7, "LOC")]),

    ("The COVID-19 vaccine rollout has accelerated globally.", [(4, 12, "EVENT")]),
    ("Google's Android operating system dominates the market.", [(0, 6, "ORG")]),
    ("The Nobel Peace Prize was awarded to Malala Yousafzai.", [(4, 28, "EVENT"), (33, 50, "PERSON")]),
    ("The tech industry is evolving rapidly with AI advancements.", [(4, 12, "LOC")]),
    ("Elon Musk plans to send humans to Mars.", [(0, 9, "PERSON"), (23, 27, "GPE")]),
    ("The 2021 Tokyo Olympics faced many challenges.", [(4, 25, "EVENT"), (31, 36, "DATE")]),
    ("The British Royal Family attended the funeral of Prince Philip.", [(4, 31, "GPE"), (39, 50, "PERSON")]),
    ("Netflix is producing a new documentary series.", [(0, 7, "ORG")]),
    ("The Paris Agreement addresses climate change issues.", [(4, 18, "EVENT")]),
    ("The Olympic Games in Paris are highly anticipated.", [(4, 20, "EVENT"), (26, 31, "GPE")]),

    ("The smartphone market is becoming saturated.", [(4, 14, "LOC")]),
    ("Amazon is facing increased competition from Walmart.", [(0, 6, "ORG"), (33, 39, "ORG")]),
    ("The United Nations General Assembly meets annually.", [(4, 36, "ORG")]),
    ("The 2023 Cricket World Cup will be hosted by India.", [(4, 27, "EVENT"), (40, 45, "GPE")]),
    ("Tesla's stock prices have fluctuated dramatically.", [(0, 5, "ORG")]),
    ("The Grammy Awards are held every year.", [(4, 18, "EVENT")]),
    ("The Eiffel Tower attracts millions of tourists every year.", [(4, 15, "LOC"), (36, 41, "DATE")]),
    ("NASA's Mars Rover is searching for signs of life.", [(0, 4, "ORG")]),
    ("The 2024 U.S. Presidential election is coming up.", [(4, 26, "EVENT"), (30, 34, "DATE")]),
    ("Tesla is set to launch its new Cybertruck.", [(0, 5, "ORG"), (30, 34, "PRODUCT")])
]


# Define a comprehensive preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Tokenization and stopword removal
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
    return ' '.join(tokens)

# Initialize lists for storing true and predicted entities
all_true_entities = []
all_pred_entities = []

# Iterate through training data
for text, true_entities in training_data:
    # Run NER model
    doc = nlp(text)

    # Predicted entities from the model
    pred_entities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]

    # Store true and predicted entities for all examples
    all_true_entities.extend([(ent[0], ent[1], ent[2]) for ent in true_entities])
    all_pred_entities.extend([(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents])

# Convert to binary classification (1 for correct entity, 0 for incorrect)
y_true = [1 if ent in all_true_entities else 0 for ent in all_pred_entities]
y_pred = [1 for _ in all_pred_entities]  # Assuming all predictions are correct for now

# Calculate Precision, Recall, F1
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Precision: 35.3933%
Recall: 100.0000%
F1 Score: 52.2822%


## Part of Speech Tagging

This code is same as the code from subset 1 but the training data contains a total of 100 samples. A detailed preprocessing function is provided, which aims at lowering the case of text, scrapping special signs and stopwords from the text respectively which aims at improving the quality of the text before it is fed for training.

In [None]:
!python -m spacy download en_core_web_lg
import spacy
import random
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Load spaCy's POS tagging model
nlp = spacy.load("en_core_web_lg")

# Sample expanded training data: list of (text, true_pos_tags) pairs
training_data = [
    ("She sells seashells by the seashore.", ['PRON', 'VERB', 'NOUN', 'ADP', 'DET', 'NOUN']),
    ("The quick brown fox jumps over the lazy dog.", ['DET', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'ADP', 'DET', 'ADJ', 'NOUN']),
    ("I love coding in Python.", ['PRON', 'VERB', 'NOUN', 'ADP', 'PROPN']),
    ("Birds fly in the sky.", ['NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("Alice and Bob went to the market.", ['PROPN', 'CCONJ', 'PROPN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("Reading books is fun.", ['VERB', 'NOUN', 'AUX', 'ADJ']),
    ("My car is very fast.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("We are going to the zoo.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("It is raining today.", ['PRON', 'AUX', 'VERB', 'NOUN']),
    ("Programming languages are interesting.", ['NOUN', 'NOUN', 'AUX', 'ADJ']),

    ("The cat sleeps on the mat.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("John likes to play soccer.", ['PROPN', 'VERB', 'PART', 'VERB', 'NOUN']),
    ("She is learning French.", ['PRON', 'AUX', 'VERB', 'PROPN']),
    ("The weather is nice today.", ['DET', 'NOUN', 'AUX', 'ADJ', 'NOUN']),
    ("He bought a new laptop yesterday.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN', 'NOUN']),
    ("They are swimming in the pool.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("The pizza smells delicious.", ['DET', 'NOUN', 'VERB', 'ADJ']),
    ("Can you help me with this project?", ['AUX', 'PRON', 'VERB', 'PRON', 'ADP', 'DET', 'NOUN']),
    ("This task is quite difficult.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("He enjoys reading books.", ['PRON', 'VERB', 'VERB', 'NOUN']),

    ("The dog barked loudly at the strangers.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),
    ("I have a meeting tomorrow.", ['PRON', 'AUX', 'DET', 'NOUN', 'ADJ']),
    ("They will travel to Spain next year.", ['PRON', 'AUX', 'VERB', 'ADP', 'PROPN', 'ADV', 'NOUN']),
    ("He plays the guitar beautifully.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADV']),
    ("The book on the shelf is mine.", ['DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'AUX', 'PRON']),
    ("Jessica ran a marathon last summer.", ['PROPN', 'VERB', 'DET', 'NOUN', 'ADJ', 'NOUN']),
    ("Cooking is a wonderful hobby.", ['VERB', 'AUX', 'DET', 'ADJ', 'NOUN']),
    ("The stars shine brightly in the night sky.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN', 'NOUN']),
    ("I am learning how to code.", ['PRON', 'AUX', 'VERB', 'ADV', 'ADP', 'VERB']),
    ("The flowers bloom in spring.", ['DET', 'NOUN', 'VERB', 'ADP', 'NOUN']),

    ("My friends enjoy hiking on weekends.", ['DET', 'NOUN', 'VERB', 'VERB', 'ADP', 'NOUN']),
    ("Dogs are great companions.", ['NOUN', 'AUX', 'ADJ', 'NOUN']),
    ("She wrote an amazing story.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("The sun rises in the east.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("He plays soccer every weekend.", ['PRON', 'VERB', 'NOUN', 'ADV', 'NOUN']),
    ("Reading novels helps improve vocabulary.", ['VERB', 'NOUN', 'VERB', 'VERB', 'NOUN']),
    ("My family enjoys movie nights.", ['DET', 'NOUN', 'VERB', 'NOUN', 'NOUN']),
    ("She is very talented in music.", ['PRON', 'AUX', 'ADV', 'ADJ', 'ADP', 'NOUN']),
    ("We will celebrate his birthday soon.", ['PRON', 'AUX', 'VERB', 'PRON', 'NOUN', 'ADV']),
    ("The wind blew fiercely during the storm.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),

    ("They went hiking in the mountains.", ['PRON', 'VERB', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("This recipe is quite easy.", ['DET', 'NOUN', 'AUX', 'ADV', 'ADJ']),
    ("The teacher explains concepts clearly.", ['DET', 'NOUN', 'VERB', 'NOUN', 'ADV']),
    ("We have been working on this project.", ['PRON', 'AUX', 'VERB', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("He prefers tea over coffee.", ['PRON', 'VERB', 'NOUN', 'ADP', 'NOUN']),
    ("The child laughed joyfully at the joke.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),
    ("She likes to dance at parties.", ['PRON', 'VERB', 'PART', 'VERB', 'ADP', 'NOUN']),
    ("They are playing video games right now.", ['PRON', 'AUX', 'VERB', 'NOUN', 'ADV', 'ADV']),
    ("The cat chased the mouse.", ['DET', 'NOUN', 'VERB', 'DET', 'NOUN']),
    ("Jack and Jill went up the hill.", ['PROPN', 'CCONJ', 'PROPN', 'VERB', 'ADP', 'DET', 'NOUN']),

    ("The children are laughing in the park.", ['DET', 'NOUN', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She loves to read novels.", ['PRON', 'VERB', 'PART', 'VERB', 'NOUN']),
    ("The fish swims in the ocean.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("He wrote a letter yesterday.", ['PRON', 'VERB', 'DET', 'NOUN', 'NOUN']),
    ("They are playing soccer after school.", ['PRON', 'AUX', 'VERB', 'NOUN', 'ADP', 'NOUN']),
    ("The chef prepares delicious meals.", ['DET', 'NOUN', 'VERB', 'ADJ', 'NOUN']),
    ("We will visit our grandparents next weekend.", ['PRON', 'AUX', 'VERB', 'PRON', 'NOUN', 'ADV', 'NOUN']),
    ("The dog fetches the ball.", ['DET', 'NOUN', 'VERB', 'DET', 'NOUN']),
    ("She enjoys painting landscapes.", ['PRON', 'VERB', 'VERB', 'NOUN']),
    ("The phone rang unexpectedly.", ['DET', 'NOUN', 'VERB', 'ADV']),

    ("They will join us for dinner.", ['PRON', 'AUX', 'VERB', 'PRON', 'ADP', 'NOUN']),
    ("He is running very fast.", ['PRON', 'AUX', 'VERB', 'ADV', 'ADJ']),
    ("The train arrives at the station.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She sings beautifully on stage.", ['PRON', 'VERB', 'ADV', 'ADP', 'NOUN']),
    ("The baby cried all night.", ['DET', 'NOUN', 'VERB', 'DET', 'NOUN']),
    ("We are going shopping tomorrow.", ['PRON', 'AUX', 'VERB', 'VERB', 'NOUN']),
    ("He finished his homework before dinner.", ['PRON', 'VERB', 'PRON', 'NOUN', 'ADP', 'NOUN']),
    ("The sun sets in the west.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She has a beautiful voice.", ['PRON', 'AUX', 'DET', 'ADJ', 'NOUN']),
    ("The garden is full of flowers.", ['DET', 'NOUN', 'AUX', 'ADJ', 'ADP', 'NOUN']),

    ("They watched a movie last night.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADJ', 'NOUN']),
    ("The children played happily at the playground.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),
    ("We are studying for the exam.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("He kicked the ball into the goal.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN']),
    ("She is going to the concert tonight.", ['PRON', 'AUX', 'VERB', 'ADP', 'DET', 'NOUN', 'ADV']),
    ("The computer crashed during the update.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("I saw a shooting star.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("They will attend the meeting next week.", ['PRON', 'AUX', 'VERB', 'DET', 'NOUN', 'ADV', 'NOUN']),
    ("The mountain trail is steep.", ['DET', 'NOUN', 'AUX', 'ADJ']),
    ("He traveled to Paris last summer.", ['PRON', 'VERB', 'ADP', 'PROPN', 'ADJ', 'NOUN']),

    ("She baked cookies for her friends.", ['PRON', 'VERB', 'NOUN', 'ADP', 'PRON', 'NOUN']),
    ("The artist painted a stunning mural.", ['DET', 'NOUN', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("We enjoy exploring new places.", ['PRON', 'VERB', 'VERB', 'ADJ', 'NOUN']),
    ("He repaired the broken fence.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("They discovered a hidden treasure.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("The baby laughed at the puppy.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She danced gracefully across the stage.", ['PRON', 'VERB', 'ADV', 'ADP', 'DET', 'NOUN']),
    ("The team won the championship.", ['DET', 'NOUN', 'VERB', 'DET', 'NOUN']),
    ("I found a great restaurant.", ['PRON', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("He enjoys hiking during the summer.", ['PRON', 'VERB', 'VERB', 'ADP', 'DET', 'NOUN']),

    ("The car sped down the highway.", ['DET', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN']),
    ("She likes to play the piano.", ['PRON', 'VERB', 'PART', 'VERB', 'DET', 'NOUN']),
    ("They ran a marathon in record time.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'NOUN']),
    ("The flowers bloomed beautifully in spring.", ['DET', 'NOUN', 'VERB', 'ADV', 'ADP', 'NOUN']),
    ("She is reading a fascinating book.", ['PRON', 'AUX', 'VERB', 'DET', 'ADJ', 'NOUN']),
    ("The dog wagged its tail excitedly.", ['DET', 'NOUN', 'VERB', 'PRON', 'NOUN', 'ADV']),
    ("We are planning a trip next month.", ['PRON', 'AUX', 'VERB', 'DET', 'NOUN', 'ADJ', 'NOUN']),
    ("He solved the puzzle in no time.", ['PRON', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN']),
    ("The storm caused significant damage.", ['DET', 'NOUN', 'VERB', 'ADJ', 'NOUN']),
    ("They are learning new skills at work.", ['PRON', 'AUX', 'VERB', 'ADJ', 'NOUN', 'ADP', 'NOUN'])
]
# Preprocess text: convert to lowercase and shuffle the training data
training_data = [(text.lower(), tags) for text, tags in training_data]
random.shuffle(training_data)

# Split data into training, validation, and test sets (60% train, 20% validation, 20% test)
train_data, temp_data = train_test_split(training_data, test_size=0.4, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Initialize lists to store true and predicted POS tags for all sets
all_true_pos_tags_train, all_predicted_pos_tags_train = [], []
all_true_pos_tags_val, all_predicted_pos_tags_val = [], []
all_true_pos_tags_test, all_predicted_pos_tags_test = [], []

# Function to preprocess and evaluate POS tagging
def process_data(data, all_true_pos_tags, all_predicted_pos_tags):
    for text, true_pos_tags in data:
        # Special character removal
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        # Process the text with spaCy
        doc = nlp(text)
        # Extract predicted POS tags and filter out stopwords
        predicted_pos_tags = [token.pos_ for token in doc if not token.is_stop and not token.is_punct]
        # Extend lists with true and predicted tags for evaluation
        all_true_pos_tags.extend(true_pos_tags)
        all_predicted_pos_tags.extend(predicted_pos_tags)

# Process training, validation, and test data
process_data(train_data, all_true_pos_tags_train, all_predicted_pos_tags_train)
process_data(val_data, all_true_pos_tags_val, all_predicted_pos_tags_val)
process_data(test_data, all_true_pos_tags_test, all_predicted_pos_tags_test)

# Ensure both lists are the same length to avoid ValueError
def ensure_equal_length(true_tags, predicted_tags):
    if len(true_tags) != len(predicted_tags):
        min_length = min(len(true_tags), len(predicted_tags))
        true_tags = true_tags[:min_length]
        predicted_tags = predicted_tags[:min_length]
    return true_tags, predicted_tags

# Ensure correct lengths for all sets
all_true_pos_tags_train, all_predicted_pos_tags_train = ensure_equal_length(all_true_pos_tags_train, all_predicted_pos_tags_train)
all_true_pos_tags_val, all_predicted_pos_tags_val = ensure_equal_length(all_true_pos_tags_val, all_predicted_pos_tags_val)
all_true_pos_tags_test, all_predicted_pos_tags_test = ensure_equal_length(all_true_pos_tags_test, all_predicted_pos_tags_test)

# Function to calculate metrics for a dataset
def evaluate_metrics(true_tags, predicted_tags):
    accuracy = accuracy_score(true_tags, predicted_tags)
    precision = precision_score(true_tags, predicted_tags, average='weighted')
    recall = recall_score(true_tags, predicted_tags, average='weighted')
    f1 = f1_score(true_tags, predicted_tags, average='weighted')
    return accuracy, precision, recall, f1

# Evaluate on training, validation, and test sets
metrics_train = evaluate_metrics(all_true_pos_tags_train, all_predicted_pos_tags_train)
metrics_val = evaluate_metrics(all_true_pos_tags_val, all_predicted_pos_tags_val)
metrics_test = evaluate_metrics(all_true_pos_tags_test, all_predicted_pos_tags_test)

# Combine all metrics into single print statement
total_accuracy = (metrics_train[0] + metrics_val[0] + metrics_test[0]) / 3
total_precision = (metrics_train[1] + metrics_val[1] + metrics_test[1]) / 3
total_recall = (metrics_train[2] + metrics_val[2] + metrics_test[2]) / 3
total_f1 = (metrics_train[3] + metrics_val[3] + metrics_test[3]) / 3

# Print consolidated metrics
print("Consolidated Metrics across Training, Validation, and Test Data:")
print(f"Accuracy: {total_accuracy * 100:.4f}%")
print(f"Precision: {total_precision * 100:.4f}%")
print(f"Recall: {total_recall * 100:.4f}%")
print(f"F1 Score: {total_f1 * 100:.4f}%")

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m853.7 kB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Consolidated Metrics across Training, Validation, and Test Data:
Accuracy: 21.6055%
Precision: 11.8210%
Recall: 21.6055%
F1 Score: 15.1812%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Sentiment Analysis

This code is same as the code from subset 1 but the training data contains a total of 100 samples. A detailed preprocessing function is provided, which aims at lowering the case of text, scrapping special signs and stopwords from the text respectively which aims at improving the quality of the text before it is fed for training.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load a blank model and add text classifier
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

train_data = [
    ("I'm so frustrated with how slow my internet is.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I'm so happy with my new job!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The customer service at that store is excellent.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The movie was a complete waste of time.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That movie was truly heartwarming and beautiful.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been feeling really down lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The food at the new restaurant was absolutely delicious.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t get the job, and I feel so defeated.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The sunset this evening was breathtaking.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m really upset that I missed the deadline.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("I finally finished the book, and it was such a rewarding read.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("This weather is terrible, I can’t wait for it to end.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The surprise party was such a success!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My laptop crashed again, and I lost all my work.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The concert was absolutely mind-blowing!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been struggling with my workload lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The flowers you sent me are absolutely stunning.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I regret spending money on that product.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just found out I won the contest! I’m over the moon.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Everything seems to be going wrong lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("You did a fantastic job on that presentation.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m tired of dealing with all this stress.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I couldn’t be happier with how everything turned out.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My car broke down again, and I’m so frustrated.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That was one of the most enjoyable dinners I’ve had in ages.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m feeling really overwhelmed with everything going on.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m incredibly grateful for the support I’ve received.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t enjoy the event; it was a total letdown.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m so proud of everything we’ve accomplished this year.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("That comment really hurt my feelings.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("That was the best coffee I’ve had in a while!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I can’t believe how rude they were to me.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I got a promotion at work, and I couldn’t be more thrilled.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My phone screen cracked, and now I have to get it replaced.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("What a beautiful and sunny day!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m really anxious about everything going on.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Spending time with family over the holidays was perfect.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t get enough sleep last night, and now I’m exhausted.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("This new app makes my life so much easier.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been really unmotivated lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("That vacation was exactly what I needed.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("That presentation did not go well at all.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I appreciate all the effort you put into this project.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My relationship with my friends hasn’t been great lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’ve made some great new friends recently.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The traffic was horrible, and I barely made it on time.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I had such a fun time with the kids at the park today.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I didn’t enjoy the book at all; it was so boring.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just got my dream job, and I’m beyond excited!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I feel like I’ve been making one mistake after another.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),

    ("I had a terrible experience with the customer service rep.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’ve been feeling so energetic and positive lately!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I'm feeling completely hopeless today.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That was the most amazing concert I’ve ever attended!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I can’t believe I lost my wallet again.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("My birthday party was so much fun, I loved it!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The service at the restaurant was terrible, I’m never going back.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’ve been so productive today, I got everything done!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I had an argument with my best friend, and now I feel awful.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I received a surprise gift, and it made my entire day.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),

    ("I didn’t get the promotion, and now I’m feeling defeated.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That was the best vacation I’ve had in years.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been feeling anxious and restless lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m extremely proud of how far I’ve come.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I don’t know why, but I’m feeling really down today.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The presentation went really well, I’m so relieved.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m struggling to stay positive with everything going wrong.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I feel incredibly blessed to have such supportive friends.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("My flight got canceled, and now I’m stuck at the airport.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I got a big raise at work, I’m so happy!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),

    ("I’ve been feeling very isolated and alone recently.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just adopted a puppy, and I’m beyond excited!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve had the worst headache all day, it won’t go away.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m thrilled to have finished that project ahead of time.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been getting really stressed about all my deadlines.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That new movie was so entertaining, I loved every minute.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m so frustrated with how long this process is taking.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’ve never been more excited for the future.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The noise in my neighborhood is driving me crazy.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I finally achieved my fitness goals, I feel amazing.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),

    ("I’ve been having such a hard time balancing work and life.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I got to reconnect with an old friend today, it was so great.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m feeling really insecure about everything right now.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That new café is fantastic, I’ll definitely be going back.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m so upset that my favorite show got canceled.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just finished a great workout, I feel so energized!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m so disappointed in how things turned out.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That speech was truly inspiring, I’m so motivated now.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m incredibly tired of dealing with all these problems.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just finished reading an amazing book, I couldn’t put it down.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),

    ("I feel like everything is falling apart.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just won tickets to see my favorite band live!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’ve been feeling really disconnected from everyone lately.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I just got engaged, and I couldn’t be happier!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m really disappointed with the service I received today.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m so excited to start this new chapter in my life.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m feeling really anxious about the upcoming event.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("That was the most fun I’ve had in years.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I’m feeling really low and unsure about everything right now.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I’m grateful for all the blessings in my life.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}})
]

# Define a comprehensive preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Tokenize and remove stopwords
    tokens = [token.text for token in nlpTC(text) if not token.is_stop]  # Directly extract tokens
    # Join tokens back to a string
    return ' '.join(tokens)

In [None]:
# Extract text data from train_data
text = [data[0] for data in train_data]
labels = [data[1]['cats']['POSITIVE'] for data in train_data] # Extract labels

# Vectorize text data using the extracted text list
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)

# Train a Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict sentiments
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy * 100:.4f}%")
print(f"Precision: {precision * 100:.4f}%")
print(f"Recall: {recall * 100:.4f}%")
print(f"F1 Score: {f1 * 100:.4f}%")

Accuracy: 80.0000%
Precision: 73.3333%
Recall: 91.6667%
F1 Score: 81.4815%


## Text Summarizer

In [7]:
!python -m spacy download en_core_web_lg
import spacy
from sklearn.metrics import precision_score, recall_score, f1_score
from nltk.tokenize import sent_tokenize
import nltk
import re

# Download the punkt tokenizer if not already downloaded
nltk.download('punkt')

# Load spaCy model
nlp = spacy.load("en_core_web_lg")

# Example text and reference summary
text = """Climate change is one of the most pressing issues of our time. The increasing levels of greenhouse gases in the atmosphere
have led to rising global temperatures. As a result, glaciers are melting, sea levels are rising, and extreme weather events
are becoming more frequent. Many governments around the world have pledged to reduce carbon emissions, but progress has been slow.
Renewable energy sources such as solar and wind power offer hope, but their adoption has not been widespread enough to make a significant impact yet.
Urgent action is needed to address this global crisis before it’s too late."""

reference_summary = """Climate change is caused by greenhouse gases and is leading to rising temperatures and extreme weather.
Renewable energy offers hope, but its adoption is slow."""

# Function to preprocess text: remove special characters, stopwords, and tokenize
def preprocess_text(text):
    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Keep only alphanumeric characters and spaces
    doc = nlp(text.lower())  # Convert to lowercase and process with spaCy
    # Remove stopwords and return tokens
    tokens = [token.text for token in doc if not token.is_stop]
    return tokens

# Enhanced extractive summarization function
def extractive_summary(text, reference_summary, num_sentences=3):
    doc = nlp(text)  # Process the original text
    sentences = [sent.text for sent in doc.sents]  # Extract original sentences
    preprocessed_sentences = [preprocess_text(sent) for sent in sentences]  # Preprocess each sentence

    # Score sentences based on similarity to the reference summary
    ref_tokens = preprocess_text(reference_summary)  # Preprocess reference summary
    ref_doc = nlp(' '.join(ref_tokens))  # Create a spaCy doc from the preprocessed tokens

    # Score sentences based on similarity to the reference summary
    sentence_scores = [(sent, nlp(' '.join(preprocess_text(sent))).similarity(ref_doc)) for sent in sentences]
    ranked_sentences = sorted(sentence_scores, key=lambda x: x[1], reverse=True)

    # Select top sentences
    top_sentences = [sent[0] for sent in ranked_sentences[:num_sentences]]
    return top_sentences

# Tokenizing the reference and generated summaries into sentences
generated_summary = extractive_summary(text, reference_summary)  # Summary in lowercase
reference_sentences = [sent.lower() for sent in sent_tokenize(reference_summary)]  # Reference in lowercase

# Convert to binary relevance: 1 if the sentence appears in the reference summary, 0 otherwise
y_true = [1 if sent in reference_sentences else 0 for sent in sent_tokenize(text.lower())]
y_pred = [1 if sent in generated_summary else 0 for sent in sent_tokenize(text.lower())]

# Ensure y_true and y_pred are of the same length
if len(y_true) != len(y_pred):
    min_length = min(len(y_true), len(y_pred))
    y_true = y_true[:min_length]
    y_pred = y_pred[:min_length]

# Calculate precision, recall, and F1 score
precision = precision_score(y_true, y_pred) * 100
recall = recall_score(y_true, y_pred) * 100
f1 = f1_score(y_true, y_pred) * 100

# Output results
print(f"Generated Summary: {' '.join(generated_summary)}")
print(f"Precision: {precision:.2f}%")
print(f"Recall: {recall:.2f}%")
print(f"F1 Score: {f1:.2f}%")


Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Generated Summary: Renewable energy sources such as solar and wind power offer hope, but their adoption has not been widespread enough to make a significant impact yet.
 The increasing levels of greenhouse gases in the atmosphere
have led to rising global temperatures. Many governments around the world have pledged to reduce carbon emissions, but progress has been slow.

Precision: 0.00%
Recall: 0.00%
F1 Score: 0.00%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
