Importing the `spacy` library provides tokenization, part-of-speech tagging, named entity recognition, and text classification. The `Example` class allows the example training data to train the models. By converting the text and its annotations into a format, it can be for model updates.

In [43]:
import spacy
from spacy.training import Example

The `nlpTC` is an instance of the spacy model or pipeline. The `spacy.blank("en")` model is for the English language, making it easy to understand what language is in the training data. A `blank` model doesn't have pre-built pipelines so that custom pipelines will be added. The `textcat` is the text classifier that gets added to the `nlpTC` pipeline using `nlpTC.add_pipe("textcat")`. It categorize text into predefined labels such as "SPAM" AND "HAM".

In [44]:
# Load a blank model and add text classifier
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

This codes are adding a label of "SPAM' and "HAM" to the text classification model, the `textcat`. For the Spam label, it will identify if the data is from unknown person or email and for Ham label is for non-spam or regular emails from known person or email. This model will predict either "SPAM" or "HAM" for the given input text.

In [45]:
# Add labels for classification
textcat.add_label("SPAM")
textcat.add_label("HAM")

1

The `train_data` is a list of tuples that holds the training text examples with their corresponding labels. Example: "This is spam" is the text string, this is text that will train and `cats":{"SPAM":1, "HAM":0}` it specifies which category text belongs. This is how to teach the model to train the data if its SPAM or HAM.

In [46]:
#Example training data
train_data = [
    ("This is spam", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello, how are you?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You won a million dollars!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Claim your free prize now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Meeting at 10 AM tomorrow", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your invoice is attached", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Exclusive offer just for you!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Get a free iPhone today", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we reschedule our call?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Update your account details", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Limited time deal, buy now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Your package has been shipped", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Win a trip to Hawaii now", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Important meeting agenda", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congratulations! You've been selected", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we discuss this project?", {"cats": {"SPAM": 0, "HAM": 1}}),
]


The `optimizer` is used to update the model's weights during training and the model's pipeline is initialized to setting it up for training. As the training data is fed into the model, an optimizer through a control means adjusts the parameters of the model in a way that reduces errors thereby improving its potential to learn correctly. The `for text, annotations in train_data` started a loop where every training example is utilized for model training. At each iteration, train_data are utilized to go through this loop. For every example, an Example object is created by taking its text and annotations. In that case, text and labels are transformed into a format which can be understood by spaCy. Thus `nlpTC.update([example], sgd=optimizer)` modifies the model with use of the Example object as a passing entity. To some extent it becomes clear that model learns how to connect particular patterns found within the text with SPAM or HAM labels. At this point, through optimizer the internal parameters of the model are readjusted so as to enhance performance.

In [47]:
# Training the model
optimizer = nlpTC.initialize()
for text, annotations in train_data:
  example = Example.from_dict(nlp.make_doc(text), annotations)
  nlpTC.update([example], sgd=optimizer)


This prints the message prior to demonstrating how well the model performs on a new example. The doc is a processed document object that holds the model’s prediction for the input text. A Doc object is returned when applying to any string by using nlpTC (for example, `doc = nlpTC("Claim your prize now!")`), which contains a processed text. Important things like tokenization, category predictions (SPAM or HAM), and other language features extracted by the model are included in this object.

In [48]:
#Testing the model
print("Sample Prediction Output with probabilities: ")
doc = nlpTC("Claim your prize now!")

Sample Prediction Output with probabilities: 


This model, `classify_email(email)`, processes the text and attempts to categorize it. The category scores for SPAM and HAM are extracted from it by processing the input email through the model. For `spam_score =doc.cats[‘SPAM’]`, it gets SPAM score of the model which represents how likely that email is going to be categorized as SPAM. The HAM score shows the probability that an email will not be classified as spam. The code `if spam_score > ham_score:` indicates if the SPAM score is higher than the HAM score, the function will return "SPAM", classifying the email as spam. While the else is identifying if the HAM score is higher, the email is classified as "HAM" (not spam).

In [49]:
# Function to classify user input emails
def classify_email(email):
  doc = nlpTC(email)
  spam_score =doc.cats['SPAM']
  ham_score =doc.cats['HAM']

  if spam_score > ham_score:
    return "SPAM"
  else:
      return "HAM"

From the line `while true`, an infinite loop begins and users can keep testing the model by entering email texts until they type “exit”. Under `user_input`, users are prompted to input a sample email text to classify. When the user types "exit", the loop stops and the program ends. The `classify_email` function should be called on user’s input in order to classify the email at hand. This result (SPAM or HAM) will be printed by this function on console interface.


In [50]:
# Allow users to test the model by inputting their own data
while True:
  user_input = input("Now,enter a sample email for classification (or type 'exit' to quit):")
  if user_input.lower() == 'exit':
    break
  classification = classify_email(user_input)
  print(f"The email is classified as: {classification}")

Now,enter a sample email for classification (or type 'exit' to quit):We are excited to inform you that you have been selected to receive an exclusive offer! Click the link below to claim your free gift. This is a limited-time offer, so don’t miss out!
The email is classified as: SPAM
Now,enter a sample email for classification (or type 'exit' to quit):EXIT


This code uses spaCy to perform Named Entity Recognition (NER), extracting and identifying important elements such as names, organizations, and dates from user-provided text.

In [27]:
import spacy

In [28]:
#Load pre-trained spaCy model
nlp = spacy.load("en_core_web_sm")

The first step is to define the function `‘analyze_text()’`, which will take in an input string and return a list of named entities together with their corresponding labels. The input text is processed into a doc object using spaCy’s pre-trained model (presumably saved in the nlp object). It contains processed text that consists of tokens and named entities. The code  then extracts these named entities from doc.ents attribute and stores them as tuples in entities_list. Each tuple comprises the entity’s text (e.g., for an individual it can be someone’s name or for a corporation it might be a company), and its label (e.g., PERSON, ORG, DATE).

After collecting the named entities, the function returns the entities_list for further use. The script then prompts the user to input a text string using the `input()` function. The user's input is passed to the `analyze_text()` function, which processes the text and extracts the named entities and their labels. Finally, the extracted entities are printed, preceded by a heading ("Named Entities and labels:") for clarity. The result is a list of named entities found in the user's input, along with the corresponding type of each entity, such as a person's name, an organization, or a date. This process effectively showcases how spaCy can be used for entity recognition in natural language processing applications.








In [29]:
#Function to analyze user input text and return entities as list
def analyze_text(text):
  doc = nlp(text)
  entities_list = [(ent.text, ent.label_)for ent in doc.ents]
  return entities_list

# Allow user input and analyze
user_input = input ("Enter a text for named entity analysis:")
entities = analyze_text(user_input)

#Display the result as a list

print("\nNamed Entities and labels:")
print(entities)

Enter a text for named entity analysis:Barack Obama was born in Honolulu, Hawaii, on August 4, 1961. He served as the 44th President of the United States from 2009 to 2017. During his presidency, he lived in Washington, D.C., and frequently visited Chicago, Illinois.

Named Entities and labels:
[('Barack Obama', 'PERSON'), ('Honolulu', 'GPE'), ('Hawaii', 'GPE'), ('August 4, 1961', 'DATE'), ('44th', 'ORDINAL'), ('the United States', 'GPE'), ('2009', 'DATE'), ('2017', 'DATE'), ('Washington', 'GPE'), ('D.C.', 'GPE'), ('Chicago', 'GPE'), ('Illinois', 'GPE')]


This code performs part-of-speech tagging using spaCy. It begins by loading the pre-trained small English model `en_core_web_sm`. This model is capable of analyzing text, and in this case, it is used to tokenize the user input and assign a part-of-speech (POS) tag to each token. The analyze_text function processes the input text using the nlp model and iterates through each token in the doc object (which holds the processed text). For each token, the text and its corresponding POS tag are stored in a list called pos_list. This list is then returned and displayed to the user, showing the individual tokens alongside their grammatical roles, such as nouns, verbs, adjectives, etc.

In [None]:
import spacy

In [30]:
#Load pre-trained spaCy model
nlp = spacy.load("en_core_web_sm")

In [31]:
#Function to analyze user input text and return tokens with POS tags as a list
def analyze_text(text):
  doc = nlp(text)
  pos_list = [(token.text, token.pos_)for token in doc]
  return pos_list

# Allow user input and analyze
user_input = input ("Enter a text for named entity analysis:")
pos_tags = analyze_text(user_input)

#Display the result as a list

print("\nTokens and POS Tags:")
print(pos_tags)

Enter a text for named entity analysis:Barack Obama was born in Honolulu, Hawaii, on August 4, 1961. He served as the 44th President of the United States from 2009 to 2017. During his presidency, he lived in Washington, D.C., and frequently visited Chicago, Illinois.

Tokens and POS Tags:
[('Barack', 'PROPN'), ('Obama', 'PROPN'), ('was', 'AUX'), ('born', 'VERB'), ('in', 'ADP'), ('Honolulu', 'PROPN'), (',', 'PUNCT'), ('Hawaii', 'PROPN'), (',', 'PUNCT'), ('on', 'ADP'), ('August', 'PROPN'), ('4', 'NUM'), (',', 'PUNCT'), ('1961', 'NUM'), ('.', 'PUNCT'), ('He', 'PRON'), ('served', 'VERB'), ('as', 'ADP'), ('the', 'DET'), ('44th', 'ADJ'), ('President', 'PROPN'), ('of', 'ADP'), ('the', 'DET'), ('United', 'PROPN'), ('States', 'PROPN'), ('from', 'ADP'), ('2009', 'NUM'), ('to', 'ADP'), ('2017', 'NUM'), ('.', 'PUNCT'), ('During', 'ADP'), ('his', 'PRON'), ('presidency', 'NOUN'), (',', 'PUNCT'), ('he', 'PRON'), ('lived', 'VERB'), ('in', 'ADP'), ('Washington', 'PROPN'), (',', 'PUNCT'), ('D.C.', 'PROP

This code focuses on text classification using a blank spaCy model for sentiment analysis. A new model nlpTC is created from scratch, and a text classification component (textcat) is added to the pipeline. The classifier is trained to predict two categories: "POSITIVE" and "NEGATIVE", which are added as labels. Training data is defined with a set of example texts labeled as "SPAM" or "HAM", although these could be adapted for sentiment analysis. The train_model function is responsible for training the model over multiple iterations , where it shuffles the training data and updates the model’s parameters using the optimizer. In each epoch, the model is updated, and the training loss is tracked. After training, the predict_sentiment function is used to predict the sentiment of user-provided text. The function returns the sentiment categories with their respective confidence scores. When the user inputs a sentence, the model predicts whether it is positive or negative.

In [32]:
import spacy
from spacy.training import Example
import random

# Load a blank model and add text classifier
nlpTC = spacy.blank("en")
textcat = nlpTC.add_pipe("textcat")

# Add labels for classification
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")


#Example training data
train_data = [
    ("This is spam", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Hello, how are you?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("You won a million dollars!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Claim your free prize now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Meeting at 10 AM tomorrow", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Your invoice is attached", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Exclusive offer just for you!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Get a free iPhone today", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we reschedule our call?", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Update your account details", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Limited time deal, buy now!", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Your package has been shipped", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Win a trip to Hawaii now", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Important meeting agenda", {"cats": {"SPAM": 0, "HAM": 1}}),
    ("Congratulations! You've been selected", {"cats": {"SPAM": 1, "HAM": 0}}),
    ("Can we discuss this project?", {"cats": {"SPAM": 0, "HAM": 1}}),
]

In [39]:
# Training the model
def train_model(data, n_iter=10):
  random.shuffle(data)
  optimizer = nlp.begin_training()
  for epoch in range(n_iter):
    losses = {}
    for text, annotation in data:
      doc = nlp.make_doc(text)
      example = Example.from_dict(doc, annotations)
      nlp.update([example], drop=0.5, losses=losses)
      print(f"Epoch {epoch+1}/{n_iter} - losses: {losses}")

      train_model(train_data)

  # Function to predict sentiment
      def predict_sentiment(text):
        doc = nlp(text)
        return doc.cats

# user input and sentiment analysis
user_input = input("Enter a text for sentiment analysis: ")
prediction = predict_sentiment(user_input)
print("\nSentiment Prediction: ")
print(prediction)

Enter a text for sentiment analysis: I had an amazing experience at the restaurant last night! The food was delicious, and the service was exceptional

Sentiment Prediction: 
{'POSITIVE': 0.5, 'NEGATIVE': 0.5}


This code performs text summarization. Once again, the pre-trained en_core_web_sm model is loaded. The summarize function processes the input text and uses token frequency to score sentences. For each sentence in the text, it counts how many important tokens (excluding stop words and punctuation) appear in the sentence. The sentences with the highest scores are selected as the most important and returned as the summary. The user inputs the text they want to summarize, and the function returns the top N sentences based on token frequency, giving a shorter version of the text.

In [40]:
import spacy
from collections import Counter

#Load pre-trained spaCy model
nlp = spacy.load("en_core_web_sm")


In [42]:
# Summarize text
def summarize(text, n_sentence = 2):
  doc = nlp(text)
  sentence_scores = Counter()

  # Score sentence based on token frequency
  for sent in doc.sents:
    for token in sent:
      if not token.is_stop and not token.is_punct:
        sentence_scores[sent] += 1

  # Select top N sentences
  top_sentences = [sent.text for sent, score in sentence_scores.most_common(n_sentence)]
  return " ".join(top_sentences)

# User input for summarization
user_text = input("Enter the text you want to summarize: ")
summary = summarize(user_text)
print("\nSummary: ")
print(summary)


Enter the text you want to summarize: Barack Obama was born in Honolulu, Hawaii, on August 4, 1961. He served as the 44th President of the United States from 2009 to 2017. During his presidency, he lived in Washington, D.C., and frequently visited Chicago, Illinois.

Summary: 
Barack Obama was born in Honolulu, Hawaii, on August 4, 1961. During his presidency, he lived in Washington, D.C., and frequently visited Chicago, Illinois.
