# 🙋 Answers

These won't run within this notebook, you should copy / adapt them to the appropriate location within the other materials

## 🧱 Foundations

1. Try changing the example text or adding another sentence and re-running the above code. For instance, what happens if you include a contraction or a numerical date? Examine how spaCy tokenizes and lemmatizes it. You can also try printing other token attributes like token.dep_ (dependency relation) to see syntactic dependencies.

In [None]:
# Answer

text = "Apple is looking at buying U.K. startup for $1 billion. It doesn't know whether the deal will go through by 10th November."
doc = nlp(text) # spaCy tokenises and processes the text
print("Tokens:", [token.text for token in doc])

for token in doc:
    print(token.text, "→ lemma:", token.lemma_, "| POS:", token.pos_, "| StopWord?", token.is_stop)

2. The spaCy pipeline we used includes POS tagging, lemmatization, etc. If you wanted to add a custom preprocessing step (say, replacing all numbers with a special token like \<NUM>). This is useful for tasks like text classification where the actual number may not be important but the fact that there is a number is. How might you do it?

*Hint: You could post-process the token list or use regex on the original text before sending to spaCy.*

In [None]:
# Answer
# Replace numbers with a placeholder
# This is useful for tasks like text classification where the actual number may not be important
# but the fact that there is a number is.

tokens_with_num = ["<NUM>" if token.like_num else token.text for token in doc]
print("Tokens with numbers replaced:", tokens_with_num)

In [None]:
# Answer: filter to only include PERSON entities

# Create a network of the entities based on their co-occurrence in the same sentence
entity_sets = []
for sent in bookdoc.sents:
    entities = set([ent for ent in sent.ents if ent.label_ == 'PERSON']) # KEY CHANGE HERE
    if len(entities) > 1:
        entity_sets.append(entities) 

# Each item in entity_sets is the group of entities that co-occur in the same sentence
entity_sets

## 🧠 Text Embeddings

1. Use the cosine similarity function to find the most similar documents to the first document in the df based on TF-IDF

In [None]:
tfidf_sim = cosine_similarity(tfidf_df.iloc[0].values.reshape(1, -1), tfidf_df.iloc[1:].values)
display(tfidf_sim)

In [None]:
# Get indices of top 5 most similar documents (excluding the first document itself)
tfidf_top5_indices = tfidf_sim[0].argsort()[-5:][::-1]
covid_df.iloc[tfidf_top5_indices]['webTitle']

2. Use the cosine similarity function to find the most similar documents to the first document in the df based on embeddings

In [None]:
# Get embeddings
covid_embeddings = model.encode(covid_df['bodyContent'], convert_to_tensor=True)

df = pd.DataFrame(covid_embeddings.cpu().numpy(), index=covid_df['webTitle'])

embedding_sim = cosine_similarity(df.iloc[0].values.reshape(1, -1), df.iloc[1:].values)
embedding_top5_indices = embedding_sim[0].argsort()[-5:][::-1]
covid_df.iloc[embedding_top5_indices]['webTitle']


3. Compare the results of the two methods (plot as a scatter graph). Do they yield similar results? Why or why not?

In [None]:
import matplotlib.pyplot as plt

plt.scatter(embedding_sim, tfidf_sim, alpha=0.1)
plt.xlim(0,1)
plt.ylim(0,1)
plt.xlabel("Embedding Similarity to First Document")
plt.ylabel("TF-IDF Similarity to First Document")
plt.title("Comparison of Embedding and TF-IDF Similarity")
plt.show()

## 🗂️ Topic Modelling

1. Remove the `vectorizer_model=vectorizer_model` argument from the `BERTopic` constructor, and re-run the model. What happens to the topics?


In [None]:
# Filter the DataFrame for specific sections and drop rows with NaN in 'bodyContent'
sample_df = df[df['sectionName'].isin(['Opinion', 'Football'])
               ].dropna(subset='bodyContent').sample(n=1000, random_state=42) # Sample 1000 rows for speed today
docs = sample_df['bodyContent'].tolist()

topic_model = BERTopic() #  create BERTopic model
topics, probs = topic_model.fit_transform(docs) # fit the model to the documents
topic_info = topic_model.get_topic_info() # get topic information
topic_info # display the topics found

# We get basically the same topics, but the key words are undesirably full of stopwords

2. Try selecting some articles from different sections of the Guardian, such "Politics", "World news", "US news". What do we find for topics here?

In [None]:
# Filter the DataFrame for specific sections and drop rows with NaN in 'bodyContent'
sample_df = df[df['sectionName'].isin(["Politics", "World news", "US news"])
               ].dropna(subset='bodyContent').sample(n=1000, random_state=42) # Sample 1000 rows for speed today
docs = sample_df['bodyContent'].tolist()

vectorizer_model = CountVectorizer(stop_words="english") # only used for c-TF-IDF stage
topic_model = BERTopic(vectorizer_model=vectorizer_model) #  create BERTopic model
topics, probs = topic_model.fit_transform(docs) # fit the model to the documents
topic_info = topic_model.get_topic_info() # get topic information
topic_info # display the topics found

# Clear separation of stories topics by section is not observed

## 🤖 Applying State-of-the-Art NLP

1. We used the simple, fast, cheap `gpt-4o-mini` model. Try using the more powerful `gpt-4o` model instead on our labeled data. How does it perform? Can you think of a task (perhaps from your own work) where the more powerful model 

In [None]:
def zeroshot_classify(article_text):
    response = client.responses.create(
        model="gpt-4o", # UPDATED MODEL
        input=[
            {
                # This part defines the system prompt, which sets the context for the model
                "role": "system",
                "content": [
                    {
                        "type": "input_text",
                        "text": system_prompt
                    }
                ]
            },
            {
                # This part defines the user input, which is the article text to be classified
                "role": "user",
                "content": [
                    {
                        "type": "input_text",
                        "text": article_text
                    }
                ]
            },
        ],

        # This next part defines the output format, a JSON object with a single key "classification"
        text={"format": {
            "name": "news_article_classification",
            "type": "json_schema",
            "schema": {
                "type": "object",
                "properties": {
                    "classification": {
                        "type": "number",
                        "description": "The classification output of the model: 1 for supportive of UK's approach to COVID, 0 otherwise."
                    }
                },
                "required": ["classification"],
                "additionalProperties": False
            },
            "strict": True
        }},
    )
    return json.loads(response.output_text)['classification']

# Limited opportunity for improvement in this classfication task
# It takes a bit longer / is more expensive, so probably not worth it
# However it may be suited to many other more complex tasks