# Project 3 - Article Parsing Toolset

## Team 3 Members

- Matthew Dunbar
- Jeffrei Cher
- Basil James

## Project Problem

To parse article content and extract various data from the article contents

- _**Article Classification into a set of pre-defined categories**_
  (a prediction model)

For Sports Articles:
- _**to identify the sport, and extract game/match scores if present**_
- _(to provide stats summaries if present?)_

## Learning Goal

Develop experience:

- Building and deploying models to leverage in real world applications
- Leveraging custom training of LLMs to provide article analysis.
- Using a tool-based (extensible toolbox) approach to provide multiple analytics features


## Dataset

- https://www.kaggle.com/datasets/fabiochiusano/medium-articles  
[size: 190k+, categories: multiple tags per article, large set, Includes titles, full articles, and URLs]

### Retrieve dataset

In [4]:
! kaggle datasets download -d fabiochiusano/medium-articles -p ./data --unzip

Dataset URL: https://www.kaggle.com/datasets/fabiochiusano/medium-articles
License(s): CC0-1.0
Downloading medium-articles.zip to ./data
 97%|███████████████████████████████████████▉ | 359M/369M [00:03<00:00, 145MB/s]
100%|█████████████████████████████████████████| 369M/369M [00:03<00:00, 116MB/s]


### Local file

In [6]:
!ls -l ./data/

total 1017916
-rw-r--r-- 1 jupyter jupyter 1042340506 Apr  9 16:11 medium_articles.csv


### Fulfill basic Dataframe dependencies

In [3]:
import os

import pandas as pd
from google.cloud import bigquery

import subprocess
import warnings

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
warnings.filterwarnings("ignore")

import json
from pprint import pprint
import markdown
# import tensorflow as tf

# print(tf.version.VERSION)

### Load dataframe

In [9]:
# Load dataset (change filename accordingly)
df = pd.read_csv('./data/medium_articles.csv')

# Display first few rows
df.head()

Unnamed: 0,title,text,url,authors,timestamp,tags
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."


### Parse DataFrame to remove unwanted columns

In [10]:
# parse the CSV data
CSV_COLUMNS = [
    "title",
    "text",
    "url",
    "authors",
    "timestamp",
    "tags",
]
LABEL_COLUMN = "text"
DEFAULTS = [["na"], ["na"], ["na"], ["na"], ["na"], ["na"]]
UNWANTED_COLS = ["title", "url", "authors", "timestamp"]

DESIRED_COLUMNS = [col for col in CSV_COLUMNS if col not in UNWANTED_COLS]
df = df[DESIRED_COLUMNS]

# Show full column width and prevent truncation
pd.set_option("display.max_colwidth", None)  # Show full text
pd.set_option("display.expand_frame_repr", False)  # Prevent wrapping

df.head()

Unnamed: 0,text,tags
0,"Photo by Josh Riemer on Unsplash\n\nMerry Christmas and Happy Holidays, everyone!\n\nWe just wanted everyone to know how much we appreciate everyone and how thankful we are for all our readers and writers here. We wouldn’t be anywhere without you, so thank you all for bringing informative, vulnerable, and important pieces that destigmatize mental illness and mental health.\n\nWithout further ado, here are ten of our top stories from last week, all of which were curated:\n\n“Just as the capacity to love and inspire is universal so is the capacity to hate and discourage. Irrespective of gender, race, age or religion none of us are exempt from aggressive proclivities. Those who are narcissistically disordered, and accordingly repress deep seated feelings of inferiority with inflated delusions of grandeur and superiority, are more prone to aggression and violence. They infiltrate our interactions in myriad environments from home, work, school and the cyber world. Hence, bullying does not happen in isolation. Although there is a ringleader she looks to her minions to either sanction her cruelty or look the other way.”\n\n“Even though the circumstances that brought me here were sad and challenging, I’m grateful for how this program has changed my life for the better. I can’t help but imagine what life would be like if everyone learned to accept their powerlessness over other people, prioritize their serenity, and take life one step at a time. We’ll never know, but I’d bet the world would be much happier.”\n\n“The prospect of spending a horrible Christmas, locked in on a psychiatric unit, was one of the low points of my life. For weeks, the day room was festooned with cheesy decorations and a sorry pink aluminum tree. All of our “activity” therapies revolved around the holidays. We baked and decorated cookies. We fashioned quick-drying clay into ornaments that turned out to be too heavy for the tree. Crappy Christmas carols were background torture. It was hard to get pissed off at the staff because they were making the best with what they had.”\n\n“Although I hate to admit it, even if my ex had never betrayed me, I still wouldn’t have been happy. I had set him up for an impossible job — to define me and make me whole. If I cannot find peace and contentment within myself, how could anyone else do it for me?”\n\n“On a personal note, significant feelings of loss and sadness can still flare up from time to time. That’s only natural; it’s no reason for self-critique. No matter how resilient we purport to be, we are all emotionally vulnerable human beings. Besides, we aren’t talking about some conceptual loss that we can just mechanically compartmentalize away — we are talking about the loss of our fathers, mothers, sisters and brothers.”\n\n“The next six weeks will be hard as cases continue to explode and government leadership remains nonexistent. I can’t control any of this. The only thing I can do is take deep breaths, remain vigilant when it comes to limiting exposure to the virus, and let lots of stuff go. I may always be a hypochondriac, but now that I recognize the beast, I’m hopeful I’ll be able to tame it.”\n\n“From anecdotal news reports and informal surveys, there is evidence that for some of us, this pandemic-imposed isolation is a boon rather than a trial. One study on mixed emotions showed that those with lower emotional stability (“moody” personalities) are actually better at responding to uncertainty.”\n\n“Every day I wish in my heart and soul that I didn’t have ME/CFS. Unfortunately, I do. It’s a result of a virus I had; 10–12 percent of people who experience a serious infection go on to develop ME. I’ve visualized life without CFS for over a year now; I can smell life without it, I can taste it. It’s in the smell of the lavender fields that I can no longer run through. It’s in the taste of the meals from my favorite restaurant that I can no longer walk to. It’s on the tip of my tongue. It’s in the potentialities; all the things I could be doing, as a twenty-four year-old, that I can’t. I cannot cross the chasm between the potential and the reality. And that’s nothing to do with manifestation.”\n\n“Whether it’s cabin fever, redundancy, loss, or general Covid anxieties, this year has caused us to be exposed to more uncertainty than ever. Uncertainty creates unease and feelings of stress. Some of us may have taken this year as one to motivate — plan dream trips, and prepare and be inspired for what the future could bring. For the rest, it has caused us to become irrational, emotional, and reserved.\n\n“To be more self-compassionate is a task that can be tricky because we always want to push ourselves and do better. Without realising it, this can lead to us being self-critical which can have damaging consequences.\n\nIt’s important to notice these times when we are harsh because we can easily turn it into self-compassion, which is linked to a better quality of life.”\n\nMerry Christmas and Happy Holidays, everyone!\n\n— Ryan, Juliette, Marie, and Meredith","['Mental Health', 'Health', 'Psychology', 'Science', 'Neuroscience']"
1,"Your Brain On Coronavirus\n\nA guide to the curious and troubling impact of the pandemic and isolation\n\nPhoto by cottonbro from Pexels\n\nThe coronavirus pandemic frustrates and confounds epidemiologists and immunologists, even after months of study. It frustrates politicians and public health officials dealing with mask non-compliance. It frustrates everyone stuck at home, whether they lost their job or adapting to Zoom.\n\nAfter exposure to the virus, it first enters the lungs, using host machinery to replicate. The virus itself is just a genetic sequence enclosed in a protein and lipid coat. It binds the ACE2 receptors on lung cells, with a spike protein located on its protein-lipid coat. This receptor, attached to the virus, trafficks into the lung cell. Here the virus hijacks the machinery of the cell to replicate, damaging lung tissue and spreading throughout the body.\n\nThe ACE2 receptor, expressed in many regions of the body, is vulnerable to further entry of these viral particles. The ACE2 receptor regulates blood pressure, nutrient absorption and inflammation. These pathways converge and mediate brain health and disease.\n\nThe novel coronavirus perplexed us for many different reasons. A large majority of people who get it don’t display any symptoms, while some display symptoms for many months and others require ventilators to breathe. It is unclear whether someone infected with coronavirus retains long-term immunity.\n\nAlso, troubling findings implicate this disease in the induction of stroke and the worsening of mental health. The realization that there are likely long-term complications of coronavirus infection is worrying, as millions of people may require expensive coverage for this new pre-existing condition.\n\nThose of us lucky to avoid being infected become more socially isolated and lonely. Many studies report the worsening of mental health symptoms, especially in frontline workers, nurses and doctors. These professionals are more prone to burning out and require extra care.\n\nCOVID-19 and Stroke\n\nThe cells in the brain require a disproportionate amount of energy to function. When deprived of oxygen, even for minutes, the cells begin to die, leading to a variety of debilitating sensory, motor or language deficits depending. When there is blood loss to a specific region of the brain, cells cannot use oxygen to generate energy. If there is a clot in an artery, fresh oxygen cannot travel to any regions primarily supplied by that blood vessel. These events, classified as ischemic strokes, cause lifelong disability in some of those afflicted.\n\nEarly findings in patients found abnormal clotting in blood vessels. Vessels around the lungs or even arterial blood-flow to the brain is interrupted. Thus, individuals infected with coronavirus who suffered abnormal blood clotting as a result, were at higher risk of stroke.\n\nIn June of 2020, researchers published a report of neurological symptoms in the New England Journal of Medicine. While they did not report common symptoms of having a stroke, they showed other strange brain-related features. Of thirteen COVID-19 patients who underwent brain imaging, three of them showed signs of an ischemic stroke. A subset of eight of these patients showed other types of inflammation, while eleven presented with a lack of blood flow to the frontal areas of the brain.\n\nThough a preliminary observational study, it suggested that the coronavirus impacted blood clotting and flow to the brain. Several studies since identified swathes of patients suffering from ischemic strokes or brain/vascular inflammation. Another study reviewed the current state of evidence, concluding that 41% of patients suffering from neurological symptoms after COVID-19 infection, suffered from strokes. Larger studies however, are needed to decipher how common this is among all those infected with the novel coronavirus.\n\nDepending on which region of the brain loses oxygen, stroke may manifest as a broad range of symptoms. If cells die in an area of the brain responsible for motor movement, it later manifests in unilateral or bilateral difficulties with movement. Other common symptoms involve fatigue, challenges with balance or walking, partial paralysis, pain or inattention to one entire side of the body. It prevents individuals from doing the things they do in their daily lives, such as dress themselves or go to the bathroom independently.\n\nCOVID-19 and Psychiatric Disorders\n\nPhoto by Jonathan Rados on Unsplash\n\nEither through neuroimmune signalling or by directly entering the cells of the brain, COVID-19 also contributes to psychiatric symptoms and disorders. It is unclear what role it may play in their pathology, but it may worsen existing conditions or as a contributing factor in its development.\n\nOne study compared individuals afflicted with the novel coronavirus to those in quarantine or the general public, finding elevated rates of depression (29.2%) in those with COVID-19. Another small study reported increased post-traumatic stress symptoms in these patients.\n\nIndividuals already living with psychiatric disorders reported a worsening of symptoms in two different studies. Several other studies reported depressive and anxious symptoms worsened among essential workers.\n\nAnother study surveyed >2000 individuals in Denmark, finding a reduction in overall psychological well-being measures during the pandemics. This study also reported that women were more negatively affected than men.\n\nAdditionally, it recognized that many older adults living in adult-care communities during shelter-in-place orders experience loneliness and depression. A study of older adults in San Francisco found that they showed increased rates of loneliness and depression.\n\nWe must do our best to check-up on our friends and loved ones. We are all affected differently by the pandemic, so it is important to recognize that the rates of anxiety, depression and stress-related disorders may arise.\n\nCOVID-19 Long-Haulers\n\nThousands of individuals initially infected with COVID-19, the long-haulers, continue to suffer symptoms many months later. On average, these individuals are women around the age of 44 who are otherwise healthy. Their infections were classified as mild severity because they could recover at home.\n\nFacing stigma and in need of a community, several groups sprouted up to support each other. Originally disbelieved, they rallied to raise awareness of their predicament within the medical establishment. It should no longer be sufficient to classify individuals infected with COVID-19, who don’t require a hospital stay, as mild.\n\nA few different studies report that most individuals affected with COVID-19 suffer from symptoms months later (Italy, UK, Germany). Intriguingly, many long-haulers did not produce high-levels of coronavirus antibodies. Many individuals experience pain, fatigue and many other debilitating symptoms.\n\nThese symptoms are consistent with disturbances in the autonomic nervous system, which is responsible for many automatic physiological functions like breathing or heart-rate but also influence fatigue. Preliminary physiotherapy involves reconditioning the nervous system of patients so that they may regain some of these functions. In his article, Ed Yong states:","['Mental Health', 'Coronavirus', 'Science', 'Psychology', 'Neuroscience']"
2,"Mind Your Nose\n\nHow smell training can change your brain in six weeks — and why it matters.\n\nBy Ann-Sophie Barwich\n\nWhen it comes to training your brain, your sense of smell is possibly the last thing you’d think could strengthen your neural pathways. Learning a new language or reading more books (and fewer social media posts) — sure. But your nose?\n\nThat’s because the olfactory system is one of the most plastic systems in your brain. Neuroplasticity describes how the brain flexibly adapts to changes in the environment or when exposed to neural damage. Stimulating the brain strengthens existing neural structures and further adds fuel to the brain’s capacity to remain adaptive, thereby keeping it young. And your smell system is particularly adept at repair and renewal. (Olfactory cells have recently been used in human transplant therapy to treat spinal cord injury, for example.)\n\nOne reason for the olfactory system’s adaptive responsiveness is that it undergoes adult neurogenesis. Humans grow new olfactory neurons every three to four weeks throughout their entire life, not just during child development. (These sensory neurons sit in the mucous of your nose, where they pick up airborne chemicals and send activity signals straight to the core of the brain.) If it weren’t for this ongoing regeneration of sensory cells in your nose, we would stop detecting smells after our first few colds.\n\nNeural plasticity weakens as we grow old — and so does our sense of smell. Olfactory performance decreases around the age of 70 as the regeneration of olfactory neurons slows down. Yet this process of regeneration never stops entirely. Training your nose helps slow down that decline and offers a great way to increase your brain’s plasticity. That said, increasing your sensitivity to odors in the environment does not always sound desirable. Smell usually comes with negative connotations: that whiff of urine in the metro, that overpowering literal skunk, or that trail of body odor from the person walking in front of you. But paying more attention to the smells around you also has benefits, and not just for a greater enjoyment of food aromas and neighbors’ gardens.\n\nRecent studies show that olfactory abilities correspond with differences in cortical areas involved in smell processing in the brain. Johannes Frasnelli, an olfactory scientist at the University of Quebec in Trois-Rivières, explained: “We did some studies where we saw that there is a link between the structure of certain brain regions-like the thickness of the cortex and the thickness of the gray matter layer in certain brain olfactory processing regions-and the ability to perceive.” Frasnelli and his colleagues found that people with better perceptual capacities had a thicker cortex. When they looked at people who had lost their sense of smell, they also saw a reduction of cortical matter in areas involved in odor processing.\n\nThat raises the question: Could you change the structure of your brain simply by smelling things? In 2019, Frasnelli’s group discovered that undergoing as little as six weeks of intense olfactory training results in significant structural changes in some regions of the brain (namely, the right inferior frontal gyrus, the bilateral fusiform gyrus, and the right entorhinal cortex).\n\nParticipants were given three tasks with a cognitive component.\n\nThe first task was a classification task. Participants had to organize two simple odor mixtures by ordering each from lowest to highest concentration. The second was an identification task. Participants were presented with a target odor blended with a citrus scent in a specific ratio (4%). Then they were given the same blend in different ratios and asked to order them according to quality (more citrusy or less?). Lastly, the detection task: Was the learned target odor present in a range of 14 samples of different odor mixtures or not?\n\nThis entire exercise was undertaken each day for 20 minutes during the six weeks. Responses were monitored and evaluated on speed and accuracy.\n\nSuch intense olfactory training led to a general improvement in olfactory performance. Plus, the increase of olfactory skill was not restricted to the training exercises but also transferred to other olfactory abilities-abilities that had not been tested as part of the training. These perceptual tests included: the detection threshold of an odor, accuracy in odor discrimination (same or different?), cued odor identification (which of these four descriptors is correct?), and even free odor identification (identifying an odor without cues!).\n\nIncreasing insight into what the nose knows, and how it communicates with the brain, has broader implications-even philosophical ones. Old (yet still prevalent) cookie-cutter views of the mind coax us to believe that our senses are passive-indifferently picking up signals in the world that are then processed by the brain. Perception, in such views, is a process separate from cognition. Highly plastic systems such as olfaction present us with a much more intriguing and interwoven picture of the mind: Training your nose’s performance (just like other cognitive capacities) fundamentally shapes what you perceive by rewiring the system.\n\nYour senses are far from being impartial transmitters; what you are able to perceive in the world ultimately hinges on the depth of your cognitive engagement with it. In other words, your mind does not emerge apathetically as a product of some remarkable, intricate molecular twists performed by the brain. The mind is enhanced by what you can train your brain to do. Just like strength is a result of muscle training, cognitive training of the senses is the bodybuilding of the brain.","['Biotechnology', 'Neuroscience', 'Brain', 'Wellness', 'Science']"
3,Passionate about the synergy between science and technology to provide better care. Check out my newsletter: scienceforreal.substack.com 📰\n\nFollow,"['Health', 'Neuroscience', 'Mental Health', 'Psychology', 'Science']"
4,"You’ve heard of him, haven’t you? Phineas Gage. The railroad worker who survived an explosion that involved an iron rod piercing through his left cheek and out of his brain and skull.\n\nYeah.\n\nI know.\n\nYou’re probably wondering “yeah, alright sweet. What about him?” Well, let’s just say that he was a really popular patient for the field of neuroscience (Cherry, par. 1). And what I found the most interesting about this tragic event was the science of his behavior afterward.\n\nFor those of you who don’t know much about Phineas Gage, let me fill you in with the help of my research.\n\nPhineas Gage, 25 years old, was a railroad worker in Vermont. One day, at work, he was using an iron rod to handle explosive gun powder. As he was using the iron rod to handle the gun powder, an explosion suddenly occurred. The iron rod then went through his left cheek and brain. Fortunately, he survived and was able to talk and walk after the accident (Cherry, par. 2–3).\n\nWhy did people say that Phineas Gage was a “different person” after his accident? It actually has to do with neuroscience.\n\nThe iron rod went through his brain, in particular, it went through the frontal lobe of his brain. Does this mean that the frontal lobe of your brain has to do with the kind of person you are? To answer this question, we have to understand what the frontal lobe in our brain is responsible for.\n\nOur frontal lobes are responsible for many things. Some of them are higher-order thinking, personality, and decision making. This explains why people who knew Phineas Gage said that he was a totally different person after the accident. Since the iron rod went through his frontal lobe, it means that his personality and thinking, as a whole, completely changed, making him seem like he was a whole different person due to the way he started acting.\n\nThis accident and the treatment of Phineas Gage actually played a big role in the field of neurology. His case helped scientists better understand the role of the frontal cortex of the brain (Cherry, par. 16–17).\n\nBibliography\n\nCherry, Kendra. “The Famous Case of Phineas Gage’s Astonishing Brain Injury.” Phineas Gage’s Astonishing Brain Injury, Verywell Mind, 3 Oct. 2019, www.verywellmind.com/phineas-gage-2795244#targetText=The%20rod%20penetrated%20Gage's%20left,be%20seen%20by%20a%20doctor.","['Brain', 'Health', 'Development', 'Psychology', 'Science']"


### Check formatting within text column

In [69]:
print(df["text"].iloc[4])  # Display only the 'text' column

You’ve heard of him, haven’t you? Phineas Gage. The railroad worker who survived an explosion that involved an iron rod piercing through his left cheek and out of his brain and skull.

Yeah.

I know.

You’re probably wondering “yeah, alright sweet. What about him?” Well, let’s just say that he was a really popular patient for the field of neuroscience (Cherry, par. 1). And what I found the most interesting about this tragic event was the science of his behavior afterward.

For those of you who don’t know much about Phineas Gage, let me fill you in with the help of my research.

Phineas Gage, 25 years old, was a railroad worker in Vermont. One day, at work, he was using an iron rod to handle explosive gun powder. As he was using the iron rod to handle the gun powder, an explosion suddenly occurred. The iron rod then went through his left cheek and brain. Fortunately, he survived and was able to talk and walk after the accident (Cherry, par. 2–3).

Why did people say that Phineas Gage wa

### Parse Data Sets from DataFrame

In [11]:
from sklearn.model_selection import train_test_split

# Split into training and temp (validation + test) set (70% train, 30% temp)
train, temp = train_test_split(df, test_size=0.3, random_state=42)

# Split the temp set into validation and test set (50% validation, 50% test of the 30%)
validate, test = train_test_split(temp, test_size=0.5, random_state=42)

# Print counts of rows in each set
print(f"Training Set Size: {train.shape[0]}")
print(f"Validation Set Size: {validate.shape[0]}")
print(f"Test Set Size: {test.shape[0]}")

Training Set Size: 134657
Validation Set Size: 28855
Test Set Size: 28856


## Solution

### Approach

LLM-based tool(s) with tool-specific training, with tool-specific engineered prompt(s).
Modular. Function-based.  Extensible.


#### Input: content of the article to analyse

(ideally, the contents directly.  alternate consideration might be to provide a URL, but that would require additional python supporting fucntions to pull, then clean up the contents prior to submission)

#### Output: variable per tool/function

### Instantiate an LLM, and set it up to support one or more functions

#### Ensure the required pip underpinnings are installed (required when having to spin up a new workbench instance)

In [16]:
!pip install google-genai
!pip show google-genai

Collecting google-genai
  Downloading google_genai-1.10.0-py3-none-any.whl.metadata (32 kB)
Collecting httpx<1.0.0,>=0.28.1 (from google-genai)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Downloading google_genai-1.10.0-py3-none-any.whl (154 kB)
Using cached httpx-0.28.1-py3-none-any.whl (73 kB)
Installing collected packages: httpx, google-genai
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
asl 0.1 requires google-genai==1.2.0, but you have google-genai 1.10.0 which is incompatible.
langchain-google-vertexai 2.0.10 requires httpx<0.28.0,>=0.27.0, but you have httpx 0.28.1 which is incompatible.[0m[31m
[0mSuccessfully installed google-genai-1.10.0 httpx-0.28.1
Name: google-genai
Version: 1.10.0
Summary: GenAI Python SDK
Home-page: https://github.com/googleapis/python-genai
Author: 
Author-email: Google LLC <googleapis-packages@google

#### Instantiate the new LLM

In [39]:
# Instantiate an LLM
from typing import Any, Callable, Optional, Tuple, Union

from google import genai
from google.cloud import bigquery
from google.genai.types import (
    FunctionDeclaration,
    GenerateContentConfig,
    GenerateContentResponse,
    Part,
    Schema,
    Tool,
)
from IPython.display import Markdown

REGION = "us-central1"
PROJECT = !(gcloud config get-value core/project)
PROJECT = PROJECT[0]

MODEL = "gemini-2.0-flash-001"

client = genai.Client(vertexai=True, location="us-central1")

# Define the ChatAgent parent class
class ChatAgent:
    def __init__(
        self,
        tools: list[Tool],
        tool_handler_fn: Callable[[str, dict], Any],
        max_iterative_calls: int = 5,
    ):
        
        self.tools = tools
        self.tool_handler_fn = tool_handler_fn
        
        # Define the generate_content_config
        generate_content_config = GenerateContentConfig(
            max_output_tokens=3,
            response_modalities=["TEXT"],
            safety_settings=[
                SafetySetting(
                    category="HARM_CATEGORY_HATE_SPEECH",
                    threshold="OFF"
                ),
                SafetySetting(
                    category="HARM_CATEGORY_DANGEROUS_CONTENT",
                    threshold="OFF"
                ),
                SafetySetting(
                    category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
                    threshold="OFF"
                ),
                SafetySetting(
                    category="HARM_CATEGORY_HARASSMENT",
                    threshold="OFF"
                ),
            ]
        )        

        self.chat_session = chat = client.chats.create(
            model=MODEL,
            config=generate_content_config,
        )
        self.max_iterative_calls = 5

    def send_message(self, message: str) -> GenerateContentResponse:
        response = self.chat_session.send_message(message)
        # This is None if a function call was not triggered
        fn_calls = response.function_calls

        num_calls = 0
        # Reasoning loop. If fn_calls is empty then we never enter this
        # and simply return the response
        while fn_calls:
            if num_calls > self.max_iterative_calls:
                break

            # Handle the function calls
            fn_call_responses = []
            for fn_call in fn_calls:
                response = self.tool_handler_fn(
                    fn_call.name, dict(fn_call.args)
                )
                fn_call_responses.append(
                    Part.from_function_response(
                        name=fn_call.name,
                        response={
                            "content": response,
                        },
                    ),
                )
                num_calls += 1

            # Send the function call result back to the model
            response = self.chat_session.send_message(fn_call_responses)

            # If the response is another function call then we want to
            # stay in the reasoning loop and keep calling functions.
            fn_calls = response.function_calls

        return response

### Tools

#### Article Topic Classifier

input: article body  
output: one of a curated set of topic categories, based upon highest probability match

##### Create Custom Query and Response Templates

###### LLM Input

In [57]:
# Define the function declaration for the topic classifier
from pydantic import BaseModel, Extra, Field
from typing import Dict, Any
from enum import Enum

import logging
import traceback

# Custom JSON Formatter for logging
class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_obj = {
            "timestamp": self.formatTime(record, self.datefmt),
            "level": record.levelname,
            "message": record.getMessage(),
            "article_text": getattr(record, 'article_text', 'No article text provided'),
            "row_index": getattr(record, 'row_index', 'No row index provided'),
            "stack_trace": getattr(record, 'stack_trace', traceback.format_exc() if record.exc_info else 'No stack trace available')
        }
        return json.dumps(log_obj)

# Get the root logger
logger = logging.getLogger()
logger.setLevel(logging.ERROR)

# Clear existing handlers (important in Jupyter)
logger.handlers.clear()

# Create new handlers
file_handler = logging.FileHandler("./data/error_log.json", mode='a')
stream_handler = logging.StreamHandler()

# Assign formatter
formatter = JSONFormatter()
file_handler.setFormatter(formatter)
stream_handler.setFormatter(formatter)

# Add handlers
logger.addHandler(file_handler)
logger.addHandler(stream_handler)

    
# Define an Enum for topics
class TopicEnum(str, Enum):
    arts = "arts"
    business = "business"
    entertainment = "entertainment"
    culture = "culture"
    literature = "literature"
    medicine = "medicine"
    music = "music"
    personal_development = "personal development"
    philosophy = "philosophy"
    politics = "politics"
    religion = "religion"
    science = "science"
    sports = "sports"
    technology = "technology"
    us_news = "us news"
    world = "world"

    # Use a @classmethod to safely store descriptions
    @classmethod
    def _get_descriptions(cls):
        return {
            "arts": "Cultural and creative activities, including fine arts, theater, and music.",
            "business": "The activities related to the production, distribution, and sale of goods and services.",
            "entertainment": "Media, performance, and activities designed to entertain an audience.",
            "culture": "The shared customs, arts, and social institutions of a particular group of people.",
            "literature": "Written works, books, essays, poems, et cetera, especially those considered of superior or lasting artistic merit.",
            "medicine": "The field of health and healing, including clinical practices and healthcare.",
            "music": "An art form that uses sound and rhythm to express emotions, ideas, and cultural identity.",
            "personal_development": "Activities and practices that improve awareness and identity, develop talents, and enhance the quality of life.",
            "philosophy": "The study of fundamental questions regarding existence, knowledge, ethics, reason, and the mind.",
            "politics": "The activities associated with governance, policy, and political ideologies.",
            "religion": "The system of beliefs, practices, and worship regarding a deity or deities.",
            "science": "Systematic enterprise that builds and organizes knowledge through testable explanations and predictions.",
            "sports": "Physical activities involving skill, competition, and fitness.",
            "technology": "The application of scientific knowledge for practical purposes, particularly in industry.",
            "us_news": "News related to events, politics, and issues within the United States.",
            "world": "Global news, issues, and events occurring internationally."
        }

    @classmethod
    def get_description(cls, topic: "TopicEnum") -> str:
        """Fetches the description for a given topic using the class-level _descriptions dictionary."""
        descriptions = cls._get_descriptions()  # Safely fetch the descriptions
        # Convert spaces in topic.value to underscores to match the dictionary keys
        key = topic.value.replace(" ", "_")
        if key not in descriptions:
            raise ValueError(f"No description found for topic: {topic.value}")
        return descriptions[key]

class Schema(BaseModel):
    type: str
    properties: Dict[str, Any]
    required: list

class FunctionDeclarationWithExtra(BaseModel):
    name: str
    description: str
    parameters: Schema
    # Allow extra fields (like PROMPT)
    class Config:
        extra = Extra.allow
        
topic_list = "\n".join([f"{topic.value.capitalize()}: {TopicEnum.get_description(topic)}" for topic in TopicEnum])

# Define the function declaration for the topic classifier
topic_classifier_tool_handler_fn = FunctionDeclarationWithExtra(
    name="topic_classifier",
    description="Identify the article topic by analysing the contents of the article",
    parameters=Schema(
        type="OBJECT",
        properties={
            "article_text": {  # Define the expected 'article_text' input
                "type": "STRING",
                "description": "The content of the article to classify."
            },
            "topic": {
                "type": "STRING",
                "description": "Topic",
                "enum": [topic.value for topic in TopicEnum]
            },
        },
        required=["article_text", "topic"],
    ),
    PROMPT="""
    Identify the most relevant topic for the following article:

    {article_text}
    
    Review the enumerated topic categories and their descriptions categoried based upon both the topics and descriptions.
    If an areticle IS a poem, the topic is 'literature'.
    Only return one of the enumerated topics:
    
    {topic_list}
    
    """
)

# Function to classify article text based on topic classification
def classify_article_and_get_tags(article_text: str, row_index: int = None):
    try:
        # Dynamically set the prompt with the actual article text
        formatted_prompt = topic_classifier_tool_handler_fn.PROMPT.format(
            article_text=article_text,
            topic_list=topic_list
        )
        
        # Send the article text to the model for classification
        response = client.models.generate_content(
            model=MODEL,
            contents=formatted_prompt,
            config=GenerateContentConfig(
                response_mime_type="text/x.enum",
                response_schema={
                    "type": "STRING",
                    "enum": [topic.value for topic in TopicEnum],
                },
            ),
        )
        
        # The response contains the classification result
        if response is None:
            logger.error(f"Received None response from model for row {row_index}")
            
                
        # Check if the response is valid
        if response and hasattr(response, 'text') and isinstance(response.text, str):
            classification_result = response.text.strip()  # Clean up the result
        else:
            classification_result = "unclassified"  # fallback response

            # Log the issue if response is None
            logger.error(
                "Received None response from model",
                extra={
                    'article_text': article_text,
                    'row_index': row_index,
                    'stack_trace': traceback.format_exc()
                }
            )
        
        return classification_result
    
    except Exception as e:
        # Log any unexpected errors with the row data
        logger.error(
            f"An error occurred while classifying article at row {row_index}: {str(e)}",
            extra={
                'row_index': row_index,
                'stack_trace': traceback.format_exc()
            }
        )
        return "unclassified"  # Fallback

# Now, assuming we have a dataframe or list of articles to process
article_data = [
    "You’ve heard of him, haven’t you? Phineas Gage. The railroad worker who survived an explosion...",
    "Another article text that talks about politics and economy...",
    # Add more articles as needed
]

## Generate the list
# print(topic_list)
# print

# Process each article and classify its topic
for idx, row in test.head(10).iterrows():
    article_text = row["text"]
    predicted_topic = classify_article_and_get_tags(article_text, row_index=idx)
    actual_tags = row["tags"]  # Get the tags for the current row
    print(f"{article_text}\n")
    print("---------------------------------------------------------------------------------\n")
    print(f"Predicted Article Topic: {predicted_topic}\t\tDataset tags: {actual_tags}\n")
    print("---------------------------------------------------------------------------------\n\n")

Lloyd Austin, Secretary of Defense

While Gen. Lloyd Austin would be the first Black Defense Secretary, his terrible record should overshadow this feat, but we know that is going to be brought up in mainstream media. As Gen. Austin was tapped by Obama to begin the process of the withdrawal of American troops from Iraq in 2010, it was an unorganized effort leaving a depleted Iraqi government vulnerable with a military not trained or equipped to quell the rise of ISIS. The disaster did not stop there as Gen. Austin was then given the responsibility of overseeing a Syrian rebel program to combat ISIS that cost us $384 million, which ended in failure after the millions of dollars earmarked to build the training camps were rarely if ever used.

Shortly after all of these debacles, Gen. Austin sold out to defense contractors and wealthy investment funds becoming a board member at Raytheon and a partner at Pine Island Capital Partners, who both stand to profit immensely from a Biden administr

### Having decided that Gemini's accuracy is Surprisingly High

Now create a new dataframe with articles ('text') column, and a new topic column, as a new functional data set.  
(Save every thousand articles to a CSV file, so as not to overload our memory and cause an abend.)

In [55]:
from datetime import datetime

def print_with_timestamp(message: str):
    timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')  # Format the time
    print(f"{timestamp} - {message}")

new_data_file = "./data/medium.article.topic.csv"
chunk_size = 100

# Start with a clean slate, so we don't make a huge mess within the new csv file while validating
# !rm {new_data_file} 2> /dev/null  # (commented out again for safety purposes)

# Initialize header flag to ensure header is written once
header_written = False

# Initialize a list to store the rows of text and topic
rows = []

# Print dataset length to check if it's large enough
print_with_timestamp(f"Total rows in dataset: {len(df)}")

# Process the DataFrame in chunks
for start_row in range(0, len(df), chunk_size):  # originally started at zero, continuing at 13900 after adding some safety logic following an error interrupting processing
    chunk_df = df.iloc[start_row:start_row + chunk_size]
    print_with_timestamp(f"Processing chunk starting at row {start_row}, ending at row {start_row + chunk_size}")
    
    # Generate the 'topic' column by applying the LLM function to the 'text' column
    chunk_df['topic'] = chunk_df.apply(
        lambda row: classify_article_and_get_tags(
            article_text=row['text'],         # Pass the 'text' column of the row
            row_index=row.name,               # Pass the index (row.name) as row_index
        ), axis=1
    )
    
    # Create a new DataFrame with the 'text' and 'topic' columns
    new_chunk_df = chunk_df[['text', 'topic']]
    
    # Append the chunk to the CSV file, ensuring header is written only once
    new_chunk_df.to_csv(new_data_file, mode='a', index=False, header=not header_written)
    
    # Ensure the header is not written again in subsequent chunks
    header_written = True
    
    # Print after writing
    print_with_timestamp(f"Written rows {start_row} to {start_row + chunk_size} to {new_data_file}")
    print(f"Current file size: {os.path.getsize(new_data_file)} bytes")

print_with_timestamp("Processing complete.")

2025-04-09 21:18:31 - Total rows in dataset: 192368
2025-04-09 21:18:31 - Processing chunk starting at row 0, ending at row 100
2025-04-09 21:19:04 - Written rows 0 to 100 to ./data/medium.article.topic.csv
Current file size: 508612 bytes
2025-04-09 21:19:04 - Processing chunk starting at row 100, ending at row 200
2025-04-09 21:19:37 - Written rows 100 to 200 to ./data/medium.article.topic.csv
Current file size: 1055060 bytes
2025-04-09 21:19:37 - Processing chunk starting at row 200, ending at row 300
2025-04-09 21:20:08 - Written rows 200 to 300 to ./data/medium.article.topic.csv
Current file size: 1633045 bytes
2025-04-09 21:20:08 - Processing chunk starting at row 300, ending at row 400
2025-04-09 21:20:38 - Written rows 300 to 400 to ./data/medium.article.topic.csv
Current file size: 2181381 bytes
2025-04-09 21:20:38 - Processing chunk starting at row 400, ending at row 500
2025-04-09 21:21:10 - Written rows 400 to 500 to ./data/medium.article.topic.csv
Current file size: 2773814

{"timestamp": "2025-04-09 22:29:59,234", "level": "ERROR", "message": "Received None response from model", "article_text": "How to Have a Sex Life When You Have Kids\n\nBecoming a parent doesn\u2019t have to extinguish the flame\n\nPhoto by rawpixel.com from Pexels\n\nI was lucky enough never to have had the misfortune to walk in on my parents during sex. I do have one specific memory of a time when I was about 16 and realized that my parents had just done the deed. As kids, thinking about our parents getting it on was enough to make us run wide-eyed from the room. Now that I\u2019m the parent, I realize how much having children can complicate having an active sex life.\n\nMy kids are 9 and 13, so there\u2019s no more nookie when the kids are napping. Having a fulfilling sex life when you\u2019ve got a family to manage is challenging. With one or both parents often exhausted at day\u2019s end, it\u2019s common to let too much time slip by without finding moments to really connect. Over

2025-04-09 22:30:21 - Written rows 13900 to 14000 to ./data/medium.article.topic.csv
Current file size: 79351815 bytes
2025-04-09 22:30:21 - Processing chunk starting at row 14000, ending at row 14100
2025-04-09 22:30:51 - Written rows 14000 to 14100 to ./data/medium.article.topic.csv
Current file size: 79889789 bytes
2025-04-09 22:30:51 - Processing chunk starting at row 14100, ending at row 14200
2025-04-09 22:31:20 - Written rows 14100 to 14200 to ./data/medium.article.topic.csv
Current file size: 80465177 bytes
2025-04-09 22:31:20 - Processing chunk starting at row 14200, ending at row 14300
2025-04-09 22:32:02 - Written rows 14200 to 14300 to ./data/medium.article.topic.csv
Current file size: 81018920 bytes
2025-04-09 22:32:02 - Processing chunk starting at row 14300, ending at row 14400
2025-04-09 22:32:32 - Written rows 14300 to 14400 to ./data/medium.article.topic.csv
Current file size: 81597640 bytes
2025-04-09 22:32:32 - Processing chunk starting at row 14400, ending at row 1



2025-04-09 22:38:54 - Written rows 15600 to 15700 to ./data/medium.article.topic.csv
Current file size: 88430071 bytes
2025-04-09 22:38:54 - Processing chunk starting at row 15700, ending at row 15800
2025-04-09 22:39:24 - Written rows 15700 to 15800 to ./data/medium.article.topic.csv
Current file size: 88900318 bytes
2025-04-09 22:39:24 - Processing chunk starting at row 15800, ending at row 15900
2025-04-09 22:39:53 - Written rows 15800 to 15900 to ./data/medium.article.topic.csv
Current file size: 89389701 bytes
2025-04-09 22:39:53 - Processing chunk starting at row 15900, ending at row 16000
2025-04-09 22:40:23 - Written rows 15900 to 16000 to ./data/medium.article.topic.csv
Current file size: 89801047 bytes
2025-04-09 22:40:23 - Processing chunk starting at row 16000, ending at row 16100
2025-04-09 22:40:51 - Written rows 16000 to 16100 to ./data/medium.article.topic.csv
Current file size: 90331819 bytes
2025-04-09 22:40:51 - Processing chunk starting at row 16100, ending at row 1

{"timestamp": "2025-04-09 22:47:13,457", "level": "ERROR", "message": "Received None response from model", "article_text": "I t was middle school of 2016 and there I was, a 15-year-old Maggie, a proud bisexual plebe in my first heterosexual relationship with my childhood best friend \u2014 we\u2019ll call him Sam.\n\nI had already lost my virginity to Sam at 14 and decided I largely hated penises and the way they ripped open my fragile, white, porcelain area down there, so we stuck to what we knew best \u2014 hand jobs, fingering, and a lot of oral. (And honestly, even in adulthood, I find oral is so much better than a long, hard, cylinder occasionally ramming into my cervix \u2014 it\u2019s a sensation I can do without.)\n\nSo one evening Sam and I begin fooling around under the covers of the lower portion of his bunk bed \u2014 which is to say aggressively attacking each other faces with mouths, lips and tongues and grabbing at each other\u2019s genitalia (over the clothes). After so

2025-04-09 22:47:15 - Written rows 17300 to 17400 to ./data/medium.article.topic.csv
Current file size: 97080428 bytes
2025-04-09 22:47:15 - Processing chunk starting at row 17400, ending at row 17500
2025-04-09 22:47:48 - Written rows 17400 to 17500 to ./data/medium.article.topic.csv
Current file size: 97560504 bytes
2025-04-09 22:47:48 - Processing chunk starting at row 17500, ending at row 17600
2025-04-09 22:48:18 - Written rows 17500 to 17600 to ./data/medium.article.topic.csv
Current file size: 98172284 bytes
2025-04-09 22:48:18 - Processing chunk starting at row 17600, ending at row 17700
2025-04-09 22:48:49 - Written rows 17600 to 17700 to ./data/medium.article.topic.csv
Current file size: 98781155 bytes
2025-04-09 22:48:49 - Processing chunk starting at row 17700, ending at row 17800
2025-04-09 22:49:17 - Written rows 17700 to 17800 to ./data/medium.article.topic.csv
Current file size: 99255777 bytes
2025-04-09 22:49:17 - Processing chunk starting at row 17800, ending at row 1

{"timestamp": "2025-04-09 22:52:02,420", "level": "ERROR", "message": "Received None response from model", "article_text": "Giuliano Bugiardini, Rape of Dinah (1554).\n\nRio Americano High School sits at the east end of American River Drive in a tony part of Sacramento. In the 1980s, Rio was the wealthiest public school in Sacramento, supported by a tax base generated from expansive California ranch houses and mutant versions of the English manor. Most of the kids who went to Rio were children of doctors, lawyers, and people high up in state politics. One of my friend\u2019s father owned Sacramento\u2019s biggest construction firm. Another acquaintance\u2019s dad was dentist to governors and attorney generals.\n\nMe? I was one of the school runts, a kid from the other side of the river, lower middle class and falling. My dad lived in a workingman\u2019s apartment on the edge of the district. I claimed his address so that I could go to Rio and play in its 5-watt radio station, the only 

2025-04-09 22:52:11 - Written rows 18300 to 18400 to ./data/medium.article.topic.csv
Current file size: 102298769 bytes
2025-04-09 22:52:11 - Processing chunk starting at row 18400, ending at row 18500
2025-04-09 22:52:39 - Written rows 18400 to 18500 to ./data/medium.article.topic.csv
Current file size: 102794631 bytes
2025-04-09 22:52:39 - Processing chunk starting at row 18500, ending at row 18600
2025-04-09 22:53:08 - Written rows 18500 to 18600 to ./data/medium.article.topic.csv
Current file size: 103299008 bytes
2025-04-09 22:53:08 - Processing chunk starting at row 18600, ending at row 18700
2025-04-09 22:53:37 - Written rows 18600 to 18700 to ./data/medium.article.topic.csv
Current file size: 103876367 bytes
2025-04-09 22:53:37 - Processing chunk starting at row 18700, ending at row 18800
2025-04-09 22:54:05 - Written rows 18700 to 18800 to ./data/medium.article.topic.csv
Current file size: 104415078 bytes
2025-04-09 22:54:05 - Processing chunk starting at row 18800, ending at 

{"timestamp": "2025-04-09 22:59:37,677", "level": "ERROR", "message": "Received None response from model", "article_text": "When Your Church is a Hate Group\n\nGrowing Up in a Southern Baptist Culture of Hate, Sexual Abuse, and Conversion Therapy\n\nFaith Baptist Church, Glen Burnie, MD www.welcometofaith.com\n\nI grew up in church. A lot of us, even in the LGBTQ+ community, grew up in church. But I feel like that is a gross oversimplification of what being part of the church was to my family at the time. We were at church Sunday morning, Sunday evening, Monday night, Wednesday night, and for a good chunk of the year during Upward basketball season Saturday too. Four out of seven days of the week I was in church. It didn\u2019t matter, if the church had something going on our family would be there. From worship services, to potlucks, to plays, to youth parties, to camps, to basketball, to revivals and concerts and everything in between being at this church was our life. We didn\u2019t 

2025-04-09 22:59:53 - Written rows 19900 to 20000 to ./data/medium.article.topic.csv
Current file size: 110679704 bytes
2025-04-09 22:59:53 - Processing chunk starting at row 20000, ending at row 20100
2025-04-09 23:00:22 - Written rows 20000 to 20100 to ./data/medium.article.topic.csv
Current file size: 111161575 bytes
2025-04-09 23:00:22 - Processing chunk starting at row 20100, ending at row 20200
2025-04-09 23:00:51 - Written rows 20100 to 20200 to ./data/medium.article.topic.csv
Current file size: 111659251 bytes
2025-04-09 23:00:51 - Processing chunk starting at row 20200, ending at row 20300
2025-04-09 23:01:24 - Written rows 20200 to 20300 to ./data/medium.article.topic.csv
Current file size: 112175784 bytes
2025-04-09 23:01:24 - Processing chunk starting at row 20300, ending at row 20400
2025-04-09 23:01:53 - Written rows 20300 to 20400 to ./data/medium.article.topic.csv
Current file size: 112598250 bytes
2025-04-09 23:01:53 - Processing chunk starting at row 20400, ending at 

{"timestamp": "2025-04-09 23:29:26,619", "level": "ERROR", "message": "Received None response from model", "article_text": "I didn\u2019t find out I was a middle until I was in my mid-30s. And then I felt like I had to figure everything out on my own.\n\nMy first hunch was that I was drawn to so many elements of the Daddy Dom, little girl kink.\n\nI love the dynamic created in DDlg relationships. And I adore the aesthetic that goes with it \u2014 everything cutesy, playful, and bubbly.\n\nAnd of course, I\u2019m really into the daddy doms. The way they can be dominant, nurturing, and giving all at once makes me weak in the knees.\n\nBut I could never fully identify with the littles I found online. The more I read up on their lifestyles and saw what being a little means in practice, the less it felt like my kink.\n\nI could get down with being playful and sweet, but I wasn\u2019t into the idea of putting a pacifier or a sippy cup to my mouth.\n\nSame with a lot of the activities littles

2025-04-09 23:29:40 - Written rows 26000 to 26100 to ./data/medium.article.topic.csv
Current file size: 142190854 bytes
2025-04-09 23:29:40 - Processing chunk starting at row 26100, ending at row 26200
2025-04-09 23:30:09 - Written rows 26100 to 26200 to ./data/medium.article.topic.csv
Current file size: 142694749 bytes
2025-04-09 23:30:09 - Processing chunk starting at row 26200, ending at row 26300
2025-04-09 23:30:39 - Written rows 26200 to 26300 to ./data/medium.article.topic.csv
Current file size: 143139771 bytes
2025-04-09 23:30:39 - Processing chunk starting at row 26300, ending at row 26400
2025-04-09 23:31:07 - Written rows 26300 to 26400 to ./data/medium.article.topic.csv
Current file size: 143699952 bytes
2025-04-09 23:31:07 - Processing chunk starting at row 26400, ending at row 26500
2025-04-09 23:31:35 - Written rows 26400 to 26500 to ./data/medium.article.topic.csv
Current file size: 144163601 bytes
2025-04-09 23:31:35 - Processing chunk starting at row 26500, ending at 

{"timestamp": "2025-04-09 23:31:52,704", "level": "ERROR", "message": "Received None response from model", "article_text": "1994.\n\nYou are eight and your step-cousin sleeps over. She goes down on you and maybe you go down on her too. Afterward she rushes into the bathroom crying, repeatedly wiping her tongue, washing her mouth. She is upset, and twelve \u2014 older than you \u2014 so you think you should be upset too. You never talk about it.\n\n1995.\n\nYou sit at the top of the stairs with your friend, Roman, and watch your parents\u2019 friends watch porn in the basement. You like it, but you just watch. Later that year you become friends with Anya and you watch more porn and touch each other. You think she is ugly, but you like doing this anyway. One day you want it so bad you do it in the living room with her grandmother in the kitchen. You hang two aprons in the doorway between the rooms to try and keep her out. It seems to work.\n\n1996.\n\nThere are other girl friends you do 

2025-04-09 23:32:06 - Written rows 26500 to 26600 to ./data/medium.article.topic.csv
Current file size: 144755529 bytes
2025-04-09 23:32:06 - Processing chunk starting at row 26600, ending at row 26700
2025-04-09 23:32:35 - Written rows 26600 to 26700 to ./data/medium.article.topic.csv
Current file size: 145306013 bytes
2025-04-09 23:32:35 - Processing chunk starting at row 26700, ending at row 26800
2025-04-09 23:33:04 - Written rows 26700 to 26800 to ./data/medium.article.topic.csv
Current file size: 145777584 bytes
2025-04-09 23:33:04 - Processing chunk starting at row 26800, ending at row 26900
2025-04-09 23:33:34 - Written rows 26800 to 26900 to ./data/medium.article.topic.csv
Current file size: 146255434 bytes
2025-04-09 23:33:34 - Processing chunk starting at row 26900, ending at row 27000
2025-04-09 23:34:04 - Written rows 26900 to 27000 to ./data/medium.article.topic.csv
Current file size: 146828091 bytes
2025-04-09 23:34:04 - Processing chunk starting at row 27000, ending at 

{"timestamp": "2025-04-09 23:48:00,725", "level": "ERROR", "message": "Received None response from model", "article_text": "(Photo by Alex Bl\u0103jan on Unsplash)\n\nAll psychopaths follow the same strategy when operating in intimate relationships. I know this strategy well because I was in a relationship with a psychopath for around four years.\n\nI also know other women who dated and are dating psychopaths. Some of them are still abused, some of them had their lives totally destroyed. Only a few manage to break out, and the only reason that they do is covered at the end of this article.\n\nMany people mistake normal persons for narcissists or psychopaths. Many humans these days display psychopathic traits because they believe that only inhumane behavior would allow them to survive in this cruel world.\n\nBecause of such reasoning, some people act like psychopaths, but this act is external \u2014 deep within they still have human warmth and finer feelings.\n\nReal psychopaths can be 

2025-04-09 23:48:03 - Written rows 29800 to 29900 to ./data/medium.article.topic.csv
Current file size: 161435053 bytes
2025-04-09 23:48:03 - Processing chunk starting at row 29900, ending at row 30000
2025-04-09 23:48:32 - Written rows 29900 to 30000 to ./data/medium.article.topic.csv
Current file size: 161925629 bytes
2025-04-09 23:48:32 - Processing chunk starting at row 30000, ending at row 30100
2025-04-09 23:49:00 - Written rows 30000 to 30100 to ./data/medium.article.topic.csv
Current file size: 162437165 bytes
2025-04-09 23:49:00 - Processing chunk starting at row 30100, ending at row 30200
2025-04-09 23:49:29 - Written rows 30100 to 30200 to ./data/medium.article.topic.csv
Current file size: 162865041 bytes
2025-04-09 23:49:29 - Processing chunk starting at row 30200, ending at row 30300
2025-04-09 23:49:57 - Written rows 30200 to 30300 to ./data/medium.article.topic.csv
Current file size: 163346273 bytes
2025-04-09 23:49:57 - Processing chunk starting at row 30300, ending at 

{"timestamp": "2025-04-09 23:57:43,562", "level": "ERROR", "message": "Received None response from model", "article_text": "I pulled the curtains closed, dimming out the moonlight and shrouding my room in darkness. I could barely make out the toy I had already placed onto my bed, but I knew it was there. I had put it there that morning in anticipation\u2026 I had spent all day teasing myself, not physically, but mentally.\n\nI had read erotica on my commute to work, my eyes gliding over words of passion, spanking and hard fucking, hidden behind the innocent plain cover of my e-reader. The man next to me having no clue that I was clenching my already wet pussy.\n\nOn my lunch break I read more, sucking Sir\u2019s cock, being watched and begging to cum. I crossed and uncrossed my legs a few times, my eyes flickering if the slight movement managed to rub my clit. I wanted to hide in the bathroom and touch myself, I keep a near silent bullet vibe in my bag for such occasions\u2026 but I di

2025-04-09 23:58:06 - Written rows 31900 to 32000 to ./data/medium.article.topic.csv
Current file size: 171943834 bytes
2025-04-09 23:58:06 - Processing chunk starting at row 32000, ending at row 32100
2025-04-09 23:58:34 - Written rows 32000 to 32100 to ./data/medium.article.topic.csv
Current file size: 172496183 bytes
2025-04-09 23:58:34 - Processing chunk starting at row 32100, ending at row 32200
2025-04-09 23:59:02 - Written rows 32100 to 32200 to ./data/medium.article.topic.csv
Current file size: 172958089 bytes
2025-04-09 23:59:02 - Processing chunk starting at row 32200, ending at row 32300
2025-04-09 23:59:30 - Written rows 32200 to 32300 to ./data/medium.article.topic.csv
Current file size: 173423195 bytes
2025-04-09 23:59:30 - Processing chunk starting at row 32300, ending at row 32400
2025-04-09 23:59:58 - Written rows 32300 to 32400 to ./data/medium.article.topic.csv
Current file size: 173905959 bytes
2025-04-09 23:59:58 - Processing chunk starting at row 32400, ending at 

{"timestamp": "2025-04-10 00:04:24,507", "level": "ERROR", "message": "Received None response from model", "article_text": "Why Should Women Feel Guilty About Not Being in the Mood?\n\nAnd why do males act microaggressively to females who rebuff their advances?\n\nAlright, anybody who knows me can vouch for this \u2014 I am pretty much DTF most of the time.\n\nBut like anybody, there are nights when I\u2019m tired or having PMS and the idea of you fucking me, dude, is just anathema to my personal well-being.\n\nNothing personal.\n\nYes, I know you are super sexy. Yes, I know you do a million push-ups a day and you are totally cut and your guns are enormous and your six-pack abs are up to seven last time I counted. It\u2019s not you, it\u2019s me.\n\nBut since you are giving me that face and you look like you\u2019re very disappointed, and you\u2019re kinda waiting for me to give you the mercy jerk-off, I would like to investigate this issue, if I may.\n\nWhy exactly am I the bad guy he

2025-04-10 00:04:25 - Written rows 33200 to 33300 to ./data/medium.article.topic.csv
Current file size: 178670971 bytes
2025-04-10 00:04:25 - Processing chunk starting at row 33300, ending at row 33400
2025-04-10 00:04:55 - Written rows 33300 to 33400 to ./data/medium.article.topic.csv
Current file size: 179230168 bytes
2025-04-10 00:04:55 - Processing chunk starting at row 33400, ending at row 33500
2025-04-10 00:05:23 - Written rows 33400 to 33500 to ./data/medium.article.topic.csv
Current file size: 179674607 bytes
2025-04-10 00:05:23 - Processing chunk starting at row 33500, ending at row 33600
2025-04-10 00:05:52 - Written rows 33500 to 33600 to ./data/medium.article.topic.csv
Current file size: 180110953 bytes
2025-04-10 00:05:52 - Processing chunk starting at row 33600, ending at row 33700
2025-04-10 00:06:21 - Written rows 33600 to 33700 to ./data/medium.article.topic.csv
Current file size: 180638368 bytes
2025-04-10 00:06:21 - Processing chunk starting at row 33700, ending at 



2025-04-10 00:17:12 - Written rows 35700 to 35800 to ./data/medium.article.topic.csv
Current file size: 191974692 bytes
2025-04-10 00:17:12 - Processing chunk starting at row 35800, ending at row 35900
2025-04-10 00:17:42 - Written rows 35800 to 35900 to ./data/medium.article.topic.csv
Current file size: 192460257 bytes
2025-04-10 00:17:42 - Processing chunk starting at row 35900, ending at row 36000
2025-04-10 00:18:11 - Written rows 35900 to 36000 to ./data/medium.article.topic.csv
Current file size: 193057728 bytes
2025-04-10 00:18:11 - Processing chunk starting at row 36000, ending at row 36100
2025-04-10 00:18:41 - Written rows 36000 to 36100 to ./data/medium.article.topic.csv
Current file size: 193524317 bytes
2025-04-10 00:18:41 - Processing chunk starting at row 36100, ending at row 36200


{"timestamp": "2025-04-10 00:18:57,307", "level": "ERROR", "message": "Received None response from model", "article_text": "Game of Thrones has never been a stranger to controversy, especially in regards to how it portrays women. From its very first episode, the series has depicted an onslaught of violence, sadism, and rape, often directed at female characters.\n\nTo be fair, much of this is runoff from the fact that the show is set in a brutally paternalistic world inspired by very real periods of human history. However, the repeated use of female objectification\u2014 whether as props for exposition, titillating scene enhancement, or objects of assault \u2014 grated on many critics and viewers over time as gratuitous, even demeaning. The story\u2019s depiction of a nonchalant attitude towards misogyny could sometimes be read as the series itself taking a nonchalant attitude, and admittedly that confusion could be understandable.\n\nThis tension reached a cultural nadir in the middle 

2025-04-10 00:19:09 - Written rows 36100 to 36200 to ./data/medium.article.topic.csv
Current file size: 194052068 bytes
2025-04-10 00:19:09 - Processing chunk starting at row 36200, ending at row 36300
2025-04-10 00:19:37 - Written rows 36200 to 36300 to ./data/medium.article.topic.csv
Current file size: 194600650 bytes
2025-04-10 00:19:37 - Processing chunk starting at row 36300, ending at row 36400
2025-04-10 00:20:07 - Written rows 36300 to 36400 to ./data/medium.article.topic.csv
Current file size: 195119615 bytes
2025-04-10 00:20:07 - Processing chunk starting at row 36400, ending at row 36500
2025-04-10 00:20:36 - Written rows 36400 to 36500 to ./data/medium.article.topic.csv
Current file size: 195640205 bytes
2025-04-10 00:20:36 - Processing chunk starting at row 36500, ending at row 36600
2025-04-10 00:21:05 - Written rows 36500 to 36600 to ./data/medium.article.topic.csv
Current file size: 196211488 bytes
2025-04-10 00:21:05 - Processing chunk starting at row 36600, ending at 

{"timestamp": "2025-04-10 00:26:36,651", "level": "ERROR", "message": "Received None response from model", "article_text": "1. The Scarborough Rapes (1987\u20131990)\n\nPaul Bernardo was a suave, intelligent, good-looking man who was well-aware of his popularity and boyish charm. Paul found it easy to converse with and get into relationships with gorgeous women. There was just one problem.\n\nHe found simple, sweet relationships way too boring. Conventional, \u201cvanilla\u201d intercourse just didn't do it for him. In order for him to find pleasure, he needed his girlfriends to do things they didn't like. He wanted to hurt, demean, and humiliate them. Their fear was the only thing that truly excited him.\n\nPaul soon realized that it was too risky to continue having his violent outbursts with his girlfriends. The only way he could satiate his need for control was by violating strangers \u2014 unfortunate women who would never know his identity.\n\nIn May 1987, Paul began sexually assa

2025-04-10 00:27:00 - Written rows 37700 to 37800 to ./data/medium.article.topic.csv
Current file size: 202832537 bytes
2025-04-10 00:27:00 - Processing chunk starting at row 37800, ending at row 37900
2025-04-10 00:27:30 - Written rows 37800 to 37900 to ./data/medium.article.topic.csv
Current file size: 203338243 bytes
2025-04-10 00:27:30 - Processing chunk starting at row 37900, ending at row 38000
2025-04-10 00:27:59 - Written rows 37900 to 38000 to ./data/medium.article.topic.csv
Current file size: 203841070 bytes
2025-04-10 00:27:59 - Processing chunk starting at row 38000, ending at row 38100
2025-04-10 00:28:28 - Written rows 38000 to 38100 to ./data/medium.article.topic.csv
Current file size: 204355267 bytes
2025-04-10 00:28:28 - Processing chunk starting at row 38100, ending at row 38200
2025-04-10 00:28:58 - Written rows 38100 to 38200 to ./data/medium.article.topic.csv
Current file size: 204859695 bytes
2025-04-10 00:28:58 - Processing chunk starting at row 38200, ending at 

{"timestamp": "2025-04-10 02:27:39,712", "level": "ERROR", "message": "Received None response from model", "article_text": "The Tragic Story of Rosalynn McGinnis, Who Was Kidnapped, Raped and Held Captive by Her Stepfather\n\nShe was kidnapped by her stepfather as a child in Oklahoma, held captive for 19 years until she had 9 children\n\nRosalynn McGinnis\u2019s kidnapping, which took place from 1997 to 2016, is sure to go down in history as one of the most notorious kidnapping cases \u2014 because she managed to escape.\n\nRosalynn McGinnis was kidnapped at the age of 12 by her stepfather Henri Piette. She has also been giving birth to their nine children for nearly two decades, and McGinnis managed to escape one day. It was almost a year before the authorities finally arrested the man she knew as the stepfather, kidnapper, and father of her children.\n\nBut thanks to a husband and wife who are observant of the oddities they see in McGinnis\u2019 life, they quickly help her escaped af

2025-04-10 02:28:07 - Written rows 62300 to 62400 to ./data/medium.article.topic.csv
Current file size: 331268197 bytes
2025-04-10 02:28:07 - Processing chunk starting at row 62400, ending at row 62500
2025-04-10 02:28:35 - Written rows 62400 to 62500 to ./data/medium.article.topic.csv
Current file size: 331880429 bytes
2025-04-10 02:28:35 - Processing chunk starting at row 62500, ending at row 62600
2025-04-10 02:29:05 - Written rows 62500 to 62600 to ./data/medium.article.topic.csv
Current file size: 332442486 bytes
2025-04-10 02:29:05 - Processing chunk starting at row 62600, ending at row 62700
2025-04-10 02:29:34 - Written rows 62600 to 62700 to ./data/medium.article.topic.csv
Current file size: 332979571 bytes
2025-04-10 02:29:34 - Processing chunk starting at row 62700, ending at row 62800
2025-04-10 02:30:02 - Written rows 62700 to 62800 to ./data/medium.article.topic.csv
Current file size: 333544714 bytes
2025-04-10 02:30:02 - Processing chunk starting at row 62800, ending at 

{"timestamp": "2025-04-10 02:32:10,906", "level": "ERROR", "message": "Received None response from model", "article_text": "Kyle Dean Murdered?\n\nThe gay-for-pay porn star died under mysterious circumstances.\n\nPhoto Kyle Dean\n\n\u201cKyle Dean Dead at 21!\u201d Queer media cried in October of 2018. Each of the stories made mention that Dean was the gay-for-pay stud of the moment. The dead porn star story plays out every day with very little thought given to it. With Dean\u2019s All-American good looks and thick cock, he was not the average gay porn star. He played up his aww-shucks charm and came across as being accessible. Things wouldn\u2019t end well for the hottie, but was it an accidental suicide as many have speculated or was it murder?\n\nMost porn stars use a stage name and Dean was no different. His birth name was Brandon Jason Chrisan. Dean liked to play sports, specifically football. He also claimed to have won fourth place in an Adult Body Training competition, though t

2025-04-10 02:32:27 - Written rows 63200 to 63300 to ./data/medium.article.topic.csv
Current file size: 336123767 bytes
2025-04-10 02:32:27 - Processing chunk starting at row 63300, ending at row 63400
2025-04-10 02:32:55 - Written rows 63300 to 63400 to ./data/medium.article.topic.csv
Current file size: 336637311 bytes
2025-04-10 02:32:55 - Processing chunk starting at row 63400, ending at row 63500
2025-04-10 02:33:25 - Written rows 63400 to 63500 to ./data/medium.article.topic.csv
Current file size: 337169238 bytes
2025-04-10 02:33:25 - Processing chunk starting at row 63500, ending at row 63600
2025-04-10 02:33:53 - Written rows 63500 to 63600 to ./data/medium.article.topic.csv
Current file size: 337748926 bytes
2025-04-10 02:33:53 - Processing chunk starting at row 63600, ending at row 63700
2025-04-10 02:34:21 - Written rows 63600 to 63700 to ./data/medium.article.topic.csv
Current file size: 338293098 bytes
2025-04-10 02:34:21 - Processing chunk starting at row 63700, ending at 

{"timestamp": "2025-04-10 03:27:51,920", "level": "ERROR", "message": "Received None response from model", "article_text": "The wilderness has a way of getting to you in the best ways. The colors of the changing sky and the mystery of everything around you. Even the little critters scattering around seems to be relaxing in their own right.\n\nMy boyfriend grabs my hand and pulls me off the trail. We giggle and climb a little up a mountainside; he turns to me with his signature \u201csexy time\u201d look. I laugh at him and think that he can not be serious.\n\nOh, and he was serious.\n\nHe gripped my hips and pulled me into him fast, and kissed me like he was not going to take no for an answer. Saying the word \u201cNo\u201d was not a thing in my mind at the moment.\n\nHis hands explored my body and slid under my clothing. My body began to heat up under his touch and the midday sun. I pushed myself into him and returned the favor. I unbuttoned his pants and got a good grip on what I wan

2025-04-10 03:27:56 - Written rows 74800 to 74900 to ./data/medium.article.topic.csv
Current file size: 392744932 bytes
2025-04-10 03:27:56 - Processing chunk starting at row 74900, ending at row 75000


{"timestamp": "2025-04-10 03:28:02,790", "level": "ERROR", "message": "Received None response from model", "article_text": "Image from Pixabay\n\nGuest Written by Nero Black. It\u2019s still Christmas eve and Ebony Scrooge, closed minded agony aunt, is expecting a visit from a ghost, as promised by her long-dead mentor. What can a spirit from the past teach her?[Part 1 available here]\n\nHer eyes weren\u2019t closed long before they were opened again by the chime of her clock. As she blinked and tried to focus, she saw it was 1 am \u2014 had she slept at all? She blinked again, noticing another shadowy apparition at the foot of her bed. This time it wasn\u2019t Marley, it was\u2026 Miranda?\n\n\u201cMiri?\u201d called out an incredulous Ebony \u201cwhy are you here?\u201d This night had started out crazy enough already and Ebony wasn\u2019t even going to question why her college roommate was in her dreams. This was a dream, right? It had to be a dream, it was so surreal and nothing mad

2025-04-10 03:28:25 - Written rows 74900 to 75000 to ./data/medium.article.topic.csv
Current file size: 393177151 bytes
2025-04-10 03:28:25 - Processing chunk starting at row 75000, ending at row 75100
2025-04-10 03:28:54 - Written rows 75000 to 75100 to ./data/medium.article.topic.csv
Current file size: 393581912 bytes
2025-04-10 03:28:54 - Processing chunk starting at row 75100, ending at row 75200
2025-04-10 03:29:22 - Written rows 75100 to 75200 to ./data/medium.article.topic.csv
Current file size: 393994966 bytes
2025-04-10 03:29:22 - Processing chunk starting at row 75200, ending at row 75300
2025-04-10 03:29:51 - Written rows 75200 to 75300 to ./data/medium.article.topic.csv
Current file size: 394558194 bytes
2025-04-10 03:29:51 - Processing chunk starting at row 75300, ending at row 75400
2025-04-10 03:30:19 - Written rows 75300 to 75400 to ./data/medium.article.topic.csv
Current file size: 394933429 bytes
2025-04-10 03:30:19 - Processing chunk starting at row 75400, ending at 

{"timestamp": "2025-04-10 03:57:24,862", "level": "ERROR", "message": "Received None response from model", "article_text": "The Monster Of The Andes\n\nen.wikipedia.org\n\nThere is a wonderful moment, a divine moment when I have my hands around a young girl\u2019s throat- Pedro Alonso Lopez\n\nThe world is yet to prove that people can be born evil, although through history we\u2019ve experienced people who might have been although it has been proven that people can grow to become monsters. Monsters are most times created by the environment they develop in, the uglier the environment the more terrifying the monster.\n\nThis is the biography of a man referred to as the monster of the Andes, a monster created by his environment.\n\nEarly Life\n\nPedro Lopez was born on October 8th, 1948 in Santa Isabel, Colombia. He was born to Medardo Ryes and Belinda Ryes. His father was killed six months before his birth in a grocery store which had been attacked by a rebellious mob, this attack occurr

2025-04-10 03:57:48 - Written rows 81000 to 81100 to ./data/medium.article.topic.csv
Current file size: 422793999 bytes
2025-04-10 03:57:48 - Processing chunk starting at row 81100, ending at row 81200
2025-04-10 03:58:16 - Written rows 81100 to 81200 to ./data/medium.article.topic.csv
Current file size: 423281341 bytes
2025-04-10 03:58:16 - Processing chunk starting at row 81200, ending at row 81300
2025-04-10 03:58:45 - Written rows 81200 to 81300 to ./data/medium.article.topic.csv
Current file size: 424122515 bytes
2025-04-10 03:58:45 - Processing chunk starting at row 81300, ending at row 81400
2025-04-10 03:59:14 - Written rows 81300 to 81400 to ./data/medium.article.topic.csv
Current file size: 424443108 bytes
2025-04-10 03:59:14 - Processing chunk starting at row 81400, ending at row 81500
2025-04-10 03:59:41 - Written rows 81400 to 81500 to ./data/medium.article.topic.csv
Current file size: 424781996 bytes
2025-04-10 03:59:41 - Processing chunk starting at row 81500, ending at 



2025-04-10 04:04:31 - Written rows 82400 to 82500 to ./data/medium.article.topic.csv
Current file size: 430094938 bytes
2025-04-10 04:04:31 - Processing chunk starting at row 82500, ending at row 82600
2025-04-10 04:05:00 - Written rows 82500 to 82600 to ./data/medium.article.topic.csv
Current file size: 430510533 bytes
2025-04-10 04:05:00 - Processing chunk starting at row 82600, ending at row 82700
2025-04-10 04:05:29 - Written rows 82600 to 82700 to ./data/medium.article.topic.csv
Current file size: 431004621 bytes
2025-04-10 04:05:29 - Processing chunk starting at row 82700, ending at row 82800
2025-04-10 04:05:58 - Written rows 82700 to 82800 to ./data/medium.article.topic.csv
Current file size: 431417154 bytes
2025-04-10 04:05:58 - Processing chunk starting at row 82800, ending at row 82900
2025-04-10 04:06:27 - Written rows 82800 to 82900 to ./data/medium.article.topic.csv
Current file size: 431876339 bytes
2025-04-10 04:06:27 - Processing chunk starting at row 82900, ending at 

{"timestamp": "2025-04-10 04:13:49,413", "level": "ERROR", "message": "An error occurred while classifying article at row 84337: 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.', 'status': 'RESOURCE_EXHAUSTED'}}", "article_text": "No article text provided", "row_index": 84337, "stack_trace": "Traceback (most recent call last):\n  File \"/var/tmp/ipykernel_10207/1077753089.py\", line 152, in classify_article_and_get_tags\n    response = client.models.generate_content(\n  File \"/opt/conda/lib/python3.10/site-packages/google/genai/models.py\", line 4959, in generate_content\n    response = self._generate_content(\n  File \"/opt/conda/lib/python3.10/site-packages/google/genai/models.py\", line 3935, in _generate_content\n    response_dict = self._api_client.request(\n  File \"/opt/conda/lib/python3.10/site-packages/google/genai/_api

2025-04-10 04:14:30 - Written rows 84300 to 84400 to ./data/medium.article.topic.csv
Current file size: 439345058 bytes
2025-04-10 04:14:30 - Processing chunk starting at row 84400, ending at row 84500
2025-04-10 04:14:59 - Written rows 84400 to 84500 to ./data/medium.article.topic.csv
Current file size: 439838582 bytes
2025-04-10 04:14:59 - Processing chunk starting at row 84500, ending at row 84600
2025-04-10 04:15:29 - Written rows 84500 to 84600 to ./data/medium.article.topic.csv
Current file size: 440319683 bytes
2025-04-10 04:15:29 - Processing chunk starting at row 84600, ending at row 84700
2025-04-10 04:15:57 - Written rows 84600 to 84700 to ./data/medium.article.topic.csv
Current file size: 440728390 bytes
2025-04-10 04:15:57 - Processing chunk starting at row 84700, ending at row 84800
2025-04-10 04:16:26 - Written rows 84700 to 84800 to ./data/medium.article.topic.csv
Current file size: 441085102 bytes
2025-04-10 04:16:26 - Processing chunk starting at row 84800, ending at 

{"timestamp": "2025-04-10 04:23:53,561", "level": "ERROR", "message": "Received None response from model", "article_text": "The WONDERFUL Feeling of Breeding a Young One\u2026\n\nThese are the stories about a man in his early 40\u2019s who is loosely based on me.\n\nFirst and foremost i must begin by saying that these stories are a work of fiction and\n\nare not based on actual events. Logan is a young black male with hepatitis, and it is his\n\nsupreme mission to infect as many \u201cyoung\u201d males as he can.\n\nTHE MISSION BEGINS\u2026\n\nLogan slowly opens his eyes and yawns as he begins to wake up. He rolls over to his other side and blinks a few times as he slowly rubs his eyes. \u201cOh what time is it?\u201d he says, as he leans up and looks at his clock on the dresser drawer in the other corner of his room. He then leans up as he begins to stretch and lets out another yawn. \u201cIts 12:18 AM\u201d he says as he slowly puts his feet on the floor and stands up awkwardly. He w

2025-04-10 04:24:13 - Written rows 86300 to 86400 to ./data/medium.article.topic.csv
Current file size: 448261155 bytes
2025-04-10 04:24:13 - Processing chunk starting at row 86400, ending at row 86500
2025-04-10 04:24:41 - Written rows 86400 to 86500 to ./data/medium.article.topic.csv
Current file size: 448604360 bytes
2025-04-10 04:24:41 - Processing chunk starting at row 86500, ending at row 86600
2025-04-10 04:25:09 - Written rows 86500 to 86600 to ./data/medium.article.topic.csv
Current file size: 449034127 bytes
2025-04-10 04:25:09 - Processing chunk starting at row 86600, ending at row 86700
2025-04-10 04:25:38 - Written rows 86600 to 86700 to ./data/medium.article.topic.csv
Current file size: 449455081 bytes
2025-04-10 04:25:38 - Processing chunk starting at row 86700, ending at row 86800
2025-04-10 04:26:07 - Written rows 86700 to 86800 to ./data/medium.article.topic.csv
Current file size: 449976886 bytes
2025-04-10 04:26:07 - Processing chunk starting at row 86800, ending at 

{"timestamp": "2025-04-10 06:00:23,078", "level": "ERROR", "message": "Received None response from model", "article_text": "Groundhog Day (1993) \u2014 Columbia Pictures\n\nUntil this past summer, I had never seen the 1993 comedy, Groundhog Day, directed by Harold Ramis, and starring Bill Murray and Andie MacDowell. I know, right? Where have I been, living under a rock?? Not exactly. I was 5 years old when the film came out, and I was living in my parents\u2019 apartment in Manhattan, where we did not have a TV, so entertainment came in the form of reading quietly, arguing less quietly around the dinner table, and the occasional outing to the Metropolitan Opera House. So, yeah. I\u2019m a fucking snob, a New Yorker, and a feminazi libtard, and I\u2019m here to ruin Groundhog Day for you \u2014 or at least, I\u2019m going to try my best.\n\n1993 was a while ago. Hair was big, greed was good, and women were a novelty in the workplace. Many of my favorite films from that time (watched at 

2025-04-10 06:00:50 - Written rows 106300 to 106400 to ./data/medium.article.topic.csv
Current file size: 540927962 bytes
2025-04-10 06:00:50 - Processing chunk starting at row 106400, ending at row 106500
2025-04-10 06:01:20 - Written rows 106400 to 106500 to ./data/medium.article.topic.csv
Current file size: 541271993 bytes
2025-04-10 06:01:20 - Processing chunk starting at row 106500, ending at row 106600
2025-04-10 06:01:51 - Written rows 106500 to 106600 to ./data/medium.article.topic.csv
Current file size: 541993429 bytes
2025-04-10 06:01:51 - Processing chunk starting at row 106600, ending at row 106700
2025-04-10 06:02:23 - Written rows 106600 to 106700 to ./data/medium.article.topic.csv
Current file size: 542684191 bytes
2025-04-10 06:02:23 - Processing chunk starting at row 106700, ending at row 106800
2025-04-10 06:02:51 - Written rows 106700 to 106800 to ./data/medium.article.topic.csv
Current file size: 542877382 bytes
2025-04-10 06:02:51 - Processing chunk starting at row

{"timestamp": "2025-04-10 06:26:26,951", "level": "ERROR", "message": "Received None response from model", "article_text": "AN OBSERVATION ON THE DIFFERENT COLOURS OF REALITY.\n\n\n\nWritten by Festus Obehi Destiny\n\n\n\nBeing poor will make you so angry. Your emotions so taut. Your thoughts overwhelm you until you feel claustrophobic. Your freedom is not yours to give. Your lips are locked, emotions pocketed, until you explode in different drabs of depression. Depression keeps you angry and anger is the spice of poverty. Although many of us have mastered the art of introducing humor to our situation to make an euphemism of the hyperbole that anguishes our everyday thoughts. How gruesome it is to wake up everyday to a recurring state of disappointment, go out, see the life you are living, the life you are trying to avoid and the dreams you are struggling to sail in. in secondary school, we are indoctrinated to master the arms of government and the fallacy of the fundamental human righ

2025-04-10 06:26:30 - Written rows 111500 to 111600 to ./data/medium.article.topic.csv
Current file size: 566181346 bytes
2025-04-10 06:26:30 - Processing chunk starting at row 111600, ending at row 111700
2025-04-10 06:27:00 - Written rows 111600 to 111700 to ./data/medium.article.topic.csv
Current file size: 566729137 bytes
2025-04-10 06:27:00 - Processing chunk starting at row 111700, ending at row 111800
2025-04-10 06:27:28 - Written rows 111700 to 111800 to ./data/medium.article.topic.csv
Current file size: 567110216 bytes
2025-04-10 06:27:28 - Processing chunk starting at row 111800, ending at row 111900
2025-04-10 06:27:59 - Written rows 111800 to 111900 to ./data/medium.article.topic.csv
Current file size: 567788180 bytes
2025-04-10 06:27:59 - Processing chunk starting at row 111900, ending at row 112000
2025-04-10 06:28:30 - Written rows 111900 to 112000 to ./data/medium.article.topic.csv
Current file size: 568292578 bytes
2025-04-10 06:28:30 - Processing chunk starting at row

{"timestamp": "2025-04-10 08:19:12,865", "level": "ERROR", "message": "Received None response from model", "article_text": "By Angela Orlando Cameli\n\n\u201cWhat are all those pills you\u2019re taking, honey?\u201d Dad\u2019s voice booms.\n\nFor a second, when I hear the H sound, I think he\u2019s going to call me Heidi, my chosen cult name. The wine bottle shakes in my hand. I haven\u2019t been called Heidi since I left the Children of God almost two decades ago.\n\n\u201cShh,\u201d I say, \u201cJohn\u2019s sleeping upstairs with the girls.\u201d I live with my husband and two daughters who are four and six years old, in a spacious townhouse downtown, no longer crammed into the commune\u2019s bedrooms which were flanked with bunk beds and sleeping bags.\n\nDad and I are in my kitchen in Chicago, cleaning up after a family party. In his sixties, he\u2019s wider and grayer. In my thirties, so am I. Dad lives in Spain with my mom and two younger sisters, but he visits Chicago often thes

2025-04-10 08:19:25 - Written rows 133900 to 134000 to ./data/medium.article.topic.csv
Current file size: 671111656 bytes
2025-04-10 08:19:25 - Processing chunk starting at row 134000, ending at row 134100
2025-04-10 08:19:56 - Written rows 134000 to 134100 to ./data/medium.article.topic.csv
Current file size: 671491741 bytes
2025-04-10 08:19:56 - Processing chunk starting at row 134100, ending at row 134200
2025-04-10 08:20:26 - Written rows 134100 to 134200 to ./data/medium.article.topic.csv
Current file size: 671707751 bytes
2025-04-10 08:20:26 - Processing chunk starting at row 134200, ending at row 134300
2025-04-10 08:20:55 - Written rows 134200 to 134300 to ./data/medium.article.topic.csv
Current file size: 671980763 bytes
2025-04-10 08:20:55 - Processing chunk starting at row 134300, ending at row 134400
2025-04-10 08:21:25 - Written rows 134300 to 134400 to ./data/medium.article.topic.csv
Current file size: 672481031 bytes
2025-04-10 08:21:25 - Processing chunk starting at row

{"timestamp": "2025-04-10 09:11:24,941", "level": "ERROR", "message": "Received None response from model", "article_text": "Top 5 Tourist Tips\n\nAs a tourist, traveling through Thailand is infinitely better when a Thai girl accompanies you. While many men wisely choose to book ahead and line up many Thai women using online dating, other guys land and then start wondering what to do next.\n\nIs that bewildered guy now you?\n\nHow do you go about quickly find Thai women in Thailand as a newly landed tourist?\n\nMeeting Thai girls when traveling in Thailand is not easy. Opportunities dramatically increase when a traveler uses proven strategies. Our research shows using business cards, asking direct questions, and giving Thai women compliments will provide you with a 260% uplift in meeting opportunities.\n\nWe have got you covered though.\n\nAll of these techniques will work for new tourists. You\u2019ll be entering Thailand through a major airport and a capital city.\n\nLarge cities are 

2025-04-10 09:11:44 - Written rows 144200 to 144300 to ./data/medium.article.topic.csv
Current file size: 723369416 bytes
2025-04-10 09:11:44 - Processing chunk starting at row 144300, ending at row 144400
2025-04-10 09:12:15 - Written rows 144300 to 144400 to ./data/medium.article.topic.csv
Current file size: 723894413 bytes
2025-04-10 09:12:15 - Processing chunk starting at row 144400, ending at row 144500
2025-04-10 09:12:46 - Written rows 144400 to 144500 to ./data/medium.article.topic.csv
Current file size: 724402605 bytes
2025-04-10 09:12:46 - Processing chunk starting at row 144500, ending at row 144600
2025-04-10 09:13:18 - Written rows 144500 to 144600 to ./data/medium.article.topic.csv
Current file size: 724868127 bytes
2025-04-10 09:13:18 - Processing chunk starting at row 144600, ending at row 144700
2025-04-10 09:13:49 - Written rows 144600 to 144700 to ./data/medium.article.topic.csv
Current file size: 725320661 bytes
2025-04-10 09:13:49 - Processing chunk starting at row

{"timestamp": "2025-04-10 12:30:12,916", "level": "ERROR", "message": "Received None response from model", "article_text": "I became aware of the human trafficking for the sex exploitation issue in Italy for the first time in 2015 when I moved to Bologna.\n\nI rented an apartment close to Porta San Vitale with a balcony that looked over an alley where three Nigerian women stayed to meet their clients every night. On the other side of the main road, two Romanian women waited for clients to stop by with their car.\n\nThey were from 19 up to 40 years old. These women are a constant nightly presence on the ring road that surrounds the historic center of Bologna. They are like shadows that appear at dusk and disappear at dawn.\n\nI owe their acquaintance to my dog\u2019s friendliness. She was a rescued dog. I had adopted her from an animal shelter and found out immediately after she was hopelessly sick. Yet, she outlived what the veterinarians initially predicted, and for sure, it was thank

2025-04-10 12:30:17 - Written rows 182200 to 182300 to ./data/medium.article.topic.csv
Current file size: 944345414 bytes
2025-04-10 12:30:17 - Processing chunk starting at row 182300, ending at row 182400
2025-04-10 12:30:49 - Written rows 182300 to 182400 to ./data/medium.article.topic.csv
Current file size: 944963598 bytes
2025-04-10 12:30:49 - Processing chunk starting at row 182400, ending at row 182500
2025-04-10 12:31:20 - Written rows 182400 to 182500 to ./data/medium.article.topic.csv
Current file size: 945522006 bytes
2025-04-10 12:31:20 - Processing chunk starting at row 182500, ending at row 182600
2025-04-10 12:31:58 - Written rows 182500 to 182600 to ./data/medium.article.topic.csv
Current file size: 945996396 bytes
2025-04-10 12:31:58 - Processing chunk starting at row 182600, ending at row 182700
2025-04-10 12:32:31 - Written rows 182600 to 182700 to ./data/medium.article.topic.csv
Current file size: 946452656 bytes
2025-04-10 12:32:31 - Processing chunk starting at row

In [2]:
# check the new dataframe contents
new_df.head()

NameError: name 'new_df' is not defined

In [1]:
# Check the contents of the directory after writing
!ls -lh ./data/

total 1.9G
-rw-r--r-- 1 jupyter jupyter 949M Apr 10 09:24 medium.article.topic.csv
-rw-r--r-- 1 jupyter jupyter 995M Apr  9 12:11 medium_articles.csv


#### Next, train BERT (a much less expensive model) to perform the Topic identification.

##### set up science kit

In [76]:
!pip install torch

Collecting torch
  Downloading torch-2.6.0-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)


In [4]:
from google.cloud import aiplatform, bigquery
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

##### load new dataframe

In [5]:
bert_df = pd.read_csv('./data/medium.article.topic.csv')

##### check dataframe format

In [6]:
# Display first few rows
bert_df.head()

Unnamed: 0,text,topic
0,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,personal development
1,Your Brain On Coronavirus\n\nA guide to the cu...,medicine
2,Mind Your Nose\n\nHow smell training can chang...,science
3,Passionate about the synergy between science a...,technology
4,"You’ve heard of him, haven’t you? Phineas Gage...",science


##### split out new training, validation, and test datasets

In [7]:
from sklearn.model_selection import train_test_split

# Split into training and temp (validation + test) set (70% train, 30% temp)
train, temp = train_test_split(bert_df, test_size=0.3, random_state=42)

# Split the temp set into validation and test set (50% validation, 50% test of the 30%)
validate, test = train_test_split(temp, test_size=0.5, random_state=42)

# Print counts of rows in each set
print(f"Training Set Size: {train.shape[0]}")
print(f"Validation Set Size: {validate.shape[0]}")
print(f"Test Set Size: {test.shape[0]}")

Training Set Size: 134657
Validation Set Size: 28855
Test Set Size: 28856


##### set up training prereqs

In [80]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.51.2-py3-none-any.whl.metadata (38 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Downloading transformers-4.51.2-py3-none-any.whl (10.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-0.30.2-py3-none-any.whl (481 kB)
Downloading safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (471 kB)
Installing collected packages: safetensors, huggingface-hub, transformers
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.17.2
    Uninstalling huggingface-hub-0.17.2:
      Successfully uninstalled huggingface-hub-0.17.2
[31mERROR: pi

##### Create and train model

In [16]:
import datetime
import shutil

import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from google.cloud import aiplatform
from official.nlp import optimization  # to create AdamW optmizer

tf.get_logger().setLevel("ERROR")

In [17]:
bert_df.head()

Unnamed: 0,text,topic
0,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,personal development
1,Your Brain On Coronavirus\n\nA guide to the cu...,medicine
2,Mind Your Nose\n\nHow smell training can chang...,science
3,Passionate about the synergy between science a...,technology
4,"You’ve heard of him, haven’t you? Phineas Gage...",science


In [20]:
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder

BATCH_SIZE = 32

# Fit label encoder on all labels
label_encoder = LabelEncoder()
label_encoder.fit(pd.concat([train["topic"], validate["topic"], test["topic"]]))

# Encode the labels
train_labels = label_encoder.transform(train["topic"])
val_labels   = label_encoder.transform(validate["topic"])
test_labels  = label_encoder.transform(test["topic"])

# Convert DataFrames to tf.data.Dataset
raw_train_ds = tf.data.Dataset.from_tensor_slices((train["text"].values, train_labels))
val_ds       = tf.data.Dataset.from_tensor_slices((validate["text"].values, val_labels))
test_ds      = tf.data.Dataset.from_tensor_slices((test["text"].values, test_labels))

# Shuffle and batch the training set
raw_train_ds = raw_train_ds.shuffle(buffer_size=len(train)).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_ds       = val_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_ds      = test_ds.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

# Print dataset sizes (samples, not batches)
print(f"Train samples: {train.shape[0]}")
print(f"Validation samples: {validate.shape[0]}")
print(f"Test samples: {test.shape[0]}")

Train samples: 134657
Validation samples: 28855
Test samples: 28856


##### Check splits

In [24]:
class_names = label_encoder.classes_

for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(3):
        print(f"Text  : {text_batch.numpy()[i].decode('utf-8')}")
        label = label_batch.numpy()[i]
        print(f"Label : {label} ({class_names[label]})")
        print("---")

Text  : PlayOjo is a great site for new players in the UK. It is licensed in Malta and the UK and has the same selection of games as its online counterpart. Players in the UK will also find that the casino is very user-friendly. The interface is colorful and easy to use. You can also access a number of exclusive competitions and guides to games at PlayOjo. The gaming industry is booming in Malta, and the casino is doing very well in the country.

You can contact the support team via live chat or email. Live chat is usually quick, but you should check the FAQ first to see if there is an answer to your question. You can also check their blog or FAQ section for any related questions. Once you have registered with the casino, you can begin playing. The site will even give you a free trial of the games. Besides that, the customer service team is very friendly and will be happy to assist you.

The support team at Play OJO login UK is friendly and will answer any questions you have about the 

##### Choose initial bert model

In [25]:
# defining the URL of the smallBERT model to use
tfhub_handle_encoder = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"

# defining the corresponding preprocessing model for the BERT model above
tfhub_handle_preprocess = "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"

print(f"BERT model selected           : {tfhub_handle_encoder}")
print(f"Preprocess model auto-selected: {tfhub_handle_preprocess}")

BERT model selected           : https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3


##### Load it into Keras

In [26]:
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

##### Initial preprocessor validation

In [27]:
text_test = ["this is such an amazing movie!"]
text_preprocessed = bert_preprocess_model(text_test)

# This print box will help you inspect the keys in the pre-processed dictionary
print(f"Keys       : {list(text_preprocessed.keys())}")

# 1. input_word_ids is the ids for the words in the tokenized sentence
print(f'Shape      : {text_preprocessed["input_word_ids"].shape}')
print(f'Word Ids   : {text_preprocessed["input_word_ids"][0, :12]}')

# 2. input_mask is the tokens which we are masking (masked language model)
print(f'Input Mask : {text_preprocessed["input_mask"][0, :12]}')

# 3. input_type_ids is the sentence id of the input sentence.
print(f'Type Ids   : {text_preprocessed["input_type_ids"][0, :12]}')

Keys       : ['input_type_ids', 'input_word_ids', 'input_mask']
Shape      : (1, 128)
Word Ids   : [ 101 2023 2003 2107 2019 6429 3185  999  102    0    0    0]
Input Mask : [1 1 1 1 1 1 1 1 1 0 0 0]
Type Ids   : [0 0 0 0 0 0 0 0 0 0 0 0]


#### Optimize frame size

In [30]:
from transformers import BertTokenizer
import numpy as np

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def get_token_lengths(text_batches):
    # Flattening the batch into a list of tokenized lengths
    return [len(tokenizer.tokenize(text.decode('utf-8'))) for text in text_batches]

# Initialize the lists to collect token lengths
token_lengths = []

# Iterate through a single batch of data in raw_train_ds
for text_batch, label_batch in raw_train_ds.take(1):  # Assuming raw_train_ds is the dataset you've prepared
    # Get token lengths for this batch of texts
    token_lengths.extend(get_token_lengths(text_batch))

# Analyze the token lengths
print("Token count stats:")
print(f"Mean      : {np.mean(token_lengths):.1f}")
print(f"Median    : {np.median(token_lengths):.1f}")
print(f"95th perc : {np.percentile(token_lengths, 95):.0f}")
print(f"Max       : {np.max(token_lengths)}")

# Print dataset sizes
print(f"Train samples: {train.shape[0]}")
print(f"Validation samples: {validate.shape[0]}")
print(f"Test samples: {test.shape[0]}")

# Sanity check by printing a few text examples from the batch
class_names = label_encoder.classes_

for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(3):
        print(f"Text  : {text_batch.numpy()[i].decode('utf-8')}")
        label = label_batch.numpy()[i]
        print(f"Label : {label} ({class_names[label]})")
        print("---")

RuntimeError: Only a single TORCH_LIBRARY can be used to register the namespace prims; please put all of your definitions in a single TORCH_LIBRARY block.  If you were trying to specify implementations, consider using TORCH_LIBRARY_IMPL (which can be duplicated).  If you really intended to define operators for a single namespace in a distributed way, you can use TORCH_LIBRARY_FRAGMENT to explicitly indicate this.  Previous registration of TORCH_LIBRARY was registered at /dev/null:241; latest registration was registered at /dev/null:241

##### Step 5: Main function to run the workflow

##### kubeflow model with parallel training and comparison testing, and hypertuning using k8s

#### Sport and Score Extractor

input: article body  
output: identified sport, score summary

In [None]:

# Import necessary modules from pydantic
from pydantic import BaseModel, Extra
from typing import Dict, Any
import pandas as pd  # Assuming you're using pandas for the test dataframe

# Define the Schema class
class Schema(BaseModel):
    type: str
    properties: Dict[str, Any]
    required: list

# Define the FunctionDeclarationWithExtra class
class FunctionDeclarationWithExtra(BaseModel):
    name: str
    description: str
    parameters: Schema
    # Allow extra fields (like PROMPT)
    class Config:
        extra = Extra.allow

# Define the function declaration for score extraction
score_extraction_func = FunctionDeclarationWithExtra(
    name="score_extraction",
    description="Extract the scores from the article if feasible",
    parameters=Schema(
        type="OBJECT",
        properties={
            "article_text": {  # Define the expected 'article_text' input
                "type": "STRING",
                "description": "The content of the article to extract scores."
            },
            "score_available": {
                "type": "BOOLEAN",
                "description": "Indicates if scores are available in the article."
            },
            "team1_score": {
                "type": "INTEGER",
                "description": "Score for team 1 if available."
            },
            "team2_score": {
                "type": "INTEGER",
                "description": "Score for team 2 if available."
            },
            "team1_name": {
                "type": "STRING",
                "description": "Name of team 1 if mentioned."
            },
            "team2_name": {
                "type": "STRING",
                "description": "Name of team 2 if mentioned."
            },
        },
        required=["article_text", "score_available"],
    ),
    PROMPT="""
    Given the following article, extract the score if feasible:

    {article_text}

    Provide the output in a structured JSON format:
    {{
        "score_available": true/false,
        "team1_name": "name_of_team1",
        "team1_score": score_of_team1,
        "team2_name": "name_of_team2",
        "team2_score": score_of_team2
    }}
    """
)

def extract_scores_from_article(article_text: str):
    # Dynamically set the prompt with the actual article text
    formatted_prompt = score_extraction_func.PROMPT.format(article_text=article_text)

    # Example client and content generation logic (ensure the client and model are correctly defined)
    response = client.models.generate_content(
        model=MODEL,
        contents=formatted_prompt,
        config=GenerateContentConfig(
            response_mime_type="application/json",
            response_schema={
                "type": "OBJECT",
                "properties": {
                    "score_available": {
                        "type": "BOOLEAN",
                    },
                    "team1_name": {
                        "type": "STRING",
                    },
                    "team1_score": {
                        "type": "INTEGER",
                    },
                    "team2_name": {
                        "type": "STRING",
                    },
                    "team2_score": {
                        "type": "INTEGER",
                    }
                },
                "required": ["score_available"],
            },
        ),
    )

    # The response contains the scores extracted from the article
    scores_extraction_result = json.loads(response.json())  # Parse the JSON response

    return scores_extraction_result

# Loop through the test dataframe and extract scores
for idx, row in test.head(2000).iterrows():  # Adjust the dataframe as necessary
    article_text = row["text"]
    extracted_scores = extract_scores_from_article(article_text)
    actual_tags = row["tags"]  # Get the tags for the current row

    if extracted_scores.get("score_available"):
        print(f"Extracted Scores: {extracted_scores}\t\tActual tags: {actual_tags}")

#### Sport Statistics Summarizer

input: article body  
output: statistics summary

In [None]:

# Import necessary modules from pydantic
from pydantic import BaseModel, Extra
from typing import Dict, Any
import pandas as pd  # Assuming you're using pandas for the test dataframe
import json

# Define the Schema class
class Schema(BaseModel):
    type: str
    properties: Dict[str, Any]
    required: list

# Define the FunctionDeclarationWithExtra class
class FunctionDeclarationWithExtra(BaseModel):
    name: str
    description: str
    parameters: Schema
    # Allow extra fields (like PROMPT)
    class Config:
        extra = Extra.allow

# Define the function declaration for sport mention and person summary extraction
sport_person_extraction_func = FunctionDeclarationWithExtra(
    name="sport_person_extraction",
    description="Extract mentions of sports and summarize information related to individuals mentioned in the article if feasible",
    parameters=Schema(
        type="OBJECT",
        properties={
            "article_text": {  # Define the expected 'article_text' input
                "type": "STRING",
                "description": "The content of the article to extract sport mentions and person summaries."
            },
            "sport_mentioned": {
                "type": "BOOLEAN",
                "description": "Indicates if any sports are mentioned in the article."
            },
            "person_name": {
                "type": "STRING",
                "description": "Name of the person mentioned in relation to the sport."
            },
            "summary": {
                "type": "STRING",
                "description": "Summary of what is said about the person."
            },
        },
        required=["article_text", "sport_mentioned"],
    ),
    PROMPT="""
    Given the following article, identify if any sports are mentioned. If there are mentions of sports, find any names attached to the article and provide a summary about what the article is saying about that person.

    {article_text}

    Provide the output in a structured JSON format:
    {{
        "sport_mentioned": true/false,
        "person_name": "name_of_person_mentioned",
        "summary": "summary_about_person"
    }}
    """
)

def extract_sport_person_summary_from_article(article_text: str):
    # Dynamically set the prompt with the actual article text
    formatted_prompt = sport_person_extraction_func.PROMPT.format(article_text=article_text)

    # Example client and content generation logic (ensure the client and model are correctly defined)
    response = client.models.generate_content(
        model=MODEL,  # Make sure to replace with the actual model name you are using
        contents=formatted_prompt,
        config=GenerateContentConfig(
            response_mime_type="application/json",
            response_schema={
                "type": "OBJECT",
                "properties": {
                    "sport_mentioned": {
                        "type": "BOOLEAN",
                    },
                    "person_name": {
                        "type": "STRING",
                    },
                    "summary": {
                        "type": "STRING",
                    }
                },
                "required": ["sport_mentioned"],
            },
        ),
    )

    # Parse the JSON response
    response_data = json.loads(response.json())
    sport_person_extraction_result = response_data.get('parsed', {})

    return sport_person_extraction_result

# Loop through the test dataframe and extract sport mentions and person summaries
for idx, row in test.head(2000).iterrows():
    article_text = row["text"]
    extracted_info = extract_sport_person_summary_from_article(article_text)
    actual_tags = row["tags"]  # Get the tags for the current row

    if extracted_info.get("sport_mentioned") == True:
        print(f"Extracted Info: {extracted_info}\t\tActual tags: {actual_tags}")

### API Deployment (Input/Output)

#### Stand up Webapp endpoint

##### Dockerize instance with Python scripting and the weighted BERT model

##### Create Support Pipeline

##### Stand up Vertex Instance (the actual endpoint)

## Reference Labs

- Gemini Function Calling
- Gemini Prompt Engineering
- AutoML for Text Classification - Vertex
- KFP Walkthrogh - Vertex Containerization - Training and Deployment Pipelines