# Project 3 - Article Parsing Toolset

## Team 3 Members

- Matthew Dunbar
- Jeffrei Cher
- Basil James

## Project Problem

To parse article content and extract various data from the article contents

- _**Article Classification into a set of pre-defined categories**_
  (a prediction model)

For Sports Articles:
- _**to identify the sport, and extract game/match scores if present**_
- _(to provide stats summaries if present?)_

## Learning Goal

Develop experience:

- Building and deploying models to leverage in real world applications
- Leveraging custom training of LLMs to provide article analysis.
- Using a tool-based (extensible toolbox) approach to provide multiple analytics features


## Dataset

- https://www.kaggle.com/datasets/fabiochiusano/medium-articles  
[size: 190k+, categories: multiple tags per article, large set, Includes titles, full articles, and URLs]


#### Retrieve dataset and load into Big Query

In [1]:
! kaggle datasets download -d fabiochiusano/medium-articles -p ./data --unzip

Dataset URL: https://www.kaggle.com/datasets/fabiochiusano/medium-articles
License(s): CC0-1.0
Downloading medium-articles.zip to ./data
100%|█████████████████████████████████████████| 369M/369M [00:03<00:00, 144MB/s]
100%|█████████████████████████████████████████| 369M/369M [00:03<00:00, 119MB/s]


#### Fulfill basic BigQuery dependencies

In [2]:
!pip install google-cloud-bigquery pandas pyarrow



In [5]:
import os

import pandas as pd
from google.cloud import bigquery

# Load dataset (change filename accordingly)
df = pd.read_csv('./data/medium_articles.csv')

# Display first few rows
df.head()

Unnamed: 0,title,text,url,authors,timestamp,tags
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."


In [7]:
!ls -l ./data/

total 1017916
-rw-r--r-- 1 jupyter jupyter 1042340506 Apr  2 17:18 medium_articles.csv


#### Upload to BigQuery

In [6]:
import subprocess

# Set BigQuery project details
project_id = subprocess.check_output(["gcloud", "config", "get-value", "project"], text=True).strip()
dataset_id = "articles"
table_id = "medium"
destination = f"{project_id}.{dataset_id}.{table_id}"

# Initialize BigQuery client
client = bigquery.Client(project=project_id)

# Upload dataframe to BigQuery
job = client.load_table_from_dataframe(df, destination, job_config=bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE"))

# Wait for the job to complete
job.result()

print(f"Active Project ID: {project_id}")
print(f"Dataset ID: {dataset_id}")
print(f"Table ID: {table_id}")
print(f"Uploaded data to {destination}")

NotFound: 404 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/qwiklabs-asl-04-b3351f84df35/jobs?uploadType=resumable: Not found: Dataset qwiklabs-asl-04-b3351f84df35:articles

## Solution

### Approach

LLM-based tool(s) with tool-specific training, with tool-specific engineered prompt(s).
Modular. Function-based.  Extensible.


#### Input: content of the article to analyse

(ideally, the contents directly.  alternate consideration might be to provide a URL, but that would require additional python supporting fucntions to pull, then clean up the contents prior to submission)

#### Output: variable per tool/function

### Tools

#### Article Topic Classifier



#### Sport and Score Extractor


#### Sport Statistics Summarizer


### API Deployment (Input/Output)

Webapp endpoint

## Reference Labs

- Gemini Function Calling
- AutoML for Text Classification - Vertex
- KFP Walkthrogh - Vertex Containerization - Training and Deployment Pipelines