# Gemini News Feature Playground

Interactively load historical news articles, call the Gemini 2.5 Flash model with the structured prompt, and inspect the extracted features alongside the rolling context summary.

## Prerequisites

- Install dependencies: `pip install google-generativeai pandas ipywidgets python-dateutil`.
- Export your Gemini API key before running: `setx GOOGLE_API_KEY "<your key>"` (PowerShell) then restart the notebook kernel.
- Ensure the cleaned news JSON files exist under `data_clean/bronze/news/historical_5year`.

All requests in this notebook run sequentially to respect rate limits.

In [None]:
# Standard library imports
import json,sys
from pathlib import Path
from datetime import datetime, date, time, timezone

# Third-party imports
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, Markdown

sys.path.append(r"c:\\Users\\user\\Desktop\\kz code\\fx-ml-pipeline")

from src_clean.data_pipelines.silver.LLM_news_feature_pipelines import (
    NewsLLMFeaturePipeline,
    combine_article_text,
)

In [127]:
from pathlib import Path

import os
os.environ["GOOGLE_API_KEY"] = "AIzaSyCIh_EGlbK5LI0BpcOpzRkKFka-IgOtdH0"

PROJECT_ROOT = Path.cwd().resolve()
if PROJECT_ROOT.name == "notebooks":
    PROJECT_ROOT = PROJECT_ROOT.parent



try:
    from dotenv import load_dotenv
    load_dotenv()
except Exception:
    pass

import os
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')
if not GOOGLE_API_KEY:
    display(Markdown('?? **Warning**: GOOGLE_API_KEY is not set. Set it before running the LLM calls.'))

if not INPUT_DIR.exists():
    raise FileNotFoundError(f'Expected directory not found: {INPUT_DIR}')

DEFAULT_START = date(2023, 6, 1)
DEFAULT_END = date(2023, 6, 7)


In [128]:
PROJECT_ROOT

WindowsPath('C:/Users/user/Desktop/kz code/fx-ml-pipeline')

In [129]:
# Widgets to select the date window and maximum number of articles
start_picker = widgets.DatePicker(description='Start Date', value=DEFAULT_START)
end_picker = widgets.DatePicker(description='End Date', value=DEFAULT_END)
max_articles_slider = widgets.IntSlider(description='Max articles', min=1, max=20, value=5)
model_text = widgets.Text(description='Gemini model', value='gemini-2.5-flash-lite')

widgets.VBox([
    widgets.HTML('<b>Select date window and settings</b>'),
    start_picker,
    end_picker,
    max_articles_slider,
    model_text,
])

VBox(children=(HTML(value='<b>Select date window and settings</b>'), DatePicker(value=datetime.date(2023, 6, 1…

In [None]:
from typing import List, Dict, Any
from datetime import datetime
from src_clean.data_pipelines.silver.LLM_news_manifest import filter_manifest

def load_articles_between(input_dir: Path, start: date, end: date, limit: int) -> List[Dict[str, Any]]:
    # Load and filter articles by published_at timestamp.
    if start is None or end is None:
        raise ValueError('Please select both start and end dates.')

    start_dt = datetime.combine(start, time.min, tzinfo=timezone.utc)
    end_dt = datetime.combine(end, time.max, tzinfo=timezone.utc)

    manifest = Path("../data_clean/silver/news/news_manifest_cleaned_20251022_122221.parquet")
    df = filter_manifest(
        manifest,
        start=start_dt,
        end=end_dt,
        limit=None,
    )

    articles = []
    for path_str in df.file_path:
        path = PROJECT_ROOT / Path(path_str)
        with path.open("r", encoding="utf-8") as f:
            article = json.load(f)
        articles.append(article)

    df = pd.DataFrame(articles)

    # remove duplicates, keep earliest published dates
    before = len(df)
    df = df.sort_values("published_at", ascending=True)
    df = df.drop_duplicates(subset=["headline"], keep="first")
    after = len(df)
    dropped = before - after
    if dropped:
        display(Markdown(f"⚠️ **Warning**: dropped {dropped} duplicate row(s)."))

    before = len(df)
    df = df[df['body'].str.len()>0]
    after = len(df)
    dropped = before - after
    if dropped:
        display(Markdown(f"⚠️ **Warning**: dropped {dropped} row(s) that has empty body."))


    return df


In [38]:
# Load articles for the selected window
selected_articles = load_articles_between(
    INPUT_DIR,
    start_picker.value or DEFAULT_START,
    end_picker.value or DEFAULT_END,
    max_articles_slider.value,
)

display(selected_articles)
selected_articles.to_clipboard()

Unnamed: 0,article_id,headline,body,url,source,published_at,collected_at,sp500_relevant,collection_method,language,content_fetched
0,36052f0b0998bc21fbecd60e7f7395ae,"Wall Street termine dans le rouge , les craint...",Temps de lecture 6 min Temps de lecture 1 min ...,http://lifestyle.boursorama.com/bourse/actuali...,gdelt_lifestyle.boursorama.com,2023-06-01T00:30:00+00:00,2025-10-20T11:12:55.522311+00:00,True,gdelt_api,French,True
1,89cdabc5d9bdf783236c1997f1cf560d,"Wall Street Anjlok , Investor Cermati Pemungut...","Liputan6.com, nEW yORK - Bursa saham Amerika S...",https://www.liputan6.com/saham/read/5304108/wa...,gdelt_liputan6.com,2023-06-01T01:30:00+00:00,2025-10-20T11:12:55.519182+00:00,True,gdelt_api,Indonesian,True
2,f90a6b0903fb43b21eed0a4bb42cca44,Saham Asia naik di tengah pembicaraan jeda ken...,Tokyo (ANTARA) - Sebagian besar pasar saham di...,https://www.antaranews.com/berita/3566778/saha...,gdelt_antaranews.com,2023-06-01T08:45:00+00:00,2025-10-20T11:12:55.521183+00:00,True,gdelt_api,Indonesian,True
3,d16913739539125c51ccfd15ff016fe2,American Eagle Stock Has Upside Potential To I...,"ByTrefis Team, Contributor. American Eagle Out...",https://www.forbes.com/sites/greatspeculations...,gdelt_forbes.com,2023-06-01T11:45:00+00:00,2025-10-20T11:12:55.520169+00:00,True,gdelt_api,English,True
4,4ae55e66bb63e912d78fb6c7e0cf639e,Insurance Stocks Moving In Thursday Intraday S...,According to Benzinga Pro following are the ga...,https://www.benzinga.com/trading-ideas/movers/...,gdelt_benzinga.com,2023-06-01T19:00:00+00:00,2025-10-20T11:12:55.518183+00:00,True,gdelt_api,English,True
5,379faa5b51ec03f81fe6a6083480c766,12 Information Technology Stocks Moving In Thu...,This article was generated by Benzinga's autom...,https://www.benzinga.com/trading-ideas/movers/...,gdelt_benzinga.com,2023-06-01T23:30:00+00:00,2025-10-20T11:12:55.519182+00:00,True,gdelt_api,English,True
6,47733be43dc74fe11fd108836400ca2a,12 Industrials Stocks Moving In Thursday After...,This article was generated by Benzinga's autom...,https://www.benzinga.com/trading-ideas/movers/...,gdelt_benzinga.com,2023-06-01T23:30:00+00:00,2025-10-20T11:12:55.518183+00:00,True,gdelt_api,English,True
7,c1aec02727f21b5f99e40f122231903e,12 Consumer Discretionary Stocks Moving In Thu...,This article was generated by Benzinga's autom...,https://www.benzinga.com/trading-ideas/movers/...,gdelt_benzinga.com,2023-06-01T23:30:00+00:00,2025-10-20T11:12:55.518183+00:00,True,gdelt_api,English,True
8,f7cd81922e35f99e1e6744d0a2f1c55f,Wall Street Melesat Terdorong Sentimen Plafon ...,"Liputan6.com, New York - Bursa saham Amerika S...",https://www.liputan6.com/saham/read/5304759/wa...,gdelt_liputan6.com,2023-06-02T00:45:00+00:00,2025-10-20T11:13:38.490539+00:00,True,gdelt_api,Indonesian,True
9,13184fb767580c0b1bc45f7b37629ed3,Saham Asia dibuka naik dipicu kemajuan RUU uta...,Singapura (ANTARA) - Saham Asia menguat pada a...,https://www.antaranews.com/berita/3567795/saha...,gdelt_antaranews.com,2023-06-02T03:15:00+00:00,2025-10-20T11:13:38.492543+00:00,True,gdelt_api,Indonesian,True


In [None]:
# Sequentially call Gemini with rolling context
from src_clean.data_pipelines.silver.LLM_news_manifest import filter_manifest
from src_clean.data_pipelines.silver.clean_manifest import clean_manifest_df

MANIFEST_PATH = PROJECT_ROOT / 'data_clean/silver/news/news_manifest_cleaned_20251022_122221.parquet'

def run_sequential_llm_calls(start: date, end: date, limit: int, model_name: str = 'gemini-2.5-flash'):
    manifest_df = filter_manifest(
        MANIFEST_PATH,
        start=datetime.combine(start, time.min, tzinfo=timezone.utc),
        end=datetime.combine(end, time.max, tzinfo=timezone.utc),
        limit=limit,
    )

    if manifest_df.empty:
        return []

    manifest_df = clean_manifest_df(manifest_df)

    pipeline = NewsLLMFeaturePipeline(
        manifest_path=MANIFEST_PATH,
        output_path=PROJECT_ROOT / 'data_clean/silver/news/gemini_feature_test.parquet',
        model=model_name or 'gemini-2.5-flash',
        context_history='No prior context provided.',
    )

    results = []
    context_summary = pipeline.context_history

    for idx, row in manifest_df.iterrows():
        pipeline.context_history = context_summary
        feature_row, article = pipeline._analyze_article(row)
        context_summary = feature_row.get('context_history_summary') or context_summary

        feature_row.update(
            {
                'sequence_idx': idx + 1,
                'article_id': row.get('article_id') or article.get('article_id'),
                'published_at': row.get('published_at') or article.get('published_at'),
                'headline': row.get('headline') or article.get('headline'),
            }
        )
        results.append(feature_row)
    return results


In [None]:
import json
from src_clean.data_pipelines.silver.LLM_news_feature_pipelines import NewsFeatureSchema
from src_clean.data_pipelines.silver import LLM_news_feature_pipelines as nf
import importlib

# Reload if you've just edited the module
importlib.reload(nf)

# 1) Raw Pydantic schema
raw_schema = NewsFeatureSchema.model_json_schema()
print(json.dumps(raw_schema, indent=2))

# 2) Schema after stripping unsupported fields (the one Gemini sees)
pipeline = nf.NewsLLMFeaturePipeline(
    manifest_path=PROJECT_ROOT / "data_clean/bronze/news/historical_5year",
    output_path=PROJECT_ROOT / "tmp.parquet",
    provider="gemini",
)
print(json.dumps(pipeline.response_schema, indent=2))


{
  "properties": {
    "sentiment_score": {
      "maximum": 1.0,
      "minimum": -1.0,
      "title": "Sentiment Score",
      "type": "number"
    },
    "subjectivity": {
      "maximum": 1.0,
      "minimum": 0.0,
      "title": "Subjectivity",
      "type": "number"
    },
    "volatility_flag": {
      "title": "Volatility Flag",
      "type": "boolean"
    },
    "uncertainty_score": {
      "maximum": 1.0,
      "minimum": 0.0,
      "title": "Uncertainty Score",
      "type": "number"
    },
    "sentiment_score_vs_expectation": {
      "maximum": 1.0,
      "minimum": -1.0,
      "title": "Sentiment Score Vs Expectation",
      "type": "number"
    },
    "confidence_in_trend": {
      "maximum": 1.0,
      "minimum": 0.0,
      "title": "Confidence In Trend",
      "type": "number"
    },
    "market_reaction_flag": {
      "title": "Market Reaction Flag",
      "type": "boolean"
    },
    "emotion_keywords": {
      "items": {
        "type": "string"
      },
      "tit

In [None]:
# Run the sequential analysis
sequential_results = run_sequential_llm_calls(
    start=start_picker.value or DEFAULT_START,
    end=end_picker.value or DEFAULT_END,
    limit=max_articles_slider.value,
    model_name=model_text.value,
)

if sequential_results:
    display(Markdown('### LLM Output'))
    display(pd.json_normalize(sequential_results))
    display(Markdown('### Final Context Summary'))
    display(Markdown(sequential_results[-1].get('context_history_summary', '(missing)')))
else:
    display(Markdown('⚠️ No articles processed.'))

In [None]:
import importlib
import pandas as pd
import src_clean.data_pipelines.silver.LLM_news_feature_pipelines as nf

# Reload module to pick up recent edits
importlib.reload(nf)

INPUT_MANIFEST = PROJECT_ROOT / "data_clean/gold/news/news_with_features.parquet"
pipeline = nf.NewsLLMFeaturePipeline(
    manifest_path=INPUT_MANIFEST,
    output_path=PROJECT_ROOT / "data_clean/silver/news/gemini_feature_test.parquet",
    model= "gemini-2.0-flash-lite",
    context_history="No prior context provided.",
)

manifest_df = pipeline._load_manifest()
if manifest_df.empty:
    raise ValueError("Manifest is empty. Run the join pipeline first.")

debug_row = manifest_df.iloc[5]

try:
    result, article, prompt = pipeline._analyze_article(debug_row)
    print("LLM output (dict):")
    print(result)
except Exception as exc:
    print("Validation failure:", exc)

# Load the model
import google.generativeai as genai
model = genai.GenerativeModel("gemini-2.0-flash-lite")


# Count tokens
count = model.count_tokens(prompt)
print("Input Token count:", count.total_tokens)
# Count tokens
count = model.count_tokens(str(result))
print("Output Token count:", count.total_tokens)


LLM output (dict):
{'sentiment_score': 0.3, 'sentiment_magnitude': 0.6, 'subjectivity': 0.2, 'uncertainty_score': 0.1, 'confidence_in_trend': 0.6, 'sentiment_score_vs_expectation': 0.1, 'sentiment_trend': 'Improving', 'sentiment_trend_strength': 0.5, 'surprise_indicator': 0.0, 'emotion_intensity_score': 0.4, 'volatility_flag': False, 'market_reaction_flag': True, 'reaction_magnitude_score': 0.5, 'volatility_pressure_score': 0.3, 'emotion_keywords': ['positive'], 'predicted_price_impact_direction': 1.0, 'predicted_price_impact_probability': 0.6, 'predicted_price_impact_magnitude': 0.3, 'impact_duration_category': 'intraday', 'impact_duration_score': 0.3, 'catalyst_type': 'earnings', 'catalyst_strength_score': 0.7, 'named_entities': ['S&P 500', 'Nasdaq', 'Netflix'], 'entity_sentiment': {'Netflix': -0.5}, 'is_macro_related': 3.0, 'macro_sentiment_score': 0.2, 'relevance_to_stock': 0.7, 'industry_sector': None, 'sector_sentiment_score': 0.3, 'cross_sector_correlation_score': 0.3, 'policy_i

In [150]:
result

{'sentiment_score': 0.4,
 'sentiment_magnitude': 0.4,
 'subjectivity': 0.3,
 'uncertainty_score': 0.2,
 'confidence_in_trend': 0.6,
 'sentiment_score_vs_expectation': 0.3,
 'sentiment_trend': 'improving',
 'sentiment_trend_strength': 0.4,
 'surprise_indicator': 0.1,
 'emotion_intensity_score': 0.3,
 'volatility_flag': False,
 'market_reaction_flag': True,
 'reaction_magnitude_score': 0.2,
 'volatility_pressure_score': 0.1,
 'emotion_keywords': ['erinomaisesti', 'pettymys'],
 'predicted_price_impact_direction': 0.3,
 'predicted_price_impact_probability': 0.6,
 'predicted_price_impact_magnitude': 0.4,
 'impact_duration_category': 'intraday',
 'impact_duration_score': 0.3,
 'catalyst_type': 'earnings',
 'catalyst_strength_score': 0.4,
 'named_entities': ['Netflix', 'S&P 500', 'Nasdaq'],
 'entity_sentiment': {'Netflix': -0.5},
 'is_macro_related': 0.5,
 'macro_sentiment_score': 0.0,
 'relevance_to_stock': 0.8,
 'industry_sector': 'technology',
 'sector_sentiment_score': 0.4,
 'cross_sector

In [181]:
df = pd.read_parquet('../data_clean/gold/news/LLM_news.parquet')
df["published_at"] = pd.to_datetime(df["published_at"]).dt.tz_localize(None)
df.to_excel("output.xlsx", index=False)

In [161]:
df

Unnamed: 0,sentiment_score,sentiment_magnitude,subjectivity,uncertainty_score,confidence_in_trend,sentiment_score_vs_expectation,sentiment_trend,sentiment_trend_strength,surprise_indicator,emotion_intensity_score,...,key_new_information,overall_market_sentiment,overall_market_sentiment_score,model_confidence_score,language_detected,article_id,published_at,source,headline,url
0,-0.5,0.5,0.4,0.4,0.3,-0.6,worsening,0.4,-0.5,0.3,...,[Chinese stock market reversed early gains and...,Bearish,-0.4,0.50,Chinese,fe5201b8430c30bb6eb9513fc9227a19,2020-10-19 09:30:00+00:00,gdelt_cn.reuters.com,中国股市逆转早盘涨势收低 之前公布的GDP数据令人失望,https://cn.reuters.com/article/china-stocks-gd...
1,-0.6,0.7,0.8,0.6,0.3,-0.7,worsening,0.7,-0.6,0.7,...,"[China's Q3 GDP grew by 4.9%., China's economy...",Bearish,-0.6,0.70,Chinese,ff434099c7b77b6ab6395b9e95560437,2020-10-19 11:00:00+00:00,gdelt_blog.eastmoney.com,统计局今日公布统计数据 ： 三季度GDP同比增长4 . 9 % 经济由负转正 。 东方财富网博客,http://blog.eastmoney.com/o8326325948088954/bl...
2,0.5,0.6,0.8,0.7,0.7,0.4,stable,0.3,0.1,0.5,...,[Stock market crashes are normal and unprevent...,Neutral,0.1,0.85,English,ba853d93b28568870abfea3b9959429e,2020-10-19 11:15:00+00:00,gdelt_fool.com,4 Perfect Stocks to Buy if the Stock Market Cr...,https://www.fool.com/investing/2020/10/19/4-pe...
3,0.2,0.5,0.7,0.6,0.6,0.1,stable,0.2,0.1,0.5,...,[S&P 500 has recorded 12 of its 15 largest sin...,Neutral,0.1,0.80,English,7027368bd2cc529ea3b7ef5b0f13a0f4,2020-10-21 12:45:00+00:00,gdelt_fool.com,5 Things to Do When the Stock Market Is Volatile,https://www.fool.com/investing/2020/10/21/5-th...
4,0.1,0.4,0.5,0.6,0.6,0.1,improving,0.4,0.1,0.4,...,[Wall Street menguat di awal perdagangan Rabu....,Neutral,0.1,0.80,id,95b030051539adae2a486ec59b50bbcf,2020-10-21 15:00:00+00:00,gdelt_investasi.kontan.co.id,"Wall Street menguat , masih ada ganjalan pada ...",https://investasi.kontan.co.id/news/wall-stree...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1835,,,,,,,,,,,...,,,,,,c1f2f9e80b00bbe39d931158bdb0f696,2021-10-08 00:00:00+00:00,gdelt_investmentwatchblog.com,Chart | Wall Street vs Main Street : Why the S...,https://www.investmentwatchblog.com/chart-wall...
1836,,,,,,,,,,,...,,,,,,b219bf59716ff57d361275b1d5b57573,2021-10-08 00:15:00+00:00,gdelt_economy.okezone.com,Wall Street Ditutup Menguat Didukung Kenaikan ...,https://economy.okezone.com/read/2021/10/08/27...
1837,,,,,,,,,,,...,,,,,,5bacfa05b7eb90e144daa7fac561d361,2021-10-09 00:30:00+00:00,gdelt_investors.com,"Dow Jones Futures : Market Rally , These 5 Sto...",https://www.investors.com/market-trend/stock-m...
1838,,,,,,,,,,,...,,,,,,d94379ed9ce513cf095222084863dcd0,2021-10-09 00:30:00+00:00,gdelt_investors.com,Stock Market Drifts Lower After Jobs Data ; 10...,https://www.investors.com/market-trend/the-big...


In [136]:
pd.DataFrame([prompt]).to_clipboard()

In [19]:
pipeline._call_gemini(prompt)

InvalidArgument: 400 * GenerateContentRequest.generation_config.response_schema.properties["entity_sentiment"].properties: should be non-empty for OBJECT type


## Tips

- Re-run the parameter cell after updating the date range or maximum articles.
- If you hit rate limits, increase the gap between runs or manually throttle the helper function.
- The notebook uses a private helper (`_analyze_article`) for demonstration purposes. For production use, wrap this logic in a dedicated orchestrator that handles persistence and retries as needed.