<a href="https://colab.research.google.com/github/meka-williams/Free-News-APIs/blob/main/Team%204%20News%20API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini Project: Retrieving and Storing News Articles from Free News APIs
Authors: Payal Moorti & Shameka Williams

[GitHub Respository](https://github.com/meka-williams/Free-News-APIs.git)

**Project Objectives**
1. Understand how to interact with public APIs to retrieve data
2. Learn to process, filter, and store large-scale textual data
3. Gain hands-on experince integrating web mining techniques with cloud storage solutions
4. Develop team collaboration and project management skills

**Project Description**

Each team will focus on retrieving news articles using a designated free API from the provided list. Teams will extract, clean, and store the data into a designated S3 bucket on AWS. This project will involve designining efficient workflows for API interaction, data processing, and storage, with documentation for reproducibility.

# Project Milestones and Timeline

* Week 1: Project Setup and API Familiarization
* Week 2: Data Retrieval and Preprocessing
* Week 3: Storing Data in S3
* Week 4: Final Presentation and Reporting

# Establish variables for Current News API key and and GeminiAI API key

Created an instance of the `GeminiAI` object. Uses the environment variable `GOOGLE_API_KEY` to pass the key.

*We used the Gemini API instead of the OpenAI API because its free-to-use and has a high usage limit.*


In [None]:
import requests
import json
import google.generativeai as genai
from google.colab import userdata
import pprint
from google import genai

news_key=userdata.get("CURRENT_KEY")
gemma=userdata.get("GOOGLE_API_KEY")

# Gemini Client Creation & Testing Model Communication
Created a Gemini API client and utility function to communicate with Gemini model, "Gemini 2.0 Flash Lite"

In [None]:
# Create Gemini client
client = genai.Client(api_key=gemma)

#utility method to talk the Gemini
def get_response(prompt, model='gemini-2.0-flash-lite'):
    response = client.models.generate_content(
        model=model,
        contents=prompt
    )
    return response.text

Checking the response from the AI

In [None]:
#testing get_response
get_response("What is a zero shot prompt?")

'A zero-shot prompt is a prompt given to a language model (like GPT-3, Bard, or Claude) **that requires the model to perform a task without any prior examples or demonstrations.** The model is expected to understand the instructions solely based on the prompt itself.\n\nThink of it like asking someone a question they\'ve never heard before and expecting them to answer correctly, purely based on their existing knowledge and understanding.\n\nHere\'s a breakdown:\n\n*   **Zero-Shot:** "Zero" refers to the number of examples provided in the prompt. There are no examples of how the task should be done.\n*   **Prompt:** The text or instruction given to the language model. It describes the task you want the model to perform.\n\n**How Zero-Shot Prompts Work:**\n\nThe language model relies on its pre-existing training data to understand the prompt and generate a response. This data includes a massive amount of text and code, allowing it to recognize patterns, relationships between words, and c

# Querying the News API
The `get_news_articles` function retrieves articles from the Currents API. articles from Currents API. Specifying the amount of articles obtained to be 150, in English, and pertaining to the topic of health.

**Parameters**

*   `api_key` : key for the Currents API
*   `keywords = "Health"` : topic of the articles
*   `page_size` : number of articles

**Challenges & Outcomes**

At first, only getting 30 articles. Changed the parameters of the retrieval function to fix issue.

In [None]:
def get_news_articles(api_key, keywords='Health', language='en', page_size=150):
   # Retrieves news articles from the Currents API based on the provided parameters.
    url = f'https://api.currentsapi.services/v1/search?keywords={keywords}&language={language}&page_size={page_size}&apiKey={api_key}'
    response = requests.get(url)
    data = response.json()

    print(f"Total Results: {len(data.get('news', []))}")  # Access 'news' from JSON data

    return data

# Example usage
news_data = get_news_articles(news_key)

Total Results: 150


In [None]:
#display sample article
pprint.pprint(news_data['news'][60])

{'author': 'foxnews',
 'category': ['general'],
 'description': 'LIFE-THREATENING - Abortion pill found to have "severe '
                'adverse effects" for 1 in 10 women, study finds. Continue '
                "reading…'SOUL DOG' - Woman says her cockapoo detected her "
                'breast cancer before doctors did. Continue reading…SUMMER '
                'SKINCARE - Save on sunscreens, moisturizers and self-tanners. '
                'Continue ...',
 'id': '5c01b7c1-4010-4124-b0fc-a40c9ad8955f',
 'image': 'https://static.foxnews.com/foxnews.com/content/uploads/2025/04/newsletter430.jpg',
 'language': 'en',
 'published': '2025-05-01 00:25:13 +0000',
 'title': 'Alzheimer’s, cancer and ALS breakthroughs to know about',
 'url': 'https://www.foxnews.com/health/alzheimers-cancer-als-breakthroughs-know-about'}


# Preprocessing and Clean Retrieved Articles
This code preprocesses a collection of news articles by handling data quality issues:

It begins by printing the count of articles to be processed and inspects the structure of a sample article.
It creates empty collections to track seen URLs and store processed articles
For each article, it:

*   Skips articles without URLS
*   Skips articles that contain duplicate URLS
*   Creates structure with consistent feilds:Adds default values for missing data (like "Unknown" for missing authors)
*   Adds "PROCESSED:" prefix to titles for verification
*   It skips articles missing both title and description

It returns the collection of cleaned, deduplicated articles and prints statistics about the preprocessing and shows a comparison between an original and a preprocessed article.

**Parameters**

*   `articles` : retrieved articles

In [None]:
#preprocessing retrieved articles: handling duplicates and missing data
def preprocess_articles(articles):
    print(f"Starting preprocessing on {len(articles)} articles")
    original_article = news_data.get('news', [])[90]
    print("\nOriginal article keys:")
    print(sorted(original_article.keys()))

    seen_urls = set()
    preprocessed_articles = []
    for article in articles:
      #skip articles without url
        url = article.get('url')
        if not url:
            continue
        #skip duplicate urls
        if url in seen_urls:
            continue
        seen_urls.add(url)
        processed_article = {
            "id": article.get("id", ""),
            "title": "PROCESSED: " + article.get("title", "No title available"),  # Add a prefix to verify
            "description": article.get("description", "No description available"),
            "url": url,
            "author": article.get("author", "Unknown"),
            "image": article.get("image", ""),
            "language": article.get("language", "en"),
            "category": article.get("category", []),
            "published": article.get("published", "na"),
        }
        # Skip articles with empty titles and descriptions (both missing)
        if processed_article["title"] == "No title available" and processed_article["description"] == "No description available":
            continue
        preprocessed_articles.append(processed_article)
    return preprocessed_articles

processed_data = preprocess_articles(news_data.get('news', []))

print(f"Total articles after preprocessing: {len(processed_data)}")

#print sample original article
print("\nOriginal article:")
pprint.pprint(news_data.get('news', [])[60])

#print sample preprocessed article
print("\nPreprocessed article:")
pprint.pprint(processed_data[60])

Starting preprocessing on 150 articles

Original article keys:
['author', 'category', 'description', 'id', 'image', 'language', 'published', 'title', 'url']
Total articles after preprocessing: 150

Original article:
{'author': 'foxnews',
 'category': ['general'],
 'description': 'LIFE-THREATENING - Abortion pill found to have "severe '
                'adverse effects" for 1 in 10 women, study finds. Continue '
                "reading…'SOUL DOG' - Woman says her cockapoo detected her "
                'breast cancer before doctors did. Continue reading…SUMMER '
                'SKINCARE - Save on sunscreens, moisturizers and self-tanners. '
                'Continue ...',
 'id': '5c01b7c1-4010-4124-b0fc-a40c9ad8955f',
 'image': 'https://static.foxnews.com/foxnews.com/content/uploads/2025/04/newsletter430.jpg',
 'language': 'en',
 'published': '2025-05-01 00:25:13 +0000',
 'title': 'Alzheimer’s, cancer and ALS breakthroughs to know about',
 'url': 'https://www.foxnews.com/health/alzhei

# Set Up Access to S3 Bucket

Imported libaries required for access. Created variables, `TEAM` and `BUCKET_NAME`, with correct values for access to class bucket and team number.

In [None]:
!pip install boto3



In [None]:
import os
import boto3
from botocore.config import Config
from botocore import UNSIGNED

In [None]:
TEAM = "TEAM_4/"
BUCKET_NAME = "cus635-spring2025"

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Upload a Test File to S3 Bucket
Uploaded `words.txt` to test access to S3.

In [None]:
file_path = "/content/sample_data/tmp/"
file_name = "words.txt"
object_name = file_name

s3.upload_file(file_path+file_name, BUCKET_NAME, TEAM+object_name)

In [None]:
response = s3.list_objects_v2(Bucket=BUCKET_NAME)
if "Contents" in response:
    print("Files in S3 Bucket:")
    for obj in response["Contents"]:
        print(f" - {obj['Key']}")
else:
    print("No files found in the bucket.")

Files in S3 Bucket:
 - TEAM_1/sources/ABC_News.json
 - TEAM_1/sources/ABC_News_AU_.json
 - TEAM_1/sources/ANSA_it.json
 - TEAM_1/sources/AppleInsider.json
 - TEAM_1/sources/Associated_Press.json
 - TEAM_1/sources/BBC_News.json
 - TEAM_1/sources/BBC_Sport.json
 - TEAM_1/sources/Bild.json
 - TEAM_1/sources/Bleacher_Report.json
 - TEAM_1/sources/Bloomberg.json
 - TEAM_1/sources/Breitbart_News.json
 - TEAM_1/sources/Business_Insider.json
 - TEAM_1/sources/CBC_News.json
 - TEAM_1/sources/CBS_News.json
 - TEAM_1/sources/CNET.json
 - TEAM_1/sources/CoinDesk.json
 - TEAM_1/sources/Crypto_Coins_News.json
 - TEAM_1/sources/Digital_Trends.json
 - TEAM_1/sources/ESPN.json
 - TEAM_1/sources/Educatedguesswork_org.json
 - TEAM_1/sources/Flowingdata_com.json
 - TEAM_1/sources/FourFourTwo.json
 - TEAM_1/sources/Fox_News.json
 - TEAM_1/sources/Fox_Sports.json
 - TEAM_1/sources/Genbeta_com.json
 - TEAM_1/sources/Github_com.json
 - TEAM_1/sources/Gizmodo_com.json
 - TEAM_1/sources/Gizmodo_jp.json
 - TEAM_

# Upload JSON Articles to Team Folder
Created `upload_article_to_s3` function to iterate through processed articles and upload the articles individually into the team folder in the S3 Bucket.

**Parameters**


*   `articles` : list of processed articles
*   `bucket_name` : name of the S3 bucket
*   `team_folder` : team folder to store articles

**Challenges & Outcomes**

Setting the names for the each article and removing spaces from the JSON file name. Fixed using regular expression for pattern matching and removing spaces.

In [None]:
import re

def upload_article_to_s3(articles, bucket_name, team_folder):
  for index, article in enumerate(articles):
    title = article.get("title", f"article_{index + 1}")
    file_name= re.sub(r"[^\w\s.-]", "", title)
    file_name = file_name.replace(" ", "_")
    object_key = f"{team_folder}{file_name}.json"
    try:
        s3.put_object(
            Body=json.dumps(article),
            Bucket=bucket_name,
            Key=object_key,
            ContentType='application/json'
        )
        print(f"Article {index + 1} uploaded successfully to s3://{bucket_name}/{object_key}")
    except Exception as e:
        print(f"Error uploading article {index + 1}: {e}")

upload_article_to_s3(processed_data, BUCKET_NAME, TEAM)

Article 1 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Trump_fully_fit_for_duty_White_House_physician_says.json
Article 2 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_The_Many_Ways_Kennedy_Is_Already_Undermining_Vaccines.json
Article 3 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Alabama_woman_has_pig_kidney_removed_after_a_record_130_days.json
Article 4 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Bernie_Sanders_says_largest_Fighting_Oligarchy_rally_with_AOC_is_making_Trump_Musk_very_nervous.json
Article 5 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_How_Health_and_Human_Safety_Department_cuts_could_affect_your_health.json
Article 6 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Exclusive_--_Ironic_How_Liberal_Media_Weaponizes_Woke_Cancel_Culture_RFK_Jr._Hammers_Fake_News_Attacks_Falsely_Claiming_He_Mocked_Handicapped.json
Article 7 uploaded successfully to s3://cus635

# Clearing Team Folder
The `clear_s3_folder` function removes all files from the team folder.

**Parameters**

*   `bucket_name` : name of the S3 bucket
*   `folder_path` : name of the team folder

**Challenges & Outcomes**
As an anonymous user, we are not allowed to delete any files from the S3 bucket. Thus, there might be duplicate articles from our last attempt to upload the articles to the S3 bucket.


In [None]:
def clear_s3_folder(bucket_name, folder_path):
  response = s3.list_objects_v2(Bucket=bucket_name, Prefix=folder_path)['Contents']

  try:
    objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=folder_path)['Contents']
    for obj in objects:
      s3.delete_object(Bucket=bucket_name, Key=obj['Key'])
      print(f"Deleted {obj['Key']}")

    print(f"Cleared folder: s3://{bucket_name}/{folder_path}")
  except KeyError:
    print(f"Folder '{folder_path}' is empty or does not exist.")
  except Exception as e:
    print(f"Error clearing folder: {e}")

clear_s3_folder(BUCKET_NAME, TEAM)

Error clearing folder: An error occurred (AccessDenied) when calling the DeleteObject operation: Access Denied


# List Files Stored in Team Folder

In [None]:
response = s3.list_objects_v2(Bucket=BUCKET_NAME)
if "Contents" in response:
    print("Files in S3 Bucket:")
    for obj in response["Contents"]:
        print(f" - {obj['Key']}")
else:
    print("No files found in the bucket.")

Files in S3 Bucket:
 - TEAM_1/sources/ABC_News.json
 - TEAM_1/sources/ABC_News_AU_.json
 - TEAM_1/sources/ANSA_it.json
 - TEAM_1/sources/AppleInsider.json
 - TEAM_1/sources/Associated_Press.json
 - TEAM_1/sources/BBC_News.json
 - TEAM_1/sources/BBC_Sport.json
 - TEAM_1/sources/Bild.json
 - TEAM_1/sources/Bleacher_Report.json
 - TEAM_1/sources/Bloomberg.json
 - TEAM_1/sources/Breitbart_News.json
 - TEAM_1/sources/Business_Insider.json
 - TEAM_1/sources/CBC_News.json
 - TEAM_1/sources/CBS_News.json
 - TEAM_1/sources/CNET.json
 - TEAM_1/sources/CoinDesk.json
 - TEAM_1/sources/Crypto_Coins_News.json
 - TEAM_1/sources/Digital_Trends.json
 - TEAM_1/sources/ESPN.json
 - TEAM_1/sources/Educatedguesswork_org.json
 - TEAM_1/sources/Flowingdata_com.json
 - TEAM_1/sources/FourFourTwo.json
 - TEAM_1/sources/Fox_News.json
 - TEAM_1/sources/Fox_Sports.json
 - TEAM_1/sources/Genbeta_com.json
 - TEAM_1/sources/Github_com.json
 - TEAM_1/sources/Gizmodo_com.json
 - TEAM_1/sources/Gizmodo_jp.json
 - TEAM_

# Installing Libraries

In [None]:
!pip install pinecone


Collecting pinecone
  Downloading pinecone-6.0.2-py3-none-any.whl.metadata (9.0 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone-6.0.2-py3-none-any.whl (421 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.9/421.9 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone
Successfully installed pinecone-6.0.2 pinecone-plugin-interface-0.0.7


In [None]:
!pip install openai



In [None]:
!pip install mistralai

Collecting mistralai
  Downloading mistralai-1.6.0-py3-none-any.whl.metadata (30 kB)
Collecting eval-type-backport>=0.2.0 (from mistralai)
  Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Downloading mistralai-1.6.0-py3-none-any.whl (288 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.7/288.7 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)
Installing collected packages: eval-type-backport, mistralai
Successfully installed eval-type-backport-0.2.2 mistralai-1.6.0


In [None]:
pip install sentence-transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

# Document Processing

In [None]:
from pinecone import Pinecone
import openai
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
from tokenizers import Tokenizer

In [None]:
pc = Pinecone(api_key=userdata.get("PINECONE_KEY"))

#model = SentenceTransformer("sentence-transformers/stsb-bert-large")
tokenizer = Tokenizer.from_pretrained("google-bert/bert-base-uncased")
static_embedding = StaticEmbedding(tokenizer, embedding_dim=1024)

model = SentenceTransformer(modules=[static_embedding])

#finding host name
index_id = userdata.get("INDEX_ID")
indexes = pc.list_indexes()
for idx in indexes:
    if idx['name'] == index_id:
        print(f"Host for index {index_id}: {idx['host']}")
        index = pc.Index(host=idx['host'])

index = pc.Index(host=userdata.get("INDEX_HOST"))

def get_articles_from_s3(bucket, prefix):
    response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
    articles = []
    for obj in response.get('Contents', []):
        if obj['Key'].endswith('.json'):
            file_data = s3.get_object(Bucket=bucket, Key=obj['Key'])['Body'].read()
            article = json.loads(file_data)
            articles.append(article)
    return articles

def get_embedding(text):
    if not text:
        return []
    embedding = model.encode(text)
    return embedding.tolist()

def upsert_to_pinecone(index, articles, namespace="team-4"):
    records = []
    for i, article in enumerate(articles):
        text = article.get("description", "")
        if not text:
            continue
        embedding = get_embedding(text)
        record = {
            "id": f"article_{i}",
            "values": embedding,
            "metadata": {
                "chunk_text": text,
                "team": "Team 4",
                "category": "Policies",
                "title": article.get("title", ""),
                "published": article.get("published", ""),
                "url": article.get("url", "")
            }
        }
        records.append(record)

    for i in range(0, len(records), 100):
        batch = records[i:i+100]
        index.upsert(vectors=batch, namespace=namespace)
        print(f"Upserted batch {i//100 + 1} of {len(records)} vectors")

articles = get_articles_from_s3(BUCKET_NAME, TEAM)
upsert_to_pinecone(index, articles)


Host for index cus635: cus635-g311jqa.svc.aped-4627-b74a.pinecone.io
Upserted batch 1 of 613 vectors
Upserted batch 2 of 613 vectors
Upserted batch 3 of 613 vectors
Upserted batch 4 of 613 vectors
Upserted batch 5 of 613 vectors
Upserted batch 6 of 613 vectors
Upserted batch 7 of 613 vectors


# Querying in Pinecone

In [None]:
query = "Trump Adminstration"
query_vector = model.encode(query).tolist()

results = index.query(
    namespace = "team-4",
    top_k = 5,
    include_metadata = True,
    vector = query_vector
)

print(f"\n Top Results for: \"{query}\"\n")
count = 0
for match in results['matches']:
  print(f"[{count}] Score: {match['score']:.3f}")
  print(f"Title: {match['metadata']['title']}")
  print(f"URL: {match['metadata']['url']}")
  print(f"Published: {match['metadata']['published']}")
  count+=1


 Top Results for: "Trump Adminstration"

[0] Score: 0.060
Title: PROCESSED: Gun-control measure signed into law, a Trump defense fund and more from the Colorado legislature this week
URL: https://www.denverpost.com/2025/04/12/colorado-gun-control-trump-defense-fund-legislature/
Published: 2025-04-12 12:00:22 +0000
[1] Score: 0.051
Title: PROCESSED: Olivia Munn faces wrath of parenting police after toddler goes shoeless on New York city streets
URL: https://www.foxnews.com/entertainment/olivia-munn-faces-wrath-parenting-police-after-toddler-tantrum-defeat
Published: 2025-04-28 19:34:31 +0000
[2] Score: 0.049
Title: PROCESSED: Some 9/11 first responders left in limbo amid funding cuts
URL: https://www.cbsnews.com/video/some-911-first-responders-left-in-limbo-amid-funding-cuts/
Published: 2025-05-02 00:42:00 +0000
[3] Score: 0.048
Title: PROCESSED: Stock Inhalers Save ER Trips, Keep Students in School
URL: https://www.medscape.com/viewarticle/stock-inhalers-relieve-asthma-symptoms-school

# Create AI Agent

In [None]:
%pip install -U langchain-google-genai



In [None]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.23-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [None]:
# Core LangChain
!pip install langchain

# Gemini support
!pip install langchain-google-genai google-generativeai

# Community tools (e.g., GoogleSearchAPIWrapper)
!pip install langchain-community

# For prompts and agent creation
!pip install langchain-core

# Optional: For secure key input
!pip install python-dotenv
!pip install langchain-google-genai google-generativeai

# Community tools (e.g., GoogleSearchAPIWrapper)
!pip install langchain-community

# For prompts and agent creation
!pip install langchain-core

# Optional: For secure key input
!pip install python-dotenv

!pip install pinecone


Collecting pinecone
  Downloading pinecone-6.0.2-py3-none-any.whl.metadata (9.0 kB)
Downloading pinecone-6.0.2-py3-none-any.whl (421 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.9/421.9 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pinecone
Successfully installed pinecone-6.0.2


In [None]:
import os
from pinecone import Pinecone
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.tools import Tool
from langchain_community.utilities import GoogleSearchAPIWrapper
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnablePassthrough
from langchain.agents import create_tool_calling_agent, AgentExecutor

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
from tokenizers import Tokenizer

# 1. Initialize the Gemini LLM
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash-lite",
    google_api_key=gemma,
    temperature=0.0
)

GOOGLE_API_KEY = userdata.get("GOOGLE_API_KEY")
GOOGLE_CSE_ID = userdata.get("GOOGLE_CSE_ID")
INDEX_HOST =  userdata.get("INDEX_HOST")
INDEX_NAME = userdata.get("INDEX_ID")
NAMESPACE = "team-4"
PINECONE_KEY = userdata.get("PINECONE_KEY")

pc = Pinecone(api_key=PINECONE_KEY)
index = pc.Index(host=INDEX_HOST)

# 2. Setup the Google Search Tool
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
os.environ["GOOGLE_CSE_ID"] = GOOGLE_CSE_ID
search = GoogleSearchAPIWrapper()

# Load embedding model
tokenizer = Tokenizer.from_pretrained("google-bert/bert-base-uncased")
static_embedding = StaticEmbedding(tokenizer, embedding_dim=1024)
model = SentenceTransformer(modules=[static_embedding])

# Define Pinecone search function
def custom_pinecone_search(query: str):
    query_vector = model.encode(query).tolist()
    results = index.query(
        namespace=NAMESPACE,
        top_k=50,
        include_metadata=True,
        vector=query_vector
    )

    if not results["matches"]:
        return "No relevant articles were found in the database."

    summary = []
    for match in results["matches"]:
        metadata = match["metadata"]
        title = metadata.get("title", "Untitled")
        chunk = metadata.get("chunk_text", "")
        published = metadata.get("published", "Unknown date")
        summary.append(f" *{title}* (Published: {published})\n{chunk.strip()}\n")

    return "\n---\n".join(summary)


search_tool = Tool(
    name="article_search",
    description="Use this tool to search embedded news articles for information to answer the query.",
    func=custom_pinecone_search,
)

# 3. Prompt Template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a smart assistant. Use external tools if needed."),
    ("user", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

# 4. Create the Agent using Gemini-compatible function
agent = create_tool_calling_agent(
    llm=llm,
    tools=[search_tool],
    prompt=prompt,
)

# 5. Agent Executor
agent_executor = AgentExecutor(
    agent=agent,
    tools=[search_tool],
    verbose=True,
)

# 6. Run
if __name__ == "__main__":
    question = "What is news on mental health"
    response = agent_executor.invoke({"input": question})
    print(response["output"])




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `article_search` with `mental health`


[0m[36;1m[1;3m *PROCESSED: Singapore’s lower-income families on ComLink+ scheme to get more help navigating healthcare, housing system* (Published: 2025-03-10 05:09:00 +0000)
Coaches will also work with these families to help them adopt a healthier lifestyle.

---
 *PROCESSED: Olivia Munn faces wrath of parenting police after toddler goes shoeless on New York city streets* (Published: 2025-04-28 19:34:31 +0000)
Olivia Munn picked her battles and averted another meltdown, albeit with a shoeless walk through the city.Munn, 44, caught both slack and praise from social media followers after sharing a series of photos of her 3-year-old son, Malcolm, walking down the street in New York City without shoes.While s...

---
 *PROCESSED: The Alignment-to-Value Pipeline* (Published: 2025-03-10 15:00:10 +0000)
This pipeline ensures strategic alignment and backlog health and focuses on 