<a href="https://colab.research.google.com/github/meka-williams/Free-News-APIs/blob/main/Team%204%20News%20API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini Project: Retrieving and Storing News Articles from Free News APIs
Authors: Payal Moorti & Shameka Williams

[GitHub Respository](https://github.com/meka-williams/Free-News-APIs.git)

**Project Objectives**
1. Understand how to interact with public APIs to retrieve data
2. Learn to process, filter, and store large-scale textual data
3. Gain hands-on experince integrating web mining techniques with cloud storage solutions
4. Develop team collaboration and project management skills

**Project Description**

Each team will focus on retrieving news articles using a designated free API from the provided list. Teams will extract, clean, and store the data into a designated S3 bucket on AWS. This project will involve designining efficient workflows for API interaction, data processing, and storage, with documentation for reproducibility.

# Project Milestones and Timeline

* Week 1: Project Setup and API Familiarization
* Week 2: Data Retrieval and Preprocessing
* Week 3: Storing Data in S3
* Week 4: Final Presentation and Reporting

# Establish variables for Current News API key and and GeminiAI API key

Created an instance of the `GeminiAI` object. Uses the environment variable `GOOGLE_API_KEY` to pass the key.

*We used the Gemini API instead of the OpenAI API because its free-to-use and has a high usage limit.*


In [32]:
import requests
import json
import google.generativeai as genai
from google.colab import userdata
import pprint
from google import genai

news_key=userdata.get("CURRENT_KEY")
gemma=userdata.get("GOOGLE_API_KEY")

# Gemini Client Creation & Testing Model Communication
Created a Gemini API client and utility function to communicate with Gemini model, "Gemini 2.0 Flash Lite"

In [33]:
# Create Gemini client
client = genai.Client(api_key=gemma)

#utility method to talk the Gemini
def get_response(prompt, model='gemini-2.0-flash-lite'):
    response = client.models.generate_content(
        model=model,
        contents=prompt
    )
    return response.text

Checking the response from the AI

In [34]:
#testing get_response
get_response("What is a zero shot prompt?")

'A zero-shot prompt is a type of prompt used in the context of large language models (LLMs) like GPT-3, GPT-4, and others. It allows the model to perform a task without being given any specific examples or training data. Essentially, you simply provide the model with a task description or question and expect it to understand and respond accordingly.\n\nHere\'s a breakdown:\n\n*   **Zero-Shot:** The model receives *no* examples during the prompt itself. It\'s expected to leverage its pre-trained knowledge to solve the task.\n*   **Prompt:** This is the input text you provide to the LLM, instructing it on what to do. It can be a question, a task description, or a request for generation.\n\n**How it works:**\n\nLLMs are trained on vast amounts of text and code. This pre-training allows them to:\n\n*   Understand language patterns.\n*   Recognize relationships between words and concepts.\n*   Generate text in a grammatically correct and contextually relevant manner.\n\nA zero-shot prompt t

# Querying the News API
The `get_news_articles` function retrieves articles from the Currents API. articles from Currents API. Specifying the amount of articles obtained to be 150, in English, and pertaining to the topic of health.

**Parameters**

*   `api_key` : key for the Currents API
*   `keywords = "Health"` : topic of the articles
*   `page_size` : number of articles

**Challenges & Outcomes**

At first, only getting 30 articles. Changed the parameters of the retrieval function to fix issue.

In [35]:
def get_news_articles(api_key, keywords='Health', language='en', page_size=150):
   # Retrieves news articles from the Currents API based on the provided parameters.
    url = f'https://api.currentsapi.services/v1/search?keywords={keywords}&language={language}&page_size={page_size}&apiKey={api_key}'
    response = requests.get(url)
    data = response.json()

    print(f"Total Results: {len(data.get('news', []))}")  # Access 'news' from JSON data

    return data

# Example usage
news_data = get_news_articles(news_key)

Total Results: 150


In [36]:
#display sample article
pprint.pprint(news_data['news'][60])

{'author': 'foxnews',
 'category': ['entertainment'],
 'description': 'King Charles III and Queen Camilla have figured out the '
                'secret to a successful marriage, and it may have something to '
                'do with where they sleep.\xa0"They have known each other '
                'since [they were] very young and are great friends. There is '
                'no competitive edge between King Charles and Queen Camilla," '
                'Britis...',
 'id': '6fab6c2d-fa87-4817-ac0d-db4b24c91f05',
 'image': 'https://static.foxnews.com/foxnews.com/content/uploads/2025/04/charles-and-camilla.jpg',
 'language': 'en',
 'published': '2025-04-09 08:00:02 +0000',
 'title': 'King Charles, Queen Camilla’s unconventional bedroom arrangement is '
          'secret sauce to staying together: expert',
 'url': 'https://www.foxnews.com/entertainment/king-charles-queen-camillas-unconventional-bedroom-arrangement-secret-sauce-staying-together-expert'}


# Preprocessing and Clean Retrieved Articles
This code preprocesses a collection of news articles by handling data quality issues:

It begins by printing the count of articles to be processed and inspects the structure of a sample article.
It creates empty collections to track seen URLs and store processed articles
For each article, it:

*   Skips articles without URLS
*   Skips articles that contain duplicate URLS
*   Creates structure with consistent feilds:Adds default values for missing data (like "Unknown" for missing authors)
*   Adds "PROCESSED:" prefix to titles for verification
*   It skips articles missing both title and description

It returns the collection of cleaned, deduplicated articles and prints statistics about the preprocessing and shows a comparison between an original and a preprocessed article.

**Parameters**

*   `articles` : retrieved articles

In [37]:
#preprocessing retrieved articles: handling duplicates and missing data
def preprocess_articles(articles):
    print(f"Starting preprocessing on {len(articles)} articles")
    original_article = news_data.get('news', [])[90]
    print("\nOriginal article keys:")
    print(sorted(original_article.keys()))

    seen_urls = set()
    preprocessed_articles = []
    for article in articles:
      #skip articles without url
        url = article.get('url')
        if not url:
            continue
        #skip duplicate urls
        if url in seen_urls:
            continue
        seen_urls.add(url)
        processed_article = {
            "id": article.get("id", ""),
            "title": "PROCESSED: " + article.get("title", "No title available"),  # Add a prefix to verify
            "description": article.get("description", "No description available"),
            "url": url,
            "author": article.get("author", "Unknown"),
            "image": article.get("image", ""),
            "language": article.get("language", "en"),
            "category": article.get("category", []),
            "published": article.get("published", "na"),
        }
        # Skip articles with empty titles and descriptions (both missing)
        if processed_article["title"] == "No title available" and processed_article["description"] == "No description available":
            continue
        preprocessed_articles.append(processed_article)
    return preprocessed_articles

processed_data = preprocess_articles(news_data.get('news', []))

print(f"Total articles after preprocessing: {len(processed_data)}")

#print sample original article
print("\nOriginal article:")
pprint.pprint(news_data.get('news', [])[60])

#print sample preprocessed article
print("\nPreprocessed article:")
pprint.pprint(processed_data[60])

Starting preprocessing on 150 articles

Original article keys:
['author', 'category', 'description', 'id', 'image', 'language', 'published', 'title', 'url']
Total articles after preprocessing: 150

Original article:
{'author': 'foxnews',
 'category': ['entertainment'],
 'description': 'King Charles III and Queen Camilla have figured out the '
                'secret to a successful marriage, and it may have something to '
                'do with where they sleep.\xa0"They have known each other '
                'since [they were] very young and are great friends. There is '
                'no competitive edge between King Charles and Queen Camilla," '
                'Britis...',
 'id': '6fab6c2d-fa87-4817-ac0d-db4b24c91f05',
 'image': 'https://static.foxnews.com/foxnews.com/content/uploads/2025/04/charles-and-camilla.jpg',
 'language': 'en',
 'published': '2025-04-09 08:00:02 +0000',
 'title': 'King Charles, Queen Camilla’s unconventional bedroom arrangement is '
          'secret s

# Set Up Access to S3 Bucket

Imported libaries required for access. Created variables, `TEAM` and `BUCKET_NAME`, with correct values for access to class bucket and team number.

In [38]:
!pip install boto3



In [39]:
import os
import boto3
from botocore.config import Config
from botocore import UNSIGNED

In [40]:
TEAM = "TEAM_4/"
BUCKET_NAME = "cus635-spring2025"

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Upload a Test File to S3 Bucket
Uploaded `words.txt` to test access to S3.

In [41]:
file_path = "/content/sample_data/tmp/"
file_name = "words.txt"
object_name = file_name

s3.upload_file(file_path+file_name, BUCKET_NAME, TEAM+object_name)

FileNotFoundError: [Errno 2] No such file or directory: '/content/sample_data/tmp/words.txt'

In [None]:
response = s3.list_objects_v2(Bucket=BUCKET_NAME)
if "Contents" in response:
    print("Files in S3 Bucket:")
    for obj in response["Contents"]:
        print(f" - {obj['Key']}")
else:
    print("No files found in the bucket.")

# Upload JSON Articles to Team Folder
Created `upload_article_to_s3` function to iterate through processed articles and upload the articles individually into the team folder in the S3 Bucket.

**Parameters**


*   `articles` : list of processed articles
*   `bucket_name` : name of the S3 bucket
*   `team_folder` : team folder to store articles

**Challenges & Outcomes**

Setting the names for the each article and removing spaces from the JSON file name. Fixed using regular expression for pattern matching and removing spaces.

In [42]:
import re

def upload_article_to_s3(articles, bucket_name, team_folder):
  for index, article in enumerate(articles):
    title = article.get("title", f"article_{index + 1}")
    file_name= re.sub(r"[^\w\s.-]", "", title)
    file_name = file_name.replace(" ", "_")
    object_key = f"{team_folder}{file_name}.json"
    try:
        s3.put_object(
            Body=json.dumps(article),
            Bucket=bucket_name,
            Key=object_key,
            ContentType='application/json'
        )
        print(f"Article {index + 1} uploaded successfully to s3://{bucket_name}/{object_key}")
    except Exception as e:
        print(f"Error uploading article {index + 1}: {e}")

upload_article_to_s3(processed_data, BUCKET_NAME, TEAM)

Article 1 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_The_Many_Ways_Kennedy_Is_Already_Undermining_Vaccines.json
Article 2 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Alabama_woman_has_pig_kidney_removed_after_a_record_130_days.json
Article 3 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Bernie_Sanders_says_largest_Fighting_Oligarchy_rally_with_AOC_is_making_Trump_Musk_very_nervous.json
Article 4 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_How_Health_and_Human_Safety_Department_cuts_could_affect_your_health.json
Article 5 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Exclusive_--_Ironic_How_Liberal_Media_Weaponizes_Woke_Cancel_Culture_RFK_Jr._Hammers_Fake_News_Attacks_Falsely_Claiming_He_Mocked_Handicapped.json
Article 6 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Diabetes_and_weight-loss_drug_changed_my_life_says_senator_I_feel_a_decade_younger.json
Article 7 uploa

# Clearing Team Folder
The `clear_s3_folder` function removes all files from the team folder.

**Parameters**

*   `bucket_name` : name of the S3 bucket
*   `folder_path` : name of the team folder

**Challenges & Outcomes**
As an anonymous user, we are not allowed to delete any files from the S3 bucket. Thus, there might be duplicate articles from our last attempt to upload the articles to the S3 bucket.


In [None]:
def clear_s3_folder(bucket_name, folder_path):
  response = s3.list_objects_v2(Bucket=bucket_name, Prefix=folder_path)['Contents']

  try:
    objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=folder_path)['Contents']
    for obj in objects:
      s3.delete_object(Bucket=bucket_name, Key=obj['Key'])
      print(f"Deleted {obj['Key']}")

    print(f"Cleared folder: s3://{bucket_name}/{folder_path}")
  except KeyError:
    print(f"Folder '{folder_path}' is empty or does not exist.")
  except Exception as e:
    print(f"Error clearing folder: {e}")

clear_s3_folder(BUCKET_NAME, TEAM)

# List Files Stored in Team Folder

In [None]:
response = s3.list_objects_v2(Bucket=BUCKET_NAME)
if "Contents" in response:
    print("Files in S3 Bucket:")
    for obj in response["Contents"]:
        print(f" - {obj['Key']}")
else:
    print("No files found in the bucket.")

In [None]:
!pip install pinecone


In [None]:
!pip install openai

In [44]:
!pip install mistralai

Collecting mistralai
  Downloading mistralai-1.6.0-py3-none-any.whl.metadata (30 kB)
Collecting eval-type-backport>=0.2.0 (from mistralai)
  Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Downloading mistralai-1.6.0-py3-none-any.whl (288 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.7/288.7 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)
Installing collected packages: eval-type-backport, mistralai
Successfully installed eval-type-backport-0.2.2 mistralai-1.6.0


In [48]:
pip install sentence-transformers



In [50]:
from pinecone import Pinecone
import openai
from sentence_transformers import SentenceTransformer


pc = Pinecone(api_key=userdata.get("PINECONE_KEY"))

model = SentenceTransformer("sentence-transformers/stsb-bert-large")


#finding host name
index_id = userdata.get("INDEX_ID")
indexes = pc.list_indexes()
for idx in indexes:
    if idx['name'] == index_id:
        print(f"Host for index {index_id}: {idx['host']}")
        index = pc.Index(host=idx['host'])

index = pc.Index(host=userdata.get("INDEX_HOST"))

def get_articles_from_s3(bucket, prefix):
    response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
    articles = []
    for obj in response.get('Contents', []):
        if obj['Key'].endswith('.json'):
            file_data = s3.get_object(Bucket=bucket, Key=obj['Key'])['Body'].read()
            article = json.loads(file_data)
            articles.append(article)
    return articles

def get_embedding(text):
    if not text:
        return []
    embedding = model.encode(text)
    return embedding.tolist()

def upsert_to_pinecone(index, articles, namespace="team-4"):
    records = []
    for i, article in enumerate(articles):
        text = article.get("description", "")
        if not text:
            continue
        embedding = get_embedding(text)
        record = {
            "id": f"article_{i}",
            "values": embedding,
            "metadata": {
                "chunk_text": text,
                "team": "Team 4",
                "category": "Policies",
                "title": article.get("title", ""),
                "published": article.get("published", ""),
                "url": article.get("url", "")
            }
        }
        records.append(record)

    for i in range(0, len(records), 100):
        batch = records[i:i+100]
        index.upsert(vectors=batch, namespace=namespace)
        print(f"Upserted batch {i//100 + 1} of {len(records)} vectors")

articles = get_articles_from_s3(BUCKET_NAME, TEAM)
upsert_to_pinecone(index, articles)


Host for index cus635: cus635-g311jqa.svc.aped-4627-b74a.pinecone.io
Upserted batch 1 of 450 vectors
Upserted batch 2 of 450 vectors
Upserted batch 3 of 450 vectors
Upserted batch 4 of 450 vectors
Upserted batch 5 of 450 vectors


{
  "namespaces": {
    "": {
      "vector_count": 122
    },
    "team-4": {
      "vector_count": 450
    }
  },
  "index_fullness": 0.0,
  "total_vector_count": 572,
  "dimension": 1024,
  "metric": "cosine",
  "vector_type": "dense"
}
