# Mini Project: Retrieving and Storing News Articles from Free News APIs
Authors: Payal Moorti & Shameka Williams

[GitHub Respository](https://github.com/meka-williams/Free-News-APIs.git)

**Project Objectives**
1. Understand how to interact with public APIs to retrieve data
2. Learn to process, filter, and store large-scale textual data
3. Gain hands-on experince integrating web mining techniques with cloud storage solutions
4. Develop team collaboration and project management skills

**Project Description**
Each team will focus on retrieving news articles using a designated free API from the provided list. Teams will extract, clean, and store the data into a designated S3 bucket on AWS. This project will involve designining efficient workflows for API interaction, data processing, and storage, with documentation for reproducibility.

# Project Milestones and Timeline

* Week 1: Project Setup and API Familiarization
* Week 2: Data Retrieval and Preprocessing
* Week 3: Storing Data in S3
* Week 4: Final Presentation and Reporting

In [2]:
import requests
import json
import google.generativeai as genai
from google.colab import userdata
import pprint
from google import genai

news_key=userdata.get("CURRENT_KEY")
gemma=userdata.get("GOOGLE_API_KEY")

In [3]:
# Create Gemini client
client = genai.Client(api_key=gemma)

#utility method to talk the Gemini
def get_response(prompt, model='gemini-2.0-flash-lite'):
    response = client.models.generate_content(
        model=model,
        contents=prompt
    )
    return response.text

In [4]:
#testing get_response
get_response("What is a zero shot prompt?")

'A zero-shot prompt is a type of prompt used in natural language processing (NLP) and large language models (LLMs) that allows the model to perform a task it hasn\'t been explicitly trained on. In other words, the model is given a task *without* any specific examples or training data.\n\nHere\'s a breakdown:\n\n*   **Zero-Shot:** The model receives *no* examples of input-output pairs to learn from.\n*   **Prompt:** A text input that instructs the model what to do. This could be a question, a command, a description, or anything that guides the model\'s response.\n\n**How it Works:**\n\nThe model leverages its pre-existing knowledge and the relationships it learned during its extensive pre-training on vast amounts of text data. The prompt acts as a cue, helping the model to understand the task and generate an appropriate response based on its internal understanding of language and the world.\n\n**Example:**\n\n**Prompt:** "Translate \'Hello, world!\' to French."\n\nIn this example, the L

In [5]:
#querying current api for health related articles
url = f'https://api.currentsapi.services/v1/search?keywords=Health&language=en&page_size=150&apiKey={news_key}'
response = requests.get(url)
data=response.json()
print(data)
print(f"Total Results: {len(data.get('news', []))}")  # Access 'news' from JSON data

{'status': 'ok', 'news': [{'id': '0db8758c-b27c-480e-bbc3-35ad51095bef', 'title': "Supreme Court to hear challenge to 'conversion therapy' ban for minors", 'description': "More than 20 states restrict conversion therapy, which health providers say dangerously attempts to change young people's gender identity or sexual orientation.", 'url': 'https://www.washingtonpost.com/politics/2025/03/10/supreme-court-colorado-conversion-therapy-ban/', 'author': 'Ann Marimow', 'image': 'None', 'language': 'en', 'category': ['politics'], 'published': '2025-03-10 16:49:09 +0000'}, {'id': '1bead49b-e150-466b-bef5-3922a9a2d8f3', 'title': 'Anew Health Seeks U.S. IPO On Faltering Revenue Growth', 'description': 'Anew Health plans a $20M IPO amid slowing revenue growth and market challenges. Discover insights on its expansion plans and risks. Click for more on AVG stock now.', 'url': 'https://seekingalpha.com/article/4766190-anew-health-seeks-us-ipo-on-faltering-revenue-growth?source=feed_tag_ipo_analysis'

In [6]:
#display sample article
pprint.pprint(data['news'][65])

{'author': 'MarketBeat',
 'category': ['finance'],
 'description': 'Kendall Capital Management decreased its position in shares '
                'of  CVS Health Co. (NYSE:CVS - Free Report) by 52.5% during '
                'the 4th quarter, according to the company in its most recent '
                'disclosure with the SEC. The fund owned 4,680 shares of the '
                "pharmacy operator's stock after selling 5,163 shares during",
 'id': '9c501918-931a-4965-9581-e8d7cae55c7e',
 'image': 'https://www.marketbeat.com/logos/cvs-health-co-logo-1200x675.png?v=20221020160339',
 'language': 'en',
 'published': '2025-03-10 09:19:29 +0000',
 'title': 'Kendall Capital Management Reduces Holdings in CVS Health Co. '
          '(NYSE:CVS)',
 'url': 'https://www.marketbeat.com/instant-alerts/kendall-capital-management-reduces-holdings-in-cvs-health-co-nysecvs-2025-03-10/'}


In [7]:
#preprocessing retrieved articles: handling duplicates and missing data
def preprocess_articles(articles):
    print(f"Starting preprocessing on {len(articles)} articles")
    original_article = data.get('news', [])[90]
    print("\nOriginal article keys:")
    print(sorted(original_article.keys()))

    seen_urls = set()
    preprocessed_articles = []
    for article in articles:
      #skip articles without url
        url = article.get('url')
        if not url:
            continue
        #skip duplicate urls
        if url in seen_urls:
            continue
        seen_urls.add(url)
        processed_article = {
            "id": article.get("id", ""),
            "title": "PROCESSED: " + article.get("title", "No title available"),  # Add a prefix to verify
            "description": article.get("description", "No description available"),
            "url": url,
            "author": article.get("author", "Unknown"),
            "image": article.get("image", ""),
            "language": article.get("language", "en"),
            "category": article.get("category", []),
            "published": article.get("published", "na"),
        }
        # Skip articles with empty titles and descriptions (both missing)
        if processed_article["title"] == "No title available" and processed_article["description"] == "No description available":
            continue
        preprocessed_articles.append(processed_article)
    return preprocessed_articles

processed_data = preprocess_articles(data.get('news', []))

print(f"Total articles after preprocessing: {len(processed_data)}")

#print sample original article
print("\nOriginal article:")
pprint.pprint(data.get('news', [])[17])

#print sample preprocessed article
print("\nPreprocessed article:")
pprint.pprint(processed_data[17])

Starting preprocessing on 150 articles

Original article keys:
['author', 'category', 'description', 'id', 'image', 'language', 'published', 'title', 'url']
Total articles after preprocessing: 150

Original article:
{'author': 'Susie Webb',
 'category': ['regional', 'pennsylvania'],
 'description': 'Texas health officials say an unvaccinated school-aged child '
                'was hospitalized in Lubbock last week and tested positive for '
                'measles. That case sadly turned deadly. Texas health '
                'officials say they are s...',
 'id': '2c8e0a72-6717-4d55-b739-f73e84e02908',
 'image': 'https://kubrick.htvapps.com/htv-prod-media.s3.amazonaws.com/images/measles-thumb-2-67c8d01ebfb3d.jpg?crop=0.912xw:0.912xh;0.0433xw,0.0313xh&resize=1200:*',
 'language': 'en',
 'published': '2025-03-10 13:30:00 +0000',
 'title': 'Get the Facts: Measles cases are 3 times higher this year',
 'url': 'https://www.wgal.com/article/measles-cases-vaccination-trends/64057180'}

Prepro

In [8]:
!pip install boto3

Collecting boto3
  Downloading boto3-1.37.9-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.38.0,>=1.37.9 (from boto3)
  Downloading botocore-1.37.9-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.12.0,>=0.11.0 (from boto3)
  Downloading s3transfer-0.11.4-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.37.9-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.5/139.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.37.9-py3-none-any.whl (13.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m85.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Downloading s3transfer-0.11.4-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.4/84.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00

In [9]:
import os
import boto3
from botocore.config import Config
from botocore import UNSIGNED

In [10]:
TEAM = "TEAM_4/"
BUCKET_NAME = "cus635-spring2025"

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

In [12]:
file_path = "/content/sample_data/tmp/"
file_name = "words.txt"
object_name = file_name

s3.upload_file(file_path+file_name, BUCKET_NAME, TEAM+object_name)

In [13]:
response = s3.list_objects_v2(Bucket=BUCKET_NAME)
if "Contents" in response:
    print("Files in S3 Bucket:")
    for obj in response["Contents"]:
        print(f" - {obj['Key']}")
else:
    print("No files found in the bucket.")

Files in S3 Bucket:
 - TEAM_1/words.txt
 - TEAM_1/words_01.txt
 - TEAM_2/
 - TEAM_2//Unknown_Keep-smiling-and-goals-will-come---Maresca-tells-Palmer.json
 - TEAM_2/ADG_1-1.-La-Sarriana-suma-un-punto-ante-el-Silva-gracias-a-un-penalti-de-Boedo.json
 - TEAM_2/Adam-Chitwood_Is-‘SNL’-New-Tonight?-Who’s-Hosting-the-Next-Episode.json
 - TEAM_2/Alex-Conrad_Troy-Deeney-claims-Rasmus-Hojlund-is-not-Man-United&#8217;s-biggest-problem,-but-one-&#8216;frustration&#8217;-remains.json
 - TEAM_2/Amber-Raiken_Cristiano-Ronaldo-raises-eyebrows-for-jokingly-calling-lookalike-fan-‘ugly’.json
 - TEAM_2/Annett-Meiritz_USA:-„Gold-Card“-–-Trump-kündigt-Visa-für-fünf-Millionen-Dollar-an.json
 - TEAM_2/Ariel-Cabrera_Resultados-de-las-loterías-y-chances-de-este-sábado-8-marzo-de-2025-en-Colombia.json
 - TEAM_2/Bola-Badmus_Abolish-Sarkin-Sasa-stool,-pan-Yoruba-group-tells-Olubadan.json
 - TEAM_2/Catootje_Entertainment-•-Een-huis-vol,-ook-buiten-het-tv-seizoen.-Deel-4.json
 - TEAM_2/Cpl.-Oliver-Nisbet_QUART-25.2:

In [15]:
type(processed_data)

list