<a href="https://colab.research.google.com/github/meka-williams/Free-News-APIs/blob/main/Team%204%20News%20API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini Project: Retrieving and Storing News Articles from Free News APIs
Authors: Payal Moorti & Shameka Williams

[GitHub Respository](https://github.com/meka-williams/Free-News-APIs.git)

**Project Objectives**
1. Understand how to interact with public APIs to retrieve data
2. Learn to process, filter, and store large-scale textual data
3. Gain hands-on experince integrating web mining techniques with cloud storage solutions
4. Develop team collaboration and project management skills

**Project Description**
Each team will focus on retrieving news articles using a designated free API from the provided list. Teams will extract, clean, and store the data into a designated S3 bucket on AWS. This project will involve designining efficient workflows for API interaction, data processing, and storage, with documentation for reproducibility.

# Project Milestones and Timeline

* Week 1: Project Setup and API Familiarization
* Week 2: Data Retrieval and Preprocessing
* Week 3: Storing Data in S3
* Week 4: Final Presentation and Reporting

In [1]:
import requests
import json
import google.generativeai as genai
from google.colab import userdata
import pprint
from google import genai

news_key=userdata.get("CURRENT_KEY")
gemma=userdata.get("GOOGLE_API_KEY")

In [2]:
# Create Gemini client
client = genai.Client(api_key=gemma)

#utility method to talk the Gemini
def get_response(prompt, model='gemini-2.0-flash-lite'):
    response = client.models.generate_content(
        model=model,
        contents=prompt
    )
    return response.text

In [3]:
#testing get_response
get_response("What is a zero shot prompt?")

'A zero-shot prompt is a type of prompt used with large language models (LLMs) where the model is asked to perform a task without being explicitly trained on that specific task or provided with any examples beforehand. It\'s like giving the model a task and expecting it to figure out how to do it based solely on its pre-existing knowledge and understanding of language.\n\nHere\'s a breakdown:\n\n*   **Zero-shot:** The model hasn\'t "seen" (or been trained on) examples related to the task it\'s being asked to do.\n*   **Prompt:** This is the instruction or question given to the LLM. It\'s the input that guides the model\'s response.\n\n**How it Works:**\n\nLLMs are trained on massive datasets of text and code. This pre-training allows them to:\n\n*   Understand language structure and grammar.\n*   Identify relationships between words and concepts.\n*   Recognize patterns in text.\n*   Have a broad understanding of the world based on the data they were trained on.\n\nWhen given a zero-sh

In [32]:
def get_news_articles(api_key, keywords='Health', language='en', page_size=150):
   # Retrieves news articles from the Currents API based on the provided parameters.
    url = f'https://api.currentsapi.services/v1/search?keywords={keywords}&language={language}&page_size={page_size}&apiKey={api_key}'
    response = requests.get(url)
    data = response.json()

    print(f"Total Results: {len(data.get('news', []))}")  # Access 'news' from JSON data

    return data

# Example usage
news_data = get_news_articles(news_key)

Total Results: 150


In [36]:
#display sample article
pprint.pprint(news_data['news'][60])

{'author': 'Author Raman Pathik',
 'category': ['company'],
 'description': "Now is the time to take a 'You-Turn' away from sickness "
                'towards happy emotional health... Post this\n'
                '\n'
                '"Young people, especially, are under so much pressure and '
                'experience an unprecedented level of anxi...',
 'id': '75ef17f8-ec9f-4d96-ade6-6292f785a58c',
 'image': 'https://mma.prnewswire.com/media/2629752/cover.jpg?p=facebook',
 'language': 'en',
 'published': '2025-03-10 12:00:00 +0000',
 'title': "Simple Solution Expert Brings Latest in Transforming One's Life and "
          'Reducing Stress',
 'url': 'https://www.prweb.com/releases/simple-solution-expert-brings-latest-in-transforming-ones-life-and-reducing-stress-302392819.html'}


In [35]:
#preprocessing retrieved articles: handling duplicates and missing data
def preprocess_articles(articles):
    print(f"Starting preprocessing on {len(articles)} articles")
    original_article = news_data.get('news', [])[90]
    print("\nOriginal article keys:")
    print(sorted(original_article.keys()))

    seen_urls = set()
    preprocessed_articles = []
    for article in articles:
      #skip articles without url
        url = article.get('url')
        if not url:
            continue
        #skip duplicate urls
        if url in seen_urls:
            continue
        seen_urls.add(url)
        processed_article = {
            "id": article.get("id", ""),
            "title": "PROCESSED: " + article.get("title", "No title available"),  # Add a prefix to verify
            "description": article.get("description", "No description available"),
            "url": url,
            "author": article.get("author", "Unknown"),
            "image": article.get("image", ""),
            "language": article.get("language", "en"),
            "category": article.get("category", []),
            "published": article.get("published", "na"),
        }
        # Skip articles with empty titles and descriptions (both missing)
        if processed_article["title"] == "No title available" and processed_article["description"] == "No description available":
            continue
        preprocessed_articles.append(processed_article)
    return preprocessed_articles

processed_data = preprocess_articles(news_data.get('news', []))

print(f"Total articles after preprocessing: {len(processed_data)}")

#print sample original article
print("\nOriginal article:")
pprint.pprint(news_data.get('news', [])[60])

#print sample preprocessed article
print("\nPreprocessed article:")
pprint.pprint(processed_data[60])

Starting preprocessing on 150 articles

Original article keys:
['author', 'category', 'description', 'id', 'image', 'language', 'published', 'title', 'url']
Total articles after preprocessing: 150

Original article:
{'author': 'Author Raman Pathik',
 'category': ['company'],
 'description': "Now is the time to take a 'You-Turn' away from sickness "
                'towards happy emotional health... Post this\n'
                '\n'
                '"Young people, especially, are under so much pressure and '
                'experience an unprecedented level of anxi...',
 'id': '75ef17f8-ec9f-4d96-ade6-6292f785a58c',
 'image': 'https://mma.prnewswire.com/media/2629752/cover.jpg?p=facebook',
 'language': 'en',
 'published': '2025-03-10 12:00:00 +0000',
 'title': "Simple Solution Expert Brings Latest in Transforming One's Life and "
          'Reducing Stress',
 'url': 'https://www.prweb.com/releases/simple-solution-expert-brings-latest-in-transforming-ones-life-and-reducing-stress-302392

In [7]:
!pip install boto3

Collecting boto3
  Downloading boto3-1.37.9-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.38.0,>=1.37.9 (from boto3)
  Downloading botocore-1.37.9-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.12.0,>=0.11.0 (from boto3)
  Downloading s3transfer-0.11.4-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.37.9-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.5/139.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.37.9-py3-none-any.whl (13.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m92.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Downloading s3transfer-0.11.4-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.4/84.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00

In [8]:
import os
import boto3
from botocore.config import Config
from botocore import UNSIGNED

In [9]:
TEAM = "TEAM_4/"
BUCKET_NAME = "cus635-spring2025"

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

In [10]:
file_path = "/content/sample_data/tmp/"
file_name = "words.txt"
object_name = file_name

s3.upload_file(file_path+file_name, BUCKET_NAME, TEAM+object_name)

FileNotFoundError: [Errno 2] No such file or directory: '/content/sample_data/tmp/words.txt'

In [None]:
response = s3.list_objects_v2(Bucket=BUCKET_NAME)
if "Contents" in response:
    print("Files in S3 Bucket:")
    for obj in response["Contents"]:
        print(f" - {obj['Key']}")
else:
    print("No files found in the bucket.")

In [None]:
type(processed_data)