<a href="https://colab.research.google.com/github/meka-williams/Free-News-APIs/blob/main/Team%204%20News%20API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini Project: Retrieving and Storing News Articles from Free News APIs
Authors: Payal Moorti & Shameka Williams

[GitHub Respository](https://github.com/meka-williams/Free-News-APIs.git)

**Project Objectives**
1. Understand how to interact with public APIs to retrieve data
2. Learn to process, filter, and store large-scale textual data
3. Gain hands-on experince integrating web mining techniques with cloud storage solutions
4. Develop team collaboration and project management skills

**Project Description**
Each team will focus on retrieving news articles using a designated free API from the provided list. Teams will extract, clean, and store the data into a designated S3 bucket on AWS. This project will involve designining efficient workflows for API interaction, data processing, and storage, with documentation for reproducibility.

# Project Milestones and Timeline

* Week 1: Project Setup and API Familiarization
* Week 2: Data Retrieval and Preprocessing
* Week 3: Storing Data in S3
* Week 4: Final Presentation and Reporting

In [16]:
import requests
import json
import google.generativeai as genai
from google.colab import userdata
import pprint
from google import genai

news_key=userdata.get("CURRENT_KEY")
gemma=userdata.get("GOOGLE_API_KEY")

In [17]:
# Create Gemini client
client = genai.Client(api_key=gemma)

#utility method to talk the Gemini
def get_response(prompt, model='gemini-2.0-flash-lite'):
    response = client.models.generate_content(
        model=model,
        contents=prompt
    )
    return response.text

In [18]:
#testing get_response
get_response("What is a zero shot prompt?")

'A zero-shot prompt is a specific type of prompt used in the context of **Natural Language Processing (NLP)**, particularly with large language models (LLMs) like GPT-3, GPT-4, or Bard.  It allows you to get a model to perform a task **without providing any examples** of how to perform that task beforehand.\n\nHere\'s a breakdown:\n\n*   **Zero-Shot:** The "zero" indicates that **zero examples** of the desired output are given to the model during the prompting phase.  The model relies purely on its pre-existing knowledge and the instructions in the prompt.\n*   **Prompt:**  The input you provide to the language model. It\'s a carefully crafted instruction or question designed to guide the model\'s response.\n\n**How it Works:**\n\nThe core principle of zero-shot learning relies on the massive amount of data the LLM has been trained on.  During training, these models learn intricate relationships between words, concepts, and tasks.  By providing a clear, concise, and well-defined prompt

In [19]:
def get_news_articles(api_key, keywords='Health', language='en', page_size=150):
   # Retrieves news articles from the Currents API based on the provided parameters.
    url = f'https://api.currentsapi.services/v1/search?keywords={keywords}&language={language}&page_size={page_size}&apiKey={api_key}'
    response = requests.get(url)
    data = response.json()

    print(f"Total Results: {len(data.get('news', []))}")  # Access 'news' from JSON data

    return data

# Example usage
news_data = get_news_articles(news_key)

Total Results: 150


In [20]:
#display sample article
pprint.pprint(news_data['news'][60])

{'author': 'Author Raman Pathik',
 'category': ['company'],
 'description': "Now is the time to take a 'You-Turn' away from sickness "
                'towards happy emotional health... Post this\n'
                '\n'
                '"Young people, especially, are under so much pressure and '
                'experience an unprecedented level of anxi...',
 'id': '75ef17f8-ec9f-4d96-ade6-6292f785a58c',
 'image': 'https://mma.prnewswire.com/media/2629752/cover.jpg?p=facebook',
 'language': 'en',
 'published': '2025-03-10 12:00:00 +0000',
 'title': "Simple Solution Expert Brings Latest in Transforming One's Life and "
          'Reducing Stress',
 'url': 'https://www.prweb.com/releases/simple-solution-expert-brings-latest-in-transforming-ones-life-and-reducing-stress-302392819.html'}


In [21]:
#preprocessing retrieved articles: handling duplicates and missing data
def preprocess_articles(articles):
    print(f"Starting preprocessing on {len(articles)} articles")
    original_article = news_data.get('news', [])[90]
    print("\nOriginal article keys:")
    print(sorted(original_article.keys()))

    seen_urls = set()
    preprocessed_articles = []
    for article in articles:
      #skip articles without url
        url = article.get('url')
        if not url:
            continue
        #skip duplicate urls
        if url in seen_urls:
            continue
        seen_urls.add(url)
        processed_article = {
            "id": article.get("id", ""),
            "title": "PROCESSED: " + article.get("title", "No title available"),  # Add a prefix to verify
            "description": article.get("description", "No description available"),
            "url": url,
            "author": article.get("author", "Unknown"),
            "image": article.get("image", ""),
            "language": article.get("language", "en"),
            "category": article.get("category", []),
            "published": article.get("published", "na"),
        }
        # Skip articles with empty titles and descriptions (both missing)
        if processed_article["title"] == "No title available" and processed_article["description"] == "No description available":
            continue
        preprocessed_articles.append(processed_article)
    return preprocessed_articles

processed_data = preprocess_articles(news_data.get('news', []))

print(f"Total articles after preprocessing: {len(processed_data)}")

#print sample original article
print("\nOriginal article:")
pprint.pprint(news_data.get('news', [])[60])

#print sample preprocessed article
print("\nPreprocessed article:")
pprint.pprint(processed_data[60])

Starting preprocessing on 150 articles

Original article keys:
['author', 'category', 'description', 'id', 'image', 'language', 'published', 'title', 'url']
Total articles after preprocessing: 150

Original article:
{'author': 'Author Raman Pathik',
 'category': ['company'],
 'description': "Now is the time to take a 'You-Turn' away from sickness "
                'towards happy emotional health... Post this\n'
                '\n'
                '"Young people, especially, are under so much pressure and '
                'experience an unprecedented level of anxi...',
 'id': '75ef17f8-ec9f-4d96-ade6-6292f785a58c',
 'image': 'https://mma.prnewswire.com/media/2629752/cover.jpg?p=facebook',
 'language': 'en',
 'published': '2025-03-10 12:00:00 +0000',
 'title': "Simple Solution Expert Brings Latest in Transforming One's Life and "
          'Reducing Stress',
 'url': 'https://www.prweb.com/releases/simple-solution-expert-brings-latest-in-transforming-ones-life-and-reducing-stress-302392

In [22]:
!pip install boto3



In [23]:
import os
import boto3
from botocore.config import Config
from botocore import UNSIGNED

In [24]:
TEAM = "TEAM_4/"
BUCKET_NAME = "cus635-spring2025"

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

In [25]:
file_path = "/content/sample_data/tmp/"
file_name = "words.txt"
object_name = file_name

s3.upload_file(file_path+file_name, BUCKET_NAME, TEAM+object_name)

In [26]:
response = s3.list_objects_v2(Bucket=BUCKET_NAME)
if "Contents" in response:
    print("Files in S3 Bucket:")
    for obj in response["Contents"]:
        print(f" - {obj['Key']}")
else:
    print("No files found in the bucket.")

Files in S3 Bucket:
 - TEAM_1/words.txt
 - TEAM_1/words_01.txt
 - TEAM_2/
 - TEAM_2//Unknown_Keep-smiling-and-goals-will-come---Maresca-tells-Palmer.json
 - TEAM_2/ADG_1-1.-La-Sarriana-suma-un-punto-ante-el-Silva-gracias-a-un-penalti-de-Boedo.json
 - TEAM_2/Adam-Chitwood_Is-‘SNL’-New-Tonight?-Who’s-Hosting-the-Next-Episode.json
 - TEAM_2/Alex-Conrad_Troy-Deeney-claims-Rasmus-Hojlund-is-not-Man-United&#8217;s-biggest-problem,-but-one-&#8216;frustration&#8217;-remains.json
 - TEAM_2/Amber-Raiken_Cristiano-Ronaldo-raises-eyebrows-for-jokingly-calling-lookalike-fan-‘ugly’.json
 - TEAM_2/Annett-Meiritz_USA:-„Gold-Card“-–-Trump-kündigt-Visa-für-fünf-Millionen-Dollar-an.json
 - TEAM_2/Ariel-Cabrera_Resultados-de-las-loterías-y-chances-de-este-sábado-8-marzo-de-2025-en-Colombia.json
 - TEAM_2/Bola-Badmus_Abolish-Sarkin-Sasa-stool,-pan-Yoruba-group-tells-Olubadan.json
 - TEAM_2/Catootje_Entertainment-•-Een-huis-vol,-ook-buiten-het-tv-seizoen.-Deel-4.json
 - TEAM_2/Cpl.-Oliver-Nisbet_QUART-25.2:

In [40]:
import re

def upload_article_to_s3(articles, bucket_name, team_folder):
  for index, article in enumerate(articles):
    title = article.get("title", f"article_{index + 1}")
    file_name= re.sub(r"[^\w\s.-]", "", title)
    file_name = file_name.replace(" ", "_")
    object_key = f"{team_folder}{file_name}.json"
    try:
        s3.put_object(
            Body=json.dumps(article),
            Bucket=bucket_name,
            Key=object_key,
            ContentType='application/json'
        )
        print(f"Article {index + 1} uploaded successfully to s3://{bucket_name}/{object_key}")
    except Exception as e:
        print(f"Error uploading article {index + 1}: {e}")

upload_article_to_s3(processed_data, BUCKET_NAME, TEAM)

Article 1 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Inmate_waiting_weeks_for_mental_health_transfer_dies_from_neglect_Illinois_suit_says.json
Article 2 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Free_dental_screenings_and_kits_at_Carrollton_Health__Safety_Fair.json
Article 3 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Teen_died_after_suffering_bleeding_in_her_brain_following_fall_during_epileptic_seizure_inquest_hears__BreakingNews.ie.json
Article 4 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_What_started_as_fatigue_turned_out_to_be_kidney_failure_for_Becky.json
Article 5 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Singer_David_Kushner_cancels_gig_after_mental_health_struggles.json
Article 6 uploaded successfully to s3://cus635-spring2025/TEAM_4/PROCESSED_Nasdaq_plunges_3_SP_drops_2_and_Dow_falls_as_well_as_Wall_Street_sell-off_continues.json
Article 7 uploaded successfully to s3

In [36]:
def clear_s3_folder(bucket_name, folder_path):
  response = s3.list_objects_v2(Bucket=bucket_name, Prefix=folder_path)['Contents']

  try:
    objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=folder_path)['Contents']
    for obj in objects:
      s3.delete_object(Bucket=bucket_name, Key=obj['Key'])
      print(f"Deleted {obj['Key']}")

    print(f"Cleared folder: s3://{bucket_name}/{folder_path}")
  except KeyError:
    print(f"Folder '{folder_path}' is empty or does not exist.")
  except Exception as e:
    print(f"Error clearing folder: {e}")

clear_s3_folder(BUCKET_NAME, TEAM)

Error clearing folder: An error occurred (AccessDenied) when calling the DeleteObject operation: Access Denied


In [41]:
response = s3.list_objects_v2(Bucket=BUCKET_NAME)
if "Contents" in response:
    print("Files in S3 Bucket:")
    for obj in response["Contents"]:
        print(f" - {obj['Key']}")
else:
    print("No files found in the bucket.")

Files in S3 Bucket:
 - TEAM_1/words.txt
 - TEAM_1/words_01.txt
 - TEAM_2/
 - TEAM_2//Unknown_Keep-smiling-and-goals-will-come---Maresca-tells-Palmer.json
 - TEAM_2/ADG_1-1.-La-Sarriana-suma-un-punto-ante-el-Silva-gracias-a-un-penalti-de-Boedo.json
 - TEAM_2/Adam-Chitwood_Is-‘SNL’-New-Tonight?-Who’s-Hosting-the-Next-Episode.json
 - TEAM_2/Alex-Conrad_Troy-Deeney-claims-Rasmus-Hojlund-is-not-Man-United&#8217;s-biggest-problem,-but-one-&#8216;frustration&#8217;-remains.json
 - TEAM_2/Amber-Raiken_Cristiano-Ronaldo-raises-eyebrows-for-jokingly-calling-lookalike-fan-‘ugly’.json
 - TEAM_2/Annett-Meiritz_USA:-„Gold-Card“-–-Trump-kündigt-Visa-für-fünf-Millionen-Dollar-an.json
 - TEAM_2/Ariel-Cabrera_Resultados-de-las-loterías-y-chances-de-este-sábado-8-marzo-de-2025-en-Colombia.json
 - TEAM_2/Bola-Badmus_Abolish-Sarkin-Sasa-stool,-pan-Yoruba-group-tells-Olubadan.json
 - TEAM_2/Catootje_Entertainment-•-Een-huis-vol,-ook-buiten-het-tv-seizoen.-Deel-4.json
 - TEAM_2/Cpl.-Oliver-Nisbet_QUART-25.2: