<a href="https://colab.research.google.com/github/karegapauline/Analysis_papers_and_media_GS/blob/main/metaanalysis_mediaarticles_alone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Methodology Part 2

Article search to gauge how the media portrays air quality and health We use Feedparser v6.0.11 to scrape news feed data. Feedparser is used to download and parse feeds. We obtained news articles about air quality and health in Kenya, South Africa, and the United Kingdom. We targeted media houses that offered both digital and print articles. For Kenya, we used Nation Africa, Standard Media, and The Star. South Africa, Daily Maverick, timeslive, and news24. And for the UK, BBC, The Guardian, and Telegraph. We did not specify any timelines and gathered all articles. Our search terms were as follows: ”air pollution”, ”air quality”, "climate change", ”respiratory diseases”, and ”pollution policy”.

In [None]:
pip install feedparser

Collecting feedparser
  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)
Collecting sgmllib3k (from feedparser)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading feedparser-6.0.12-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.5/81.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6046 sha256=6199a6337f7e295f3a1ce9726a6da4a3d524d17b5af39b272936cbca186614ee
  Stored in directory: /root/.cache/pip/wheels/03/f5/1a/23761066dac1d0e8e683e5fdb27e12de53209d05a4a37e6246
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.12 sgmllib3k-1.0.0


In [None]:
import feedparser
import pandas as pd
import time
from urllib.parse import quote_plus

# ------------------------
# CONFIGURATION
# ------------------------

SEARCH_TERMS = [
    "air pollution",
    "air quality",
    "respiratory illness",
    "respiratory disease",
    "air quality policy"
]

COUNTRY_SOURCES = {
    "Kenya": ["nation.africa", "standardmedia.co.ke", "the-star.co.ke"],
    "South Africa": ["dailymaverick.co.za", "timeslive.co.za", "news24.com"],
    "UK": ["bbc.co.uk", "theguardian.com", "telegraph.co.uk"]
}

# ------------------------
# FUNCTION TO PARSE GOOGLE RSS
# ------------------------

def fetch_articles(search_term, site):
    query = f'{search_term} site:{site}'
    encoded_query = quote_plus(query)
    url = f"https://news.google.com/rss/search?q={encoded_query}&hl=en-GB&gl=GB&ceid=GB:en"

    feed = feedparser.parse(url)
    articles = []

    for entry in feed.entries:
        articles.append({
            "search_term": search_term,
            "source_site": site,
            "title": entry.title,
            "link": entry.link,
            "published": entry.get("published", ""),
            "summary": entry.get("summary", "")
        })

    return articles

# ------------------------
# MAIN FUNCTION
# ------------------------

def scrape_google_news():
    for country, sources in COUNTRY_SOURCES.items():
        print(f"\n Scraping Google News for {country}...")
        all_records = []

        for site in sources:
            for term in SEARCH_TERMS:
                print(f"🔍 {term} @ {site}")
                articles = fetch_articles(term, site)
                all_records.extend(articles)
                time.sleep(1)  # be polite to Google servers

        # Save to CSV
        df = pd.DataFrame(all_records)
        filename = f"{country.lower().replace(' ', '_')}_gnews.csv"
        df.to_csv(filename, index=False)
        print(f"✅ Saved {len(df)} articles to {filename}")

if __name__ == "__main__":
    scrape_google_news()



 Scraping Google News for Kenya...
🔍 air pollution @ nation.africa
🔍 air quality @ nation.africa
🔍 respiratory illness @ nation.africa
🔍 respiratory disease @ nation.africa
🔍 air quality policy @ nation.africa
🔍 air pollution @ standardmedia.co.ke
🔍 air quality @ standardmedia.co.ke
🔍 respiratory illness @ standardmedia.co.ke
🔍 respiratory disease @ standardmedia.co.ke
🔍 air quality policy @ standardmedia.co.ke
🔍 air pollution @ the-star.co.ke
🔍 air quality @ the-star.co.ke
🔍 respiratory illness @ the-star.co.ke
🔍 respiratory disease @ the-star.co.ke
🔍 air quality policy @ the-star.co.ke
✅ Saved 1479 articles to kenya_gnews.csv

 Scraping Google News for South Africa...
🔍 air pollution @ dailymaverick.co.za
🔍 air quality @ dailymaverick.co.za
🔍 respiratory illness @ dailymaverick.co.za
🔍 respiratory disease @ dailymaverick.co.za
🔍 air quality policy @ dailymaverick.co.za
🔍 air pollution @ timeslive.co.za
🔍 air quality @ timeslive.co.za
🔍 respiratory illness @ timeslive.co.za
🔍 respira

In [None]:
## FITERING OF RELEVANT ARTICLES
# REMOVE DUPLICATES first
import pandas as pd

# Load your file
df = pd.read_csv("kenya_gnews.csv")
df2 = pd.read_csv("uk_gnews.csv")
df3 = pd.read_csv("south_africa_gnews.csv")

# Normalize titles
df['clean_title'] = df['title'].str.lower().str.strip()
df2['clean_title'] = df2['title'].str.lower().str.strip()
df3['clean_title'] = df3['title'].str.lower().str.strip()

# Mark duplicates
df['duplicate'] = df.duplicated(subset='clean_title', keep='first')
df2['duplicate'] = df2.duplicated(subset='clean_title', keep='first')
df3['duplicate'] = df3.duplicated(subset='clean_title', keep='first')

# Save with duplicate flag
df.to_csv("kenya_gnews_deduped.csv", index=False)

df2.to_csv("uk_gnews_deduped.csv", index=False)

df3.to_csv("south_africa_gnews_deduped.csv", index=False)

# Save only unique articles
df[~df['duplicate']].to_csv("kenya_gnews_unique.csv", index=False)

df2[~df2['duplicate']].to_csv("uk_gnews_unique.csv", index=False)

df3[~df3['duplicate']].to_csv("south_africa_gnews_unique.csv", index=False)

print(f"✅ Found and removed {df['duplicate'].sum()} duplicates.")
print(f"✅ Found and removed {df2['duplicate'].sum()} duplicates.")
print(f"✅ Found and removed {df3['duplicate'].sum()} duplicates.")

✅ Found and removed 638 duplicates.
✅ Found and removed 565 duplicates.
✅ Found and removed 562 duplicates.


In [None]:
## NOW REMOVE ALL HEADINGS THAT ARE NOT AIR QUALITY AND HEALTH RELATED
import pandas as pd

# Load your file
df = pd.read_csv("kenya_gnews_unique.csv")
df2 = pd.read_csv("uk_gnews_unique.csv")
df3 = pd.read_csv("south_africa_gnews_unique.csv")

# Filter titles that mention "air quality" or "health"
df_filtered = df[df['title'].str.contains(r'\b(air quality| pollution | cooking | electric | climate | diseases | breathing | cities | environmental |health)\b', case=False, na=False)]
df2_filtered = df2[df2['title'].str.contains(r'\b(air quality| pollution | cooking | electric | climate | diseases | breathing | cities | environmental |health)\b', case=False, na=False)]
df3_filtered = df3[df3['title'].str.contains(r'\b(air quality| pollution | cooking | electric | climate | diseases | breathing | cities | environmental |health)\b', case=False, na=False)]



# Normalize titles for deduplication
df_filtered['clean_title'] = df_filtered['title'].str.lower().str.strip()
df2_filtered['clean_title'] = df2_filtered['title'].str.lower().str.strip()
df3_filtered['clean_title'] = df3_filtered['title'].str.lower().str.strip()

# Mark exact duplicates
df_filtered['duplicate'] = df_filtered.duplicated(subset='clean_title', keep='first')
df2_filtered['duplicate'] = df2_filtered.duplicated(subset='clean_title', keep='first')
df3_filtered['duplicate'] = df3_filtered.duplicated(subset='clean_title', keep='first')

# Save filtered and deduplicated articles
df_filtered.to_csv("kenya_gnews_filtered_deduped.csv", index=False)
df2_filtered.to_csv("uk_gnews_filtered_deduped.csv", index=False)
df3_filtered.to_csv("south_africa_gnews_filtered_deduped.csv", index=False)

# Save only unique ones
df_filtered[~df_filtered['duplicate']].to_csv("kenya_gnews_filtered_unique.csv", index=False)
df2_filtered[~df2_filtered['duplicate']].to_csv("uk_gnews_filtered_unique.csv", index=False)
df3_filtered[~df3_filtered['duplicate']].to_csv("south_africa_gnews_filtered_unique.csv", index=False)

print(f"✅ Filtered to {len(df_filtered)} relevant articles, removed {df_filtered['duplicate'].sum()} duplicates.")
print(f"✅ Filtered to {len(df2_filtered)} relevant articles, removed {df2_filtered['duplicate'].sum()} duplicates.")
print(f"✅ Filtered to {len(df3_filtered)} relevant articles, removed {df3_filtered['duplicate'].sum()} duplicates.")


✅ Filtered to 147 relevant articles, removed 0 duplicates.
✅ Filtered to 139 relevant articles, removed 0 duplicates.
✅ Filtered to 134 relevant articles, removed 0 duplicates.


  df_filtered = df[df['title'].str.contains(r'\b(air quality| pollution | cooking | electric | climate | diseases | breathing | cities | environmental |health)\b', case=False, na=False)]
  df2_filtered = df2[df2['title'].str.contains(r'\b(air quality| pollution | cooking | electric | climate | diseases | breathing | cities | environmental |health)\b', case=False, na=False)]
  df3_filtered = df3[df3['title'].str.contains(r'\b(air quality| pollution | cooking | electric | climate | diseases | breathing | cities | environmental |health)\b', case=False, na=False)]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['clean_title'] = df_filtered['title'].str.lower().str.strip()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,

Final iteration run on October 3rd 2025.

147 articles were obtained for Kenya, 139 for the UK, and 134 for SA. this was after deduplication and filtering of articles that weren't releted to air quality and health.