### Reddit API Notebook

This notebook will be attempting to use reddit api to find examples of Brazilian landslides/natural disasters in place of twitter api not working.

In [2]:
import praw
import pandas as pd
import json, os
from openai import OpenAI
from secretcodes import reddit_secret, reddit_client, open_api_key
import requests
from datetime import datetime
from sentence_transformers import SentenceTransformer, util

reddit = praw.Reddit(
    client_id = reddit_client,
    client_secret=reddit_secret,
    user_agent="brazilian_urop by u/haleyhernan"
)


In [3]:
query = "(Brazil AND (landslide OR mudslide)) OR (deslizamento AND terra)"
subreddits = ["worldnews", "news", "brazil", "environment", "earthscience"]

posts = []

First, we will attempt to run our API on a small working example to test if it works. 

In [6]:
for sub in subreddits: 
    print(f"\n🔎 Searching r/{sub}...")
    for post in reddit.subreddit(sub).search(query, limit=20, sort="new"):
        posts.append({
            "subreddit": sub,
            "title": post.title,
            "score": post.score,
            "url": post.url,
            "num_comments": post.num_comments,
            "created_utc": post.created_utc
        })


🔎 Searching r/worldnews...

🔎 Searching r/news...

🔎 Searching r/brazil...

🔎 Searching r/environment...

🔎 Searching r/earthscience...


In [7]:
df = pd.DataFrame(posts)
print(df.head())

   subreddit                                              title  score  \
0  worldnews  Brazil: Landslides and flooding kill 60 in Rio...    464   
1  worldnews  Brazil floods: death toll rises to 48 as lands...     41   
2  worldnews  Heavy rain triggers flooding, landslides in Br...     33   
3  worldnews  Death Toll From Floods, Landslides Rises to 36...     22   
4  worldnews  Landslides after heavy rains in southern Brazi...     26   

                                                 url  num_comments  \
0     https://www.bbc.com/news/articles/c0w03627kq4o            45   
1  https://www.theguardian.com/world/2023/feb/23/...             2   
2  https://www.abc.net.au/news/2023-02-20/brazil-...            10   
3  https://www.telesurenglish.net/news/Death-Toll...             0   
4  https://www.france24.com/en/americas/20221201-...             2   

    created_utc  
0  1.714859e+09  
1  1.677124e+09  
2  1.676920e+09  
3  1.676902e+09  
4  1.669935e+09  


In [8]:
df.to_csv("brazil_landslides.csv", index=False)
print("\n✅ Saved results to brazil_landslides.csv")


✅ Saved results to brazil_landslides.csv


Since we now know it works, let's test this on a specific example to find a specific landslide in Brazil using the dates of posts. 

By doing this, we can test to see if it feasible to use our data collection to find undocumented cases of landslides. 

Reddit API doesn't support date filtering, so we will be using Pushshift to try and filter. 

In [21]:
query = "Brazil (landslide OR mudslide OR flooding OR rain OR storm OR deslizamento OR enchente)"
start = datetime(2024, 10, 1).timestamp()
end = datetime(2024, 10, 31, 23, 59, 59).timestamp()
url = "https://api.pullpush.io/reddit/search/submission/"

In [22]:
params = {
    "q": query,
    "after": after,
    "before": before,
    "size": 100,          
    "sort": "desc",
    "sort_type": "score"  
}

In [23]:
response = requests.get(url, params=params)
data = response.json().get("data", [])

In [24]:
for sub in subreddits:
    print(f"\n🔎 Searching r/{sub}...")
    for post in reddit.subreddit(sub).search(query, limit=200, sort="new"):
        if start <= post.created_utc <= end:
            posts.append({
                "subreddit": sub,
                "title": post.title,
                "score": post.score,
                "url": post.url,
                "num_comments": post.num_comments,
                "created_utc": datetime.utcfromtimestamp(post.created_utc).strftime("%Y-%m-%d %H:%M:%S")
            })


🔎 Searching r/worldnews...

🔎 Searching r/news...

🔎 Searching r/brazil...

🔎 Searching r/environment...

🔎 Searching r/earthscience...


In [25]:
df = pd.DataFrame(posts)
print(df.head())

df.to_csv("brazil_landslides_oct2024.csv", index=False)
print("\n✅ Saved results to brazil_landslides_oct2024.csv")

   subreddit                                              title  score  \
0  worldnews  Eight dead as heavy rain thrashes Brazil after...     61   
1     brazil               Any suggestions north of Salvador ?       1   
2     brazil                                Amazon travel tips?      2   
3     brazil             Did I make a mistake booking my visit?      2   
4     brazil                                         Sent to me    323   

                                                 url  num_comments  \
0  https://phys.org/news/2024-10-dead-heavy-thras...             2   
1  https://www.reddit.com/r/Brazil/comments/1gg9f...             4   
2  https://www.reddit.com/r/Brazil/comments/1g016...             4   
3  https://www.reddit.com/r/Brazil/comments/1fzmz...            15   
4               https://i.redd.it/xe074bozqmsd1.jpeg            54   

           created_utc  
0  2024-10-14 04:41:10  
1  2024-10-31 08:08:20  
2  2024-10-09 20:24:26  
3  2024-10-09 08:57:48  
4  2024-1

Now we will clean the data before feeding into the LLM. This will reformat the contents of the post to be uniformed. It will also remove duplicate posts/posts with very similar scores. 

In [17]:
df = pd.read_csv("brazil_landslides.csv")

df["title_clean"] = df["title"].str.lower().str.strip()



In [18]:
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(df["title_clean"], convert_to_tensor=True)
cosine_scores = util.cos_sim(embeddings, embeddings)


In [19]:
to_drop = set()
for i in range(len(df)):
    if i in to_drop:
        continue
    for j in range(i + 1, len(df)):
        if cosine_scores[i][j] > 0.8:
            to_drop.add(j)

In [20]:
df_clean = df.drop(df.index[list(to_drop)])
df_clean.to_csv("brazil_landslides_clean.csv", index=False)
print(f"Removed {len(to_drop)} similar posts, {len(df_clean)} remain.")

Removed 22 similar posts, 24 remain.


Now we will do LLM work to quickly turn each record into a JSON with it's location, date, type of disaster, severity, sentiment, and possible source type (news vs. personal).

In [22]:
client = OpenAI(api_key=open_api_key)

def extract_event_info(title: str):
    prompt = f"""
    You are extracting event information from Reddit posts about Brazilian landslides.
    For the post below, return a JSON with:
    - "location": most likely location or city/state if mentioned
    - "event_type": short phrase like "landslide", "flood", "mudslide"
    - "severity": estimate low/medium/high based on death toll or language intensity
    - "sentiment": classify as negative/neutral/positive (based on tone)
    - "source_type": "news" if linking to a news outlet, otherwise "personal/post"
    - "summary": a short 1-line summary

    Post title: "{title}"

    Return ONLY valid JSON.
    """

    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    # Extract and safely parse JSON
    try:
        content = response.choices[0].message.content
        data = json.loads(content)
    except Exception as e:
        print("⚠️ Error parsing JSON for:", title)
        print("Response content:", content)
        data = {}
    return data


In [23]:
results = []
for t in df_clean["title_clean"]:
    info = extract_event_info(t)
    info["title"] = t
    results.append(info)

events_df = pd.DataFrame(results)
events_df.to_csv("brazil_landslides_structured.csv", index=False)

⚠️ Error parsing JSON for: deadly landslide engulfs motorway in brazil
Response content: ```json
{
    "location": "Brazil",
    "event_type": "landslide",
    "severity": "high",
    "sentiment": "negative",
    "source_type": "personal/post",
    "summary": "A deadly landslide has engulfed a motorway in Brazil."
}
```
⚠️ Error parsing JSON for: dramatic footage shows moment brazil mudslide begins
Response content: ```json
{
    "location": "Brazil",
    "event_type": "mudslide",
    "severity": "medium",
    "sentiment": "negative",
    "source_type": "personal/post",
    "summary": "Video captures the initial moments of a mudslide in Brazil."
}
```


In [26]:
grouped = events_df.groupby("location")

summaries = []
for location, group in grouped:
    summary_prompt = f"""
    Summarize the following reports about landslides/floods in {location}.
    Include:
    - The likely event date range
    - Overall severity (low/medium/high)
    - The nature of the events (landslide, flood, mudslide, etc.)
    - The general sentiment (negative, neutral, or positive)
    - Notable details (like number of posts, sources, or key facts)
    
    Posts data:
    {group.to_json(orient='records')}
    """

    resp = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": summary_prompt}],
        temperature=0.3
    )

    summaries.append({
        "location": location,
        "summary": resp.choices[0].message.content
    })

In [None]:
summary_df = pd.DataFrame(summaries)
summary_df.to_csv("brazil_landslides_event_summaries.csv", index=False)