# How popular are different social movements over time?

LSE DS105A - Data for Data Science (2024/25)

**Date**: 18/11/24

**Author**: Amelia Dunn

**Objective**:🌟 Pull data files from GDELT API using Pytrends to get popularity of different social movements or events over time.

## For google search using Pytrends instead - Cannot use this oops

- I tried investigating Reddit data, but you cannot access historical data
- Then I tried investigating X (previously known as Twitter), but you could only access the historical data through paying.
- We tried querying data from GDELT API, but we were only able to access data from 2014 onward, which we found to be too limiting. On top of this, it provides a seperate json file for each day from 2014 which is alot.
- Tryed to get data from Open sanctions and open corporate, but found that this was an API you have to pay for
- Now cycling back to GDELT and gathering data only from 2014 onwards. 

---

In [15]:
from pytrends.request import TrendReq
import pandas as pd
import json

def fetch_trends_data(keywords, geo='US', timeframe='2004-01-01 2023-12-31'):
    """
    Fetch Google Trends interest data for a list of keywords.
    """
    pytrends = TrendReq(hl='en-US', tz=360)
    trends_data = {}

    for keyword in keywords:
        try:
            pytrends.build_payload([keyword], timeframe=timeframe, geo=geo)
            interest_over_time = pytrends.interest_over_time()

            # Check if data exists
            if not interest_over_time.empty:
                # Convert Timestamp keys to string
                trends_data[keyword] = {date.strftime("%Y-%m-%d"): interest 
                                        for date, interest in interest_over_time[keyword].items()}
            else:
                print(f"No data available for '{keyword}'.")
        except Exception as e:
            print(f"An error occurred for '{keyword}': {e}")

    return trends_data

# Define a list of social movements - picked manually from a top 10 list of social movements researched online [find a webpage later]
movements = [
    "Black Lives Matter",
    "Me Too Movement",
    "Climate Strikes",
    "LGBTQ Rights",
    "Women's March",
    "Occupy Wall Street",
    "March for Our Lives",
    "Civil Rights Movement",
    "Environmental Activism",
    "Immigration Reform"
]

# Fetch data for the list of movements
raw_data = fetch_trends_data(movements, geo="US", timeframe="2004-01-01 2023-12-31")

# Save raw data to JSON
with open("../data_amelia/raw/social_movements_raw.json", "w") as json_file:
    json.dump(raw_data, json_file, indent=4)

print("Data has been saved to '../data_amelia/raw/social_movements_raw.json'")

Data has been saved to '../data_amelia/raw/social_movements_raw.json'


In [23]:
import json
import pandas as pd

def process_json_to_csv(json_file_path, csv_file_path):
    """
    Converts raw data from a JSON file to a CSV file.
    Args:
        json_file_path (str): Path to the input JSON file.
        csv_file_path (str): Path to the output CSV file.
    """
    try:
        # Load the JSON file
        with open(json_file_path, "r") as json_file:
            raw_data = json.load(json_file)
        
        # Process the data into a list of dictionaries
        processed_data = []
        for movement, data in raw_data.items():
            for date, interest in data.items():
                processed_data.append({"Movement": movement, "Date": date, "Interest": interest})
        
        # Create a DataFrame and save as CSV
        df = pd.DataFrame(processed_data)
        df.to_csv(csv_file_path, index=False)
        print(f"CSV file has been saved to {csv_file_path}.")
    except Exception as e:
        print(f"An error occurred while processing the JSON file: {e}")
        

# Example usage
json_file_path = "../data_amelia/raw/social_movements_raw.json"
csv_file_path = "../data_amelia/processed/social_movements.csv"

process_json_to_csv(json_file_path, csv_file_path)

CSV file has been saved to ../data_amelia/processed/social_movements.csv.


In [27]:
df

Unnamed: 0,Movement,Date,Interest
0,Black Lives Matter,2004-01-01,0
1,Black Lives Matter,2004-02-01,0
2,Black Lives Matter,2004-03-01,0
3,Black Lives Matter,2004-04-01,0
4,Black Lives Matter,2004-05-01,0
...,...,...,...
2395,Immigration Reform,2023-08-01,2
2396,Immigration Reform,2023-09-01,3
2397,Immigration Reform,2023-10-01,3
2398,Immigration Reform,2023-11-01,3


processing the data - move this to NB02 when have the time

**TODO:** use SQLite to put the dataframe into a database and any other processing.

---

## GDELT data:

Code for downloading all the zip files individually (if you want to use it)

In [1]:
import requests
import os
import time
from datetime import datetime, timedelta

def download_zip_file(url, output_dir, retries=3, delay=2):
    """
    Downloads a zip file from the given URL and stores it in the specified directory.
    
    Args:
        url (str): URL of the GDELT zip file.
        output_dir (str): Directory to save the zip files.
        retries (int): Number of retry attempts on failure.
        delay (int): Delay (in seconds) between retries.
        
    Returns:
        str: Path to the saved zip file, or None if the download failed.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    file_name = url.split("/")[-1]
    zip_path = os.path.join(output_dir, file_name)

    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=30)
            if response.status_code == 404:
                print(f"File not found (404): {url}")
                return None
            elif response.status_code == 200:
                with open(zip_path, "wb") as zip_file:
                    zip_file.write(response.content)
                print(f"Downloaded zip file: {zip_path}")
                return zip_path
            else:
                print(f"Failed to download {url}. HTTP Status Code: {response.status_code}")
        except requests.exceptions.RequestException as e:
            print(f"Error downloading {url}: {e}")
        
        print(f"Retrying in {delay} seconds...")
        time.sleep(delay)

    print(f"Failed to download after {retries} attempts: {url}")
    return None


def fetch_gdelt_data_zip(start_date, end_date, output_dir="../data_amelia/raw", delay=2, retries=3):
    """
    Fetch GDELT data for a given date range (2014 onwards) and store as zip files.
    
    Args:
        start_date (str): Start date in 'YYYY-MM-DD' format.
        end_date (str): End date in 'YYYY-MM-DD' format.
        output_dir (str): Directory to store the downloaded zip files.
        delay (int): Delay (in seconds) between requests.
        retries (int): Number of retry attempts on failure.
    """
    start_date = datetime.strptime(start_date, "%Y-%m-%d")
    end_date = datetime.strptime(end_date, "%Y-%m-%d")
    current_date = start_date
    missing_files = []

    while current_date <= end_date:
        # For 2014 onwards, we use the daily file structure: YYYYMMDD.export.CSV.zip
        url = f"http://data.gdeltproject.org/events/{current_date.strftime('%Y%m%d')}.export.CSV.zip"

        # Download the zip file
        zip_path = download_zip_file(url, output_dir, retries, delay)
        if not zip_path:
            missing_files.append(url)

        # Increment date by one day
        current_date += timedelta(days=1)

        # Delay to avoid overloading the server
        print(f"Sleeping for {delay} seconds...")
        time.sleep(delay)

    # Log missing files
    if missing_files:
        missing_file_log = os.path.join(output_dir, "missing_files.log")
        with open(missing_file_log, "w") as log_file:
            for url in missing_files:
                log_file.write(f"{url}\n")
        print(f"Missing files logged to {missing_file_log}")


# Example usage: Fetch data from 2014 onwards
fetch_gdelt_data_zip(
    start_date="2014-01-01",  # Start from 2014
    end_date="2023-12-31",    # End on December 31st, 2023
    output_dir="../data_amelia/raw",    # Directory to save the zip files
    delay=2,                  # Delay in seconds between requests
    retries=3                 # Retry 3 times in case of failure
)


Downloaded zip file: ../data_amelia/raw/20140101.export.CSV.zip
Sleeping for 2 seconds...
Downloaded zip file: ../data_amelia/raw/20140102.export.CSV.zip
Sleeping for 2 seconds...
Downloaded zip file: ../data_amelia/raw/20140103.export.CSV.zip
Sleeping for 2 seconds...
Downloaded zip file: ../data_amelia/raw/20140104.export.CSV.zip
Sleeping for 2 seconds...
Downloaded zip file: ../data_amelia/raw/20140105.export.CSV.zip
Sleeping for 2 seconds...
Downloaded zip file: ../data_amelia/raw/20140106.export.CSV.zip
Sleeping for 2 seconds...
Downloaded zip file: ../data_amelia/raw/20140107.export.CSV.zip
Sleeping for 2 seconds...
Downloaded zip file: ../data_amelia/raw/20140108.export.CSV.zip
Sleeping for 2 seconds...
Downloaded zip file: ../data_amelia/raw/20140109.export.CSV.zip
Sleeping for 2 seconds...
Downloaded zip file: ../data_amelia/raw/20140110.export.CSV.zip
Sleeping for 2 seconds...
Downloaded zip file: ../data_amelia/raw/20140111.export.CSV.zip
Sleeping for 2 seconds...
Downloaded

OSError: [Errno 122] Disk quota exceeded

Processing the data so to not exceed the data limit

In [8]:
import requests
import zipfile
import io
import os
import pandas as pd
import time
from datetime import datetime

# List of column names as provided
GDELT_COLUMNS = [
    "GLOBALEVENTID", "SQLDATE", "MonthYear", "Year", "FractionDate", "Actor1Code", "Actor1Name", 
    "Actor1CountryCode", "Actor1KnownGroupCode", "Actor1EthnicCode", "Actor1Religion1Code", 
    "Actor1Religion2Code", "Actor1Type1Code", "Actor1Type2Code", "Actor1Type3Code", "Actor2Code", 
    "Actor2Name", "Actor2CountryCode", "Actor2KnownGroupCode", "Actor2EthnicCode", "Actor2Religion1Code", 
    "Actor2Religion2Code", "Actor2Type1Code", "Actor2Type2Code", "Actor2Type3Code", "IsRootEvent", 
    "EventCode", "EventBaseCode", "EventRootCode", "QuadClass", "GoldsteinScale", "NumMentions", 
    "NumSources", "NumArticles", "AvgTone", "Actor1Geo_Type", "Actor1Geo_FullName", "Actor1Geo_CountryCode", 
    "Actor1Geo_ADM1Code", "Actor1Geo_Lat", "Actor1Geo_Long", "Actor1Geo_FeatureID", "Actor2Geo_Type", 
    "Actor2Geo_FullName", "Actor2Geo_CountryCode", "Actor2Geo_ADM1Code", "Actor2Geo_Lat", "Actor2Geo_Long", 
    "Actor2Geo_FeatureID", "ActionGeo_Type", "ActionGeo_FullName", "ActionGeo_CountryCode", "ActionGeo_ADM1Code", 
    "ActionGeo_Lat", "ActionGeo_Long", "ActionGeo_FeatureID", "DATEADDED", "SOURCEURL"
]

def download_and_process_zip(url, output_dir, retries=3, delay=2):
    """
    Downloads and processes a zip file, adding column headers, filtering for US protest events, 
    and returns a filtered DataFrame.
    
    Args:
        url (str): URL of the GDELT zip file.
        output_dir (str): Directory to store the extracted files.
        retries (int): Number of retry attempts on failure.
        delay (int): Delay (in seconds) between retries.
        
    Returns:
        pd.DataFrame: Processed data, or None if the download failed.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    file_name = url.split("/")[-1]
    zip_path = os.path.join(output_dir, file_name)

    # Try to download the file with retries
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=30)
            if response.status_code == 404:
                print(f"File not found (404): {url}")
                return None
            elif response.status_code == 200:
                with zipfile.ZipFile(io.BytesIO(response.content)) as z:
                    # Extract the first file from the zip
                    for file in z.namelist():
                        with z.open(file) as csv_file:
                            # Read the CSV file into a DataFrame without headers
                            df = pd.read_csv(csv_file, sep='\t', header=None, low_memory=False)
                            
                            # Check the number of columns and adjust headers accordingly
                            num_columns = df.shape[1]
                            if num_columns == len(GDELT_COLUMNS):
                                df.columns = GDELT_COLUMNS
                            else:
                                print(f"Warning: Column mismatch. Expected {len(GDELT_COLUMNS)} columns, but found {num_columns}.")
                                # Optionally: Add handling for other column mismatches (e.g., truncating or adding dummy columns)
                            
                            # Filter for US protest events only (adjust column names as needed)
                            df_filtered = df[(df['Actor1Geo_CountryCode'] == 'US') & (df['EventCode'] == 140)]
                            return df_filtered
            else:
                print(f"Failed to download {url}. HTTP Status Code: {response.status_code}")
        except requests.exceptions.RequestException as e:
            print(f"Error downloading {url}: {e}")
        
        print(f"Retrying in {delay} seconds...")
        time.sleep(delay)

    print(f"Failed to download after {retries} attempts: {url}")
    return None


def process_year_data(start_year, end_year, output_dir="../data_amelia/processed"):
    """
    Processes data for the entire year, one zip file at a time, and saves the result as a single CSV.
    
    Args:
        start_year (int): Start year for the data.
        end_year (int): End year for the data.
        output_dir (str): Directory to store the extracted and processed data.
    """
    for year in range(start_year, end_year + 1):
        print(f"Processing data for year {year}...")
        year_data = pd.DataFrame()  # To store processed data for the year

        # Process each day in the year
        for month in range(1, 13):  # Loop over each month
            for day in range(1, 32):  # Loop over each day
                # Build the URL for the daily file (assuming the GDELT format is daily post 2013)
                url = f"http://data.gdeltproject.org/events/{year}{str(month).zfill(2)}{str(day).zfill(2)}.export.CSV.zip"
                
                # Download and process the zip file
                df_filtered = download_and_process_zip(url, output_dir)
                if df_filtered is not None:
                    # Concatenate the filtered data for the day into the year's dataframe
                    year_data = pd.concat([year_data, df_filtered], ignore_index=True)
                
                # Add a small delay between requests
                time.sleep(2)

        # Save the year's data to a CSV file
        if not year_data.empty:
            output_file = os.path.join(output_dir, f"{year}_protest_data.csv")
            year_data.to_csv(output_file, index=False)
            print(f"Year {year} data saved to {output_file}")
        else:
            print(f"No data found for year {year}.")

# Example usage: Process data from 2014 to 2023
process_year_data(start_year=2014, end_year=2023)


Processing data for year 2014...
File not found (404): http://data.gdeltproject.org/events/20140123.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20140124.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20140125.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20140229.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20140230.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20140231.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20140319.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20140431.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20140631.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20140931.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20141131.export.CSV.zip
Year 2014 data saved to gdelt_csv/2014_protest_data.csv
Processin

  year_data = pd.concat([year_data, df_filtered], ignore_index=True)


File not found (404): http://data.gdeltproject.org/events/20161131.export.CSV.zip
Year 2016 data saved to gdelt_csv/2016_protest_data.csv
Processing data for year 2017...
File not found (404): http://data.gdeltproject.org/events/20170229.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20170230.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20170231.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20170431.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20170631.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20170931.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20171131.export.CSV.zip
Year 2017 data saved to gdelt_csv/2017_protest_data.csv
Processing data for year 2018...
File not found (404): http://data.gdeltproject.org/events/20180229.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20180230.export.CSV.zip
Fi

  year_data = pd.concat([year_data, df_filtered], ignore_index=True)


File not found (404): http://data.gdeltproject.org/events/20200230.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20200231.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20200431.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20200631.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20200931.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20201131.export.CSV.zip
Year 2020 data saved to gdelt_csv/2020_protest_data.csv
Processing data for year 2021...
File not found (404): http://data.gdeltproject.org/events/20210229.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20210230.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20210231.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20210431.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20210631.export.CSV.zip


**Note:** The function above will take at least 4 or 5 hours to get the data from. It also stopped running before it could collect data for the year 2023. 