# How popular are different social movements over time?

LSE DS105A - Data for Data Science (2024/25)

**Date**: 18/11/24

**Author**: Amelia Dunn

**Objective**:🌟 Pull data files from GDELT API to get popularity of different social movements or events over time.

### Soures we looked into:
- I tried investigating Reddit data, but you cannot access historical data
- Then I tried investigating X (previously known as Twitter), but you could only access the historical data through paying.
- We tried querying data from GDELT API, but we were only able to access data from 2014 onward, which we found to be too limiting. On top of this, it provides a seperate json file for each day from 2014 which is alot.
- Tryed to get data from Open sanctions and open corporate, but found that this was an API you have to pay for
- Now cycling back to GDELT and gathering data only from 2014 onwards. 

In [1]:
import requests
import zipfile
import io
import os
import pandas as pd
import time
from datetime import datetime

---

## GDELT data:

- pulling each days data file from GDELT [website](http://data.gdeltproject.org/events/index.html) and combining them into yearly csv files.

**Processing the data so to not exceed the data limit:**

In [2]:
# defining the column names of the data 

GDELT_COLUMNS = [
    "GLOBALEVENTID", "SQLDATE", "MonthYear", "Year", "FractionDate", "Actor1Code", "Actor1Name", 
    "Actor1CountryCode", "Actor1KnownGroupCode", "Actor1EthnicCode", "Actor1Religion1Code", 
    "Actor1Religion2Code", "Actor1Type1Code", "Actor1Type2Code", "Actor1Type3Code", "Actor2Code", 
    "Actor2Name", "Actor2CountryCode", "Actor2KnownGroupCode", "Actor2EthnicCode", "Actor2Religion1Code", 
    "Actor2Religion2Code", "Actor2Type1Code", "Actor2Type2Code", "Actor2Type3Code", "IsRootEvent", 
    "EventCode", "EventBaseCode", "EventRootCode", "QuadClass", "GoldsteinScale", "NumMentions", 
    "NumSources", "NumArticles", "AvgTone", "Actor1Geo_Type", "Actor1Geo_FullName", "Actor1Geo_CountryCode", 
    "Actor1Geo_ADM1Code", "Actor1Geo_Lat", "Actor1Geo_Long", "Actor1Geo_FeatureID", "Actor2Geo_Type", 
    "Actor2Geo_FullName", "Actor2Geo_CountryCode", "Actor2Geo_ADM1Code", "Actor2Geo_Lat", "Actor2Geo_Long", 
    "Actor2Geo_FeatureID", "ActionGeo_Type", "ActionGeo_FullName", "ActionGeo_CountryCode", "ActionGeo_ADM1Code", 
    "ActionGeo_Lat", "ActionGeo_Long", "ActionGeo_FeatureID", "DATEADDED", "SOURCEURL"
]

Function to download the data and process it (had to include some processing due to issues with the files being too large overall to download all of them before processing) :

In [3]:
def download_and_process_zip(url, output_dir, retries=3, delay=2):
    """
    Downloads and processes a zip file, adding column headers, filtering for US protest events, 
    and returns a filtered DataFrame.
    
    Args:
        url (str): URL of the GDELT zip file.
        output_dir (str): Directory to store the extracted files.
        retries (int): Number of retry attempts on failure.
        delay (int): Delay (in seconds) between retries.
        
    Returns:
        pd.DataFrame: Processed data, or None if the download failed.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    file_name = url.split("/")[-1]
    zip_path = os.path.join(output_dir, file_name)

    # Try to download the file with retries
    for attempt in range(retries):
        try:
            response = requests.get(url, timeout=30)
            if response.status_code == 404:
                print(f"File not found (404): {url}")
                return None
            elif response.status_code == 200:
                with zipfile.ZipFile(io.BytesIO(response.content)) as z:
                    # Extract the first file from the zip
                    for file in z.namelist():
                        with z.open(file) as csv_file:
                            # Read the CSV file into a DataFrame without headers
                            df = pd.read_csv(csv_file, sep='\t', header=None, low_memory=False)
                            
                            # Check the number of columns and adjust headers accordingly
                            num_columns = df.shape[1]
                            if num_columns == len(GDELT_COLUMNS):
                                df.columns = GDELT_COLUMNS
                            else:
                                print(f"Warning: Column mismatch. Expected {len(GDELT_COLUMNS)} columns, but found {num_columns}.")
                            
                            # Filter for US protest events only 
                            df_filtered = df[(df['Actor1Geo_CountryCode'] == 'US') & (df['EventCode'] == 140)]
                            return df_filtered
            else:
                print(f"Failed to download {url}. HTTP Status Code: {response.status_code}")
        except requests.exceptions.RequestException as e:
            print(f"Error downloading {url}: {e}")
        
        print(f"Retrying in {delay} seconds...")
        time.sleep(delay)

    print(f"Failed to download after {retries} attempts: {url}")
    return None

**Funtion to build yearly summaries of the processed daily data:**

* Even though the data is partially processed, it will be saved in a raw data folder as we will further process the data in NB02. Some processing was nessecary in this notebook to make the file sizes smaller, but this is minimal.

In [4]:
def process_year_data(start_year, end_year, output_dir=".../../data/raw"):
    """
    Processes data for the entire year, one zip file at a time, and saves the result as a single CSV.
    
    Args:
        start_year (int): Start year for the data.
        end_year (int): End year for the data.
        output_dir (str): Directory to store the extracted and processed data.
    """
    for year in range(start_year, end_year + 1):
        print(f"Processing data for year {year}...")
        year_data = pd.DataFrame()  # To store processed data for the year

        # Process each day in the year
        for month in range(1, 13):  # Loop over each month
            for day in range(1, 32):  # Loop over each day
                # Build the URL for the daily file (assuming the GDELT format is daily post 2013)
                url = f"http://data.gdeltproject.org/events/{year}{str(month).zfill(2)}{str(day).zfill(2)}.export.CSV.zip"
                
                # Download and process the zip file
                df_filtered = download_and_process_zip(url, output_dir)
                if df_filtered is not None:
                    # Concatenate the filtered data for the day into the year's dataframe
                    year_data = pd.concat([year_data, df_filtered], ignore_index=True)
                
                # Add a small delay between requests
                time.sleep(2)

        # Save the year's data to a CSV file
        if not year_data.empty:
            output_file = os.path.join(output_dir, f"{year}_protest_data.csv")
            year_data.to_csv(output_file, index=False)
            print(f"Year {year} data saved to {output_file}")
        else:
            print(f"No data found for year {year}.")

In [6]:
# Process data from 2014 to 2023
process_year_data(start_year=2013, end_year=2023)


Processing data for year 2013...
File not found (404): http://data.gdeltproject.org/events/20130101.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130102.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130103.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130104.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130105.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130106.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130107.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130108.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130109.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130110.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130111.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/2013011

KeyboardInterrupt: 

**Note:** The function above will take at least 4 or 5 hours to get the data from.

### Pulling data for 2006 to 2013:

I decided it was necessary to pull this data after the initial visualations showed that the art data that compared it to did not have much data in the 2010s. Therefore, it was necessary to have social movements data from earlier decades to be able to see some correlation between social movements and art themes

In [None]:
# GDELT column names
GDELT_COLUMNS = [
    "GLOBALEVENTID", "SQLDATE", "MonthYear", "Year", "FractionDate", "Actor1Code", "Actor1Name",
    "Actor1CountryCode", "Actor1KnownGroupCode", "Actor1EthnicCode", "Actor1Religion1Code",
    "Actor1Religion2Code", "Actor1Type1Code", "Actor1Type2Code", "Actor1Type3Code", "Actor2Code",
    "Actor2Name", "Actor2CountryCode", "Actor2KnownGroupCode", "Actor2EthnicCode", "Actor2Religion1Code",
    "Actor2Religion2Code", "Actor2Type1Code", "Actor2Type2Code", "Actor2Type3Code", "IsRootEvent",
    "EventCode", "EventBaseCode", "EventRootCode", "QuadClass", "GoldsteinScale", "NumMentions",
    "NumSources", "NumArticles", "AvgTone", "Actor1Geo_Type", "Actor1Geo_FullName", "Actor1Geo_CountryCode",
    "Actor1Geo_ADM1Code", "Actor1Geo_Lat", "Actor1Geo_Long", "Actor1Geo_FeatureID", "Actor2Geo_Type",
    "Actor2Geo_FullName", "Actor2Geo_CountryCode", "Actor2Geo_ADM1Code", "Actor2Geo_Lat",
    "Actor2Geo_Long", "Actor2Geo_FeatureID", "ActionGeo_Type", "ActionGeo_FullName",
    "ActionGeo_CountryCode", "ActionGeo_ADM1Code", "ActionGeo_Lat", "ActionGeo_Long",
    "ActionGeo_FeatureID", "DATEADDED"
]

In [None]:
def download_and_process_monthly_zip(url, output_dir):
    """
    Downloads and processes a monthly GDELT zip file, filters for US social movement data, 
    and returns a DataFrame.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    file_name = url.split("/")[-1]
    zip_path = os.path.join(output_dir, file_name)

    try:
        response = requests.get(url, timeout=60)
        if response.status_code == 404:
            print(f"File not found (404): {url}")
            return None
        elif response.status_code == 200:
            with zipfile.ZipFile(io.BytesIO(response.content)) as z:
                for file in z.namelist():
                    with z.open(file) as csv_file:
                        # Read in chunks to handle large files
                        chunk_list = []
                        for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
                            chunk.columns = GDELT_COLUMNS[:chunk.shape[1]]
                            filtered_chunk = chunk[(chunk['Actor1Geo_CountryCode'] == 'US') & (chunk['EventCode'] == 140)]
                            chunk_list.append(filtered_chunk)
                        if chunk_list:
                            return pd.concat(chunk_list, ignore_index=True)
    except requests.exceptions.RequestException as e:
        print(f"Error downloading {url}: {e}")
    except zipfile.BadZipFile as e:
        print(f"Zip file error: {e}")
    return None


In [None]:
def process_year_data(start_year, end_year, output_dir="../../data/raw"):
    """
    Processes data from 2006 to 2013, combining monthly files into yearly CSVs.
    """
    for year in range(start_year, end_year + 1):
        print(f"Processing data for year {year}...")
        year_data = pd.DataFrame()  # Initialize yearly data storage

        for month in range(1, 13):
            # Build the URL for the monthly file
            url = f"http://data.gdeltproject.org/events/{year}{str(month).zfill(2)}.zip"
            
            # Download and process the monthly file
            df_filtered = download_and_process_monthly_zip(url, output_dir)
            if df_filtered is not None:
                year_data = pd.concat([year_data, df_filtered], ignore_index=True)

            # Add a delay to avoid overloading the server
            time.sleep(2)

        # Save the year's data to a CSV file
        if not year_data.empty:
            output_file = os.path.join(output_dir, f"{year}_protest_data.csv")
            year_data.to_csv(output_file, index=False)
            print(f"Year {year} data saved to {output_file}")
        else:
            print(f"No data found for year {year}.")

In [7]:
# Run the full processing from 2006 to 2013
process_year_data(2006, 2012)


Processing data for year 2013...


  for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
  for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
  for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
  for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
  for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
  for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
  for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
  for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
  for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
  for chunk in pd.read_csv(csv_file, sep='\t', header=None, low_memory=True, chunksize=100000):
  for chunk in pd.read_csv(csv_file, sep

File not found (404): http://data.gdeltproject.org/events/201304.zip
File not found (404): http://data.gdeltproject.org/events/201305.zip
File not found (404): http://data.gdeltproject.org/events/201306.zip
File not found (404): http://data.gdeltproject.org/events/201307.zip
File not found (404): http://data.gdeltproject.org/events/201308.zip
File not found (404): http://data.gdeltproject.org/events/201309.zip
File not found (404): http://data.gdeltproject.org/events/201310.zip
File not found (404): http://data.gdeltproject.org/events/201311.zip
File not found (404): http://data.gdeltproject.org/events/201312.zip
Year 2013 data saved to ../data_amelia/raw/2013_protest_data.csv
