# How popular are different social movements over time?

LSE DS105A - Data for Data Science (2024/25)

**Date**: 18/11/24

**Author**: Amelia Dunn

**Objective**:🌟 Pull data files from GDELT API to get popularity of different social movements or events over time.

### Soures we looked into:
- I tried investigating Reddit data, but you cannot access historical data
- Then I tried investigating X (previously known as Twitter), but you could only access the historical data through paying.
- We tried querying data from GDELT API, but we were only able to access data from 2014 onward, which we found to be too limiting. On top of this, it provides a seperate json file for each day from 2014 which is alot.
- Tryed to get data from Open sanctions and open corporate, but found that this was an API you have to pay for
- Now cycling back to GDELT and gathering data only from 2014 onwards. 

In [1]:
import requests
import zipfile
import io
import os
import pandas as pd
import time
from datetime import datetime
import warnings

import sys
sys.path.append("..")  

from data_collection import download_and_process_zip, process_year_data, download_and_process_monthly_zip, Process_year_data


---

## GDELT data:

- pulling each days data file from GDELT [website](http://data.gdeltproject.org/events/index.html) and combining them into yearly csv files.

**Processing the data so to not exceed the data limit:**

In [2]:
# defining the column names of the data 

GDELT_COLUMNS = [
    "GLOBALEVENTID", "SQLDATE", "MonthYear", "Year", "FractionDate", "Actor1Code", "Actor1Name", 
    "Actor1CountryCode", "Actor1KnownGroupCode", "Actor1EthnicCode", "Actor1Religion1Code", 
    "Actor1Religion2Code", "Actor1Type1Code", "Actor1Type2Code", "Actor1Type3Code", "Actor2Code", 
    "Actor2Name", "Actor2CountryCode", "Actor2KnownGroupCode", "Actor2EthnicCode", "Actor2Religion1Code", 
    "Actor2Religion2Code", "Actor2Type1Code", "Actor2Type2Code", "Actor2Type3Code", "IsRootEvent", 
    "EventCode", "EventBaseCode", "EventRootCode", "QuadClass", "GoldsteinScale", "NumMentions", 
    "NumSources", "NumArticles", "AvgTone", "Actor1Geo_Type", "Actor1Geo_FullName", "Actor1Geo_CountryCode", 
    "Actor1Geo_ADM1Code", "Actor1Geo_Lat", "Actor1Geo_Long", "Actor1Geo_FeatureID", "Actor2Geo_Type", 
    "Actor2Geo_FullName", "Actor2Geo_CountryCode", "Actor2Geo_ADM1Code", "Actor2Geo_Lat", "Actor2Geo_Long", 
    "Actor2Geo_FeatureID", "ActionGeo_Type", "ActionGeo_FullName", "ActionGeo_CountryCode", "ActionGeo_ADM1Code", 
    "ActionGeo_Lat", "ActionGeo_Long", "ActionGeo_FeatureID", "DATEADDED", "SOURCEURL"
]

Function to download the data and process it (had to include some processing due to issues with the files being too large overall to download all of them before processing) : download_and_process_zip found in the [data_collection.py](../data_collection.py) file. 

Funtion to build yearly summaries of the processed daily data: process_year_data found in the same file as the function above.

* Even though the data is partially processed, it will be saved in a raw data folder as we will further process the data in NB02. Some processing was nessecary in this notebook to make the file sizes smaller, but this is minimal.

In [11]:
# Process data from 2014 to 2023 using the download_and_process_zip and process_year_data functions
process_year_data(start_year=2013, end_year=2023)


Processing data for year 2013...
File not found (404): http://data.gdeltproject.org/events/20130101.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130102.export.CSV.zip
File not found (404): http://data.gdeltproject.org/events/20130103.export.CSV.zip


KeyboardInterrupt: 

**Note:** The function above will take at least 4 or 5 hours to get the data from. It shows up with an error in this Notebook because we have paused the operations. Furthermore, there are no files found for the first 3 months of 2013.

### Pulling data for 2006 to 2013:

I decided it was necessary to pull this data so that we had a big enough data set and our visualisation would be able to properly demonstrate whether there is any correlation with GDP.

In [13]:
# GDELT column names
GDELT_COLUMNS = [
    "GLOBALEVENTID", "SQLDATE", "MonthYear", "Year", "FractionDate", "Actor1Code", "Actor1Name",
    "Actor1CountryCode", "Actor1KnownGroupCode", "Actor1EthnicCode", "Actor1Religion1Code",
    "Actor1Religion2Code", "Actor1Type1Code", "Actor1Type2Code", "Actor1Type3Code", "Actor2Code",
    "Actor2Name", "Actor2CountryCode", "Actor2KnownGroupCode", "Actor2EthnicCode", "Actor2Religion1Code",
    "Actor2Religion2Code", "Actor2Type1Code", "Actor2Type2Code", "Actor2Type3Code", "IsRootEvent",
    "EventCode", "EventBaseCode", "EventRootCode", "QuadClass", "GoldsteinScale", "NumMentions",
    "NumSources", "NumArticles", "AvgTone", "Actor1Geo_Type", "Actor1Geo_FullName", "Actor1Geo_CountryCode",
    "Actor1Geo_ADM1Code", "Actor1Geo_Lat", "Actor1Geo_Long", "Actor1Geo_FeatureID", "Actor2Geo_Type",
    "Actor2Geo_FullName", "Actor2Geo_CountryCode", "Actor2Geo_ADM1Code", "Actor2Geo_Lat",
    "Actor2Geo_Long", "Actor2Geo_FeatureID", "ActionGeo_Type", "ActionGeo_FullName",
    "ActionGeo_CountryCode", "ActionGeo_ADM1Code", "ActionGeo_Lat", "ActionGeo_Long",
    "ActionGeo_FeatureID", "DATEADDED"
]

In [18]:
# Suppress specific warnings related to DtypeWarnings
warnings.filterwarnings("ignore", category=pd.errors.DtypeWarning)

In [21]:
# Run the full processing from 2006 to 2013 using the Process_year_data function
Process_year_data(2006, 2012)


Processing data for year 2006...
Year 2006 data saved to ../../data/raw/2006_protest_data.csv
Processing data for year 2007...
Year 2007 data saved to ../../data/raw/2007_protest_data.csv
Processing data for year 2008...
Year 2008 data saved to ../../data/raw/2008_protest_data.csv
Processing data for year 2009...
Year 2009 data saved to ../../data/raw/2009_protest_data.csv
Processing data for year 2010...
Year 2010 data saved to ../../data/raw/2010_protest_data.csv
Processing data for year 2011...
Year 2011 data saved to ../../data/raw/2011_protest_data.csv
Processing data for year 2012...
Year 2012 data saved to ../../data/raw/2012_protest_data.csv


**Note**: The function above takes around 25 minutes to process.