Skip to content

ada-2021-project-hivemind created by GitHub Classroom

Notifications You must be signed in to change notification settings

hhildaa/epfl-ada-project

Repository files navigation

NATURAL DISASTERS IN QUOTES [Data Story]

📖 Table of Contents

Table of Contents
  1. ➤ Abstract
  2. ➤ Research questions
  3. ➤ Additional datasets
  4. ➤ Folder Structure
  5. ➤ Methods
  6. ➤ Organization within the team

📝 Abstract

Every year, natural disasters happen and often take many lives. After such events, the pages of newspapers are full of quotes from people expressing regret for the unfortunate event. These events often remain in people's memory for a lifetime. What influences how long these events will be talked about? In this research project, we explore how much is said about the biggest earthquakes after they have occured and what factors influence this. We will look for answers in the Quotebank 2008-2020 quotes on disasters taken from the international disasters database combined with world development indicators from the World Data Bank. To simplify disaster quote detection, we will further look into classifying the quotes by whether they talk about a disaster or not.

:electron: Research questions

We proposed to explore two questions in this research project.

First, how correctly will NLP models trained on disaster tweets like in this kaggle challenge generalize to classifying disaster quotes in Quotebank? This question is important in respect of robustness analysis of models and their transfer learning capabilities.

Second, what factors influence how long a earthquake will be talked about in Quotebank quotes from 2008 to 2020? The interesting factors include total deaths, total damage in dollars, country of disaster, wealth indicators of the country, etc.

You can find the description of main results on our website: adahivemind.github.io.

💾 Additional datasets

Besides doing exploratory data analysis on Quotebank in quotes_eda.ipynb, we covered four additional datasets.

1. Disasters

Data source: The international disasters database

We use the international disasters database to introduce natural disasters of this century with their most important attributes.

This dataset was compiled from various sources including UN, governmental and non-governmental agencies, insurance companies, research institutes, and press agencies. In the majority of cases, a disaster will only be entered into EM-DAT if at least two sources report the disaster's occurrence in terms of deaths and/or affected persons.

2. World Development Indicators

Data source: https://databank.worldbank.org/source/world-development-indicators

One important factor in how much people talk about a disaster might be the country and its attributes. In this dataset, the most important development indicators of the country can be found, for example GDP, population, and fertility rate. Detailed per-indicator source and description is given in databank_wdi_metadata.csv. We would like to observe whether there is a connection between these indicators and the length and distribution of time they talk about the disaster.

3. GDELT Geographic Lookup of Domains

Data source: https://blog.gdeltproject.org/mapping-the-media-a-geographic-lookup-of-gdelts-sources/

The geographical location of newspapers could affect the citations contained in them. Although the quotes in the Quotebank dataset contain links to the article in which they were found, we cannot find out the true geographical location of the news source from the link itself. E.g. theguardian.com and nytimes.com both use .com top-level domain, but they are reporting events in different countries. That's why we decided to choose a GDELT dataset that associates a particular domain with the right country from which that news source comes.

4. Twitter Disaster Dataset

Data source: https://www.kaggle.com/c/nlp-getting-started

In this dataset we have a collection of tweets about disasters and random topics from Twitter users. The tweets are labeled whether they are about a disaster or not and we use them to train NLP models for transfer learning.

🌵 Folder Structure

.
│
├── datasets
│   ├── quotebank
│   │   └── quotes-{year}.parquet
│   │
│   ├── gdelt_domains_by_country
│   │   ├── gdeltdomainsbycountry_may2018.txt
│   │   └── FIPS_country.txt
│   │
│   ├── emdat
│   │   ├── emdat_public_2021_11_06_clean.csv
│   │   └── emdat_public_2021_11_06.csv
│   │
│   └── wdi
│       ├── databank_wdi_data.csv
│       └── databank_wdi_metadata.csv
│
├── quotes_eda.ipynb                                
├── quotes_eda.ipynb                                
├── wdi_eda.ipynb                                   
├── earthquake_quotebank_extraction.ipynb           
├── media_influencing_factors.ipynb.ipynb           
└── disasters_eda.ipynb                             

Detailed description:

🔍 Methods

To handle the large Quotebank dataset, we have set up a pipeline using Dask, a flexible parallel computing library for analytics. Dask scheduler dashboard during the run of a long task is shown in the image bellow.

In tackling the research questions, the crucial component was to classify whether a specific quote is talking about a disasters or a specific event. The methods we experimented with to find whether a quote is about a certain disaster are simple regex matching of the expected words. These words might include some details about the place where the disaster happened.

We have overall experienced that the model works best with the event name and the event type, so we decided to filter the quotes with this methodology. In the following mini case study, we can see the same behaviour. This is about the Haiti earthquake in 2010, which demanded the highest number of death during the course of time. The two figures below show the frequency over time of quotes that contain haiti+earthquake (top figure) and haiti+earthquake+2010 (bottom figure). The details of the case study are in the quotes_eda.ipynb notebook.

We also wanted to investigate how the models trained on tweets disasters datasets used for classifying natural disaster in the tweets will perform on the quotes from the Quotebank. To do so, we trained a BERT and DISTILBERT and evaluated its performance on Quotebank by hand labeling a subset the positive predictions.

🎯 Organization within the team

Batuhan - NLP model training, preparation and deployment of the website

Frano - NLP model training, labeling of the quotes, outlining and enriching the data storyline

Hilda - Exploration of media influencing factors, valuation and visualization of the outcomes, labeling the quotes.

Lovro - Labeling of the quotes, quote-disaster matching, valuation and visualization of the outcomes.

About

ada-2021-project-hivemind created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages