<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Disaster Dashboards
## Leveraging News and Media for Situational Awareness (Problem #2)
---

### Team
 - Jonathan Ruiz
 - Paul Schimek
 - Michelle Cheung

### Project Statement
---
During a major disaster, it is essential to provide the public and responders with relevant local news updates in order to gain situational awareness during the event. During a disaster, news updates are coming from tens to hundreds of different sources, all in different formats, available from different websites, news channels etc., and it is often difficult to find what would be most helpful amid the chaos of other non-disaster related news and media. There is currently no forum for rounding up and archiving relevant news for a live disaster event. This project will leverage news feeds relevant to specific disasters, gathered from multiple sources, to create a website that presents these live feeds in one dashboard.

### Objective
---
Social media provides valuable real-time awareness to first responders and relief workers by providing information on localized emergencies during a disaster. However, informative signals are often clouded with irrelevant noise.

This goal of this project is to find a means of separating out informative from noninformative media in order to help filter and triage where actionable help is needed. A online dashboard will be built in order to publicly communicate the informative tweets. The dashboard will be built using Flask, a micro web development platform written in Python. Numerous classification models were used to find and compare the most effective model.

### Procedure
---

The media used to train classification models were human-labeled tweets sourced from two previous studies on disaster-related social media. Approximately 13,000 tweets were retrieved from "CrisisMMD: Multimodal Twitter Datasets from Natural Disasters<sup>1</sup>, which included tweets from Hurricane Irma (tweets dated circa Sept. 2017), Hurricane Harvey (tweets dated circa Aug. to Sept. 2017), and Hurricane Maria (tweets dated circa Sept. to Nov. 2017). Tweets used from this study were filtered so that they included at least one image and two words or hashtags. These tweets were labeled as informative or not-informative based on whether a given tweet or image determined information for humanitarian aid purposes.

An additional 3,000 tweets were retrieved from "Practical Extraction of Disaster-Relevant Information from Social Media"<sup>2</sup>, which included tweets from the Joplin Tornado (tweets dated circa May 2011) and Hurrican Sandy (tweets dated circa Oct. 2012). These tweets were labeled as either Personal (if a message was only of interest to its author and her immediate circle of family/friends and does not convey anything useful), Informative (if the message was of interest to other people beyond the author’s immediate circle), or Other (if the message was not related to the disaster).

Through iterations of models, it was found that the tweets from the Sandy/Joplin dataset were more effective in training our model than if we used both the Sandy/Joplin dataset and the Irma/Harvey/Maria dataset. Subsequently, our final training model only used the Sandy/Joplin tweets.

To further improve efficiency in communicating types of humanitarian aid needed, classification was further broken down from informative/not informative to the following classes. 

| Tweet Class | Description |
| --- | --- |
| **Not informative** | *Tweets which did not contain information valuable for disaster recovery and/or rescue.* |
| **Casualties and damage** | *Tweets which reported the information about casualties or damage done by an incident* |
| **Caution and advice** | *Tweets which conveyed/reported information about some warning or a piece of advice about a possible hazard of an incident.* |
| **Informative, other** | *Tweets included a message of interest to other people beyond the author's immediate circle.* |
| **Information source** | *Tweets which conveyed/reported some information sources like photo, footage, video, or mentions other sources like TV, radio related to an incident* |
| **Donations of money, goods, or services** | *Tweets which spoke about money raised, donation offers, goods/services offered or asked by the victims of an incident.* |

Test data includes tweets related to Hurricane Michael, which is the most recently FEMA declared disaster (hurricane). Tweets were retrieved through the Python library "GetOldTweets3." These tweets were not labeled and were collected from a data range of October 9 through October 16, 2018 with a sole search term of "Hurricane Michael." Approximately 300,000 tweets were collected in total.

Cleaning and processing of tweets included setting all alphabetical characters to lowercase, tweet string removal (ie: "RT"), duplicate removal, and usage of *TweetTokenizer*, which truncates elongations and removed Twitter handles.

Natural language processing included usage of TF-IDF and Countvectorizer, each separately with the Naive Bayes algorithm.

Algorithms used were Logistic Regression, Support Vector Machine, Naive Bayes with TF-IDF, Naive Bayes with CountVectorizer, and Random Forest. Logistic regression yieleded the most successful model. See results below.


### Findings
---

The logistic regression model had the highest accuracy, as shown in the summary below.

| Model | Train | Test |
| --- | --- | --- |
| **Logistic Regression** | 0.987 | 0.871 |
| **Support Vector Machine** | 0.969 | 0.817 |
| **Naive Bayes w/TF-IDF** | 0.822 | 0.624 |
| **Naive Bayes w/CVEC** | 0.924 | 0.702 |
| **Random Forest** | 0.502 (R<sup>2</sup>) | 0.427 (R<sup>2</sup>) |

### Dashboard
---

**To be edited...**

### Conclusions
---

Further improvement to our model can be achieved through obtaining more training data that include similar human-labeled data as existing training data included approximately 3,000 tweets. Increased accuracy and precision can only further reduce unneccessary noise so that immediate post-impact help can be provided.

Future endeavors can attempt to advance the models so that other natural disaster types, aside from hurricanes, can be applied.


### Project Deliverables
---

| File | Description |
| --- | --- |
| **Project_Write-up.md** | *Project Technical Report* |
| **Models-Sandy_Joplin.ipynb** | *Train model with Hurricane Sandy/Joplin Tornado Tweets* |
| **Models-Three Hurricanes.ipynb** | *Train model with Hurricane Irma, Hurricane Harvey, and Hurricane Maria tweets* |
| **Training Data.ipynb** | *Code for training data retrieval* |
| **Hurricane_Michael_Tweets.ipynb** | *Code for test data retrieval* |
| **code for website** | *Open source code for disaster online dashboard* |
| **Project 4_ Twitter Dashboard for Disasters** | *Powerpoint presentation* |

### Data Sources
---
1. CrisisMMD: Multimodal Twitter Datasets from Natural Disasters: Firoj Alam, Ferda Ofli, Muhammad Imran
Qatar Computing Research Institute, HBKU, Doha, Qatar
2. Practical Extraction of Disaster-Relevant Information from
Social Media: Muhammad Imran (University of Trento), Shady Elbassuoni (American University of Beirut), Carlos Castillo (Qatar Computing Research Institute), Fernando Diaz (Microsoft Research), Patrick Meier (Qatar Computing Research Institute)
3. GetOldTweets3  https://pypi.org/project/GetOldTweets3/
