# 1. Readings

## 1.1. Text summarizer

Based on https://towardsdatascience.com/summarizing-tweets-in-a-disaster-part-ii-67db021d378d:
- look for situational words, describing situation or casulties using SpaCy (Numerals (eg. number of casualties, important phone numbers); Entities (eg. places, dates, events, organisations, etc.))
    - use entity-types, look for content words
- tf-idf score (rank somthing like "Nepal" highly, but not "the") --> use Textacy
- clean data before tokenizing: abbreviations, misspellings (NLTK has a twitter-specific tokenizer)
- summary of words as an ILP problem

check also the notebooks
- for SpaCy: https://github.com/gabrieltseng/datascience-projects/blob/master/natural_language_processing/twitter_disasters/spaCy/3%20-%20Abstractive%20Summary.ipynb
- for NLTK: https://github.com/gabrieltseng/datascience-projects/blob/master/natural_language_processing/twitter_disasters/NLTK/3%20-%20Abstractive%20Summary.ipynb

IBM Watson research paper
- https://arxiv.org/pdf/1602.06023.pdf

Tensorflow text summarization model
- https://github.com/tensorflow/models/tree/master/research/textsum

API services
- https://smmry.com/api

Facebook AI research: A Neural Attention Model for Abstractive Sentence Summarization
- https://arxiv.org/pdf/1509.00685.pdf

- ideas for overall approach: use occuring tweets as well (e.g. twitter set for wildfire)

## 1.2. Keyword Extraction

- based on https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34
- also very interesting points on text pre-processing in here

<img src = "KeyWordExtraction_HighLevel.png" width="500">

## 1.3. Futher NLP Tools

- word embeddings: https://www.wikiwand.com/en/Word_embedding --> check word2vec
- sentiment analysis: https://www.wikiwand.com/en/Sentiment_analysis
    - for background on singular value decomposition https://www.wikiwand.com/en/Singular_value_decomposition
- part-of-speech (POS) tagging
- using word graphs (powerful when there are multiple sentences describing similar situations)
- linguistic quality: compare my sample sentence to "normal" English sentences
    - see also KenLM tool at https://kheafield.com/code/kenlm/
    - and more readings to understand this challenge http://masatohagiwara.net/training-an-n-gram-language-model-and-estimating-sentence-probability.html
    - can be compared to current "correct" American English https://www.english-corpora.org/coca/
- spell checker: https://pypi.org/project/pyspellchecker/
- regular expressions: https://docs.python.org/3/library/re.html
- term frequency * Inverse Document Frequency: https://hackernoon.com/finding-the-most-important-sentences-using-nlp-tf-idf-3065028897a3

### pre-trained language models
- ELMo: https://arxiv.org/abs/1802.05365
- ULMFiT: https://arxiv.org/abs/1801.06146
- OpenAI Transformer: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
- BERT: https://arxiv.org/abs/1810.04805

### NLP trends
- Commonsense Interference like Event2Mind (https://arxiv.org/pdf/1805.06939.pdf) or SWAG (https://arxiv.org/abs/1808.05326)

- summary of trends to be found here: http://ruder.io/10-exciting-ideas-of-2018-in-nlp/

### more research to be done into
- general summarization
- statistical parsing
- knowledge extraction: are 911 calls given in a standard or re-occuring format?

### Summary of current trends in NLP
- https://www.analyticsvidhya.com/blog/2017/10/essential-nlp-guide-data-scientists-top-10-nlp-tasks/ (includes a lot of interesting and helpful links)

## 1.4. DL/ ML tools

- transfer learning: https://machinelearningmastery.com/transfer-learning-for-deep-learning/

# 2. Disaster datasets

## 2.1. Twitter datasets

- https://arxiv.org/abs/1605.05894
- https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2834
- https://dl.acm.org/citation.cfm?id=2914600

## 2.2. Other datasets

- https://data.world/crowdflower/disasters-on-social-media
- collection of different datasets: https://crisisnlp.qcri.org/

## 2.3. Other github links

- Twitter: disaster classification, sentiment analysis, named entity recognition --> https://github.com/glrn/nlp-disaster-analysis
- Natural Language Understanding Bot translating unstructured text into structured data --> https://github.com/Kontikilabs/alter-nlu
- Emogram (Text Analysis for unstructured text): Acronym Resolution, Auto Corect, Key Phrase Extraction, Polarity Detection --> https://github.com/axenhammer/Emogram

# 3. Development

In [3]:
#import 
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd 
import bs4 as bs
import nltk
from nltk.tokenize import sent_tokenize # tokenizes sentences
import re
from nltk.stem import PorterStemmer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

eng_stopwords = stopwords.words('english')

# Load more data

source data from https://crisisnlp.qcri.org/lrec2016/lrec2016.html

In [87]:
file = pd.read_csv('2013_pakistan_eq.csv', skip_blank_lines=True, encoding = "ISO-8859-1")

In [88]:
file.columns

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'choose_one_category',
       'choose_one_category:confidence', 'choose_one_category_gold',
       'tweet_id', 'tweet_text'],
      dtype='object')

In [89]:
file = file.drop(['_unit_id', '_golden', '_trusted_judgments',
       '_last_judgment_at', 'choose_one_category:confidence', 'choose_one_category_gold',
       'tweet_id'], axis=1)

In [138]:
file.choose_one_category.unique()

array(['other_useful_information', 'not_related_or_irrelevant',
       'donation_needs_or_offers_or_volunteering_services',
       'injured_or_dead_people', 'missing_trapped_or_found_people',
       'caution_and_advice', 'infrastructure_and_utilities_damage',
       'sympathy_and_emotional_support',
       'displaced_people_and_evacuations'], dtype=object)

In [125]:
reduced = file[file.choose_one_category == 'missing_trapped_or_found_people']

In [126]:
for i in reduced.tweet_text.values:
    print(i)

RT @Fahdhusain: 11 kids recovered alive from under earthquake rubble in Awaran. Shukar Allah!! #earthquake
Situation of #Balochistan: 18000+ Balochs are missing, thousands killed, no relief for #BalochistanEarthquake victims.#ReleaseAbductedBaloch
RT @AnjumKiani: CJP have you found out where ur 'Missing Persons' are? They are attacking #Earthquake relief workers, Meds teams &amp; SoldiersÃ¢â¬Â¦
RT @Atta_Waqas: #earthquake pk Army and Pk Govt... trying their level best for Rescue..  WHERE r all Baloch Nationalist Sardaars and BLA...?
CJP have you found out where your 'Missing Persons' are? They are attacking #Earthquake relief workers, Med teams &amp; Soldiers. #Balochistan


### cleaning steps

In [127]:
reduced.iloc[3].tweet_text

'RT @Atta_Waqas: #earthquake pk Army and Pk Govt... trying their level best for Rescue..  WHERE r all Baloch Nationalist Sardaars and BLA...?'

In [128]:
#remove twitter specific 
#file.tweet_text = re.sub(r'http\S+', '', file.tweet_text)
#remove hyperlinks
reduced = reduced.replace(to_replace =r'http\S+', value = '', regex = True)
#remove usernames
reduced = reduced.replace(to_replace =r'@[A-Za-z0-9]+', value = '', regex = True) 
#remove hashtags
#reduced = reduced.replace(to_replace =r'#[A-Za-z0-9]+', value = '', regex = True) 
# or just remove the hashtag, but leave the actual word
reduced = reduced.replace(to_replace ='#', value = '', regex = True) 
#remove retweet
reduced = reduced.replace(to_replace ='RT :', value = '', regex = True) 

In [129]:
# Remove punctuation
reduced.tweet_text = reduced.tweet_text.replace(to_replace ='[^a-zA-Z]', value = ' ', regex = True)

In [130]:
#4. Tokenize into words (all lower case)
reduced.tweet_text = reduced.tweet_text.str.lower()
reduced.tweet_text = reduced.tweet_text.str.split() 



In [131]:
reduced

Unnamed: 0,_unit_state,choose_one_category,tweet_text
17,finalized,missing_trapped_or_found_people,"[kids, recovered, alive, from, under, earthqua..."
1082,finalized,missing_trapped_or_found_people,"[situation, of, balochistan, balochs, are, mis..."
1126,finalized,missing_trapped_or_found_people,"[cjp, have, you, found, out, where, ur, missin..."
1210,finalized,missing_trapped_or_found_people,"[rt, waqas, earthquake, pk, army, and, pk, gov..."
1366,finalized,missing_trapped_or_found_people,"[cjp, have, you, found, out, where, your, miss..."


In [None]:
#5. Remove stopwords
eng_stopwords = set(stopwords.words("english"))
review = [w for w in review if not w in eng_stopwords]
    
#6. Join the review to one sentence
review = ' '.join(review+emoticons)
# add emoticons to the end

## Correct grammer --> slow
correct grammer, something along these lines: https://pypi.org/project/pyspellchecker/

In [82]:
liste = reduced.iloc[2].tweet_text.split(' ')

In [84]:
from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
#misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])
misspelled = spell.unknown(liste)
print('falsch', misspelled)

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))


falsch {'', 'cjp', 'are?', 'soldiersã¢â\x82¬â¦', 'workers,', 'meds', "persons'", '&amp;', "'missing"}
a
cup
are
soldiersã¢â¬â¦
workers
mess
persons
camp
missing


## Named entity recognition/ disambiguiation
- find out name of school, city, street etc.

## Word embeddings

## Sentiment analysis
- - sentiment analysis 
    - check paper at https://www.analyticsvidhya.com/blog/2017/01/sentiment-analysis-of-twitter-posts-on-chennai-floods-using-python/, where sentiment analysis was performed on Chennai flood dataset

In [None]:
#output: counting expressions (like Sandy Hook School or shooting)

## Lemmatization

In [38]:
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

In [39]:
wnl = WordNetLemmatizer()

wnl_stems = []
for pair in token_tag:
    res = wnl.lemmatize(pair[0],pos=get_wordnet_pos(pair[1]))
    wnl_stems.append(res)

print(' '.join(wnl_stems))

hi sandy hook school i think there be somebody shoot in here in sandy hook school because somebody get a gun i catch a glimpse of someone theyre run down the hallway they be still run theyre still shoot sandy hook school please


## Stopwords

In [42]:
tsc_wo_stopwords = [w for w in tsc_words if not w in stopwords.words("english")]
removed_stopwords = [w for w in tsc_words if w in stopwords.words("english")]

print('REVIEW WITHOUT STOPWORDS:')
print(' '.join(tsc_wo_stopwords))
print()
print('Stop words removed', removed_stopwords)
print()
print('NUMBER OF STOPWORDS REMOVED:',len(removed_stopwords))

REVIEW WITHOUT STOPWORDS:
hi sandy hook school think somebody shooting sandy hook school somebodys got gun caught glimpse someone theyre running hallway still running theyre still shooting sandy hook school please

Stop words removed ['i', 'there', 'is', 'in', 'here', 'in', 'because', 'a', 'i', 'a', 'of', 'down', 'the', 'they', 'are']

NUMBER OF STOPWORDS REMOVED: 15


# ---------------------------------------------------------

### here follows a summary of what we extracted from the text (summary, keywords etc.) and how this influences the priority

### processing: spot what is an emergency situation

### recommend steps what and how to do it --> what to employ and where to employ to

### some more links about what we can do
- https://blog.paralleldots.com/research/artificial-intelligence-can-make-public-transportation-safer/?source=post_page---------------------------