# 1. Readings

## 1.1. Text summarizer

Based on https://towardsdatascience.com/summarizing-tweets-in-a-disaster-part-ii-67db021d378d:
- look for situational words, describing situation or casulties using SpaCy (Numerals (eg. number of casualties, important phone numbers); Entities (eg. places, dates, events, organisations, etc.))
    - use entity-types, look for content words
- tf-idf score (rank somthing like "Nepal" highly, but not "the") --> use Textacy
- clean data before tokenizing: abbreviations, misspellings (NLTK has a twitter-specific tokenizer)
- summary of words as an ILP problem

check also the notebooks
- for SpaCy: https://github.com/gabrieltseng/datascience-projects/blob/master/natural_language_processing/twitter_disasters/spaCy/3%20-%20Abstractive%20Summary.ipynb
- for NLTK: https://github.com/gabrieltseng/datascience-projects/blob/master/natural_language_processing/twitter_disasters/NLTK/3%20-%20Abstractive%20Summary.ipynb

IBM Watson research paper
- https://arxiv.org/pdf/1602.06023.pdf

Tensorflow text summarization model
- https://github.com/tensorflow/models/tree/master/research/textsum

API services
- https://smmry.com/api

Facebook AI research: A Neural Attention Model for Abstractive Sentence Summarization
- https://arxiv.org/pdf/1509.00685.pdf

- ideas for overall approach: use occuring tweets as well (e.g. twitter set for wildfire)

## 1.2. Keyword Extraction

- based on https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34
- also very interesting points on text pre-processing in here

<img src = "KeyWordExtraction_HighLevel.png" width="500">

## 1.3. Futher NLP Tools

- word embeddings: https://www.wikiwand.com/en/Word_embedding --> check word2vec
- sentiment analysis: https://www.wikiwand.com/en/Sentiment_analysis
    - for background on singular value decomposition https://www.wikiwand.com/en/Singular_value_decomposition
- part-of-speech (POS) tagging
- using word graphs (powerful when there are multiple sentences describing similar situations)
- linguistic quality: compare my sample sentence to "normal" English sentences
    - see also KenLM tool at https://kheafield.com/code/kenlm/
    - and more readings to understand this challenge http://masatohagiwara.net/training-an-n-gram-language-model-and-estimating-sentence-probability.html
    - can be compared to current "correct" American English https://www.english-corpora.org/coca/
- spell checker: https://pypi.org/project/pyspellchecker/
- regular expressions: https://docs.python.org/3/library/re.html
- term frequency * Inverse Document Frequency: https://hackernoon.com/finding-the-most-important-sentences-using-nlp-tf-idf-3065028897a3

### pre-trained language models
- ELMo: https://arxiv.org/abs/1802.05365
- ULMFiT: https://arxiv.org/abs/1801.06146
- OpenAI Transformer: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
- BERT: https://arxiv.org/abs/1810.04805

### NLP trends
- Commonsense Interference like Event2Mind (https://arxiv.org/pdf/1805.06939.pdf) or SWAG (https://arxiv.org/abs/1808.05326)

- summary of trends to be found here: http://ruder.io/10-exciting-ideas-of-2018-in-nlp/

### more research to be done into
- general summarization
- statistical parsing
- knowledge extraction: are 911 calls given in a standard or re-occuring format?

### Summary of current trends in NLP
- https://www.analyticsvidhya.com/blog/2017/10/essential-nlp-guide-data-scientists-top-10-nlp-tasks/ (includes a lot of interesting and helpful links)

## 1.4. DL/ ML tools

- transfer learning: https://machinelearningmastery.com/transfer-learning-for-deep-learning/

# 2. Disaster datasets

## 2.1. Twitter datasets

- https://arxiv.org/abs/1605.05894
- https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2834
- https://dl.acm.org/citation.cfm?id=2914600

## 2.2. Other datasets

- https://data.world/crowdflower/disasters-on-social-media
- collection of different datasets: https://crisisnlp.qcri.org/

## 2.3. Other github links

- Twitter: disaster classification, sentiment analysis, named entity recognition --> https://github.com/glrn/nlp-disaster-analysis
- Natural Language Understanding Bot translating unstructured text into structured data --> https://github.com/Kontikilabs/alter-nlu
- Emogram (Text Analysis for unstructured text): Acronym Resolution, Auto Corect, Key Phrase Extraction, Polarity Detection --> https://github.com/axenhammer/Emogram

# 3. Development

In [3]:
#import 
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd 
import bs4 as bs
import nltk
from nltk.tokenize import sent_tokenize # tokenizes sentences
import re
from nltk.stem import PorterStemmer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

eng_stopwords = stopwords.words('english')

# Load more data

source data from https://crisisnlp.qcri.org/lrec2016/lrec2016.html

In [242]:
file = pd.read_csv('2014_california_eq.csv', skip_blank_lines=True, encoding = "ISO-8859-1")

In [243]:
file.columns

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'choose_one_category',
       'choose_one_category:confidence', 'choose_one_category_gold',
       'tweet_id', 'tweet_text'],
      dtype='object')

In [244]:
file = file.drop(['_unit_id', '_golden', '_trusted_judgments',
       '_last_judgment_at', 'choose_one_category:confidence', 'choose_one_category_gold',
       'tweet_id'], axis=1)

In [245]:
file.groupby(['choose_one_category']).count()

Unnamed: 0_level_0,_unit_state,tweet_text
choose_one_category,Unnamed: 1_level_1,Unnamed: 2_level_1
caution_and_advice,84,84
displaced_people_and_evacuations,4,4
donation_needs_or_offers_or_volunteering_services,83,83
infrastructure_and_utilities_damage,351,351
injured_or_dead_people,217,217
missing_trapped_or_found_people,6,6
not_related_or_irrelevant,157,157
other_useful_information,1028,1028
sympathy_and_emotional_support,83,83


In [246]:
file.choose_one_category.unique()

array(['other_useful_information', 'infrastructure_and_utilities_damage',
       'injured_or_dead_people', 'sympathy_and_emotional_support',
       'not_related_or_irrelevant', 'caution_and_advice',
       'donation_needs_or_offers_or_volunteering_services',
       'missing_trapped_or_found_people',
       'displaced_people_and_evacuations'], dtype=object)

In [247]:
#reduced = file[file.choose_one_category == 'injured_or_dead_people']

#try with all data points
reduced = file

In [248]:
for i in reduced.tweet_text.values:
    print(i)

RT @nicoleewayne: Tennessee USA Knoxville http://t.co/RdBC6022xO #earthquake BREAKING NEWS 816 earthquake Northern California braces for afÃ¢â¬Â¦
RT @SFGate: We're updating this interactive map of reported damage from the #Napaquake. Check it out: http://t.co/24Db832o4U http://t.co/qyÃ¢â¬Â¦
RT @YourAnonNews: Strong 6.1 Earthquake Rocks San Francisco Bay Area, Injures 87+, Significant Damage In Napa  http://t.co/ibSTSBU9BJ
RT @heyyouapp: Wisconsin USA Madison http://t.co/4LByErEec9 BREAKING NEWS 421 earthquake 6.0 earthquake jolts Bay Area damage and at least Ã¢â¬Â¦
RT @scullather: "@infodude: amazing use of #BigData from @Jawbone - How the Napa Earthquake Affected Bay Area Sleepers https://t.co/SSlD8wQÃ¢â¬Â¦
RT @MarshGlobal: Napa, California #earthquake was region's most severe since 1989. Get tips on how affected should respond http://t.co/X4G6Ã¢â¬Â¦
#USAHeadlines Cleanup Efforts Underway after Major Earthquake in California http://t.co/TscSQQ2byw
Jawbone Looks At UP Data To See 

Sleep-Tracking Data Shows Who Was Jolted Awake By The Napa Earthquake | Fast Company | Business + Innovation http://t.co/m4Q70Dfbyj
RT @vickydnguyen: Meadowbrook Lane in Napa-- skaters finding upside to quake damage. Photo from #nbcbayarea photog Jeremy Carroll http://t.Ã¢â¬Â¦
RT @heyyouapp: Georgia USA Athens-Clarke County ÃÂ» http://t.co/rrxgq7riwS  #earthquake 5 Northern California rocked by magnitude 6.0 earthquÃ¢â¬Â¦
RT @NapaCoRedCross: Shelter open in Napa - Crosswalk Community Church 2590 1st. St in Napa. #napaquake
RT @MatthewKeysLive: Free to use with credit: Significant earthquake damage to the U.S. Post Office building in Napa, California - http://tÃ¢â¬Â¦
How the California Earthquake Will Affect Wine Prices http://t.co/uzDYkJqvO3
The Napa Earthquake by the Numbers http://t.co/qwqvNBDcQ6
My cousin said he felt the Napa quake downtown in San Fran this morning. Hope this isn't a precursor to the big one. #CaliforniaAlarmClock
M6.0  - 6km NW of American Canyon, California 2

The Napa earthquake's effect on sleep: https://t.co/lBxOKk576v http://t.co/KribWEMkm9
( #SuNoviaAquÃÂ­Ã¯â¢Å ) Skaters Make Best of Napa Earthquake by Shredding Buckled Streets: To quot... http://t.co/jHBsIgKwZT (#daniel_zabala02)
RT @ginalimp: Ã¢â â USA http://t.co/4jzeaBJet9 Ã¢â â Scores of aftershocks from Napa earthquake felt, more on the way USA Ã¢â â Officials have warned tÃ¢â¬Â¦
RT @msav64: @winewankers Nooooooooo  "What About The Wine?!" re #Napa, California, M6.1 quake, via @DavidSilverOak  https://t.co/FwheCasCwHÃ¢â¬Â
California begins clear-up after San Francisco quake http://t.co/YOXWiZOx8r
Dramatic Photos From The Earthquake In Northern California http://t.co/0Bc2WO0A8D
Teen crushed by bricks in Napa earthquake recounts pain, fear http://t.co/HsIiAdNHgr
How the Napa earthquake affected Bay Area sleepers, via @Jawbone data https://t.co/OLsEAlpq3q http://t.co/AbV5KQCy0S
QUAKE HITS CALIFORNIA State of emergency issued as officials weigh damage  http://t.co/FnlPRN

### cleaning steps

In [249]:
reduced.iloc[4].tweet_text

'RT @scullather: "@infodude: amazing use of #BigData from @Jawbone - How the Napa Earthquake Affected Bay Area Sleepers https://t.co/SSlD8wQÃ¢â\x82¬Â¦'

In [250]:
#remove twitter specific 
#file.tweet_text = re.sub(r'http\S+', '', file.tweet_text)
#remove hyperlinks
reduced = reduced.replace(to_replace =r'&amp;', value = '', regex = True)
reduced = reduced.replace(to_replace =r'&gt;', value = '', regex = True)
reduced = reduced.replace(to_replace =r'http\S+', value = '', regex = True)
#remove usernames
reduced = reduced.replace(to_replace =r'@\S+', value = '', regex = True) 
#remove hashtags
#reduced = reduced.replace(to_replace =r'#[A-Za-z0-9]+', value = '', regex = True) 
# or just remove the hashtag, but leave the actual word
reduced = reduced.replace(to_replace ='#', value = '', regex = True) 
#remove retweet
reduced = reduced.replace(to_replace ='RT :', value = '', regex = True) 
reduced = reduced.replace(to_replace ='RT ', value = '', regex = True) 
#remove punctation
reduced = reduced.replace(to_replace ='[",:!?\\-]', value = ' ', regex = True)

In [251]:
#reduced = reduced.replace(to_replace ='^[[-+]?[0-9]*\.?[0-9]*]', value = ' ', regex = True)
#'^[\d+\.$]'
#'^[[-+]?[0-9]*\.?[0-9]*]'

In [252]:
reduced.iloc[4].tweet_text

'   amazing use of BigData from    How the Napa Earthquake Affected Bay Area Sleepers '

In [253]:
for p in reduced.tweet_text.values:
    print(p)

 Tennessee USA Knoxville  earthquake BREAKING NEWS 816 earthquake Northern California braces for afÃ¢â¬Â¦
 We're updating this interactive map of reported damage from the Napaquake. Check it out   
 Strong 6.1 Earthquake Rocks San Francisco Bay Area  Injures 87+  Significant Damage In Napa  
 Wisconsin USA Madison  BREAKING NEWS 421 earthquake 6.0 earthquake jolts Bay Area damage and at least Ã¢â¬Â¦
   amazing use of BigData from    How the Napa Earthquake Affected Bay Area Sleepers 
 Napa  California earthquake was region's most severe since 1989. Get tips on how affected should respond 
USAHeadlines Cleanup Efforts Underway after Major Earthquake in California 
Jawbone Looks At UP Data To See How Many Were Woken Up By The Napa Earthquake  Jawbone has shown one of the mo... 
The San Francisco Earthquake In Wine Country Injured Dozens Of People  NAPAÃÂ Calif. (Reuters)   A 6.0 magnitude... 
 State of emergency after 6.0 earthquake hits California wine country  
 Thinking of all thos

  Just had an inquiry regarding the impact of the Napa earthquake  wineries are open as well as mostÃ¢â¬Â¦
~ Miira Ã¢â â California wine country rocked by 6.0 quake  dozens hurt   Reuters  ReutersCalifornia wine country rocked by 6.0 quake  doz...
Napa Valley wineries sustain damage from 6.0 earthquake  LA Times     Via 
 If Big Show    running  didn't cause a massive earthquake in California  nothing will. RAWTonight
 ÃÂ¦  371 ÃÂ¦ State of Emergency Declared After California Earthquake ÃÂ¦ Jerry Brown declared a state of emÃ¢â¬Â¦
 California USA El Cajon  earthquake BREAKING NEWS 67 earthquake Northern California braces for aftershÃ¢â¬Â¦
Its efficacious wood keeps us safe from disaster yet brick from dated chimneys cause such damage in California  quake
 After Napa earthquake  authorities warn that a big aftershock remains possible 
TZ    sad. California earthquake hits wine region hard  destroying thousands of b...  APNews
Despite damage from the earthquake  tourists continu

NO   NOT THE WINE   Ã¢â¬Å Will wine prices increase after California earthquake  
 ÃÂ¦  668 ÃÂ¦ Bay Area earthquake damages Napa County Airport  earthquake ÃÂ¦ Clean up is still underway in Ã¢â¬Â¦
Thank you Quakehold  steps from epicenter of 6.0 Napa earthquake... grandmothers hutch  dishes are safe thanks to you 
Civil servants supporting Napa earthquake relief are impressive. Great to see them rise to the occasion. napaquake
 100+ Injured In Northern California's Most Powerful Earthquake Since 1989.  
Napa Mops Up Wine and Tallies Its Losses After Quake  The Napa Valley quake wreaked havoc with California wine... 
The Napa earthquake woke this many Jawbone Up wearers    
 *hella California  This photo is very California. napa earthquake 
The Latest on Damaging Earthquake in California   ABC News 
 A magnitude 6.0 earthquake in California. It was so powerful it knocked Arnold Schwarzenegger off his housekeeper.   DaÃ¢â¬Â¦
 State of emergency after quake hits California wine cou

Netanyahu announces California earthquake payback for anti semites  self haters preventing God's boat from docking in oakland.
 ISIS Hijacks California Earthquake Hashtags To Post Threats On Twitter   
 VIDEO  California shaken by earthquake
California rocked awake during region's largest quake in 25 years 
Top story  70 Injured  Buildings Damaged After 6.0 Quake Rattles NaÃ¢â¬Â¦  see more 
Watching news coverage of today's earthquake  damage up in Napa. Thankful to be safe  BayArea residents visit 
 What 6.0 earthquake did to Napa Many injuries  no known deaths.  
San Francisco Bay Area/Napa earthquake  dykes be like   last night my normal dildos momentarily turned into 'no batteries needed vibrators' 
First time I've ever said  aw  poor friend in napa  earthquake winecountry
Biggest quakes by magnitude in California  By The Associated Press California  where a quake and fire devastated... 
The Napa earthquake woke this many Jawbone Up wearers | The Verge 
 LOOK  support bace CRACKED

In [254]:
#4. Tokenize into words (all lower case)
reduced.tweet_text = reduced.tweet_text.str.lower()
reduced.tweet_text = reduced.tweet_text.str.split() 



In [255]:
for p in reduced.tweet_text.values:
    print(p)

['tennessee', 'usa', 'knoxville', 'earthquake', 'breaking', 'news', '816', 'earthquake', 'northern', 'california', 'braces', 'for', 'afã¢â\x82¬â¦']
["we're", 'updating', 'this', 'interactive', 'map', 'of', 'reported', 'damage', 'from', 'the', 'napaquake.', 'check', 'it', 'out']
['strong', '6.1', 'earthquake', 'rocks', 'san', 'francisco', 'bay', 'area', 'injures', '87+', 'significant', 'damage', 'in', 'napa']
['wisconsin', 'usa', 'madison', 'breaking', 'news', '421', 'earthquake', '6.0', 'earthquake', 'jolts', 'bay', 'area', 'damage', 'and', 'at', 'least', 'ã¢â\x82¬â¦']
['amazing', 'use', 'of', 'bigdata', 'from', 'how', 'the', 'napa', 'earthquake', 'affected', 'bay', 'area', 'sleepers']
['napa', 'california', 'earthquake', 'was', "region's", 'most', 'severe', 'since', '1989.', 'get', 'tips', 'on', 'how', 'affected', 'should', 'respond']
['usaheadlines', 'cleanup', 'efforts', 'underway', 'after', 'major', 'earthquake', 'in', 'california']
['jawbone', 'looks', 'at', 'up', 'data', 'to', 's

['there', 'was', 'also', 'an', 'earthquake', 'in', 'california', 'prayers', 'for', 'those', 'people', 'that', 'had', 'a', 'lot', 'of', 'property', 'damage']
['quake', 'is', 'major', 'test', 'for', 'hard', 'luck', 'california', 'city', 'vallejo', 'calif.', '(ap)', 'ã¢â\x82¬â\x80\x9d', 'the', 'historic', 'blue', 'collar', 'town', 'of...']
['check', 'out', 'the', 'video', 'of', 'damage', 'from', 'the', 'napa', 'earthquake', 'shot', 'by', 'a', 'drone', 'quadcopter.', 'a', 'safer', 'way', 'to', 'assess', 'damage.']
['california', 'rattled', 'by', 'strongest', 'earthquake', 'in', '25', 'years', 'httã¢â\x82¬â¦']
['hoosiers', 'share', 'scary', 'stories', 'of', 'surviving', 'northern', 'california', 'earthquake', 'by', 'eric', 'levyindianapolis', 'ind.', '(aug....']
['california', 'earthquake', 'wine', 'country', 'gets', 'a', 'jolt', 'as', 'losses', 'run', 'into', 'billions', 'napa', 'california', 'a', 'strong', 'earthq...']
['video', 'california', 'shaken', 'by', 'earthquake', 'from', 'bbc', '

['usa', 'ã\x82â»', 'news', 'ã\x82â»', 'hot', 'news', '494', 'earthquake', 'damage', 'from', 'northern', 'california', 'earthquake', 'could', 'reach', '$1', 'billion', 'earthquake', 'uã¢â\x82¬â¦']
['california', 'has', 'been', 'trending', '30min', 'on', 'the', 'earthquake', 'page', 'tweetzup']
['damage', 'from', 'california', 'quake', 'could', 'top', '$1', 'billion', 'via']
['when', 'life', 'gives', 'you', 'lemons...', 'the', 'single', 'most', 'california', 'photo', 'of', 'all', 'time.', 'napaquake', 'napaproudã¢â\x82¬â\x9d']
['humanshld', 'update', '8', 'quake', 'rocks', 'california', 'wine', 'country', 'dozens', 'injured']
['6.0', 'california', 'earthquake', 'shakes', 'famed', 'wine', 'country', 'via']
['the', 'ring', 'of', 'fire', 'gets', 'angry', 'yesterday', 'earthquake', 'in', 'chile', '(6.6).', 'today', 'in', 'california', '(6.0)', 'and', 'now', 'peru', '(7.0).']
['usa', 'ã\x82â»', 'news', 'ã\x82â»', 'hot', 'news', '65', 'earthquake', 'california', 'hit', 'with', 'earthquake', 'o

['oh', 'my.', 'i', 'was', 'afraid', 'it', 'would', 'be', 'up', 'there...', 'total', 'economic', 'losses', 'estimated', 'by', 'for', 'napa', 'earthquake', 'at', '1', 'billion', 'usd.']
['interesting', 'graph', 'and', 'data', 'analysis', 'by', 'jawbone', 'how', 'the', 'napa', 'earthquake', 'affected', 'sleep', 'wearaã¢â\x82¬â¦']
['california', 'usa', 'fullerton', 'earthquake', 'breaking', 'news', '744', 'earthquake', '6.0', 'earthquake', 'jolts', 'bay', 'area', 'damaã¢â\x82¬â¦']
['state', 'of', 'emergency', 'after', 'us', 'quake', 'a', 'state', 'of', 'emergency', 'has', 'been', 'declared', 'in', 'california', 'after', 'a', 'strong', 'quake', 'r...']
['icymi', 'willie', 'brown', 'presented', 'a', 'donation', 'on', 'behalf', 'of', 'the', 'raiders', 'for', 'napa', 'earthquake', 'relief.']
['usaheadlines', 'cleanup', 'efforts', 'underway', 'after', 'major', 'earthquake', 'in', 'california']
['this', 'photo', 'is', 'very', 'california.', 'napa', 'earthquake']
['colorado', 'usa', 'fort', 'coll

In [256]:
eng_stopwords = set(stopwords.words("english"))
reduced['tweet_text'] = reduced['tweet_text'].apply(lambda x: [item for item in x if item not in eng_stopwords])

In [257]:
#join the list items back to one string
reduced['tweet_text'] = reduced['tweet_text'].apply(lambda x: ' '.join(x))

In [258]:
for p in reduced.tweet_text.values:
    print(p)

tennessee usa knoxville earthquake breaking news 816 earthquake northern california braces afã¢â¬â¦
we're updating interactive map reported damage napaquake. check
strong 6.1 earthquake rocks san francisco bay area injures 87+ significant damage napa
wisconsin usa madison breaking news 421 earthquake 6.0 earthquake jolts bay area damage least ã¢â¬â¦
amazing use bigdata napa earthquake affected bay area sleepers
napa california earthquake region's severe since 1989. get tips affected respond
usaheadlines cleanup efforts underway major earthquake california
jawbone looks data see many woken napa earthquake jawbone shown one mo...
san francisco earthquake wine country injured dozens people napaãâ calif. (reuters) 6.0 magnitude...
state emergency 6.0 earthquake hits california wine country
thinking dealing aftermath earthquake california.
economic losses california earthquake could top $1 billion
single california photo time. napaquake
usa ãâ» news ãâ» hot news 425 earthquake northern

northern california quake way know next ... via usaheadlines
hospital 200 treated since napa earthquake
northern california still risk aftershocks largest earthquake since 1989
california earthquake damage = skatepark. (photo jeremy carroll)
expert urges california nuclear plant closure quake threat critics call diablo canyon catã¢â¬â¦
patients continue arrive queen valley hospital. triage tents set treat injured. earthquake napa hã¢â¬â¦
damages city vallejo estimated $5 million napaquake
boy's survival napa earthquake 'a blessing' sfgate via
earthquakes california peru mega earthquake coming spanish
northern california quake bad sign things come earthquake jolting northern california struck mor...
earthquake rattles northern california
wall street journal quake jolts california wine region wall street journal california's wineã¢â¬â¦
usa ãâ» news ãâ» hot news 290 earthquake dozens injured 6 critically strong earthquake california...
n. california earthquake sparks major fires ã¢â

strongest earthquake 25 years struck northern california early sunday injuring 100 people
damage earthquake jolted california's napa valley wine country could billions
damage napaquake downtown napa vintners collective
northern california earthquake ....we survived
napa's wine community unbowed massive earthquake valley residents vintners assess damage clean... htã¢â¬â¦
typical fashion media napa earthquake devoting coverage white wine.
scores injured earthquake knocked power thousands napa calif. sparking fires buckling roads
photo california. napa earthquake
meadowbrook lane napa skaters finding upside quake damage. photo nbcbayarea photo jeremy carroll
ã£â¬âusgs breakingã£â¬â 1.4 48km n inyokern california pasthour 137 earthquake tsunami prayfromjapan
photo california. napa earthquake
usa ãâ» news ãâ» hot news 550 earthquake northern california rattled magnitude 6.0 earthquake. dozensã¢â¬â¦
new post napa residents begin process repairing quake damaged homes
usaheadlines cle

wine napaearthquake strong california quake shakes famed wine country via
usa ãâ» news ãâ» hot news 713 earthquake dozens injured 3 critically strong earthquake califoã¢â¬â¦
connecticut usa bridgeport ãâ» earthquake 151 scores aftershocks napa earthquake felt moreã¢â¬â¦
skaters turn damaged street ramp ca earthquake. life gives lemons make skateboard ramp.
napa food bank low food. need non perishable foods volunteers needed f $ donations napaquake earthquake
skaters make best napa earthquake shredding buckled streets quote jurassic park's esteemed fictiona...
meadowbrook lane napa skaters finding upside quake damage. photo nbcbayarea ...
napa earthquake earth ripples inner shift
seee ã¢â¬å wasnã¢â¬â¢t earthquake california god rolling red carpet beyonce. vmasã¢â¬â
photos show damage following napa earthquake
strong magnitude 6 earthquake rocks san francisco bay area dozens hurt significant damage napa ãâ« cbs san fran
update 10 california wine country shaken 6.0 quake dozen

In [169]:
## Example code BoW

from sklearn.feature_extraction.text import CountVectorizer

sent1 = "cool students study cool data science"
sent2 = "to know data science study data science"

vect = CountVectorizer() #instantiate

sents = np.array([sent1,sent2])

vect.fit(sents);


In [170]:
print('Total number of words in the vocabulary (and position in feature matrix):\n')
print(vect.vocabulary_)

# vocabulary for the BoW model is stored in a dictionary

Total number of words in the vocabulary (and position in feature matrix):

{'cool': 0, 'students': 4, 'study': 5, 'data': 1, 'science': 3, 'to': 6, 'know': 2}


In [171]:
# Transform to get feature vectors

bag = vect.transform(sents)

bag.toarray()

# the rows corresponds to the sentences 

array([[2, 1, 0, 1, 1, 1, 0],
       [0, 2, 1, 2, 0, 1, 1]], dtype=int64)

In [172]:
vect.get_feature_names() # stored in the right places

['cool', 'data', 'know', 'science', 'students', 'study', 'to']

In [173]:
# Put it in a DataFrame for interpretability

pd.DataFrame(bag.toarray(), columns=vect.get_feature_names(), index=[sent1,sent2])

# the number in the DataFrame is called Raw Term frequency raw term frequencies: 
# tf (t,d)—the number of times a term t occurs in a document d.

Unnamed: 0,cool,data,know,science,students,study,to
cool students study cool data science,2,1,0,1,1,1,0
to know data science study data science,0,2,1,2,0,1,1


## Classify dataset

In [259]:
file.columns

Index(['_unit_state', 'choose_one_category', 'tweet_text'], dtype='object')

In [260]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics # for confusion matrix, accuracy score etc
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(\
    reduced['tweet_text'], reduced['choose_one_category'], random_state=0, test_size=.2)


# CountVectorizer can actucally handle a lot of the preprocessing for us
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000)

In [261]:
%%time
# Transform the text data to feature
# Only fit training data (to mimic real world)

vectorizer.fit(X_train)

Wall time: 84.6 ms


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [262]:
# Check that it worked, 
# now we have fitted a model that can transform features
# to sparse matrix representation

print(vectorizer.get_feature_names()[:10])

['00', '000', '02', '03', '04', '05', '06', '07', '08', '09']


In [263]:
train_bag = vectorizer.transform(X_train) #transform to a feature matrix
test_bag = vectorizer.transform(X_test)

In [264]:
print(train_bag.toarray().shape) # 20,000 reviews, 2,000 feartures. just as expected
print(test_bag.toarray().shape)

(1610, 2944)
(403, 2944)


In [265]:
type(train_bag) # sparse matrix representation

print(train_bag)

  (0, 492)	1
  (0, 504)	1
  (0, 629)	1
  (0, 763)	1
  (0, 799)	1
  (0, 1740)	1
  (0, 2059)	1
  (1, 625)	1
  (1, 626)	1
  (1, 705)	1
  (1, 754)	1
  (1, 913)	1
  (1, 959)	1
  (1, 1108)	1
  (1, 1316)	1
  (1, 1381)	1
  (1, 1511)	1
  (1, 1740)	1
  (1, 1841)	1
  (1, 2614)	1
  (2, 45)	1
  (2, 552)	2
  (2, 700)	1
  (2, 885)	1
  (2, 913)	2
  :	:
  (1607, 2726)	2
  (1608, 276)	1
  (1608, 330)	1
  (1608, 622)	1
  (1608, 913)	1
  (1608, 1073)	1
  (1608, 1581)	1
  (1608, 1740)	1
  (1608, 1875)	1
  (1608, 2248)	1
  (1608, 2680)	1
  (1608, 2813)	1
  (1609, 467)	1
  (1609, 530)	1
  (1609, 831)	1
  (1609, 913)	1
  (1609, 1088)	1
  (1609, 1100)	1
  (1609, 1305)	1
  (1609, 1402)	1
  (1609, 1634)	1
  (1609, 1702)	1
  (1609, 1740)	1
  (1609, 2153)	1
  (1609, 2271)	1


## Clasify with Random Forest model

* Fit a Random Forest model to our bagged data set in order to do the sentiment analysis on `review_clean_original` and print the **validation accuracy** by using `forest.predict(test_bag)` and then comparing the resulting sentiment predictions with the ones stored in `y_test`.

*This can take 2-3 mins to run*

In [266]:
reduced.columns

Index(['_unit_state', 'choose_one_category', 'tweet_text'], dtype='object')

In [272]:
#list(reduced.choose_one_category.unique())

In [280]:
forest.n_outputs_

1

In [281]:
from sklearn.ensemble import RandomForestClassifier

## Initialize a Random Forest classifier with 50 trees
# hyperparameter n_estimators always set in instantiation

forest = RandomForestClassifier(n_estimators = 50) 

In [282]:
%%time
# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the target variable

forest = forest.fit(train_bag, y_train) # can take 20 seconds to run

Wall time: 3.28 s


In [283]:
# Make predictions

train_predictions = forest.predict(train_bag)
valid_predictions = forest.predict(test_bag)

## Accuracy

In [284]:
metrics.accuracy_score(y_train,train_predictions) # 100% training accuracy

0.9596273291925466

In [285]:
metrics.accuracy_score(y_test,valid_predictions) # 83% test

0.7196029776674938

In [287]:
# Confusion matrix
# Is the number of False Positives and True negatives approx 50/50?
metrics.confusion_matrix(y_test,valid_predictions)

array([[  0,   1,   0,   0,   0,   0,  16,   0],
       [  0,  10,   1,   0,   0,   0,   3,   0],
       [  0,   0,  46,   1,   0,   0,  25,   1],
       [  0,   0,   0,  41,   0,   0,   2,   0],
       [  0,   0,   0,   0,   0,   0,   3,   0],
       [  0,   0,   1,   0,   0,   8,  22,   1],
       [  5,   4,  15,   3,   0,   6, 181,   2],
       [  0,   0,   0,   0,   0,   0,   1,   4]], dtype=int64)

In [288]:
tn, fp, fn, tp = metrics.confusion_matrix(y_test,valid_predictions).ravel()
fp, fn

ValueError: too many values to unpack (expected 4)

In [292]:
valid_predictions==3

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [215]:
# What are the characteristics of False Positives for example?
# Good practice when doing analysis

df_test = pd.DataFrame(X_test)
df_test[(y_test.values==0) & (valid_predictions==1)]

Unnamed: 0,tweet_text


## Feature importances

In [289]:
importances = forest.feature_importances_
# returns relative importance of all features.
# they are in the order of the columns
print(importances)

[3.23302134e-06 3.80829646e-04 3.74849619e-05 ... 1.34828742e-04
 2.61403472e-05 4.77084384e-05]


In [290]:
# sort importance scores
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")
top_10 = indices[:10]

# Get top ten features
print([vectorizer.get_feature_names()[ind] for ind in top_10])

Feature ranking:
['damage', 'injured', 'dozens', 'photo', 'earthquake', 'help', 'california', 'napa', 'injuries', 'hurt']


# WIP: Method - to be adjusted

In [None]:
from sklearn.ensemble import RandomForestClassifier

# put everything together in a function

def predict_sentiment(cleaned_reviews, y=train["sentiment"]):

    print("Creating the bag of words model!\n")
    # CountVectorizer" is scikit-learn's bag of words tool, here we show more keywords 
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 max_features = 2000) 
    
    X_train, X_test, y_train, y_test = train_test_split(\
    cleaned_reviews, y, random_state=0, test_size=.2)

    # Then we use fit_transform() to fit the model / learn the vocabulary,
    # then transform the data into feature vectors.
    # The input should be a list of strings. .toarraty() converts to a numpy array
    
    train_bag = vectorizer.fit_transform(X_train).toarray()
    test_bag = vectorizer.transform(X_test).toarray()

    # You can extract the vocabulary created by CountVectorizer
    # by running print(vectorizer.get_feature_names())


    print("Training the random forest classifier!\n")
    # Initialize a Random Forest classifier with 75 trees
    forest = RandomForestClassifier(n_estimators = 50) 

    # Fit the forest to the training set, using the bag of words as 
    # features and the sentiment labels as the target variable
    forest = forest.fit(train_bag, y_train)


    train_predictions = forest.predict(train_bag)
    test_predictions = forest.predict(test_bag)
    
    train_acc = metrics.accuracy_score(y_train, train_predictions)
    valid_acc = metrics.accuracy_score(y_test, test_predictions)
    print("The training accuracy is: ", train_acc, "\n", "The validation accuracy is: ", valid_acc)
    
    return(forest,vectorizer)

## Compare Original cleaned to lemmatized and stemmed data set

Now carry out the same analysis as above but on the `review_clean_ps` and `review_clean_wnl`. 

What data preprocessing strategy worked the best? Why do you think that is? (Feel free to change the number of features extracted in the bag of words model and the number of trees in the random forest model (i.e. the hyperparameters in our model), to see how it effects your accuracy. Is the accuracy better or worse?

In [None]:
%%time

print('Original Reviews')
forest1,vec1 = predict_sentiment(review_clean_original)
print('Porter Stemmer')
forest2,vec2 = predict_sentiment(review_clean_ps)
print('Lemmatizing')
forest3,vec3 = predict_sentiment(review_clean_wnl)


# It  seems like Porter Stemmer and Lemmatizing does not effect the results as much as we thought
# This is just what Sebastian Raschka points out in his book Python Machine Learning:

'''
The Porter stemming algorithm is probably the oldest and simplest
stemming algorithm. Other popular stemming algorithms include the
newer Snowball stemmer (Porter2 or "English" stemmer) or the Lancaster
stemmer (Paice-Husk stemmer), which is faster but also more aggressive
than the Porter stemmer. Those alternative stemming algorithms are also
available through the NLTK package (http://www.nltk.org/api/
nltk.stem.html).

While stemming can create non-real words, such as thu, (from thus) as
shown in the previous example, a technique called lemmatization aims to
obtain the canonical (grammatically correct) forms of individual words—
the so-called lemmas. However, lemmatization is computationally more
diffcult and expensive compared to stemming and, in practice, it has
been observed that stemming and lemmatization have little impact on the
performance of text classifcation (Michal Toman, Roman Tesar, and Karel
Jezek. Infuence of word normalization on text classifcation. Proceedings of
InSciT, pages 354–358, 2006).
'''

In [None]:
vec1.get_feature_names()

In [None]:
for vectorizer in [vec1, vec2, vec3]:
    print('TOP TEN IMPORTANT FEATURES:')
    importances = forest.feature_importances_
    indices = np.argsort(importances)[::-1]
    top_10 = indices[:10]
    print([vectorizer.get_feature_names()[ind] for ind in top_10])

## Named entity recognition/ disambiguiation
- find out name of school, city, street etc.

## Word embeddings

## Sentiment analysis
- - sentiment analysis 
    - check paper at https://www.analyticsvidhya.com/blog/2017/01/sentiment-analysis-of-twitter-posts-on-chennai-floods-using-python/, where sentiment analysis was performed on Chennai flood dataset

In [None]:
#output: counting expressions (like Sandy Hook School or shooting)

## Lemmatization

In [38]:
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

In [39]:
wnl = WordNetLemmatizer()

wnl_stems = []
for pair in token_tag:
    res = wnl.lemmatize(pair[0],pos=get_wordnet_pos(pair[1]))
    wnl_stems.append(res)

print(' '.join(wnl_stems))

hi sandy hook school i think there be somebody shoot in here in sandy hook school because somebody get a gun i catch a glimpse of someone theyre run down the hallway they be still run theyre still shoot sandy hook school please


## Stopwords

In [42]:
tsc_wo_stopwords = [w for w in tsc_words if not w in stopwords.words("english")]
removed_stopwords = [w for w in tsc_words if w in stopwords.words("english")]

print('REVIEW WITHOUT STOPWORDS:')
print(' '.join(tsc_wo_stopwords))
print()
print('Stop words removed', removed_stopwords)
print()
print('NUMBER OF STOPWORDS REMOVED:',len(removed_stopwords))

REVIEW WITHOUT STOPWORDS:
hi sandy hook school think somebody shooting sandy hook school somebodys got gun caught glimpse someone theyre running hallway still running theyre still shooting sandy hook school please

Stop words removed ['i', 'there', 'is', 'in', 'here', 'in', 'because', 'a', 'i', 'a', 'of', 'down', 'the', 'they', 'are']

NUMBER OF STOPWORDS REMOVED: 15


# ---------------------------------------------------------

### here follows a summary of what we extracted from the text (summary, keywords etc.) and how this influences the priority

### processing: spot what is an emergency situation

### recommend steps what and how to do it --> what to employ and where to employ to

### some more links about what we can do
- https://blog.paralleldots.com/research/artificial-intelligence-can-make-public-transportation-safer/?source=post_page---------------------------