<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Bag-of-Words" data-toc-modified-id="Bag-of-Words-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Bag of Words</a></span></li><li><span><a href="#Train-Words2Vec" data-toc-modified-id="Train-Words2Vec-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Train Words2Vec</a></span></li><li><span><a href="#Creating-the-Urgent-vs.-non-Urgent-Vectors" data-toc-modified-id="Creating-the-Urgent-vs.-non-Urgent-Vectors-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Creating the Urgent vs. non-Urgent Vectors</a></span></li></ul></div>

# Imports 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

import gensim

# Bag of Words

- **words were created using a logistic regression on an open source database used for NLP that contains messages during disasters and their categorizations, one of those being if the message was a direct call for help, what we will classify as a message being more "urgent"**
    - https://appen.com/datasets/combined-disaster-response-data/
    
    
- **another way to classify is a message is more urgent is using FEMA's Public Assistance Program and Policy Guide**
    - in this guide, public assistance work is either emergency work or permanent work. Important terms from each designation were added to our bag of words 
    - **Emergency**: 
        - Emergency Protective Measures
        - Debris Removal 
    - **Permanent**:
        - Roads and Bridges
        - Water Control Facilities
        - Buildings and Equipment
        - Utilities 
        - Parks, Recreational, other 
        
    - https://www.fema.gov/media-library-data/1525468328389-4a038bbef9081cd7dfe7538e7751aa9c/PAPPG_3.1_508_FINAL_5-4-2018.pdf
   

In [1]:
#list of urgent words
urgent = ['help',
         'fire',
          'need',
          'hungry',
          'aid',
          'removal',
          'tent',
          'dying',
          'evacuation',
          'starving',
          'smoke',
          'shelter',
          'debris',
          'unsafe',
          'access',
          'rescue',
          'search',
          'lost',
          'victim',
          'sos'
         ]

In [None]:
#list of non-urgent words 
non_urgent = ['job',
              'news',
              'government',
              'country',
              'reported',
              'materials',
              'work',
              'price',
              'utilities',
              'facility'
              'parks',
              'playground',
              'bridge',
              'sidewalk',
              'guardrails',
              'erosion',
              'irrigation',
              'vegetation',
              'traffic',
              'restoration',
              'inspection',
              'assessment'
             ]

# Train Words2Vec

In [4]:
model = gensim.models.KeyedVectors.load_word2vec_format('..//data/GoogleNews-vectors-negative300.bin', binary=True)

In [6]:
#Checking vector size of GoogleNews Word2Vec
model.vector_size

300

# Creating the Urgent vs. non-Urgent Vectors

**Helpful medium post**
https://medium.com/@belen.sanchez27/leveraging-social-media-to-map-disasters-74b4cc34848d

In [None]:
urgent_vector = np.zeros((1,300))
count = 0
for word in urgent:
    if word not in model.vocab:
        continue
    else:
        temp = model.word_vec(word)
        emerg_vect = urgent_vector + temp
        counter +=1

urgent_vector = urgent_vector/counter
urgent_vector = np.squeeze(urgent_vector)

In [None]:
non_urgent_vector = np.zeros((1,300))
count = 0
for word in non_urgent:
    
    if word not in model.vocab:
        continue
    else:
        temp = model.word_vec(word)
        permanent_vect = non_urgent_vector + temp
        counter +=1

non_urgent_vector = non_urgent_vector/counter
non_urgent_vector = np.squeeze(non_urgent_vector)

In [None]:
#Where our classifications will go
urgent_message = [] 
#for a tweet in our clean tweet tokes
for tweet in clean_tweets:
#set up counter    
    count = 0
    #for each token in tweet
    for token in tweet:
        #set up a message vector that is size of 300
        message_vector = np.zeros((1, 300))
        # if token is not in Word2Vec model, do not include
        if token not in model.vocab.keys(): 
            continue
        else:
            message_vector = message_vector + model.word_vec(token)
            count += 1
    if count == 0:
        count = 1
    message_vector = np.squeeze(message_vector)/count
    
    
    #Calculate the dot product for each vector (urgent or non-urgent), then through cosine similarity assign
    #whether the message is urgent or not urgent based on if the message is more similar to urgent or non-urgent
    #score
    
    if np.dot(message_vector, urgent_vector)/(np.linalg.norm(urgent_vector)*np.linalg.norm(message_vector)) >= np.dot(message_vector, non_urgent_vector)/(np.linalg.norm(non_urgent_vector)*np.linalg.norm(message_vector)):
        urgent_message.append(1)
    else:
        urgent_message.append(0)

In [None]:
# add classification to df 
df['urgent_message'] = urgent_message