<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Bag-of-Words" data-toc-modified-id="Bag-of-Words-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Bag of Words</a></span></li><li><span><a href="#Train-Words2Vec" data-toc-modified-id="Train-Words2Vec-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Train Words2Vec</a></span></li><li><span><a href="#Importing-the-tweets" data-toc-modified-id="Importing-the-tweets-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Importing the tweets</a></span></li><li><span><a href="#Creating-the-Urgent-vs.-non-Urgent-Vectors" data-toc-modified-id="Creating-the-Urgent-vs.-non-Urgent-Vectors-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Creating the Urgent vs. non-Urgent Vectors</a></span></li></ul></div>

# Imports 

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import regex as re
import unidecode
from nltk.corpus import stopwords

import gensim

# Bag of Words

- **words were created using a logistic regression on an open source database used for NLP that contains messages during disasters and their categorizations, one of those being if the message was a direct call for help, what we will classify as a message being more "urgent"**
    - https://appen.com/datasets/combined-disaster-response-data/
    
    
- **another way to classify is a message is more urgent is using FEMA's Public Assistance Program and Policy Guide**
    - in this guide, public assistance work is either emergency work or permanent work. Important terms from each designation were added to our bag of words 
    - **Emergency**: 
        - Emergency Protective Measures
        - Debris Removal 
    - **Permanent**:
        - Roads and Bridges
        - Water Control Facilities
        - Buildings and Equipment
        - Utilities 
        - Parks, Recreational, other 
        
    - https://www.fema.gov/media-library-data/1525468328389-4a038bbef9081cd7dfe7538e7751aa9c/PAPPG_3.1_508_FINAL_5-4-2018.pdf
   

In [18]:
#list of urgent words
urgent = ['help',
          'campfire',
          'tonight',
          'today',
          'fire',
          'need',
          'aid',
          'removal',
          'burn',
          'tree',
          'evacuation',
          'smoke',
          'unsafe',
          'search',
          'lost',
          'victim',
          'medical',
          'med',
          'urgent',
          'home',
          'haze',
          'serious',
          'fires',
          'reporting',
          'smoking',
          'come',
          'emergency',
          'wildfire',
          'Portland',
          'send',
          'park',
          'valley',
          'investigation',
          'structure'
         ]

In [19]:
#list of non-urgent words 
non_urgent = ['job',
              'news',
              'government',
              'country',
              'materials',
              'work',
              'price',
              'utilities',
              'facility',
              'erosion',
              'irrigation',
              'restoration',
              'shoulder',
              'mud',
              'silt',
              'ditch',
              'slip',
              'pray',
              'hope',
              'thanks',
              'thankful'
             ]

# Train Words2Vec

In [6]:
#Model is commented out because the google news dataset is too large to fit on git
#model = gensim.models.KeyedVectors.load_word2vec_format('..//datasets/GoogleNews-vectors-negative300.bin', binary=True)

In [7]:
#Checking vector size of GoogleNews Word2Vec
model.vector_size

300

# Importing the tweets 

In [9]:
df = pd.read_csv('..//datasets/tweets_lemmatized.csv')

In [10]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,username,text
0,0,0,mult co fire ems log,"['med', 'medical', 'at', 'block', 'of', 'ne', ..."
1,1,1,mult co fire ems log,"['med', 'medical', 'at', 'se', 'nd', 'ave', 's..."
2,2,2,michellebot,"['valley', 'of', 'fire', 'valley', 'of', 'fire..."
3,3,3,mult co fire ems log,"['med', 'medical', 'at', 'block', 'of', 'ne', ..."
4,4,4,houdini,"['special', 'shout', 'out', 'to', 'm', 'llyk',..."


In [11]:
stops = stopwords.words('english')

In [12]:
#Cleaning messages to ensure that they are adequately prepped for analysis 
def message_to_words(raw_message):
    
     # remove accents
    unaccented = unidecode.unidecode(raw_message)
    
    # remove all non-letter characters
    letters_only = re.sub("[^a-zA-Z]", " ", unaccented)
    
    # lowercase 
    words = letters_only.lower().split()
    
    # lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(i) for i in words]
    
    # stop words
    meaningful_words = [w for w in words if not w in stops]
    
    # return as a string 
    return(" ".join(meaningful_words))

In [13]:
total_message = df.shape[0]
clean_message = []

print("Cleaning and parsing the message...")

j = 0
for message in df['text']:
    clean_message.append(message_to_words(message))
    
    # If the index is divisible by 100, print a message
    if (j+1) % 100 == 0:
        print(f'Comment {j+1} of {total_message}.')
    
    j += 1
    
    if j == total_message:
        print('Done.')

Cleaning and parsing the message...
Comment 100 of 11784.
Comment 200 of 11784.
Comment 300 of 11784.
Comment 400 of 11784.
Comment 500 of 11784.
Comment 600 of 11784.
Comment 700 of 11784.
Comment 800 of 11784.
Comment 900 of 11784.
Comment 1000 of 11784.
Comment 1100 of 11784.
Comment 1200 of 11784.
Comment 1300 of 11784.
Comment 1400 of 11784.
Comment 1500 of 11784.
Comment 1600 of 11784.
Comment 1700 of 11784.
Comment 1800 of 11784.
Comment 1900 of 11784.
Comment 2000 of 11784.
Comment 2100 of 11784.
Comment 2200 of 11784.
Comment 2300 of 11784.
Comment 2400 of 11784.
Comment 2500 of 11784.
Comment 2600 of 11784.
Comment 2700 of 11784.
Comment 2800 of 11784.
Comment 2900 of 11784.
Comment 3000 of 11784.
Comment 3100 of 11784.
Comment 3200 of 11784.
Comment 3300 of 11784.
Comment 3400 of 11784.
Comment 3500 of 11784.
Comment 3600 of 11784.
Comment 3700 of 11784.
Comment 3800 of 11784.
Comment 3900 of 11784.
Comment 4000 of 11784.
Comment 4100 of 11784.
Comment 4200 of 11784.
Comment

In [14]:
df = df.assign(cleaned_message = clean_message)
df.head(3)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,username,text,cleaned_message
0,0,0,mult co fire ems log,"['med', 'medical', 'at', 'block', 'of', 'ne', ...",med medical block ne th ave portland portland ...
1,1,1,mult co fire ems log,"['med', 'medical', 'at', 'se', 'nd', 'ave', 's...",med medical se nd ave se johnson creek blvd po...
2,2,2,michellebot,"['valley', 'of', 'fire', 'valley', 'of', 'fire...",valley fire valley fire state park


In [15]:
df.shape

(11784, 5)

In [16]:
df = df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'username', 'text'])

In [17]:
df.shape

(11784, 1)

# Creating the Urgent vs. non-Urgent Vectors

**Helpful medium post**
https://medium.com/@belen.sanchez27/leveraging-social-media-to-map-disasters-74b4cc34848d

In [20]:
#Credit to NYC III group (Andrew Sternick, Nick Read, Preeya Sawadmanod), their code provided a better 
#visualization of the cosine similarity scores for each message vector so when I started to have issues 
#I found their code and applied it while I am tweeting my bags of words for a clearer understanding of the 
#cosine similarity scores 


#Function for vectorization of corpus 
def twitter_vector(vector):
    
    #Counter for number of words in corpus that exists in GoogleNews word list 
    count=0
   
    #Creating a template for cumulative corpus vector sum (300 dimensions)
    tweet_vector = np.zeros((1,300))
    
    #Iterating over each word in bag of word list 
    for word in vector:

        #Checking if word exists in GoogleNews word list
        if word in model.vocab:                    
            
            #Vectorizing the word if in list
            word_vector = model.word_vector(word)        
            
            #Updating counter
            count += 1
            
            #Updating cumulative vector sum 
            tweet_vector = tweet_vector + word_vector 

    #Computing average vector by taking cumulative vector sum and dividing it by number of words traced
    tweet_vector_score = tweet_vector/count
    
    #Using numpy to squeeze N-dimensional nested array object into a 1-D array 
    tweet_vector_score = np.squeeze(tweet_vector_score)
    
    return(tweet_vector_score)

In [21]:
# Applying function to vectorize both bag of words list
urgent_vec = twitter_vector(urgent)
nonurgent_vec = twitter_vector(non_urgent)

In [22]:
urgent_vec.shape

(300,)

In [23]:
nonurgent_vec.shape

(300,)

In [24]:

#Function for computing cosine similarity score 
def cos_sim_score(a,b): 
    
    #Calculating the dot product of two vectors
    dot = np.dot(a, b)
    
    #Calculating the magnitude of the vector 
    norma = np.linalg.norm(a)
    normb = np.linalg.norm(b)
    
    #Calculating cosine similarity
    cos = dot / (norma * normb)
    
    return cos

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

In [28]:
#original copy - DO NOT CHANGE
#Empty lists
urgent_cos_sim = []
non_urgent_cos_sim = []

#Iterating through each tweet in out list of tokens
for tweet in df['cleaned_message']:
    
    #Call function 
    avg_vec_per_tweet = tweet_vector(tweet)
    
    #Creating new column in df for cosine simarlity score 
    urgent_cos_sim.append(cos_sim_score(avg_vec_per_tweet, urgent_vec))
    non_urgent_cos_sim.append(cos_sim_score(avg_vec_per_tweet, nonurgent_vec))

In [29]:
df['urgent_cos_sim'] = urgent_cos_sim
df['non_urgent_cos_sim'] = non_urgent_cos_sim
df.head()

Unnamed: 0,cleaned_message,urgent_cos_sim,non_urgent_cos_sim
0,med medical block ne th ave portland portland ...,0.148701,0.137452
1,med medical se nd ave se johnson creek blvd po...,0.149998,0.135317
2,valley fire valley fire state park,0.144539,0.141861
3,med medical block ne rodney ave portland portl...,0.145074,0.13124
4,special shout llyk comin fire video downtown l...,0.144783,0.133166


In [30]:
#Creating Classification 
urgent = []

for rows in df.index:  
    if df['urgent_cos_sim'][rows] > df['non_urgent_cos_sim'][rows]:
        urgent.append(1)
    else: 
        urgent.append(0)

In [31]:
#Creating a new column
df['urgent'] = urgent

In [32]:
df['urgent'].value_counts(normalize=True)

1    0.64927
0    0.35073
Name: urgent, dtype: float64

In [33]:
#view urgent tweets
#df[df['urgent'] == 0].head(5)

Unnamed: 0,cleaned_message,urgent_cos_sim,non_urgent_cos_sim,urgent
5,smokea smoke investigation outside structure n...,0.158059,0.164721,0
9,med medical se nd ave se morrison st morrison ...,0.141777,0.144964,0
10,smokea smoke investigation outside structure s...,0.149254,0.155813,0
11,buy two get one free decorate tree ghoulish de...,0.14231,0.147313,0
12,bay area thankful rain clean air fire brutal n...,0.142018,0.143126,0
25,smokea smoke investigation outside structure n...,0.158059,0.164721,0
29,med medical se nd ave se morrison st morrison ...,0.141777,0.144964,0
30,smokea smoke investigation outside structure s...,0.149254,0.155813,0
31,buy two get one free decorate tree ghoulish de...,0.14231,0.147313,0
32,bay area thankful rain clean air fire brutal n...,0.142018,0.143126,0


In [37]:
#read in the original dataframe to merge back with the raw tweets
df_original = pd.read_csv("../datasets/raw_df.csv")

In [38]:
#df_original.head()

In [41]:
df_original = df_original[['text','username']]

In [42]:
df_original.head()

Unnamed: 0,text,username
0,"MED - MEDICAL at 700 BLOCK OF NE 78TH AVE, POR...",Mult Co Fire/EMS log
1,MED - MEDICAL at SE 32ND AVE / SE JOHNSON CREE...,Mult Co Fire/EMS log
2,Valley of Fire @ Valley of Fire State Park htt...,MichelleBot
3,"MED - MEDICAL at 2000 BLOCK OF NE RODNEY AVE, ...",Mult Co Fire/EMS log
4,Special shout out to @m0llyk4y for comin throu...,Houdini


In [43]:
df = pd.merge(df, df_original, left_index=True, right_index=True)

In [44]:
df.sample(n = 5)

Unnamed: 0,cleaned_message,urgent_cos_sim,non_urgent_cos_sim,urgent,text,username
1353,black friday release sublime greatest hit firs...,0.138924,0.137414,1,with the Black Friday releases of Sublime-Gre...,Programme
5749,med medical block ne sandy blvd portland portl...,0.145306,0.128034,1,"MED - MEDICAL at 8300 BLOCK OF NE SANDY BLVD, ...",Mult Co Fire/EMS log
4008,almcom monitored commercial fire alarm block s...,0.142837,0.145155,0,ALMCOM - MONITORED COMMERCIAL FIRE ALARM at 14...,Mult Co Fire/EMS log
4594,much latergram actually posting copy angievelv...,0.144122,0.143888,1,"Much, #latergram I’m actually posting to copy ...",Felicity Kay
10888,smoke campfire thick see bay barely see emeryv...,0.138019,0.139797,0,The smoke from the #CampFire is so thick you c...,"Safer at home, Yoda is"


In [85]:
df_urgent = df.loc[df['urgent'] == 1]

In [86]:
df_nonurgent = df.loc[df['urgent'] == 0]

In [88]:
#export df to get lat long points 
df_urgent.to_csv('..//datasets/df_urgent.csv')

In [89]:
#export df to get lat long points 
df_nonurgent.to_csv('..//datasets/df_nonurgent.csv')