<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Bag-of-Words" data-toc-modified-id="Bag-of-Words-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Bag of Words</a></span></li><li><span><a href="#Train-Words2Vec" data-toc-modified-id="Train-Words2Vec-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Train Words2Vec</a></span></li><li><span><a href="#Importing-the-tweets" data-toc-modified-id="Importing-the-tweets-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Importing the tweets</a></span></li><li><span><a href="#Creating-the-Urgent-vs.-non-Urgent-Vectors" data-toc-modified-id="Creating-the-Urgent-vs.-non-Urgent-Vectors-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Creating the Urgent vs. non-Urgent Vectors</a></span></li></ul></div>

# Imports 

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import regex as re
import unidecode
from nltk.corpus import stopwords

import gensim

# Bag of Words

- **words were created using a logistic regression on an open source database used for NLP that contains messages during disasters and their categorizations, one of those being if the message was a direct call for help, what we will classify as a message being more "urgent"**
    - https://appen.com/datasets/combined-disaster-response-data/
    
    
- **another way to classify is a message is more urgent is using FEMA's Public Assistance Program and Policy Guide**
    - in this guide, public assistance work is either emergency work or permanent work. Important terms from each designation were added to our bag of words 
    - **Emergency**: 
        - Emergency Protective Measures
        - Debris Removal 
    - **Permanent**:
        - Roads and Bridges
        - Water Control Facilities
        - Buildings and Equipment
        - Utilities 
        - Parks, Recreational, other 
        
    - https://www.fema.gov/media-library-data/1525468328389-4a038bbef9081cd7dfe7538e7751aa9c/PAPPG_3.1_508_FINAL_5-4-2018.pdf
   

In [18]:
#list of urgent words
urgent = ['help',
          'campfire',
          'tonight',
          'today',
          'fire',
          'need',
          #'hungry',
          'aid',
          'removal',
          #'forest',
          #'wood',
          #'inhalation',
          #'dark',
          #'smokey',
          #'breathe',
          'burn',
          'tree',
          #'yard',
          #'garage',
          #creek',
          #'tent',
          #'dying',
          'evacuation',
          #'starving',
          'smoke',
          #'shelter',
          #'debris',
          'unsafe',
          #'access',
          #'rescue',
          'search',
          'lost',
          'victim',
          #'sos', 
          'medical',
          'med',
          'urgent',
          #'quick',
          'home',
          #'waste',
          #'junk',
          #'ditch',
          #'seen',
          'haze',
          #'now',
          #'in',
          'serious',
          #'deadly',
          #'report',
          'fires',
          'reporting',
          'smoking',
          'come',
          #'quick',
          #'observe',
          #'surge',
          #'blaze',
          #'ems',
          'emergency',
          'wildfire',
          'Portland',
          'send',
          #'locate',
          'park',
          'valley',
          'investigation',
          #'ave',
          #'st',
          'structure'
         ]

In [19]:
#list of non-urgent words 
non_urgent = ['job',
              'news',
              'government',
              'country',
              #'reported',
              'materials',
              'work',
              'price',
              'utilities',
              'facility',
              #'st',
              #'ave',
              #'parks',
              #'playground',
              #'bridge',
              #'sidewalk',
              #'guardrails',
              'erosion',
              'irrigation',
              #'baseball',
              #'tennis',
              #'hiking',
              #'wildlife',
              #'vegetation',
              #'traffic',
              'restoration',
              'shoulder',
              #'stabilization',
              #'inspection',
              #'assessment',
              #'remediation',
              #'insurance',
              #'cops'
              #'help',
              'mud',
              'silt',
              'ditch',
              'slip',
              #'downtown',
              #'uptown',
              'pray',
              'hope',
              'thanks',
              'thankful'
             ]

# Train Words2Vec

In [6]:
#model = gensim.models.KeyedVectors.load_word2vec_format('..//datasets/GoogleNews-vectors-negative300.bin', binary=True)

In [7]:
#Checking vector size of GoogleNews Word2Vec
model.vector_size

300

In [8]:
#Example for word rescue
fire_vec = model['fire']
#Checking for 20 components
fire_vec[:20]

array([ 0.35546875,  0.18359375,  0.14941406, -0.09375   ,  0.17871094,
       -0.08398438, -0.06396484, -0.2578125 , -0.1171875 ,  0.13671875,
        0.22753906, -0.25585938, -0.18554688, -0.20800781, -0.17089844,
        0.02563477, -0.12011719, -0.10839844, -0.06347656,  0.01519775],
      dtype=float32)

# Importing the tweets 

In [9]:
df = pd.read_csv('..//datasets/tweets_lemmatized.csv')

In [10]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,username,text
0,0,0,mult co fire ems log,"['med', 'medical', 'at', 'block', 'of', 'ne', ..."
1,1,1,mult co fire ems log,"['med', 'medical', 'at', 'se', 'nd', 'ave', 's..."
2,2,2,michellebot,"['valley', 'of', 'fire', 'valley', 'of', 'fire..."
3,3,3,mult co fire ems log,"['med', 'medical', 'at', 'block', 'of', 'ne', ..."
4,4,4,houdini,"['special', 'shout', 'out', 'to', 'm', 'llyk',..."


In [11]:
stops = stopwords.words('english')

In [12]:
def message_to_words(raw_message):
    
     # remove accents
    unaccented = unidecode.unidecode(raw_message)
    
    # remove all non-letter characters
    letters_only = re.sub("[^a-zA-Z]", " ", unaccented)
    
    # lowercase 
    words = letters_only.lower().split()
    
    # lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens_lem = [lemmatizer.lemmatize(i) for i in words]
    
    # stop words
    meaningful_words = [w for w in words if not w in stops]
    
    # return as a string 
    return(" ".join(meaningful_words))

In [13]:
total_message = df.shape[0]
clean_message = []

print("Cleaning and parsing the message...")

j = 0
for message in df['text']:
    clean_message.append(message_to_words(message))
    
    # If the index is divisible by 100, print a message
    if (j+1) % 100 == 0:
        print(f'Comment {j+1} of {total_message}.')
    
    j += 1
    
    if j == total_message:
        print('Done.')

Cleaning and parsing the message...
Comment 100 of 11784.
Comment 200 of 11784.
Comment 300 of 11784.
Comment 400 of 11784.
Comment 500 of 11784.
Comment 600 of 11784.
Comment 700 of 11784.
Comment 800 of 11784.
Comment 900 of 11784.
Comment 1000 of 11784.
Comment 1100 of 11784.
Comment 1200 of 11784.
Comment 1300 of 11784.
Comment 1400 of 11784.
Comment 1500 of 11784.
Comment 1600 of 11784.
Comment 1700 of 11784.
Comment 1800 of 11784.
Comment 1900 of 11784.
Comment 2000 of 11784.
Comment 2100 of 11784.
Comment 2200 of 11784.
Comment 2300 of 11784.
Comment 2400 of 11784.
Comment 2500 of 11784.
Comment 2600 of 11784.
Comment 2700 of 11784.
Comment 2800 of 11784.
Comment 2900 of 11784.
Comment 3000 of 11784.
Comment 3100 of 11784.
Comment 3200 of 11784.
Comment 3300 of 11784.
Comment 3400 of 11784.
Comment 3500 of 11784.
Comment 3600 of 11784.
Comment 3700 of 11784.
Comment 3800 of 11784.
Comment 3900 of 11784.
Comment 4000 of 11784.
Comment 4100 of 11784.
Comment 4200 of 11784.
Comment

In [14]:
df = df.assign(cleaned_message = clean_message)
df.head(3)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,username,text,cleaned_message
0,0,0,mult co fire ems log,"['med', 'medical', 'at', 'block', 'of', 'ne', ...",med medical block ne th ave portland portland ...
1,1,1,mult co fire ems log,"['med', 'medical', 'at', 'se', 'nd', 'ave', 's...",med medical se nd ave se johnson creek blvd po...
2,2,2,michellebot,"['valley', 'of', 'fire', 'valley', 'of', 'fire...",valley fire valley fire state park


In [15]:
df.shape

(11784, 5)

In [16]:
df = df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'username', 'text'])

In [17]:
df.shape

(11784, 1)

In [597]:
def single_list(dataframe):
    
    single_list = []
    for row in dataframe['cleaned_message']:
        for word in row:
            single_list.append(word)
            
    return single_list

In [598]:
text_words = single_list(df)

In [None]:
# What are some of the most common words contained in these tweets?
# Method found on The Programming Historian: https://programminghistorian.org/en/lessons/counting-frequencies

# Create an empty list
word_frequency = []

# Loop through the list of words and count each one up
[word_frequency.append(text_words.count(word)) for word in text_words]

#print("Pairs\n" + str(list(zip(text_words, word_frequency))))

In [None]:
# Create blank dataframe
# This dataframe is ONLY being used to count up the words that are found the most frequently in these tweets
word_counts = pd.DataFrame()

# Add desired columns from previously defined variables
word_counts['words'] = text_words
word_counts['word_frequency'] = word_frequency

In [None]:
word_counts['words'].drop_duplicates(inplace=True)

In [None]:
# Sort dataframe by word frequency
word_counts.sort_values('word_frequency', ascending=False, inplace=True)

# Drop duplicate words
word_counts.drop_duplicates(subset = 'words', keep = 'first', inplace=True)

# Look at the 20 most commonly used words
word_counts.head(50)

# Creating the Urgent vs. non-Urgent Vectors

**Helpful medium post**
https://medium.com/@belen.sanchez27/leveraging-social-media-to-map-disasters-74b4cc34848d

In [20]:
#Credit to NYC group - their code provided a better visualization of the cosine similarity scores for each message
#vector so when I started to have issues I found their code and applied it while I am tweeting my bags of words
#for a clearer understanding of the cosine similarity scores 


#Function for vectorization of corpus 
def tweet_vector(bag_of_words_list):
    
    #Counter for number of words in corpus that exists in GoogleNews word list 
    counter=0
   
    #Creating a template for cumulative corpus vector sum (300 dimensions)
    tweet_vector_sum = np.zeros((1,300))
    
    #Iterating over each word in bag of word list 
    for word in bag_of_words_list:

        #Checking if word exists in GoogleNews word list
        if word in model.vocab:                    
            
            #Vectorizing the word if in list
            word_vec = model.word_vec(word)        
            
            #Updating counter
            counter += 1
            
            #Updating cumulative vector sum 
            tweet_vector_sum = tweet_vector_sum + word_vec 

    #Computing average vector by taking cumulative vector sum and dividing it by number of words traced
    tweet_vector_avg = tweet_vector_sum/counter
    
    #Using numpy to squeeze N-dimensional nested array object into a 1-D array 
    tweet_vector_avg = np.squeeze(tweet_vector_avg)
    
    return(tweet_vector_avg)

In [21]:
# Applying function to vectorize both bag of words list
urgent_vec = tweet_vector(urgent)
nonurgent_vec = tweet_vector(non_urgent)

In [22]:
urgent_vec.shape

(300,)

In [23]:
nonurgent_vec.shape

(300,)

In [24]:

#Function for computing cosine similarity score 
def cos_sim_score(a,b): 
    
    #Calculating the dot product of two vectors
    dot = np.dot(a, b)
    
    #Calculating the magnitude of the vector 
    norma = np.linalg.norm(a)
    normb = np.linalg.norm(b)
    
    #Calculating cosine similarity
    cos = dot / (norma * normb)
    
    return cos

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

In [26]:
#urgent_vec.reshape(1, -1)
#nonurgent_vec.reshape(1, -1)

In [27]:
# #working copy 
# #Empty lists
# urgent_cos_sim = []
# non_urgent_cos_sim = []

# #Iterating through each tweet in out list of tokens
# for tweet in df['cleaned_message']:
    
#     #Call function 
#     avg_vec_per_tweet = tweet_vector(tweet)
    
#     urgent_vec = urgent_vec.reshape(1, -1)
#     nonurgent_vec = nonurgent_vec.reshape(1, -1)
#     avg_vec_per_tweet = avg_vec_per_tweet.reshape(1, -1)
    
#     #Creating new column in df for cosine simarlity score 
#     urgent_cos_sim.append(cosine_similarity(avg_vec_per_tweet, urgent_vec))
#     urgent_cos_sim.append(cosine_similarity(avg_vec_per_tweet, nonurgent_vec))
#     #urgent_cos_sim.append(cos_sim_score(avg_vec_per_tweet, urgent_vec))
#     #non_urgent_cos_sim.append(cos_sim_score(avg_vec_per_tweet, nonurgent_vec))

In [28]:
#original copy - DO NOT CHANGE
#Empty lists
urgent_cos_sim = []
non_urgent_cos_sim = []

#Iterating through each tweet in out list of tokens
for tweet in df['cleaned_message']:
    
    #Call function 
    avg_vec_per_tweet = tweet_vector(tweet)
    
    #Creating new column in df for cosine simarlity score 
    urgent_cos_sim.append(cos_sim_score(avg_vec_per_tweet, urgent_vec))
    non_urgent_cos_sim.append(cos_sim_score(avg_vec_per_tweet, nonurgent_vec))

In [29]:
df['urgent_cos_sim'] = urgent_cos_sim
df['non_urgent_cos_sim'] = non_urgent_cos_sim
df.head()

Unnamed: 0,cleaned_message,urgent_cos_sim,non_urgent_cos_sim
0,med medical block ne th ave portland portland ...,0.148701,0.137452
1,med medical se nd ave se johnson creek blvd po...,0.149998,0.135317
2,valley fire valley fire state park,0.144539,0.141861
3,med medical block ne rodney ave portland portl...,0.145074,0.13124
4,special shout llyk comin fire video downtown l...,0.144783,0.133166


In [30]:
#Creating Classification 
urgent = []

for rows in df.index:  
    if df['urgent_cos_sim'][rows] > df['non_urgent_cos_sim'][rows]:
        urgent.append(1)
    else: 
        urgent.append(0)

In [31]:
#Creating a new column
df['urgent'] = urgent

In [32]:
df['urgent'].value_counts(normalize=True)

1    0.64927
0    0.35073
Name: urgent, dtype: float64

In [33]:
df[df['urgent'] == 0].head(20)

Unnamed: 0,cleaned_message,urgent_cos_sim,non_urgent_cos_sim,urgent
5,smokea smoke investigation outside structure n...,0.158059,0.164721,0
9,med medical se nd ave se morrison st morrison ...,0.141777,0.144964,0
10,smokea smoke investigation outside structure s...,0.149254,0.155813,0
11,buy two get one free decorate tree ghoulish de...,0.14231,0.147313,0
12,bay area thankful rain clean air fire brutal n...,0.142018,0.143126,0
25,smokea smoke investigation outside structure n...,0.158059,0.164721,0
29,med medical se nd ave se morrison st morrison ...,0.141777,0.144964,0
30,smokea smoke investigation outside structure s...,0.149254,0.155813,0
31,buy two get one free decorate tree ghoulish de...,0.14231,0.147313,0
32,bay area thankful rain clean air fire brutal n...,0.142018,0.143126,0


In [34]:
#urgent
df['cleaned_message'][35]

'ta p traffic accident pin block ne th ave portland portland fire rp pdx'

In [35]:
#urgent
df['cleaned_message'][3]

'med medical block ne rodney ave portland portland fire rp pdx'

In [36]:
df['cleaned_message'][5]

'smokea smoke investigation outside structure n vancouver ave columbia slough multnomah county portland fire rp'

In [37]:
df_original = pd.read_csv("../datasets/raw_df.csv")

In [38]:
#df_original.head()

In [39]:
df_original['text'].head()

0    MED - MEDICAL at 700 BLOCK OF NE 78TH AVE, POR...
1    MED - MEDICAL at SE 32ND AVE / SE JOHNSON CREE...
2    Valley of Fire @ Valley of Fire State Park htt...
3    MED - MEDICAL at 2000 BLOCK OF NE RODNEY AVE, ...
4    Special shout out to @m0llyk4y for comin throu...
Name: text, dtype: object

In [40]:
df['cleaned_message'].head()

0    med medical block ne th ave portland portland ...
1    med medical se nd ave se johnson creek blvd po...
2                   valley fire valley fire state park
3    med medical block ne rodney ave portland portl...
4    special shout llyk comin fire video downtown l...
Name: cleaned_message, dtype: object

In [41]:
df_original = df_original[['text','username']]

In [42]:
df_original.head()

Unnamed: 0,text,username
0,"MED - MEDICAL at 700 BLOCK OF NE 78TH AVE, POR...",Mult Co Fire/EMS log
1,MED - MEDICAL at SE 32ND AVE / SE JOHNSON CREE...,Mult Co Fire/EMS log
2,Valley of Fire @ Valley of Fire State Park htt...,MichelleBot
3,"MED - MEDICAL at 2000 BLOCK OF NE RODNEY AVE, ...",Mult Co Fire/EMS log
4,Special shout out to @m0llyk4y for comin throu...,Houdini


In [43]:
df = pd.merge(df, df_original, left_index=True, right_index=True)

In [44]:
df.sample(n = 5)

Unnamed: 0,cleaned_message,urgent_cos_sim,non_urgent_cos_sim,urgent,text,username
1353,black friday release sublime greatest hit firs...,0.138924,0.137414,1,with the Black Friday releases of Sublime-Gre...,Programme
5749,med medical block ne sandy blvd portland portl...,0.145306,0.128034,1,"MED - MEDICAL at 8300 BLOCK OF NE SANDY BLVD, ...",Mult Co Fire/EMS log
4008,almcom monitored commercial fire alarm block s...,0.142837,0.145155,0,ALMCOM - MONITORED COMMERCIAL FIRE ALARM at 14...,Mult Co Fire/EMS log
4594,much latergram actually posting copy angievelv...,0.144122,0.143888,1,"Much, #latergram I‚Äôm actually posting to copy ...",Felicity Kay
10888,smoke campfire thick see bay barely see emeryv...,0.138019,0.139797,0,The smoke from the #CampFire is so thick you c...,"Safer at home, Yoda is"


In [45]:
#ugent example
df['text'][1785]

'TA1 - TRAFFIC ACCIDENT - 1ST RESPONSE (FIRE & EMS) at EB I84 FWY WO / EXIT 5 & NE 82ND AVE, PORTLAND, OR [Portland Fire #RP18000092329]'

In [46]:
#non-urgent example
df['text'][9746]

'My morning  weather report. Smoke? Wind has blown smoke from the horrible fires  in Malibu, Thousand Oaks, Agoura Hills all over Los Angeles. Pray everyone‚Ä¶ https://www.instagram.com/p/BqDMM2eBV5EhDie9Z0s5ycTcXZPq9bOGzQFZMs0/?utm_source=ig_twitter_share&igshid=kmuy4xajhwju\xa0‚Ä¶'

In [47]:
#non-urgent example
df['text'][4633]

'My birthday gift redeemed, a visit with the Giant Sequoia survivors of Fire.  #tbt #DaughteroftheSouth #forests #conservation #sentientbeings @ Sequoia & Kings Canyon National Parks -‚Ä¶ https://www.instagram.com/p/BqODiFSlZRQ/?utm_source=ig_twitter_share&igshid=1fa67qnjkn6m5\xa0‚Ä¶'

In [48]:
df['text'][5607]

'Two right lanes blocked-big rig was on fire on 805 SB at Palm Ave #SDtraffic http://bit.ly/Y31oyM\xa0'

In [49]:
df['text'][6182]

'APPLI - APPLIANCE OR EQUIPMENT FIRE at 400 BLOCK OF SE 127TH AVE, PORTLAND, OR [Portland Fire #RP18000094181] 10:35 #pdx911'

In [50]:
df['text'][2165]

'GRASS - GRASS, BARKDUST OR TREE FIRE at 5000 BLOCK OF NE 6TH AVE, PORTLAND, OR [Portland Fire #RP18000095796] 04:39 #pdx911'

In [51]:
#not urgent
df['text'][4412]

'An eerily hazy view down Market St at 2pm this Friday from the smoke and poor air quality due to the Wildfires burning 150+ miles north of the Bay Area.\n‚Ä¢\n#Smoke #Haze #CampFire #Wildfire‚Ä¶ https://www.instagram.com/p/BqRPwWvBWLf/?utm_source=ig_twitter_share&igshid=1vwr2ycd8f9ke\xa0‚Ä¶'

In [52]:
#urgent
df['text'][1084]

'Carr Fire, Camp Fire Smoke, Soot, Ash, Odor and Fire Damage Clean Up\nhttps://bit.ly/2K344fA\xa0\n#carrfire #campfire #shastacounty #servpro #reddingcaliforniapic.twitter.com/NHyssLjSnR'

In [53]:
#urgent
df['text'][8693]

'A very eerie day for Channel Islands Beach as the #smoke blankets the Sun. Praying for those effected.  by the #hillfire @visitventura #wildfire #fire visitoxnardca #ominous #emptybeach‚Ä¶ https://www.instagram.com/p/Bp-GPvuhaV_/?utm_source=ig_twitter_share&igshid=1kjuq70ahft77\xa0‚Ä¶'

In [54]:
#urgent
df['text'][7470]

'‚ÄúNov 9th, 2018, CO dominant, firestorm smoke with probable contamination‚Äù.  In preparation for future extreme weather events we need to prioritize the containment and cleanup of toxic‚Ä¶ https://www.instagram.com/p/BqGcoJbH1Yl/?utm_source=ig_twitter_share&igshid=1bydhyyy7aj4d\xa0‚Ä¶'

In [55]:
#urgent
df['text'][3611]

'Vehicle on fire on I-80 WB near Echo Cyn Rd #SLCtraffic http://bit.ly/Xid4QI\xa0'

In [56]:
df['text'][6104]

'!! sigalert !! the road is closed because of a brush fire. in #Grapevine on I-5 NB at Smokey Bear Rd, stopped traffic back to Templin Hwy'

In [61]:
df['text'][5193]

'Fire next to home. Looks like someone started on purpose! @ Sorrento, California https://www.instagram.com/p/BqLuqRXHoJN/?utm_source=ig_twitter_share&igshid=sykx448gicjo\xa0‚Ä¶'

In [53]:
#Urgent
df['text'][2586]

'Camp fire closure in #Oroville on Hwy 70 Both NB/SB between Pentz Rd and CA 89 #traffic http://bit.ly/Y31VAF\xa0'

In [61]:
#Urgent
df['text'][8195]

'Camp fire closure in #Oroville on 191 Both NB/SB between Durham-Pentz Rd and before Pearson Rd #traffic http://bit.ly/Y31VAF\xa0'

In [65]:
#NonUrgent
df['text'][5565]

'Searches Intensify With More Than 600 Reported Missing in Butte County‚Äôs Camp\xa0Fire http://bit.ly/2K8GLkp\xa0'

In [125]:
#NonUrgent
df['text'][10587]

'Really bad. So bad I called it in to the NWS (I‚Äôm a trained skywarn spotter) and they issued a Dense Smoke Advisory. @ Roseville, California https://www.instagram.com/p/BqBMDGjl2Njf9eO_zgiTRXZ7vkg61XF0PUxgeI0/?utm_source=ig_twitter_share&igshid=3vujyep2jbr1\xa0‚Ä¶'

In [140]:
#NonUrgent
df['text'][9012]

'Camp Fire smoke making it look like Mars rising in the East.\n\nReally, it looked much bigger than that. @ Sausalito, California https://www.instagram.com/p/Bp9tqe5gU4-/?utm_source=ig_twitter_share&igshid=18sy347f6epv6\xa0‚Ä¶'

In [142]:
#NonUrgent
df['text'][11517]

'Same chill in my blood as the ‚Äò94 Malibu fires which were only stopped by the Pacific Ocean itself.  We live in chaparral and choose to live amongst nature‚Äôs will but it‚Äôs still‚Ä¶ https://www.instagram.com/p/Bp_GMEhlKA2/?utm_source=ig_twitter_share&igshid=u3yj0qprdr5l\xa0‚Ä¶'

In [146]:
#NonUrgent
df['text'][5712]

'You got snow and we got smoke.'

In [None]:
df_example = df.iloc[[5193],:]

In [62]:
df_exampleb = df.iloc[[9012],:]

In [64]:
df_example = df_exampleb.append(df_example)

In [65]:
df_example.head()

Unnamed: 0,cleaned_message,urgent_cos_sim,non_urgent_cos_sim,urgent,text,username
9012,camp fire smoke making look like mar rising ea...,0.141834,0.147395,0,Camp Fire smoke making it look like Mars risin...,Frank Leahy
5193,fire next home look like someone started purpo...,0.145962,0.153339,0,Fire next to home. Looks like someone started ...,1John üá∫üá∏üá®üáÆ


In [67]:
df_example = df_example.rename(columns={"text": "tweet"})

In [68]:
df_example = df_example[['tweet']]

In [70]:
df_example.to_csv('./sample_points.csv')

In [85]:
df_urgent = df.loc[df['urgent'] == 1]

In [86]:
df_nonurgent = df.loc[df['urgent'] == 0]

In [88]:
df_urgent.to_csv('..//datasets/df_urgent.csv')

In [89]:
df_nonurgent.to_csv('..//datasets/df_nonurgent.csv')

In [10]:
# urgent_vector = np.zeros((1,300))
# count = 0
# for word in urgent:
#     if word not in model.vocab:
#         continue
#     else:
#         temp = model.word_vec(word)
#         emerg_vect = urgent_vector + temp
#         count +=1

# urgent_vector = urgent_vector/count
# urgent_vector = np.squeeze(urgent_vector)

In [11]:
# non_urgent_vector = np.zeros((1,300))
# count = 0
# for word in non_urgent:
    
#     if word not in model.vocab:
#         continue
#     else:
#         temp = model.word_vec(word)
#         permanent_vect = non_urgent_vector + temp
#         count +=1

# non_urgent_vector = non_urgent_vector/count
# non_urgent_vector = np.squeeze(non_urgent_vector)

In [16]:
# #Where our classifications will go
# urgent_message = [] 
# #for a tweet in our clean tweet tokes
# for tweet in df['text']:
# #set up counter    
#     count = 0
#     #for each token in tweet
#     for token in tweet:
#         #set up a message vector that is size of 300
#         message_vector = np.zeros((1, 300))
#         # if token is not in Word2Vec model, do not include
#         if token not in model.vocab.keys(): 
#             continue
#         else:
#             message_vector = message_vector + model.word_vec(token)
#             count += 1
#     if count == 0:
#         count = 1
#     message_vector = np.squeeze(message_vector)/count
    
    
#     #Calculate the dot product for each vector (urgent or non-urgent), then through cosine similarity assign
#     #whether the message is urgent or not urgent based on if the message is more similar to urgent or non-urgent
#     #score
    
#     if np.dot(message_vector, urgent_vector)/(np.linalg.norm(urgent_vector)*np.linalg.norm(message_vector)) >= np.dot(message_vector,non_urgent_vector)/(np.linalg.norm(non_urgent_vector)*np.linalg.norm(message_vector)):
#         urgent_message.append(1)
#     else:
#         urgent_message.append(0)

  if np.dot(message_vector, urgent_vector)/(np.linalg.norm(urgent_vector)*np.linalg.norm(message_vector)) >= np.dot(message_vector,non_urgent_vector)/(np.linalg.norm(non_urgent_vector)*np.linalg.norm(message_vector)):


In [17]:
# # add classification to df 
# df['urgent_message'] = urgent_message

In [18]:
# df['urgent_message'].value_counts()

0    11784
Name: urgent_message, dtype: int64