# Data wrangling for Our Disaster Tweets

As the input training data for this project is coming out of a Kaggle competition, it is already gathered in it's entirety (and fairly well wrangled). Instead of gathering and combination, our focus will be on processing our data with machine learning techniques to accurately predict the concern that a tweet should merit. To prepare for this process, while our input tweet training set is as complete as it will be coming in, the models we will use will benefit from additional features that can be derived from the data:

1. Processed Tweet texts
 - with tokens with low predictive value such as stop words or punctuation removed
 - with remaining tokens lemmatized for increased token predictive value across the corpus
 - with common bigrams ajoined for increased predictive value
2. Primary likely topic of the tweet 
3. Representation of tweets in TF-IDF vector form

The first of these two steps will be appended to our original data set to produce an ammended CSV, the third step set of features will be stored in a seperate CSV to maintain readability of the first file.

## Check for Consistency

Before performing our feature engineering steps, we should first do a sanity check on our incoming data to make sure we have the inputs we expect. First, all tweets should have an integer 0 or 1 in their 'target' column indicating whether the tweet is a disaster or not, and second, all tweet samples should have a string object in their text column. If any sample lacks either of these features they should be removed from our data set before we begin our feature engineering.

Some subset of tweets also include keyword and location, but as they are not required, we will not be filtering our samples based on these fields. We may use keywords as a predictive variable in our machine learning step later, while we will pass over the location variable as not being relevant to our problem of automating answering whether the text of a tweet indicates it deserves attention from human emergency response.

In [1]:
import pandas as pd

tweet_df = pd.read_csv('../data/kaggle_training.csv')

In [2]:
# check if we have nulls in our columns
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [3]:
# no nulls, but lets verify fields aren't otherwise invalid
print("{} of the text entries are empty".format(
    len(tweet_df[tweet_df.text.str.len() == 0])))
print("{} tweets aren't correctly targeted".format(
    len(tweet_df[~((tweet_df.target == 0) | (tweet_df.target == 1))])))

0 of the text entries are empty
0 tweets aren't correctly targeted


## Feature Engineering

### Tweet Processing

Our fields are consistent, so it's time to start engineering our features. The first thing we aim to do is use our preprocessing toolkits to provide some normalization over our tweet texts. We'll be normalizing via stripping out white space, so called stop words ('the', 'a', etc.), and punctuation with little predictive value. We'll lemmatize remaining words after these filters to increase the amount of predictive information we can get from common terms, and then use phrase modeling to ajoin potential bigrams in our corpus.

In [4]:
# We'll use Spacy for our tweet preprocessing, and add emoticons to our
# pipeline so we don't remove them as simple punctuation tokens
import spacy
from spacyemoticon import Emoticon

nlp = spacy.load('en_core_web_lg')
emoticon = Emoticon(nlp)
nlp.add_pipe(emoticon, first=True)

And while stop words are likely to not be good signal for our final models, we're also bound to find some tokens in our tweets that lack in predictive value that aren't in the spacy dictionary at all, ie. 'http'. We'll traverse our tweets for words outside the spacey vocabulary that we think do have predictive value (ie emoticons), and let the rest be filtered out in our final processed tweets.

In [5]:
# first some utility processing
def usable_token(tok):
    """ return if token could have predictive value """
    return not (tok.lemma_ in spacy.lang.en.stop_words.STOP_WORDS
                or tok.lemma_ == 'rt' # added after discovery in exploration
                or tok.is_space
                or (tok.is_punct and not tok._.is_emoticon))

def left_hash(tok):
    """
    Return hashtag if token is a hashtag

    Words are used with slightly different emphasis when used as
    hashtags, so we'll maintain this distinction post processing

    Parameters
    ----------
    tok : nlp.Token

    Returns
    -------
    string
        Either a '#' character to be prepended to a recognized hashtag
        token or empty string '' if not a hashtag
    """
    try:
        if tok.nbor(-1).orth_ == '#':
            return '#'
    except:
        pass
    
    return ''

In [6]:
# building a set of oov tokens
oov_tokens = set()

def is_oov(tok):
    if not usable_token(tok) or left_hash(tok) == '#':
        # we want to keep our emoticons and hashtags
        return False
    else:
        return tok.is_oov

def tt_oovs(nlp, tt):
    parsed = nlp(tt)
    for token in parsed:
        if is_oov(token):
            oov_tokens.add(token.lemma_)

In [7]:
# we'll need a decorator to apply to our dataframe
def df_nlp_app(nlp, func):
    return lambda text: func(nlp, text)

In [8]:
tweet_df.text.apply(df_nlp_app(nlp, tt_oovs))

0       None
1       None
2       None
3       None
4       None
        ... 
7608    None
7609    None
7610    None
7611    None
7612    None
Name: text, Length: 7613, dtype: object

Now lets see what we caught:

In [9]:
print(len(oov_tokens))

8920


This is too many to sweep at once, lets see if we can summarily shorten that list in anyway.

In [10]:
print(list(oov_tokens)[:50])

['http://t.co/vcq2icptki', '-population:6', 'http://t.co/w7siidujoh', 'http://t.co/irqujaesck', 'http://t.co/f9j3l2yjl4', '@rejectdcartoon', 'http://t.co/i1vpkq9yag', 'http://t.co/rqu5ub8plf', 'http://t.co/gyzpisbi1u', 'http://t.co/kgkz50q8tk', 'http://t.co/ns5lbs5zup', 'https://t.co/ma4ra7atql', 'http://t.co/qwijrriyif', 'http://t.co/nlfr8t3xqm', 'oamsgajagahahah', 'http://t.co/nnylxhinpx', 'http://t.co/weudlkc4o4', '40hourfamine', 'http://t.co/fj73gdvg2n', 'http://t.co/8rdxcfgqem', 'http://t.co/up30aqgnlf', '@slatukip', 'https://t.co/yrfz5wj7r2', 'http://t.co/g5zsru0zvq', 'http://t.co/pbya7uv3v5', 'http://t.co/wnptvbm5t7', 'matako_3', 'http://t.co/bnhtxaezmm', 'http://t.co/xfhh2xf9ga', 'pic.twitter.com/pnpizody', 'http://t.co/tgdonttkty', 'favori', '@davidvonderhaar', 'http://t.co/tyyfg4qqvm', '@rzimmermanjr', 'http://t.co/ykuauov9jo', 'http://t.co/xvco7slxhw', 'http://t.co/qgum9xheos', 'http://t.co/2zgiupn06t', 'http://t.co/qvx0vqtpz0', 'http://t.co/ae9cpiexak', 'harda', 'http://t.c

This actually tells us that we'll have to do a bit more processing on some of our lemmas to remove '\x89û', but also tells us that we can probably start minimizing our group by removing all mentions and addresses.

In [11]:
filtered_oov = set()
for token in oov_tokens:
    if not (token.startswith("http") or token.startswith('@') or token.endswith("\x89û")):
        filtered_oov.add(token)

In [12]:
print(len(filtered_oov))

1905


In [13]:
print(list(filtered_oov)[:50])

['sk398', '-population:6', 'rabaa', 'wild#fire', '\x89û÷hoax', 'mitt.\x89û\x9d', '~peace', 'totoooooooooo', '1600-year', 'zones***thank', '\x89û÷the', 'reaad/', '3-inspired', 'pugwash', 'oamsgajagahahah', '40hourfamine', 'full\x89ã¢', 'macabrelolita', 'ks57', '429cj', 'matako_3', 'lilourry', 'pic.twitter.com/pnpizody', 'favori', 'goodlook', '-=-0!!!!.', 'llegaste', 'joeysterling', 'harda', 'i77', 'intro+desolation', 'lastingness', 'timestack', '2.4regionåêåênear', '\x89û÷05', 'glononium', 'airplaneåê(29', 'intact+mh370+part+lifts+odds+plane+glided+not+crashed+into+sea', 'mumbai24x7', 'australia\x89ûªs', 'socialwots', 'jimin', 'arm\x89ûò', 'since1970the', 'servicin', '9:12pm', 'vibez', 'entertainer\x89û\x9d', 'laighign', 'fwt']


Getting closer, but we see more forms of '\x89' that we'll want to filter out in our final tweets. At the moment lets just shrink our set. 

In [14]:
filtered_oov2 = set()
for token in filtered_oov:
    if not (token.endswith("\x89ûª") or token.endswith("\x89ûò")):
        filtered_oov2.add(token)

In [15]:
print(len(filtered_oov2))

1854


We had hardly any gain from this filter, so at this point we may need to just see if we can catch any tokens to save by a manual sweep over the remaining tokens

In [16]:
print(list(filtered_oov2))



We do seem to have the insight now that '\x89ûª' can be replaced with a single quote in most cases (though this doesn't fit exactly with shortening everything to lemmas, but is probably a better compromise than removing the tokens outright), and a few tokens that can likely be saved by removing prefix '\x89û÷' and '\x89ûï'. There are some typos, but since we're looking at something generalizable at the moment, we won't manually correct anything. We do spot a few more tokens that can be saved with patterns though: some words ending with '?' and some words starting (or ending) with '/' or '//'. We'll thus conduct our final pass saving tokens that match in this way, and letting all other oov tokens be removed.

In [20]:
import re

def keep_token(tok):
    """ Filter out tokens without predictive value """
    if not usable_token(tok) or tok.lemma_ == 'ûª':
        return False
    
    return ((not is_oov(tok)) or re.search('\x89û', tok.lemma_)
            or tok.lemma_.startswith('/') or tok.lemma_.endswith('/')
            or tok.lemma_.endswith('?'))

def trimmed_lemma(lemma):
    trimmed = lemma.replace("\x89ûª", "'")
    trimmed = lemma.replace("\x89û÷", "")
    trimmed = lemma.replace("\x89ûï", "")
    trimmed = lemma.replace("\x89û\x9d", "")
    trimmed = lemma.replace("\x89û", "")
    trimmed = lemma.replace("/", "")
    trimmed = lemma.replace("?", "")
    return trimmed

def preprocess(nlp, tt):
    parsed = nlp(tt)
    return " ".join([left_hash(token) + trimmed_lemma(token.lemma_) for token in parsed
                     if keep_token(token)])

In [21]:
processed_tweets = tweet_df.text.apply(df_nlp_app(nlp, preprocess))

Let's do a quick consistency check on our process so far.

In [22]:
pd.DataFrame({'original': tweet_df.text.head(10), 'processed': processed_tweets[:10]})

Unnamed: 0,original,processed
0,Our Deeds are the Reason of this #earthquake M...,-PRON- deed reason #earthquake allah forgive -...
1,Forest fire near La Ronge Sask. Canada,forest fire near la sask canada
2,All residents asked to 'shelter in place' are ...,resident ask shelter place notify officer evac...
3,"13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfire evacuation ord..."
4,Just got sent this photo from Ruby #Alaska as ...,send photo ruby #alaska smoke #wildfire pour s...
5,#RockyFire Update => California Hwy. 20 closed...,#rockyfire update = > california hwy 20 close ...
6,#flood #disaster Heavy rain causes flash flood...,#flood #disaster heavy rain cause flash floodi...
7,I'm on top of the hill and I can see a fire in...,-PRON- hill -PRON- fire wood
8,There's an emergency evacuation happening now ...,emergency evacuation happen building street
9,I'm afraid that the tornado is coming to our a...,-PRON- afraid tornado come -PRON- area


Our spacy preprocessing seems to have added in '-PRON-'s of unknown origin, we'll remove these before proceeding.

In [23]:
processed_tweets = [tweet.replace('-PRON- ', '') for tweet in processed_tweets]
processed_tweets = [tweet.replace(' -PRON-', '') for tweet in processed_tweets]

In [24]:
pd.DataFrame({'original': tweet_df.text.head(10), 'processed': processed_tweets[:10]})

Unnamed: 0,original,processed
0,Our Deeds are the Reason of this #earthquake M...,deed reason #earthquake allah forgive
1,Forest fire near La Ronge Sask. Canada,forest fire near la sask canada
2,All residents asked to 'shelter in place' are ...,resident ask shelter place notify officer evac...
3,"13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfire evacuation ord..."
4,Just got sent this photo from Ruby #Alaska as ...,send photo ruby #alaska smoke #wildfire pour s...
5,#RockyFire Update => California Hwy. 20 closed...,#rockyfire update = > california hwy 20 close ...
6,#flood #disaster Heavy rain causes flash flood...,#flood #disaster heavy rain cause flash floodi...
7,I'm on top of the hill and I can see a fire in...,hill fire wood
8,There's an emergency evacuation happening now ...,emergency evacuation happen building street
9,I'm afraid that the tornado is coming to our a...,afraid tornado come area


And before moving on, now that we've done some initial cleanup, lets do another sweep for potential stop words that weren't in Spacy's vocabulary.

In [25]:
from collections import Counter

words = []
for tweet in processed_tweets:
    for word in tweet.split(' '):
        words.append(word)
        
word_counts = Counter(words)

In [29]:
word_counts.most_common(150)

[('like', 394),
 ("'s", 381),
 ('fire', 345),
 ('amp', 298),
 ('new', 226),
 ('people', 197),
 ('kill', 174),
 ('video', 170),
 ('burn', 166),
 ('2', 165),
 ('crash', 160),
 ('attack', 154),
 ('emergency', 153),
 ('body', 153),
 ('come', 151),
 ('disaster', 151),
 ('bomb', 149),
 ('year', 147),
 ('look', 145),
 ('day', 143),
 ('|', 142),
 ('good', 142),
 ('police', 141),
 ('man', 138),
 ('know', 138),
 ('time', 133),
 ('love', 129),
 ('family', 127),
 ('building', 126),
 ('flood', 126),
 ('think', 126),
 ('storm', 125),
 ('life', 122),
 ('home', 121),
 ('suicide', 120),
 ('news', 118),
 ('watch', 118),
 ('want', 117),
 ('train', 117),
 ('california', 116),
 ('car', 114),
 ('collapse', 114),
 ('death', 111),
 ('work', 110),
 ('3', 107),
 ('world', 104),
 ('scream', 102),
 ('today', 101),
 ('need', 100),
 ('let', 99),
 ('dead', 97),
 ('wreck', 97),
 ('old', 96),
 ('bag', 95),
 ('war', 94),
 ('nuclear', 94),
 ('accident', 93),
 ('destroy', 92),
 ('fear', 90),
 ('drown', 90),
 ('way', 86),

And while it's hard to say whether some words will be more associated with disaster or non disastert tweets, we do see a few obvious tokens that seem to be without meaning and that we can further filter at this point:

In [30]:
for i in range(len(processed_tweets)):
    tweet_words = processed_tweets[i].split(' ')
    processed_tweets[i] =\
        ' '.join([w for w in tweet_words
                  if (w != "u" and w != "2" and w != "4"and w != "'s" 
                      and w != "s" and w != "|" and w != "amp" and w != '')])

And for one final step in preprocessing, before adding this processed column to our data frame, we'll ajoin likely bigram words using the gensim implementation of phrase modeling.

In [31]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

bigramed_tweets = Phrases(processed_tweets)
processed_tweets = pd.Series(
    [''.join(bigramed_tweets[tweet]) for tweet in processed_tweets])

And now we can add the processed tweets to our original dataframe:

In [32]:
tweet_df['processed_text'] = processed_tweets

In [33]:
#do a sanity check on these entries
tweet_df.head(10)[['text', 'processed_text']]

Unnamed: 0,text,processed_text
0,Our Deeds are the Reason of this #earthquake M...,deed reason #earthquake allah forgive
1,Forest fire near La Ronge Sask. Canada,forest fire near la sask canada
2,All residents asked to 'shelter in place' are ...,resident ask shelter place notify officer evac...
3,"13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfire evacuation ord..."
4,Just got sent this photo from Ruby #Alaska as ...,send photo ruby #alaska smoke #wildfire pour s...
5,#RockyFire Update => California Hwy. 20 closed...,#rockyfire update = > california hwy 20 close ...
6,#flood #disaster Heavy rain causes flash flood...,#flood #disaster heavy rain cause flash floodi...
7,I'm on top of the hill and I can see a fire in...,hill fire wood
8,There's an emergency evacuation happening now ...,emergency evacuation happen building street
9,I'm afraid that the tornado is coming to our a...,afraid tornado come area


### Topic Modeling
And next we will use LDA modeling, to obtain the most likely topics of each tweet. Each tweet will be converted to a bag of words format before being passed to gensims LDA implementation. After which each topics words will be assessed to assign topic names to the derived topic groupings.

In [34]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

In [35]:
tokened_tweets = processed_tweets.apply(lambda tweet: tweet.split(' '))
pt_dictionary = Dictionary(tokened_tweets)

In [36]:
def pt_generator(pt_dict, tweets):
    for tweet in tweets:
        yield pt_dict.doc2bow(tweet)
        
MmCorpus.serialize(
    '../data/pt_bow_corpus.mm',
    pt_generator(pt_dictionary, tokened_tweets))

tweet_mmcorpus = MmCorpus('../data/pt_bow_corpus.mm')

In [37]:
import warnings

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    lda = LdaMulticore(tweet_mmcorpus,
                       num_topics=25,
                       id2word=pt_dictionary,
                       workers=3)

In [38]:
lda.save('../data/lda_topic_models')

A bit of exploration shows us that it's difficult to derive names or themes for our topics though, likely due to the short length of using a text as short as tweets as documents. As a result we will save just the primary topic number as assigned by LDA to each of our tweets for possible predictive value, and potentially use our saved lda model for further exploration later.

In [39]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))

In [40]:
explore_topic(1, topn=10)

term                 frequency

time                 0.007
siren                0.005
terrorist            0.005
trouble              0.005
wound                0.005
sink                 0.005
road                 0.004
trauma               0.004
let                  0.004
video                0.004
face                 0.004
kill                 0.004
police               0.004
sinkhole             0.004
attack               0.004
volcano              0.003
fire                 0.003
08                   0.003
know                 0.003
storm                0.003
km                   0.003
lot                  0.003
test                 0.003
survive              0.003
west                 0.003


In [41]:
explore_topic(2, topn=10)

term                 frequency

like                 0.007
fire                 0.006
good                 0.005
sink                 0.004
snowstorm            0.004
storm                0.004
3                    0.004
video                0.004
body                 0.004
life                 0.004
look                 0.004
today                0.004
siren                0.004
wreck                0.004
day                  0.004
new                  0.004
little               0.003
@youtube             0.003
cross                0.003
bag                  0.003
people               0.003
shoulder             0.003
car                  0.003
half                 0.003
need                 0.003


In [42]:
topics = [lda[pt_dictionary.doc2bow(tweet)][0][0] for tweet in tokened_tweets]
tweet_df['primary_topic'] = pd.Series(topics)

In [43]:
# verify state of dataframe
tweet_df.head()

Unnamed: 0,id,keyword,location,text,target,processed_text,primary_topic
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deed reason #earthquake allah forgive,0
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la sask canada,12
2,5,,,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...,3
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive #wildfire evacuation ord...",15
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,send photo ruby #alaska smoke #wildfire pour s...,17


And with that we'll save out our modified dataframe before constructing our tf-idf file.

In [44]:
tweet_df.to_csv('../data/processed_kaggle_training.csv')

## TF-IDF
Term frequency x Inverse Document Frequency is a representation similar to bag of words, but instead of being a direct count of term, it weighs each terms across their appearence in an entire corpus. This serves to add a measure of term importance to each term in a document when storing that bag of words doesn't have, as well as scaling the representations in each document vector to values that tend to yield better results when ran through various machine models.

This time we will use a library provided by sklearn to vectorize our corpus before writing them to a seperate data file (this time for storage of a sparse matrix which is the typical form of this corpus).

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidvectorizer = TfidfVectorizer(min_df=4, max_df=0.8)
Xtfidf = tfidvectorizer.fit_transform(tweet_df['processed_text'])

In [46]:
import scipy.sparse
scipy.sparse.save_npz('../data/tweets-tf-idf.npz', Xtfidf)