# Data wrangling for Our Disaster Tweets

As the input training data for this project is coming out of a Kaggle competition, it is already gathered in it's entirety (and fairly well wrangled). Instead of gathering and combination, our focus will be on processing our data with machine learning techniques to accurately predict the concern that a tweet should merit. To prepare for this process, while our input tweet training set is as complete as it will be coming in, the models we will use will benefit from additional features that can be derived from the data:

1. Processed Tweet texts
 - with tokens with low predictive value such as stop words or punctuation removed
 - with remaining tokens lemmatized for increased token predictive value across the corpus
 - with common bigrams ajoined for increased predictive value
2. Primary likely topic of the tweet 
3. Representation of tweets in TF-IDF vector form

The first of these two steps will be appended to our original data set to produce an ammended CSV, the third step set of features will be stored in a seperate CSV to maintain readability of the first file.

## Check for Consistency

Before performing our feature engineering steps, we should first do a sanity check on our incoming data to make sure we have the inputs we expect. First, all tweets should have an integer 0 or 1 in their 'target' column indicating whether the tweet is a disaster or not, and second, all tweet samples should have a string object in their text column. If any sample lacks either of these features they should be removed from our data set before we begin our feature engineering.

Some subset of tweets also include keyword and location, but as they are not required, we will not be filtering our samples based on these fields. We may use keywords as a predictive variable in our machine learning step later, while we will pass over the location variable as not being relevant to our problem of automating answering whether the text of a tweet indicates it deserves attention from human emergency response.

In [1]:
import pandas as pd

tweet_df = pd.read_csv('../data/kaggle_training.csv')

In [2]:
# check if we have nulls in our columns
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [3]:
# no nulls, but lets verify fields aren't otherwise invalid
print("{} of the text entries are empty".format(
    len(tweet_df[tweet_df.text.str.len() == 0])))
print("{} tweets aren't correctly targeted".format(
    len(tweet_df[~((tweet_df.target == 0) | (tweet_df.target == 1))])))

0 of the text entries are empty
0 tweets aren't correctly targeted


## Feature Engineering

### Tweet Processing

Our fields are consistent, so it's time to start engineering our features. The first thing we aim to do is use our preprocessing toolkits to provide some normalization over our tweet texts. We'll be normalizing via stripping out white space, so called stop words ('the', 'a', etc.), and punctuation with little predictive value. We'll lemmatize remaining words after these filters to increase the amount of predictive information we can get from common terms, and then use phrase modeling to ajoin potential bigrams in our corpus.

In [4]:
# We'll use Spacy for our tweet preprocessing, and add emoticons to our
# pipeline so we don't remove them as simple punctuation tokens
import spacy
from spacyemoticon import Emoticon

nlp = spacy.load('en_core_web_md')
emoticon = Emoticon(nlp)
nlp.add_pipe(emoticon, first=True)

In [5]:
def keep_token(tok):
    """ Filter out tokens without predictive value """
    return not (tok.lemma_ in spacy.lang.en.stop_words.STOP_WORDS
                or tok.is_space
                or (tok.is_punct and not tok._.is_emoticon))

def left_hash(tok):
    """
    Return hashtag if token is a hashtag

    Words are used with slightly different emphasis when used as
    hashtags, so we'll maintain this distinction post processing

    Parameters
    ----------
    tok : nlp.Token

    Returns
    -------
    string
        Either a '#' character to be prepended to a recognized hashtag
        token or empty string '' if not a hashtag
    """
    try:
        if tok.nbor(-1).orth_ == '#':
            return '#'
    except:
        pass
    
    return ''

def preprocess(nlp, tt):
    parsed = nlp(tt)
    return " ".join([left_hash(token) + token.lemma_ for token in parsed
                     if keep_token(token)])

def df_preprocess(nlp, func):
    return lambda text: func(nlp, text)

In [6]:
processed_tweets = tweet_df.text.apply(df_preprocess(nlp, preprocess))

Let's do a quick consistency check on our process so far.

In [7]:
pd.DataFrame({'original': tweet_df.text.head(10), 'processed': processed_tweets[:10]})

Unnamed: 0,original,processed
0,Our Deeds are the Reason of this #earthquake M...,-PRON- deed reason #earthquake allah forgive -...
1,Forest fire near La Ronge Sask. Canada,forest fire near la ronge sask canada
2,All residents asked to 'shelter in place' are ...,resident ask shelter place notify officer evac...
3,"13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfir evacuation orde..."
4,Just got sent this photo from Ruby #Alaska as ...,send photo ruby #alaska smoke #wildfire pour s...
5,#RockyFire Update => California Hwy. 20 closed...,#rockyfire update = > california hwy 20 close ...
6,#flood #disaster Heavy rain causes flash flood...,#flood #disaster heavy rain cause flash floodi...
7,I'm on top of the hill and I can see a fire in...,-PRON- hill -PRON- fire wood
8,There's an emergency evacuation happening now ...,emergency evacuation happen building street
9,I'm afraid that the tornado is coming to our a...,-PRON- afraid tornado come -PRON- area


Our spacy preprocessing seems to have added in '-PRON-'s of unknown origin, we'll remove these before proceeding.

In [8]:
processed_tweets = [tweet.replace('-PRON- ', '') for tweet in processed_tweets]
processed_tweets = [tweet.replace(' -PRON-', '') for tweet in processed_tweets]

In [9]:
pd.DataFrame({'original': tweet_df.text.head(10), 'processed': processed_tweets[:10]})

Unnamed: 0,original,processed
0,Our Deeds are the Reason of this #earthquake M...,deed reason #earthquake allah forgive
1,Forest fire near La Ronge Sask. Canada,forest fire near la ronge sask canada
2,All residents asked to 'shelter in place' are ...,resident ask shelter place notify officer evac...
3,"13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfir evacuation orde..."
4,Just got sent this photo from Ruby #Alaska as ...,send photo ruby #alaska smoke #wildfire pour s...
5,#RockyFire Update => California Hwy. 20 closed...,#rockyfire update = > california hwy 20 close ...
6,#flood #disaster Heavy rain causes flash flood...,#flood #disaster heavy rain cause flash floodi...
7,I'm on top of the hill and I can see a fire in...,hill fire wood
8,There's an emergency evacuation happening now ...,emergency evacuation happen building street
9,I'm afraid that the tornado is coming to our a...,afraid tornado come area


Much better, but before adding this processed column to our data frame, we'll ajoin likely bigram words using the gensim implementation of phrase modeling.

In [10]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

bigramed_tweets = Phrases(processed_tweets)
processed_tweets = pd.Series(
    [''.join(bigramed_tweets[tweet]) for tweet in processed_tweets])

And now we can add the processed tweets to our original dataframe:

In [11]:
tweet_df['processed_text'] = processed_tweets

In [12]:
#do a sanity check on these entries
tweet_df.head(10)[['text', 'processed_text']]

Unnamed: 0,text,processed_text
0,Our Deeds are the Reason of this #earthquake M...,deed reason #earthquake allah forgive
1,Forest fire near La Ronge Sask. Canada,forest fire near la ronge sask canada
2,All residents asked to 'shelter in place' are ...,resident ask shelter place notify officer evac...
3,"13,000 people receive #wildfires evacuation or...","13,000 people receive #wildfir evacuation orde..."
4,Just got sent this photo from Ruby #Alaska as ...,send photo ruby #alaska smoke #wildfire pour s...
5,#RockyFire Update => California Hwy. 20 closed...,#rockyfire update = > california hwy 20 close ...
6,#flood #disaster Heavy rain causes flash flood...,#flood #disaster heavy rain cause flash floodi...
7,I'm on top of the hill and I can see a fire in...,hill fire wood
8,There's an emergency evacuation happening now ...,emergency evacuation happen building street
9,I'm afraid that the tornado is coming to our a...,afraid tornado come area


### Topic Modeling
And next we will use LDA modeling, to obtain the most likely topics of each tweet. Each tweet will be converted to a bag of words format before being passed to gensims LDA implementation. After which each topics words will be assessed to assign topic names to the derived topic groupings.

In [13]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

In [14]:
tokened_tweets = processed_tweets.apply(lambda tweet: tweet.split(' '))
pt_dictionary = Dictionary(tokened_tweets)

In [15]:
def pt_generator(pt_dict, tweets):
    for tweet in tweets:
        yield pt_dict.doc2bow(tweet)
        
MmCorpus.serialize(
    '../data/pt_bow_corpus.mm',
    pt_generator(pt_dictionary, tokened_tweets))

tweet_mmcorpus = MmCorpus('../data/pt_bow_corpus.mm')

In [16]:
import warnings

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    lda = LdaMulticore(tweet_mmcorpus,
                       num_topics=25,
                       id2word=pt_dictionary,
                       workers=3)

In [17]:
lda.save('../data/lda_topic_models')

A bit of exploration shows us that it's difficult to derive names or themes for our topics though, likely due to the short length of using a text as short as tweets as documents. As a result we will save just the primary topic number as assigned by LDA to each of our tweets for possible predictive value, and potentially use our saved lda model for further exploration later.

In [18]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))

In [19]:
explore_topic(1, topn=10)

term                 frequency

's                   0.006
like                 0.006
officer              0.005
kill                 0.005
wound                0.005
come                 0.005
police               0.004
wreck                0.004
suspect              0.004
fire                 0.004
storm                0.004
large                0.003
fucking              0.003
violent              0.003
exchange             0.003
video                0.003
bomb                 0.003
news                 0.003
tornado              0.003
wreckage             0.003
life                 0.003
lake                 0.002
richmond             0.002
gunfire              0.002
whirlwind            0.002


In [20]:
explore_topic(2, topn=10)

term                 frequency

's                   0.007
like                 0.005
video                0.004
wreck                0.004
people               0.004
love                 0.004
think                0.004
movie                0.004
hollywood            0.004
s                    0.003
release              0.003
weapon               0.003
chile                0.003
snowstorm            0.003
thunder              0.003
sink                 0.003
trouble              0.003
land                 0.003
hear                 0.003
3                    0.003
storm                0.003
good                 0.002
come                 0.002
look                 0.002
violent              0.002


In [21]:
topics = [lda[pt_dictionary.doc2bow(tweet)][0][0] for tweet in tokened_tweets]
tweet_df['primary_topic'] = pd.Series(topics)

In [22]:
# verify state of dataframe
tweet_df.head()

Unnamed: 0,id,keyword,location,text,target,processed_text,primary_topic
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deed reason #earthquake allah forgive,6
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada,18
2,5,,,All residents asked to 'shelter in place' are ...,1,resident ask shelter place notify officer evac...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive #wildfir evacuation orde...",2
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,send photo ruby #alaska smoke #wildfire pour s...,9


And with that we'll save out our modified dataframe before constructing our tf-idf file.

In [23]:
tweet_df.to_csv('../data/processed_kaggle_training.csv')

## TF-IDF
Term frequency x Inverse Document Frequency is a representation similar to bag of words, but instead of being a direct count of term, it weighs each terms across their appearence in an entire corpus. This serves to add a measure of term importance to each term in a document when storing that bag of words doesn't have, as well as scaling the representations in each document vector to values that tend to yield better results when ran through various machine models.

This time we will use a library provided by sklearn to vectorize our corpus before writing them to a seperate data file (this time for storage of a sparse matrix which is the typical form of this corpus).

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidvectorizer = TfidfVectorizer(min_df=4, max_df=0.8)
Xtfidf = tfidvectorizer.fit_transform(tweet_df['processed_text'])

In [25]:
import scipy.sparse
scipy.sparse.save_npz('../data/tweets-tf-idf.npz', Xtfidf)