## Cleaning Tweets <a class="anchor"  id="chapter5"></a>

Below, I will define a function that will clean the tweets by removing URLs, mentions, and punctuation, as well as splitting up hashtags into individual words. The function will also lowercase all words. 

There are many tweets in the dataset containing the word 'amp'. Upon doing some further research, I found that 'amp' is just a stand-in for the ampersand symbol (&), so that will be cleaned from the text as well. 

In [4]:
# Cleaning data to remove punctuation, URLs, hashtags, mentions, and splitting up words
# that are strung together

def clean_text(text: str) -> str:
    """
    This function cleans text to remove punctuation, URLs, mentions, and hashtags, occurrence
    of the word 'amp', and splits up words that are strung together into individual words.
    The function also lowercases all words.
    
    Parameters:
        text (str): Text to be cleaned.

    Returns: 
        str: Cleaned text.

    """
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove mentions
    text = re.sub(r'@\w+', '', text)
    
    # Remove punctuation except for hashtags
    text = re.sub(r'[^\w\s#]', '', text)
    
    # Split up hashtags into individual words and remove #
    text = re.sub(r'#(\w+)', lambda x: ' '.join(re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z])|[\'\w]+', 
                                                           x.group(1))), text)
    # Convert to lowercase
    text = text.lower()
    
    # Remove 'amp' 
    text = re.sub(r'\bamp\b', '', text)
    
    return text

# Create new column with cleaned text
df['text_cleaned'] = df['text'].apply(clean_text)

In [5]:
display(df.isna().sum())

# clean keyword column
df['keyword'] = df['keyword'].str.replace('%20', ' ')

display(df['keyword'].unique())

id                                0
keyword                          87
location                       3638
text                              0
target                         3263
word_count                        0
unique_word_count                 0
character_count                   0
hashtag_count                     0
mention_count                     0
url_count                         0
capitalized_word_count            0
capitalized_word_proportion       0
text_cleaned                      0
dtype: int64

array([nan, 'ablaze', 'accident', 'aftershock', 'airplane accident',
       'ambulance', 'annihilated', 'annihilation', 'apocalypse',
       'armageddon', 'army', 'arson', 'arsonist', 'attack', 'attacked',
       'avalanche', 'battle', 'bioterror', 'bioterrorism', 'blaze',
       'blazing', 'bleeding', 'blew up', 'blight', 'blizzard', 'blood',
       'bloody', 'blown up', 'body bag', 'body bagging', 'body bags',
       'bomb', 'bombed', 'bombing', 'bridge collapse',
       'buildings burning', 'buildings on fire', 'burned', 'burning',
       'burning buildings', 'bush fires', 'casualties', 'casualty',
       'catastrophe', 'catastrophic', 'chemical emergency', 'cliff fall',
       'collapse', 'collapsed', 'collide', 'collided', 'collision',
       'crash', 'crashed', 'crush', 'crushed', 'curfew', 'cyclone',
       'damage', 'danger', 'dead', 'death', 'deaths', 'debris', 'deluge',
       'deluged', 'demolish', 'demolished', 'demolition', 'derail',
       'derailed', 'derailment', 'desol

The 'location' column contains many missing values, and seems to contain information that is not even location-specific, so I will be dropping it as I don't think it will be useful for the metadata model. 

The 'keyword' column on the other hand may be helpful as some words may be more likely to refer to disasters than others. For example, it's pretty easy to use "fire" coloquially, but less easy to use "bomber". 

Below, I will define a function that should help extract keywords from the cleaned text. 

In [6]:
def assign_keyword(df, keywords):
    for index, row in df.iterrows():
        if pd.isna(row['keyword']) or row['keyword'] == '':
            for keyword in keywords:
                if keyword in row['text_cleaned']:
                    df.at[index, 'keyword'] = keyword
                    break
    return df


keywords_array = ['crash', 'earthquake', 'fire', 'disaster', 'emergency', 'tornado',
       'flood', 'ablaze', 'accident', 'aftershock',
       'airplane accident', 'ambulance', 'annihilated', 'annihilation',
       'apocalypse', 'armageddon', 'army', 'arson', 'arsonist', 'attack',
       'attacked', 'avalanche', 'battle', 'bioterror', 'bioterrorism',
       'blaze', 'blazing', 'bleeding', 'blew up', 'blight', 'blizzard',
       'blood', 'bloody', 'blown up', 'body bag', 'body bagging',
       'body bags', 'bomb', 'bombed', 'bombing', 'bridge collapse',
       'buildings burning', 'buildings on fire', 'burned', 'burning',
       'burning buildings', 'bush fires', 'casualties', 'casualty',
       'catastrophe', 'catastrophic', 'chemical emergency', 'cliff fall',
       'collapse', 'collapsed', 'collide', 'collided', 'collision',
       'crashed', 'crush', 'crushed', 'curfew', 'cyclone', 'damage',
       'danger', 'dead', 'death', 'deaths', 'debris', 'deluge', 'deluged',
       'demolish', 'demolished', 'demolition', 'derail', 'derailed',
       'derailment', 'desolate', 'desolation', 'destroy', 'destroyed',
       'destruction', 'detonate', 'detonation', 'devastated',
       'devastation', 'displaced', 'drought', 'drown', 'drowned',
       'drowning', 'dust storm', 'electrocute', 'electrocuted',
       'emergency plan', 'emergency services', 'engulfed', 'quarantine',
       'hazard', 'storm', 'epicentre', 'evacuate', 'evacuated',
       'evacuation', 'explode', 'exploded', 'explosion', 'eyewitness',
       'famine', 'fatal', 'fatalities', 'fatality', 'fear', 'fire truck',
       'first responders', 'flames', 'flattened', 'flooding', 'floods',
       'forest fire', 'forest fires', 'hail', 'hailstorm', 'harm',
       'hazardous', 'heat wave', 'hellfire', 'hijack', 'hijacker',
       'hijacking', 'hostage', 'hostages', 'hurricane', 'injured',
       'injuries', 'injury', 'inundated', 'inundation', 'landslide',
       'lava', 'lightning', 'loud bang', 'mass murder', 'mass murderer',
       'massacre', 'mayhem', 'meltdown', 'military', 'mudslide',
       'natural disaster', 'nuclear disaster', 'nuclear reactor',
       'obliterate', 'obliterated', 'obliteration', 'oil spill',
       'outbreak', 'pandemonium', 'panic', 'panicking', 'police',
       'quarantined', 'radiation emergency', 'rainstorm', 'razed',
       'refugees', 'rescue', 'rescued', 'rescuers', 'riot', 'rioting',
       'rubble', 'ruin', 'sandstorm', 'screamed', 'screaming', 'screams',
       'seismic', 'sinkhole', 'sinking', 'siren', 'sirens', 'smoke',
       'snowstorm', 'stretcher', 'structural failure', 'suicide bomb',
       'suicide bomber', 'suicide bombing', 'sunk', 'survive', 'survived',
       'survivors', 'terrorism', 'terrorist', 'threat', 'thunder',
       'thunderstorm', 'tragedy', 'trapped', 'trauma', 'traumatised',
       'trouble', 'tsunami', 'twister', 'typhoon', 'upheaval',
       'violent storm', 'volcano', 'war zone', 'weapon', 'weapons',
       'whirlwind', 'wild fires', 'wildfire', 'windstorm', 'wounded',
       'wounds', 'wreck', 'wreckage', 'wrecked']

df = assign_keyword(df, keywords_array)

display(df['keyword'].isna().sum())

16

There are still 16 tweets with missing keywords. I will visualize them below to see if we can manually extract them.

In [7]:
display(df[df['keyword'].isna()]['text_cleaned'].values)
display(df[df['keyword'].isna()]['text'].values)

df['keyword'] = df['keyword'].fillna('none')

array(['whats up man', 'i love fruits', 'summer is lovely',
       'my car is so fast', 'what a goooooooaaaaaal',
       'this is ridiculous', 'london is cool ', 'love skiing',
       'what a wonderful day', 'looooool', 'no wayi cant eat that shit',
       'was in nyc last week', 'love my girlfriend', 'cooool ',
       'do you like pasta', 'the end'], dtype=object)

array(["What's up man?", 'I love fruits', 'Summer is lovely',
       'My car is so fast', 'What a goooooooaaaaaal!!!!!!',
       'this is ridiculous....', 'London is cool ;)', 'Love skiing',
       'What a wonderful day!', 'LOOOOOOL',
       "No way...I can't eat that shit", 'Was in NYC last week!',
       'Love my girlfriend', 'Cooool :)', 'Do you like pasta?',
       'The end!'], dtype=object)

Interestingly, none of the above tweets seem to have keywords in them that could be mistaken for a disaster. They might be in the dataset erroneously. I will group them all into a keyword 'none' that indicates that they don't have any disaster words in them. The neural network should be able to pick up that tweets with this keyword should all belong to the 'Non-Disaster' class.

Below, I will define a function to count spelling errors in the cleaned tweets. Spelling errors could potentially be an indication of class, as news outlets and those reporting real disasters may proof read their tweets more frequently than people just casually tweeting about their daily lives. We will visualize the difference between classes in a bit.

In [8]:
# Create an instance of SpellChecker
spell = SpellChecker()

# Define a function to count spelling errors in a tweet
def count_spelling_errors(text: str) -> int:
    """
    Count the number of spelling errors in a tweet.

    Parameters:
    text (str): The text of the tweet to check.

    Returns:
    int: The number of spelling errors in the tweet.
    """
    # Split the tweet into words
    words = text.split()
    
    # Count the number of spelling errors
    num_errors = 0
    for word in words:
        # Check if the word is misspelled
        if not spell.correction(word) == word:
            num_errors += 1
            
    return num_errors

# Apply the function to the 'text' column and store the result in a new column called 'spelling_error_count'
df['spelling_error_count'] = df['text_cleaned'].apply(count_spelling_errors)

# Binary feature that shows presence of spelling errors
df['has_spelling_errors'] = df['spelling_error_count'].apply(lambda x: 1 if x>0 else 0)