In [1]:
import re
from nltk.corpus import stopwords
import string
import pandas as pd

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

## Dataset

In [2]:
tweets_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
tweets_df.head() 

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


If we check the description of the competition, we can observe that the keywords are important for the classification of distaster tweet and hence a combined tweet column is created by joining keyword and text. First the empty keywords are replaced by "".

In [3]:
tweets_df["keyword"] = tweets_df["keyword"].fillna("")
tweets_df["tweet"] = tweets_df["keyword"] + " " + tweets_df["text"]
tweets_df.sample(5, random_state=42)

Unnamed: 0,id,keyword,location,text,target,tweet
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1,destruction So you have a new weapon that can ...
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0,deluge The f$&amp;@ing things I do for #GISHWH...
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1,police DT @georgegalloway: RT @Galloway4Mayor:...
132,191,aftershock,,Aftershock back to school kick off was great. ...,0,aftershock Aftershock back to school kick off ...
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0,trauma in response to trauma Children of Addic...


## Lower Case

In [4]:
tweets_df["tweet_lower"] = tweets_df["tweet"].str.lower()
tweets_df["tweet_lower"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$&amp;@ing things i do for #gishwh...
5448    police dt @georgegalloway: rt @galloway4mayor:...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_lower, dtype: object

## Remove HTML

There are many html entities in the text such as "& gt;" and "& lt;". Also text might contain html tags such as < p >, < a > or < div >

In [5]:
from bs4 import BeautifulSoup
text = r"&gt;&gt; $15 Aftershock : Protect Yourself and Profit in the Next Global Financial... ##book http://t.co/f6ntUc734Z esquireattire"
soup = BeautifulSoup(text)
soup.get_text()

'>> $15 Aftershock : Protect Yourself and Profit in the Next Global Financial... ##book http://t.co/f6ntUc734Z esquireattire'

In [6]:
def remove_html(text):
    soup = BeautifulSoup(text)
    text = soup.get_text()
    return text

In [7]:
tweets_df["tweet_noHTML"] = tweets_df["tweet_lower"].apply(remove_html)
tweets_df["tweet_noHTML"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$&@ing things i do for #gishwhes j...
5448    police dt @georgegalloway: rt @galloway4mayor:...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noHTML, dtype: object

## Expand Contractions

There are many contractions of words used in informal communication such as can't: can not, they've: they have or even modern contractions such as sux: sucks. There is a python package to expand such contractions

In [8]:
!pip install contractions
import contractions

tweets_df["tweet_noContractions"] = tweets_df["tweet_noHTML"].apply(contractions.fix)
tweets_df["tweet_noContractions"].sample(5, random_state=42)

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-2.0.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.8/101.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.5/287.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24
[0m

2644    destruction so you have a new weapon that can ...
2227    deluge the f$&@ing things i do for #gishwhes j...
5448    police dt @georgegalloway: rt @galloway4mayor:...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noContractions, dtype: object

## Remove URLs

In [9]:
def remove_urls(text):
    pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)(/\w*)?')
    text = re.sub(pattern, "", text)
    return text

In [10]:
text = "#stlouis #caraccidentlawyer Speeding Among Top Causes of Teen Accidents https://t.co/k4zoMOF319 https://t.co/S2kXVM0cBA Car Accident"
remove_urls(text)

'#stlouis #caraccidentlawyer Speeding Among Top Causes of Teen Accidents   Car Accident'

In [11]:
tweets_df["tweet_noURLs"] = tweets_df["tweet_noContractions"].apply(remove_urls)
tweets_df["tweet_noURLs"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$&@ing things i do for #gishwhes j...
5448    police dt @georgegalloway: rt @galloway4mayor:...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noURLs, dtype: object

## Remove Email IDs

In [12]:
def remove_emails(text):
    pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
    text = re.sub(pattern, "", text)
    return text

In [13]:
text = "please send your feedback to myemail@gmail.com "
remove_emails(text)

'please send your feedback to  '

In [14]:
tweets_df["tweet_noEmail"] = tweets_df["tweet_noURLs"].apply(remove_emails)
tweets_df["tweet_noEmail"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$&@ing things i do for #gishwhes j...
5448    police dt @georgegalloway: rt @galloway4mayor:...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noEmail, dtype: object

## Remove Tweeter Mentions
The text contains maintions using @, we need to remove these mentions before removing the punctutions.

In [15]:
def remove_mentions(text):
    pattern = re.compile(r"@\w+")
    text = re.sub(pattern, "", text)
    return text

In [16]:
tweets_df["tweet_noMention"] = tweets_df["tweet_noEmail"].apply(remove_mentions)
tweets_df["tweet_noMention"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$& things i do for #gishwhes just ...
5448    police dt : rt : ûïthe col police can catch a...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noMention, dtype: object

Hashtags can also be removed in similar way but in this competition the hashtags are important as they include key information hence are not removed

## Handling Emojis

Generally emojis are removed, but in the case of distaster tweets, the emojis can contain some information and hence need to be handled properly.

I propose to convert the emojis to six basic emotions such as happiness, sadness, anger, disgust, fear, surprise and the neutral state. Each emotion class can contain multiple emojis such as happiness can contain 😀 😃 😄 😁 😆 😅 😂 🤣

This step needs to be done before removing Unicode characters in the next step because emojis are represented in unicode. 

**I think I will wait for some discussion in the comments regarding this before implementing any approach for this, because there can be ambiguity in the use of emojis as well**

## Remove Unicode Charachers

In [17]:
def remove_unicode_chars(text):
    text = text.encode("ascii", "ignore").decode()
    return text

In [18]:
tweets_df["tweet_noUnicode"] = tweets_df["tweet_noMention"].apply(remove_unicode_chars)
tweets_df["tweet_noUnicode"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f$& things i do for #gishwhes just ...
5448    police dt : rt : the col police can catch a pi...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noUnicode, dtype: object

## Abbreviation/Acronym Disambiguation
There are large number of abbreviations and acronyms in the text. These abbreviations can contain meaningful information for the classification task and might get removed or destorted during other preprocessing steps and hence they need to be expanded earlier in the preprocessing. @gunesevitan has given many of these abbreviations in his [notebook](https://www.kaggle.com/code/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert) 

In [19]:
# Acronyms
def remove_abbreviations(text):
    text = re.sub(r"mh370", "missing malaysia airlines flight", text)
    text = re.sub(r"okwx", "oklahoma city weather", text)
    text = re.sub(r"arwx", "arkansas weather", text)    
    text = re.sub(r"gawx", "georgia weather", text)  
    text = re.sub(r"scwx", "south carolina weather", text)  
    text = re.sub(r"cawx", "california weather", text)
    text = re.sub(r"tnwx", "tennessee weather", text)
    text = re.sub(r"azwx", "arizona weather", text)  
    text = re.sub(r"alwx", "alabama Weather", text)
    text = re.sub(r"wordpressdotcom", "wordpress", text)    
    text = re.sub(r"usnwsgov", "united states national weather service", text)
    text = re.sub(r"suruc", "sanliurfa", tweet)
    return text

There are many more abbreviations in the dataset and a more thorough checking is required to find all the abbreviations/acronyms.

## Remove Punctuations

In [20]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [21]:
def remove_punctuations(text):
    text = re.sub('[%s]' % re.escape(string.punctuation), " ",text)
    return text

In [22]:
tweets_df["tweet_noPuncts"] = tweets_df["tweet_noUnicode"].apply(remove_punctuations)
tweets_df["tweet_noPuncts"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f   things i do for  gishwhes just ...
5448    police dt   rt   the col police can catch a pi...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noPuncts, dtype: object

## Remove Digits or Words Containing Digits
This might not be appropriate in many cases. For example "MH370" mentioned in the tweets corresponds to Malaysia Airlines Flight 370 which went missing. In this case, keeping this number in the text might be useful in the disaster tweet classification.

In [23]:
def remove_digits(text):
    pattern = re.compile("\w*\d+\w*")
    text = re.sub(pattern, "",text)
    return text

In [24]:
text = " m194 0104 utc5km s of volcano hawaii"
remove_digits(text)

'    s of volcano hawaii'

In [25]:
tweets_df["tweet_noDigits"] = tweets_df["tweet_noPuncts"].apply(remove_digits)
tweets_df["tweet_noDigits"].sample(5, random_state=42)

2644    destruction so you have a new weapon that can ...
2227    deluge the f   things i do for  gishwhes just ...
5448    police dt   rt   the col police can catch a pi...
132     aftershock aftershock back to school kick off ...
6845    trauma in response to trauma children of addic...
Name: tweet_noDigits, dtype: object

## Remove Stopwords
Stopwords removal is one of the fundamental preprocessing operations in many NLP tasks. I sometimes remove stopwords before removing punctuations as many stopwords contain apostrophe. However, most of these stopwprds are expanded during contraction expansion process above 

In [26]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'their', 'from', 'further', 'where', 'myself', 'before', 'wouldn', 'during', 'me', 'yours', 'not', 'don', 'just', 'himself', 'now', 'they', 'with', 'this', 'very', 'what', 'few', 'she', 'if', 'these', 'couldn', 'too', "doesn't", 'own', "wouldn't", 'was', 'about', 'him', "won't", 'but', 'that', 'our', 'wasn', 'are', "needn't", "shan't", 'those', 'it', 'be', "you're", 't', 'does', 'so', "you'd", 'her', 'by', 'or', 'themselves', 'you', 'ma', 'both', 'mustn', 'yourselves', 'he', "wasn't", 'will', 'itself', 'how', 'we', 'his', 'because', 's', 'been', 'having', 'each', 'y', 'same', 'while', 'do', 'through', "you'll", 'here', 'only', 'have', 'than', 'whom', 'nor', 'ours', 'for', 'down', 'over', 'm', 'them', 'and', "couldn't", 'off', "that'll", 'ain', 'on', 'can', 'hadn', 'no', 'hasn', "mustn't", 'won', "weren't", 'isn', 'mightn', 'once', "didn't", 'to', 'being', 'theirs', 'out', 'shan', 'why', "aren't", 'needn', 'haven', "haven't", 'its', 'did', 'up', 'ourselves', "she's", 'yourself', 'shoul

In [27]:
def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stop_words])

In [28]:
tweets_df["tweet_noStopwords"] = tweets_df["tweet_noDigits"].apply(remove_stopwords)
tweets_df["tweet_noStopwords"].sample(5, random_state=42)

2644    destruction new weapon cause un imaginable des...
2227    deluge f things gishwhes got soaked deluge goi...
5448    police dt rt col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma children addicts develo...
Name: tweet_noStopwords, dtype: object

## Removing Extra Spaces
In this case while removing stopwords w esplit the text using spaces which removes extra spaces. However, we can still run the following code to be sure

In [29]:
def remove_extra_spaces(text):
    text = re.sub(' +', ' ', text).strip()
    return text

In [30]:
tweets_df["tweet_noExtraspace"] = tweets_df["tweet_noStopwords"].apply(remove_extra_spaces)
tweets_df["tweet_noExtraspace"].sample(5, random_state=42)

2644    destruction new weapon cause un imaginable des...
2227    deluge f things gishwhes got soaked deluge goi...
5448    police dt rt col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma children addicts develo...
Name: tweet_noExtraspace, dtype: object

## Stemming or Lemmatization
I generally prefer lemmatization over stemming as lemmatization gives meaningful words

In [31]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    words = [lemmatizer.lemmatize(word) for word in text.split()]
    text = ' '.join(words)
    return text

In [32]:
tweets_df["tweet_lemmatised"] = tweets_df["tweet_noExtraspace"].apply(lemmatize_text)
tweets_df["tweet_lemmatised"].sample(5, random_state=42)

2644    destruction new weapon cause un imaginable des...
2227    deluge f thing gishwhes got soaked deluge goin...
5448    police dt rt col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma child addict develop de...
Name: tweet_lemmatised, dtype: object

## Spelling Correction
Spelling correction can help in the NLP task of tweet classification in the considered example because the tweets are particularly succeptible to incorrect spellings of words, either deliberate or otherwise. There are fie woptions such as spell checker from TextBlob and Symspellpy (Python port of SymSpell). However, the Textblob is prohibitively slow while Symspellpy is very fast and accurate. Also it is language agnostic if proper dictionary is used, hence is used here

In [33]:
!pip install symspellpy
import pkg_resources
from symspellpy import SymSpell, Verbosity

Collecting symspellpy
  Downloading symspellpy-6.7.7-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting editdistpy>=0.1.3
  Downloading editdistpy-0.1.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.5/125.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: editdistpy, symspellpy
Successfully installed editdistpy-0.1.3 symspellpy-6.7.7
[0m

SymSpellpy give multiple suggestions to the words for spelling correction. We can select the first suggested word having highest probability.

In [34]:
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

True

In [35]:
def correct_spelling_symspell(text):
    words = [
        sym_spell.lookup(
            word, 
            Verbosity.CLOSEST, 
            max_edit_distance=2,
            include_unknown=True
            )[0].term 
        for word in text.split()] 
    text = " ".join(words)
    return text

The `include_unknown` option keeps the words not within `max_edit_distance` from the words in the dictionary 

In [36]:
tweets_df["tweet_spellcheck"] = tweets_df["tweet_lemmatised"].apply(correct_spelling_symspell)
tweets_df["tweet_spellcheck"].sample(5, random_state=42)

2644    destruction new weapon cause in imaginable des...
2227    deluge of thing gishwhes got soaked deluge goi...
5448    police it it col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma child addict develop de...
Name: tweet_spellcheck, dtype: object

It can be observed that it is not perfect and introduces more stopwords but can help in many cases. Some more investigation is required with the competition solution results

## Correcting Componded Words 
(Mostly in Hashtags)

In [37]:
bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

True

In [38]:
def correct_spelling_symspell_compound(text):
    words = [
        sym_spell.lookup_compound(
            word, 
            max_edit_distance=2
            )[0].term 
        for word in text.split()] 
    text = " ".join(words)
    return text

In [39]:
text = "IranDeal PantherAttack TrapMusic StrategicPatience socialnews NASAHurricane onlinecommunities humanconsumption"
correct_spelling_symspell_compound(text)

'iran deal panther attack trap music strategic patience social news as hurricane online communities human consumption'

In [40]:
tweets_df["tweet_spellcheck_compound"] = tweets_df["tweet_spellcheck"].apply(correct_spelling_symspell_compound)
tweets_df["tweet_spellcheck_compound"].sample(5, random_state=42)

2644    destruction new weapon cause in imaginable des...
2227    deluge of thing gish hes got soaked deluge goi...
5448    police it it col police catch pickpocket liver...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma child addict develop de...
Name: tweet_spellcheck_compound, dtype: object

## Final Stopward Removal
Due to previous spell checking steps, few new stopwords are introduced in the data and hence one final stopward removal step is required.

In [41]:
tweets_df["tweet_final"] = tweets_df["tweet_spellcheck_compound"].apply(remove_stopwords)
tweets_df["tweet_final"].sample(5, random_state=42)

2644    destruction new weapon cause imaginable destru...
2227    deluge thing gish hes got soaked deluge going ...
5448    police col police catch pickpocket liverpool s...
132     aftershock aftershock back school kick great w...
6845    trauma response trauma child addict develop de...
Name: tweet_final, dtype: object

Proper sequence of these operations need to be determined to achieve higher efficiency of data preprocessing

In [42]:
tweets_df.to_csv("distaster_tweets_cleaned.csv")