In [1]:
!pip install symspellpy
import pkg_resources
from symspellpy import SymSpell, Verbosity

import pandas as pd

Collecting symspellpy
  Downloading symspellpy-6.7.7-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting editdistpy>=0.1.3
  Downloading editdistpy-0.1.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.5/125.5 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: editdistpy, symspellpy
Successfully installed editdistpy-0.1.3 symspellpy-6.7.7
[0m

The approach used here is to first find all the possible abbreviations using the Spell Checker. If the word is not in the dictionary, it might be an abbreviation/acronym. Then these abbreviations can be expanded to their full forms to add more information to the tweet. Let's first import the dictionary and create the spell checker instance. The SymSpell spell checker is used here as it is very fast and relatively accurate. The spell checker will raise an error if the word is not in the dictionary. The `maximum_dictionary_edit_distance` and `max_edit_distance` is set to zero to avoid spelling correction so that we can find as many abbreviations as possible. One other reason is that the abbreviations are generally composed of 2 to 4 letter and if the max_edit_distance is 2 or even 1, the spell checker might find a word in the dictionary 1 edit distance away from the abbreviation. 

In [2]:
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

True

## Dataset 

In [3]:
tweets_df = pd.read_csv("/kaggle/input/disaster-tweets-dataset-preprocessed-nlp/distaster_tweets_cleaned.csv")
tweets_df.sample(5, random_state=0)

Unnamed: 0.1,Unnamed: 0,id,keyword,location,text,target,tweet,tweet_lower,tweet_noHTML,tweet_noContractions,...,tweet_noMention,tweet_noUnicode,tweet_noPuncts,tweet_noDigits,tweet_noStopwords,tweet_noExtraspace,tweet_lemmatised,tweet_spellcheck,tweet_spellcheck_compound,tweet_final
311,311,454,armageddon,Wrigley Field,@KatieKatCubs you already know how this shit g...,0,armageddon @KatieKatCubs you already know how ...,armageddon @katiekatcubs you already know how ...,armageddon @katiekatcubs you already know how ...,armageddon @katiekatcubs you already know how ...,...,armageddon you already know how this shit goe...,armageddon you already know how this shit goe...,armageddon you already know how this shit goe...,armageddon you already know how this shit goe...,armageddon already know shit goes world series...,armageddon already know shit goes world series...,armageddon already know shit go world series a...,armageddon already know shit go world series a...,armageddon already know shit go world series a...,armageddon already know shit go world series a...
4970,4970,7086,meltdown,Two Up Two Down,@LeMaireLee @danharmon People Near Meltdown Co...,0,meltdown @LeMaireLee @danharmon People Near Me...,meltdown @lemairelee @danharmon people near me...,meltdown @lemairelee @danharmon people near me...,meltdown @lemairelee @danharmon people near me...,...,meltdown people near meltdown comics who hav...,meltdown people near meltdown comics who hav...,meltdown people near meltdown comics who hav...,meltdown people near meltdown comics who hav...,meltdown people near meltdown comics free time...,meltdown people near meltdown comics free time...,meltdown people near meltdown comic free time ...,meltdown people near meltdown comic free time ...,meltdown people near meltdown comic free time ...,meltdown people near meltdown comic free time ...
527,527,762,avalanche,Score Team Goals Buying @,1-6 TIX Calgary Flames vs COL Avalanche Presea...,0,avalanche 1-6 TIX Calgary Flames vs COL Avalan...,avalanche 1-6 tix calgary flames vs col avalan...,avalanche 1-6 tix calgary flames vs col avalan...,avalanche 1-6 tix calgary flames vs col avalan...,...,avalanche 1-6 tix calgary flames vs col avalan...,avalanche 1-6 tix calgary flames vs col avalan...,avalanche 1 6 tix calgary flames vs col avalan...,avalanche tix calgary flames vs col avalanch...,avalanche tix calgary flames vs col avalanche ...,avalanche tix calgary flames vs col avalanche ...,avalanche tix calgary flame v col avalanche pr...,avalanche tax calgary flame a col avalanche pr...,avalanche tax calgary flame a col avalanche pr...,avalanche tax calgary flame col avalanche pres...
6362,6362,9094,suicide%20bomb,Roadside,If you ever think you running out of choices i...,0,suicide%20bomb If you ever think you running o...,suicide%20bomb if you ever think you running o...,suicide%20bomb if you ever think you running o...,suicide%20bomb if you ever think you running o...,...,suicide%20bomb if you ever think you running o...,suicide%20bomb if you ever think you running o...,suicide 20bomb if you ever think you running o...,suicide if you ever think you running out of ...,suicide ever think running choices life rembr ...,suicide ever think running choices life rembr ...,suicide ever think running choice life rembr k...,suicide ever think running choice life member ...,suicide ever think running choice life member ...,suicide ever think running choice life member ...
800,800,1160,blight,Laventillemoorings,If you dotish to blight your car go right ahea...,0,blight If you dotish to blight your car go rig...,blight if you dotish to blight your car go rig...,blight if you dotish to blight your car go rig...,blight if you dotish to blight your car go rig...,...,blight if you dotish to blight your car go rig...,blight if you dotish to blight your car go rig...,blight if you dotish to blight your car go rig...,blight if you dotish to blight your car go rig...,blight dotish blight car go right ahead mine,blight dotish blight car go right ahead mine,blight dotish blight car go right ahead mine,blight dovish blight car go right ahead mine,blight dovish blight car go right ahead mine,blight dovish blight car go right ahead mine


The dataset contains the tweets at different stages of preprocessing created using [this notebook](https://www.kaggle.com/code/rohitgarud/all-almost-data-preprocessing-techniques-for-nlp#Spelling-Correction). We have to now decide after which preprocessing step  should we do the abbreviation finding step. I this after removing the punctuations and before removing the digits should be good place to start with. Data at this stage is selected so as to have fairly clean data with digits such that abbreviations like MH370 (missing malaysia airlines flight) can be obtained.

In [4]:
tweets_df["tweet_noPuncts"].sample(5, random_state=42).values

array(['destruction so you have a new weapon that can cause un imaginable destruction ',
       'deluge the f   things i do for  gishwhes just got soaked in a deluge going for pads and tampons  thanks     ',
       'police dt   rt   the col police can catch a pickpocket in liverpool stree    ',
       'aftershock aftershock back to school kick off was great  i want to thank everyone for making it possible  what a great night ',
       'trauma in response to trauma children of addicts develop a defensive self   one that decreases vulnerability   3'],
      dtype=object)

Now lets combine all text and try to find words which are not in the standard imported dictionary and see what words we get.  

In [5]:
all_text = " ".join(tweets_df["tweet_noPuncts"].values)
all_text[:1000]

'our deeds are the reason of this  earthquake may allah forgive us all forest fire near la ronge sask  canada all residents asked to  shelter in place  are being notified by officers  no other evacuation or shelter in place orders are expected 13 000 people receive  wildfires evacuation orders in california  just got sent this photo from ruby  alaska as smoke from  wildfires pours into a school   rockyfire update    california hwy  20 closed in both directions due to lake county fire    cafire  wildfires  flood  disaster heavy rain causes flash flooding of streets in manitou  colorado springs areas i am on top of the hill and i can see a fire in the woods    there is an emergency evacuation happening now in the building across the street i am afraid that the tornado is coming to our area    three people died from the heat wave so far haha south tampa is getting flooded hah  wait a second i live in south tampa what am i going to do what am i going to do fvck  flooding  raining  flooding

In [6]:
possible_abbreviations = []
for word in all_text.split():
    try:
        sym_spell.lookup(word, 
                          Verbosity.CLOSEST, 
                          max_edit_distance=0,
                          include_unknown=False)[0].term
    except:
        possible_abbreviations.append(word)
print(set(possible_abbreviations))

{'asf', 'ny1', 'ryans', '158', 'tpanic', '1965', 'realestate', 'maca', 'infact', 'idfire', 'aul', 'idgaf', 'wsls', 'kororinpa', 'orangi', 'bestival', 'entretenimento', 'alwx', '573', 'boyhaus', 'govegan', 'saddlebrooke', '40mln', '300', 'whitehouse', 'iredell', 'exofficio', 'tornados', '28700', 'selfavowed', 'longaberger', '2k15', 'avigdorliberman', 'lembra', 'skardu', 'chandanee', '4playthursdays', 'tweeting', 'camila', 'fingerrockfire', 'haha', 'film4', 'greyjoys', 'watchin', 'fx', 'backtoback', 'pnpizody', 'mydrought', 'kfc', 'dupree', 'sl', 'peritoengrafoscopia', 'identitytheft', 'nsf', 'workd', 'kamloops', 't', 'bcs', 'suggs', 'kro', 'gigant', 'soloquiero', 'ral', 'protoshoggoth', 'cydia', 'smh', '45600', 'bachmann', 'preppers', 'offroad', 'smp', 'lolla', 'jtw', 'movietheatre', 'morty', 'ands', 'rijn', '20', 'pft', 'enviromental', 'jewelry', 'indah', 'marketingmediocrity', 'meelllttting', 'jacinta', 'aks', 'hastle', 'lcc', '2b', 'feelin', 'africansinsf', '420', 'us70', 'notgoingou

In [7]:
abb_df = pd.Series(possible_abbreviations)
abb_df.value_counts()

s           499
2           214
3           136
1           115
rt          111
           ... 
kikes         1
cluei         1
yiayplan      1
amiibos       1
forney        1
Length: 4787, dtype: int64

There are many numbers in this list which need to be removed. Also, abbreviations has to be at least 2 letter long, so single letter words should also be removed.

In [8]:
possible_abbreviations_noIntegers = []
for word in possible_abbreviations:
    if len(word) > 1:
        try:
            int(word)
        except:
            possible_abbreviations_noIntegers.append(word)
print(set(possible_abbreviations_noIntegers))

{'asf', 'ny1', 'ryans', 'tpanic', 'realestate', 'maca', 'infact', 'idfire', 'aul', 'idgaf', 'wsls', 'kororinpa', 'orangi', 'bestival', 'entretenimento', 'alwx', 'boyhaus', 'govegan', 'saddlebrooke', '40mln', 'whitehouse', 'iredell', 'exofficio', 'tornados', 'selfavowed', 'longaberger', '2k15', 'avigdorliberman', 'lembra', 'skardu', 'chandanee', '4playthursdays', 'tweeting', 'camila', 'fingerrockfire', 'haha', 'film4', 'greyjoys', 'watchin', 'fx', 'backtoback', 'pnpizody', 'mydrought', 'kfc', 'dupree', 'sl', 'peritoengrafoscopia', 'identitytheft', 'nsf', 'workd', 'kamloops', 'bcs', 'suggs', 'kro', 'gigant', 'soloquiero', 'ral', 'protoshoggoth', 'cydia', 'smh', 'bachmann', 'preppers', 'offroad', 'smp', 'lolla', 'jtw', 'movietheatre', 'morty', 'ands', 'rijn', 'pft', 'enviromental', 'jewelry', 'indah', 'marketingmediocrity', 'meelllttting', 'jacinta', 'aks', 'hastle', 'lcc', '2b', 'feelin', 'africansinsf', 'us70', 'notgoingoutinthat', 'difficultpeople', 'tgirl', 'nffc', 'nitclub', 'harman'

In [9]:
len(set(possible_abbreviations_noIntegers))

4408

In [10]:
abb_df = pd.Series(possible_abbreviations_noIntegers)
abb_df.value_counts()[:30]

rt              111
20fires          88
airplane         73
mh370            72
lol              71
bioterror        70
20storm          69
20disaster       68
20up             66
reddit           63
bioterrorism     58
20fire           52
20emergency      42
20bags           41
ok               38
20spill          38
20buildings      37
20fall           36
20reactor        36
oh               35
20collapse       35
20failure        35
20bomb           35
20accident       35
20burning        35
20plan           35
20wave           34
20bang           34
20on             33
20services       33
dtype: int64

It can be observed that mh370 comes 72 times in all of the tweets and can be replaced by the expanded version "missing malaysia airlines flight". Many of the other words are compound words and can be broken into dictionary words. Symspellpy offers a way to segment words. "rt" is most probably retweet. There are many dictionary words combined with 20, which will get removed in next steps after segmentation.

In [11]:
possible_abbreviations_segmented = []
for word in set(possible_abbreviations_noIntegers):
    result  = sym_spell.lookup_compound(word, max_edit_distance=0)[0].term
    possible_abbreviations_segmented.append(result)

print(set(possible_abbreviations_segmented))

{'asf', 'ny1', 'ryans', 'tpanic', 'flat liners', 'pine view', 'no emotion', 'bell erin', 'kororinpa', 'wsls', 'aul', 'idgaf', 'bestival', 'entretenimento', 'alwx', 'boyhaus', '40mln', 'exofficio', 'tornados', 'longaberger', '2k15', 'avigdorliberman', 'skardu', '4playthursdays', 'chandanee', 'thor gan', 'fx', 'saddle brooke', 'fingerrockfire', 'bangladesh affected', 'film4', 'backtoback', 'pnpizody', 'talk radio', 'kfc', 'dupree', 'sl', 'peritoengrafoscopia', 'planned parenthood', 'yuk i', 'dumb ass', 'nsf', 'workd', 'fins up', 'kamloops', 'bcs', 'no i', 'suggs', 'kro', 'soloquiero', 'ral', 'protoshoggoth', 'gym time', 'cydia', 'smh', 'preppers', 'smp', 'court of', 'jtw', 'ands', 'morty', 'rijn', 'speed tech', 'pft', 'enviromental', 'jewelry', 'end conflict', 'home buyer', 'meelllttting', 'social news', 'jacinta', 'aks', 'or pol', 'hastle', 'lcc', '2b', 'africansinsf', 'us70', 'pauli sta', 'notgoingoutinthat', 'smugglers nabbed', 'tgirl', 'nffc', 'ak', 'ssu', 'bronville', 'nema', 'adhd'

In [12]:
abb_df1 = pd.Series(possible_abbreviations_segmented)
abb_df1.value_counts()[:30]

asf              1
magichairbump    1
martinmj22       1
ler              1
sometimes i      1
jet star         1
blood bound      1
twenty nine      1
inst agram       1
20fire           1
yonews           1
pantofel         1
ahahahga         1
pjnet            1
lzk              1
ddos             1
nycha            1
100nd            1
music video      1
nsw              1
ridah            1
theyd            1
47km             1
dougkessler      1
totoooooooooo    1
nike plus        1
lesnar           1
offr             1
1008planet       1
guimaras         1
dtype: int64

This is creating so many multi segmented words, we need to refine and filter them 