# Feature Extraction
In this step, we convert the raw text into numerical features for analysis. We have to convert both the keywords and text data. Let's start with the keywords

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv('../downloads/train.csv')
test = pd.read_csv('../downloads/train.csv')

train.keyword.fillna('',inplace=True)
test.keyword.fillna('',inplace=True)
train.location.fillna('',inplace=True)
test.location.fillna('',inplace=True)

train.sample(5)

Unnamed: 0,id,keyword,location,text,target
1259,1815,buildings%20on%20fire,"Boston, MA",Three-alarm fire destroys two residential buil...,1
5328,7607,pandemonium,,I'll be at SFA very soon....#Pandemonium http:...,1
4601,6543,injury,"Sacramento, CA",Traffic Collision - No Injury: I5 S at I5 S 43...,1
330,477,armageddon,California,Check out #PREPPERS #DOOMSDAY MUST HAVE LIBRAR...,0
6215,8866,smoke,WORLDWI$E,I smoke toooooo much lmao I was scared to text...,0


In [2]:
from nltk import TweetTokenizer
tt = TweetTokenizer(strip_handles=True, reduce_len=True)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(tokenizer=tt.tokenize)
cv.fit(train['text'])

# Vectorize keywords and tweets
# First the training set
train_keywords = pd.DataFrame(pd.get_dummies(train.keyword,prefix='KW'))
train_tweets = pd.DataFrame(cv.transform(train['text']).toarray(),columns=cv.get_feature_names())
X_train = pd.concat([train_keywords,train_tweets],axis=1)
y_train = train.target

test_keywords = pd.DataFrame(pd.get_dummies(test.keyword,prefix='KW'))
test_tweets = pd.DataFrame(cv.transform(test['text']).toarray(),columns=cv.get_feature_names())
X_test = pd.concat([test_keywords,test_tweets],axis=1)
y_test = test.target



# Modelling
## Ridge Classifier
Let's start out with a very simple model: a ridge classifier. How well do we do for classification?

In [3]:
from sklearn.linear_model import RidgeClassifier

model = RidgeClassifier()
model.fit(X_train,y_train)

RidgeClassifier()

In [4]:
y_pred = model.predict(X_test)

from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      4342
           1       1.00      0.99      0.99      3271

    accuracy                           0.99      7613
   macro avg       0.99      0.99      0.99      7613
weighted avg       0.99      0.99      0.99      7613



Ahh, nice! Nearly perfect precision and recall! I wasn't expecting that using default parameters. Looks like the data contain useful information for classifying tweets. It makes sense, of course, the data were labeled by human readers who looked at the same text information. They must have selected tweets that they were confident in classifying. 

### Dropping the keyword data

Let's see if the results look quite as good if we drop the keyword data.

In [5]:
model2 = RidgeClassifier()
model2.fit(cv.transform(train['text']).toarray(),y_train)
y_pred = model2.predict(cv.transform(test['text']).toarray())
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      4342
           1       1.00      0.99      0.99      3271

    accuracy                           0.99      7613
   macro avg       0.99      0.99      0.99      7613
weighted avg       0.99      0.99      0.99      7613



The numbers are identical, keywords clearly don't have a drastic effect on the classification performance.

# Unsupervised methods

Now we can ask the question, why does it work so well? Here we can look to the patterns in the data using unsupervised methods. 

Can we turn the problem around and predict the keyword from the tweet text? If we did a topic analysis, would the topics map to keywords?

In [6]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

nmf = NMF(n_components=20, random_state=1,
          alpha=.1, l1_ratio=.5)

nmf.fit(X_train,y_train)

n_components = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
    
print_top_words(nmf,X_train.columns,n_top_words)



Topic #0: the that by was are with world has about at after still not last first more it's out we years
Topic #1: ? # why follow so la airlines missing aircraft debris reunion _ found no u mh370 crush malaysia KW_crush this
Topic #2: . with s it will not was & no u be p we they have it's o he KW_detonate m
Topic #3: : rt 2015-08- û_ california from as at 05 [ ] pm news s utc train police over ( 3
Topic #4: a like was by at up get this but video with from what just after watch that under sandstorm KW_sandstorm
Topic #5: ' from as are families wreckage by KW_wreckage confirmed who conclusively | were those mh370 malaysia ) it's video rescuers
Topic #6: ! be & all out what please we news from wind ass check day KW_loud%20bang bang just loud hey unconfirmed
Topic #7: to be with going have or get make want go how as do it out over not we back so
Topic #8: in killed suicide / crash people bomber who police up fire KW_hostages accident land two hostages released as are bomb
Topic #9: i it was

In [7]:
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(X_train,y_train)
print_top_words(lda, X_train.columns, n_top_words)

Topic #0: ' : more california by storm ; - here wildfire over than oil latest the northern KW_wildfire outbreak KW_razed damage
Topic #1: KW_windstorm windstorm tornado KW_mudslide mudslide KW_tornado fatalities KW_fatalities rules siren demolished KW_demolished crushed KW_crushed mph upheaval KW_upheaval super KW_siren nice
Topic #2: : ? ... in from of # mh370 on legionnaires - wreckage confirmed malaysia KW_wreckage is the found police missing
Topic #3: homes > collapse natural $ released KW_nuclear%20reactor KW_natural%20disaster meltdown KW_meltdown calgary reactor bridge KW_detonation detonation yourself bags KW_collapse KW_body%20bags gems
Topic #4:  : - û_ in for ... of disaster emergency to crash û ûªs full #news obama by a +
Topic #5: : - / ( ) the ... of at a . video in via s on pm fire by from
Topic #6: * screaming reddit lightning ruin KW_ruin KW_screaming screamed KW_screamed KW_quarantine quarantine KW_screams sirens KW_lightning spill screams offensive projected KW_sire

In [8]:
for col in X_train.columns:
    print(col,end=' ')

KW_ KW_ablaze KW_accident KW_aftershock KW_airplane%20accident KW_ambulance KW_annihilated KW_annihilation KW_apocalypse KW_armageddon KW_army KW_arson KW_arsonist KW_attack KW_attacked KW_avalanche KW_battle KW_bioterror KW_bioterrorism KW_blaze KW_blazing KW_bleeding KW_blew%20up KW_blight KW_blizzard KW_blood KW_bloody KW_blown%20up KW_body%20bag KW_body%20bagging KW_body%20bags KW_bomb KW_bombed KW_bombing KW_bridge%20collapse KW_buildings%20burning KW_buildings%20on%20fire KW_burned KW_burning KW_burning%20buildings KW_bush%20fires KW_casualties KW_casualty KW_catastrophe KW_catastrophic KW_chemical%20emergency KW_cliff%20fall KW_collapse KW_collapsed KW_collide KW_collided KW_collision KW_crash KW_crashed KW_crush KW_crushed KW_curfew KW_cyclone KW_damage KW_danger KW_dead KW_death KW_deaths KW_debris KW_deluge KW_deluged KW_demolish KW_demolished KW_demolition KW_derail KW_derailed KW_derailment KW_desolate KW_desolation KW_destroy KW_destroyed KW_destruction KW_detonate KW_deto

#np #nra #nri #ns #nsfw #nuclear #nuclearweapons #nude #nuke #nursing #nv #nwo #nws #nwt #ny #ny35 #nyc #nyg #nytimes #nz #nå¼36 #obama #obliteration #occasion2b #ocnj #offers2go #offshore #oil #oilandgas #ojoubot #ok #oklahoma #oktxduo #okwx #olympia #omg #oneborn #oneheartonemindonecss #onlinecommunities #onthisday #oocvg #oomf #ootd #oped #opp #opseaworld #or #orapinforma #orchardalley #orcot #oregon #orpol #orshow #osi2016 #otd #otleyhour #otrametlife #otratmetlife #overwatch #ovofest #p2 #painting #pakistan #palermo #palestine #palestinian #palestinians #palmoil #pandemonium #pantherattack #papicongress #papiichampoo #paraguay #paramedic #paramedics #paranormal #parents #parkchat #pathogen #patriots #patriotsnation #paulhollywood #pbban #pbs #pdx911 #pearlharbor #people #peritoengrafoscopia #perrychat #personalinjury #perspective #pets #pft #philippines #phoenix #phone #photo #photography #phuket #physician #pics #picthis #pieceofme #piling #pilot #piperwearsthepants #pitchwars #p

... .  . . . .. ... ... ... / /8 /: 0 0.45 0.6 0.75 0.9 00 000 00:11 00:25 00:52 00pm 01 01-06 012 014 01:01 01:02 01:04 01:11 01:20 01:26 01:50 02 02-06 025 03 03-08 03/08 030/6 032 05 05.08 05th 06 060/5 061 066gp 06:32 06:34 06jst 07 07:30 08 08.05 08/02 08/05 08/06 08/07 08/3 0840728 08780923344 08:00 08:02 09 097 09:13 09:36 0fsloths 0npzp 0sed 1 1-6 1.00 1.13 1.2 1.25 1.3 1.4 1.43 1.5 1.9 1.94 1/13 1/2 1/2007 10 10-6 10/3 100 100.000 1000 1000s 100bn 100mb 100nd 100s 101 1028 103 105 106.1 106:38 107 107.9 109 10:00 10:15 10:30 10:34 10:38 10:40 10am 10k 10km 10m 10news 10pm 10th 10w 10x 11 11.2 11000 111020 114 1141 1145 115 119000 11:00 11:03 11:15 11:30 11:45 11am 12 12.5 1200 12000 12022 123 1236 124.13 125 129 12:00 12:11 12:32 12hr 12jst 12m 12mm 12news 12th 12u 12v 13 13,000 130 13000 133 138:8 14 140 14000 14028 143 148 149 14th 15 150 1500 15000270364 15000270653 158-0853 158-1017 159 15901 15:04 15:41 15km 15t 15th 16 16.99 160 1600 1620 163 165000 16550 166 1665 1686b 

art arti articals article articles artificial artillery artist artists arts artwork ary as as10004 asap asb asbury ascend aseer asf ash ashayo ashdod ashenforest ashes ashley ashrafiyah asia asian asics aside ask asked askin asking asks asleep aspect aspects asphalt aspiring ass assad assailant assault assembly assertative asses assessment assets asshole assholes assistance assistant assisting associated association assume assumes assured asswipe astonishing astounding astrakhan astrologian astroturfers asylum at atamathon atc atcha ate athens athlete athletics atl atlanta atlantic atlas atleast atm atmosphere atmospheric atom atomic attached attack attack.share attacked attacking attacks attained attempt attempted attempting attend attendance attended attendees attending attention attic attila attitude attraction attractive atv atåêcinema aubrey auburn auc auckland auctions audi audience audiences audio audit aug august aul aunt aurora aussie aussies aust austin australia australia's 

causes causing caution cautious cave caves cbc cbc.ca cbplawyers cbs cc cd cdc cdt ce cebu cech cecil cee celebration celebrations celebrety celebrities celestial cell celtic celtic.indeed cement cena censor censorship census center centers centipede central centre ceo cereal cerography certain certainly certainty certificate certificates certified cervelli cervix cessna cest cfc cgg cgi ch ch4 chachi chain chains chainsaw chair chairman chairs chalked challenge challenged challenges challenging champ champagne champaign champions championsblackfoot championship championships chan chance chances chandanee chandrababu change change's changed changes changing chaning chann channel channels chaos chapoutier chapter character characters charcoal charge charged charger chargers charges charging charity charles charleston charlie charlie's charlize charlotte charmeuse charming charon charred chart chartreuse charts chase chaser chases chat chattanooga che cheap cheat cheated check checked ch

default defeat defeater defeats defective defects defence defend defendant defender defending defense defensenews.comus defensive define defined defines definite definitely definition deflategate defs deglin degree degrees dehydration dei del delany's delay delayed delaying delays delete delhi deliberately delicious deliver delivered delivers dell delmont's delphi delta deluded deluge deluged delusions deluxe dem demand demco demeanor demi demi's demise demo democracy democrat democratic democrats demolish demolish-deep demolished demolishes demolition demon demoness demons demonstrated demonstratio demonstration demonstrations dems den denali denial denied denier deniers denies denim denmark density dental dented denton denver deny denying department departments depends deploy deployed deployments depot depreciations depressed depressing depression dept depth deputies deputy der derail derailed derailed_benchmark derailment derails derby dere derivative derivatives derives derma desce

gauze gave gay gaynor gays gaza gazans gazebo gazette gba gbbo gbonyin gc gd gear gearing gears gecko geek_apocalypse gel geller gem gemini gems gemstone gen gene general generally generation generational generic generous geneva genisys genitals genius genocide gentle gentlemen gently genuine geometric george georgia georgina german germany germs gesserit gesture get gets gettin getting geyser geysers gf gg ghe ghetto ghost ghosts ghostwriter ghostwriting ghoul ghul gi giannis giant giants gibraltar gif gift gifts gig gigant gigawatts giggling gilgit gillibrand gimp gin ginga gio giorgio girardeau girl girlfriend girls gis giselle gist give giveaway given gives giving glacier glad gladbach glass glasses gleaned glen glenn glided glimpses glitch glitter global globe globi_inclusion glononium gloomy gloria glorious glory gloucester glove glue gm gmail gmcr gmmbc gmt gmtty gn gns gnwt go goal goals goat goats gobsmacked god god's godlike godly gods godslove goes gog goggles gohan goin goi

 http://t.co/7b2wf6ovfk http://t.co/7bevuje5ep http://t.co/7cadm3lnko http://t.co/7cehnv3dky http://t.co/7cibxls55f http://t.co/7cmf3noync http://t.co/7d7vweq3es http://t.co/7dyoglhmre http://t.co/7ennullkzm http://t.co/7evyelw4lc http://t.co/7fsn1gewux http://t.co/7giglwdmhy http://t.co/7hanpcr5rk http://t.co/7hkavtvx81 http://t.co/7huen4rwrn http://t.co/7ieiz619h0 http://t.co/7ijlz6bcsp http://t.co/7jfreteii4 http://t.co/7jggqwbv6s http://t.co/7k5shaiqiw http://t.co/7l6bhexixv http://t.co/7l9qazljvg http://t.co/7le5gq2psx http://t.co/7lhkjz0ivo http://t.co/7lvgcmyiyj http://t.co/7mepkbf9e8 http://t.co/7mlcd0l0b8 http://t.co/7mug2kahl7 http://t.co/7mzycu2iho http://t.co/7nagdsadwp http://t.co/7npbfrzejl http://t.co/7nu7prxeul http://t.co/7o4lnfbe7k http://t.co/7old5mjwph http://t.co/7pqs4rshhb http://t.co/7qpg80ud7v http://t.co/7rakhp3bwm http://t.co/7s1gfnebgt http://t.co/7s9nm1fict http://t.co/7uf7tst9zx http://t.co/7ufnxxavqs http://t.co/7vcezi6cbb http://t.co/7xglah10zl http://t.c

http://t.co/ifm6v6480p http://t.co/ifqqpur99x http://t.co/igcetumkcw http://t.co/igll3ph6o1 http://t.co/iglnqpgbnw http://t.co/igm2fc4t0m http://t.co/igm2fcmupm http://t.co/igtxhapo0k http://t.co/igwstttkwk http://t.co/igx8xfz8ko http://t.co/igxrqpotm7 http://t.co/igyu2peiu3 http://t.co/igz7v24ge9 http://t.co/ih0awv3l1o http://t.co/ih49kymsmp http://t.co/ihhrkg4v1s http://t.co/ihinj3enqi http://t.co/ihphzckm41 http://t.co/ihvmtmzxne http://t.co/ii4ewe1qir http://t.co/iidkc0jsbx http://t.co/iifgaz0fil http://t.co/iikssjgbdn http://t.co/ij0wq490cs http://t.co/ijd7wzv5t9 http://t.co/ijmccmhh5g http://t.co/ijobz3mzp0 http://t.co/ijwar15h16 http://t.co/ijzcytbffo http://t.co/ik7mgidvbm http://t.co/ik8m4yi9t4 http://t.co/ikfmektpcx http://t.co/iknyok9zzr http://t.co/ikpngs3dti http://t.co/ikuayuseqt http://t.co/ikuggvbyei http://t.co/iladqebxpn http://t.co/ildbeje225 http://t.co/ilq0wqj0xs http://t.co/im2hdsklq5 http://t.co/im6m4xaen2 http://t.co/imawvmzs3a http://t.co/imhfdaowrd http://t.co

 http://t.co/lwwojxttiv http://t.co/lxjjgyv86a http://t.co/lxmdiseucn http://t.co/lxtjc87kls http://t.co/lxvlqvbc8r http://t.co/ly8x7rqbwn http://t.co/lyj57pq3yx http://t.co/lyxnjlxl8s http://t.co/lzasr05ljo http://t.co/lzff4pt4az http://t.co/lzljzzkcfa http://t.co/lzml1xb2nh http://t.co/lzob8qoh1b http://t.co/lzxwoaye4x http://t.co/m0dap5xlwo http://t.co/m0utldif77 http://t.co/m19ivwrdkk http://t.co/m1rosi2wcs http://t.co/m1xykecrzr http://t.co/m203ul6o7p http://t.co/m2hpnoak8b http://t.co/m2y9ym3if6 http://t.co/m2yuxnqlqy http://t.co/m3njvvtygn http://t.co/m4cpmxmurk http://t.co/m4jdzmgjow http://t.co/m4pqkkeevc http://t.co/m4tczaawpt http://t.co/m5djllxozp http://t.co/m5kxlpkfa1 http://t.co/m5rjekvddp http://t.co/m5sbfrrsn7 http://t.co/m6lvkxl9ii http://t.co/m75dnf2xyg http://t.co/m78ir0ik01û http://t.co/m7na4skfwr http://t.co/m8ciks60bx http://t.co/m8ufjdtlsm http://t.co/m96kbqwior http://t.co/m9d2elimzi http://t.co/m9ig3wq8lq http://t.co/m9k08oazve http://t.co/m9mowcmvnj http://

http://t.co/pm2tnnfdww http://t.co/pmbuzfgin3 http://t.co/pmcp8czpnd http://t.co/pme0hojvya http://t.co/pmggavtokp http://t.co/pmhmmkspaq http://t.co/pmlohzurwr http://t.co/pms4pmur0q http://t.co/pmtqhivsxx http://t.co/pmxezuo4ay http://t.co/pnaqxprweg http://t.co/pnhpljho8e http://t.co/pnlucerp0x http://t.co/pnnunrnqja http://t.co/pnssia5e46 http://t.co/po19h8ycnd http://t.co/pol92mn8yz http://t.co/pp05etlk7t http://t.co/ppekbqdcnc http://t.co/ppji1tcnml http://t.co/pq0d7mh3qr http://t.co/pq3ipugkuy http://t.co/pqhuthss3i http://t.co/pqrjvgvgxg http://t.co/pramklrmhz http://t.co/praro2owia http://t.co/prci76howu http://t.co/prmtxjjdue http://t.co/prontouo91 http://t.co/prrb4fhxtv http://t.co/psbxl1hvu3 http://t.co/pseylyzck4 http://t.co/psi35au3pc http://t.co/pspm3ahgkq http://t.co/psqcnsvfgp http://t.co/pst5bbq0av http://t.co/ptc0xcragy http://t.co/ptevy815mt http://t.co/ptkrxtzjtv http://t.co/ptq3zmgnck http://t.co/pty9hrcuzh http://t.co/pu7c4hhbxj http://t.co/pue5xnznqb http://t.co

 http://t.co/wztz4hgmvq http://t.co/x0giy85bs8 http://t.co/x0qlgwoymt http://t.co/x1onv3d5ux http://t.co/x1x6d5enef http://t.co/x1xj0xvtj7 http://t.co/x2qsjod40u http://t.co/x39jwsyrqr http://t.co/x3g2ox6k8r http://t.co/x3rcchkago http://t.co/x3vqxdouvt http://t.co/x4ecggvnsn http://t.co/x5jgkjv6ma http://t.co/x5rc5nuamh http://t.co/x5xumtoeke http://t.co/x5yeuylt1x http://t.co/x6asgrjswc http://t.co/x6el3ysycn http://t.co/x713omh6ai http://t.co/x8i0mhyrmn http://t.co/x8moyevjsj http://t.co/x8w7tf6fhg http://t.co/x8zqbwnfo1 http://t.co/x9cuihib5n http://t.co/x9mdhocpda http://t.co/x9ofv1kmv7 http://t.co/xaermbmvlv http://t.co/xbmm7ite9q http://t.co/xbnlsbzzgi http://t.co/xbznu0qkvs http://t.co/xc96rwuszb http://t.co/xcgzc45gys http://t.co/xcolwugfjg http://t.co/xcq48ourvl http://t.co/xdt4vhfn7b http://t.co/xdxdprcpns http://t.co/xe0ee1fzfh http://t.co/xehwmsh7lv http://t.co/xevueefqbz http://t.co/xezbs3sq0y http://t.co/xfccvmxuwb http://t.co/xfguklrltf http://t.co/xfhh2xf9ga http://t.c

 https://t.co/8tygo0kizz https://t.co/8u07foqjzw https://t.co/9cpwiecegv https://t.co/9hkxxbb82o https://t.co/9jcibenckz https://t.co/9le0b19lvf https://t.co/9pmbtxmoal https://t.co/9zmwt9xydz https://t.co/abnzqwlig1 https://t.co/aim5cyhl0y https://t.co/akmihlris1 https://t.co/alnv51d95x https://t.co/an3w16c8f6 https://t.co/aomq1rykmj https://t.co/apod4eivba https://t.co/asoopcygwn https://t.co/b0zjwgmaiw https://t.co/b19z8vi3td https://t.co/b7omj7u3ei https://t.co/b7zwevsrgo https://t.co/bejftygjil https://t.co/bftou2nybw https://t.co/bhufevagpu https://t.co/biexwdldwc https://t.co/bjzssw4tid https://t.co/bptmlf4p10 https://t.co/budmke3nnf https://t.co/cavb7pgepv https://t.co/ccwzdtfbus https://t.co/cezhq7czll https://t.co/chkp0gfynj https://t.co/cic7h64qv8 https://t.co/ciyty0fgpr https://t.co/cm9tve2vsq https://t.co/cnxxmffrae https://t.co/cubdnsnuvt https://t.co/cvkqigr1az https://t.co/cvpdvhxd1r https://t.co/cyompz1a0z https://t.co/cyu8zxw1oh https://t.co/czdw8oowa2 https://t.co/cz

insure insurer insurers int intact intead integrates integrative integrity intel intelligence intended intending intense intensifies intensity intensive intentions interactions interest interested interesting interlaken interlocking intern internal internally internally-displaced international internet interpretation interracial interrogation interrupt intersectio intersection intersections interspersed interstate intertissue intertwine interval intervene interview interviewed interviews intl into intoxicated intragenerational intrigued intriguing intro introduced introducing introduction inundated inundation invaded invading invasion invented invest invested investigate investigated investigating investigation investigative investigators investing investment invincible invisible invited inviting invoices invokces involved involves involving invzices inws inåêchaos io iof iowa ip ipa ipad iphone iphoto ipo ipod ir iran iran's iranian iranians iraq iraqi iraqis iredell ireland ireporter

 meet meeting meets meg mega megadeth megadeth-symphony megalpolis megan melanie mello melt meltdown melted melts member members membuahkan meme meme-baiting memes memorable memorial memories memory memphis men men's menahem mencius mens mental mentality mentally mention mentioned mentions meowing mercenary mercury mercy mere merged merle mesh mesick mesmerizing mess message messages messed messenger messengers messi messiah messy met metal metallica metaphorically metastatic meteors meter meters method methods metlife metre metrics metro metrobus metroid metropolis metropolitan metrotown mets mexican mexico mezcal mf mfi mfs mgm mgr mgs mh mh17 mh370 mhmmm mhtw mi mia miami mic michael michel michele michigan micom microbes microchip microlight microphone microsoft microsoft's microsofts microwave mics mid mid-day mid-morning mid-south middle mideast midfield midges midget midnight mido midst midsummer midtown midweek midwest might migraine migrant migrants migrating mike mil mil-c mi

probability probably probe problem problems probs proc procedures proceeds process prod prodding produc produce produced producer produces product production productive professional professionally profile profit profit-hungry program programme progress progressive progressives prohibits project projected projectiles projects proliferation prolly prolong prom promise promised promises promote promoted promotion prompt prompted prompting prompts proms prone pronouncing proof propaganda propane propelled proper properly property property-casualty prophecy prophet prophets proportions proposal proposed pros prosecute prosecuted prosper prosser protect protected protecting protection protector protein protest protesters protesting protestors protests proto-states protoshoggoth proud prove proven provide provided providence providers province provocation provoke provokes proxies proxy prysmian ps ps1 ps2 ps3 ps4 psa psalm psalms psfda psm psp psqd psychiatric psychic psychological psychologi

 slowly slowpoke slows slsp slumber slums smack small smaller smantibatam smart smash smashing smaug smb smeared smell smelled smelling smells smfh smh smile smiles smiling smirking smithereens smithsonian smoke smokers smokes smokey smoking smoky smoochy smooth smoothed smp sms smth smug smugglers smugglersåênabbed sn snack snacks snake snakes snap snapchat snapping snd sneak sneaks sneezing sni sniff sniiiff snipe sniping snippets snoop snort snotgreen snow snowball snowden snowflake snowstorm snowy snuck snuff so soak soaked soaker soaking soap soaring sobbing soc soccer social social-media-driven socialism socialists socially socialtimes socialwots society sock socket sockets sods sofa soft softball softenza software soggy soil solano solar sold soldi soldier soldiers sole solicitor solid solitude solo soloquiero solution solve solved solving somalia some somebody someday somehow someone someone's someones somethin something sometimes sometimesi somewhere somme son son'd sona soner



Looking at the word list, there's lots of cleanup that's possible. I think that we can get rid of the following:

- http addresses
- t.co
- tweets at handles (with @ prefixes)
- starts with numbers
- a handful of UTF characters at the end?

I expect that this will considerably speed up fitting and predicting without a large impact on accuracy.

# Named Entity Recognition (NER)

In [26]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

(S
  Our/PRP$
  Deeds/NNS
  are/VBP
  the/DT
  (ORGANIZATION Reason/NNP)
  of/IN
  this/DT
  #earthquake/NN
  May/NNP
  ALLAH/NNP
  Forgive/NNP
  us/PRP
  all/DT)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jeffreykoskulics/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/jeffreykoskulics/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/jeffreykoskulics/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [35]:
tagged = nltk.pos_tag(tt.tokenize(train['text'][18]))
entities = nltk.chunk.ne_chunk(tagged)
print(entities)

(S My/PRP$ car/NN is/VBZ so/RB fast/JJ)


In [37]:
import spacy

In [39]:
nlp = spacy.load('en')

In [47]:
[nlp(tweet).ents for tweet in train['text'][:20]]

[(May,),
 (La Ronge Sask, Canada),
 (),
 (13,000, California),
 (Ruby, Alaska),
 (#, California Hwy, 20, Lake County, #CAfire #),
 (Manitou, Colorado Springs),
 (),
 (),
 (),
 (Three,),
 (Haha South Tampa, GONNA DO),
 (#raining, #, Florida, TampaBay, Tampa, 18 or 19 days),
 (Myanmar,),
 (80,),
 (),
 (),
 (Summer,),
 (),
 ()]

In [48]:
[nlp(tweet).ents for tweet in train['text'][20:30]]

[(), (London,), (), (a wonderful day,), (), (), (NYC, last week), (), (), ()]