# Doc2vec for tweets

## Loading data

In [1]:
import pandas as pd
train = pd.read_csv('train_cleaned.csv')
test = pd.read_csv('test_cleaned.csv')
train.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders in ...
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...


## Tokenize

In [2]:
import gensim
def tokenize(df):
    return df['cleaned_text'].apply(lambda x: gensim.utils.simple_preprocess(x))

train['tokens'] = tokenize(train)
test['tokens'] = tokenize(test)

In [3]:
train.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text,tokens
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...,"[our, deeds, are, the, reason, of, this, earth..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada,"[forest, fire, near, la, ronge, sask, canada]"
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...,"[all, residents, asked, to, shelter, in, place..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders in ...,"[people, receive, wildfires, evacuation, order..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...,"[just, got, sent, this, photo, from, ruby, ala..."


Create one combined docvec for test and train

In [4]:
corpus = train[['id', 'tokens']].append(test[['id','tokens']], ignore_index=True)

In [5]:
len(corpus.id.unique())

10824

In [6]:
len(corpus)

10824

In [7]:
corpus['doc'] = corpus.apply(lambda x:gensim.models.doc2vec.TaggedDocument(x['tokens'], [x['id']]), axis=1)

# Train a docvec model

In [8]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=2, epochs=40)
model.build_vocab(corpus['doc'])

In [9]:
model.train(corpus['doc'], total_examples=model.corpus_count, epochs=model.epochs)

In [10]:
train['docvec'] = train['id'].apply(lambda x:model.docvecs[x])
train.head()

Unnamed: 0,id,keyword,location,text,target,cleaned_text,tokens,docvec
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...,"[our, deeds, are, the, reason, of, this, earth...","[-0.12622303, -0.31655547, 0.0786976, 0.086720..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada,"[forest, fire, near, la, ronge, sask, canada]","[0.015244697, 0.09134338, 0.106286, -0.0205552..."
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...,"[all, residents, asked, to, shelter, in, place...","[-0.028480066, -0.16509819, 0.005487719, 0.165..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders in ...,"[people, receive, wildfires, evacuation, order...","[0.1438991, 0.010826383, 0.10716863, 0.0719983..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...,"[just, got, sent, this, photo, from, ruby, ala...","[0.35347494, -0.022264164, -0.012723667, 0.104..."


In [11]:
test['docvec'] = test['id'].apply(lambda x:model.docvecs[x])

In [12]:
test.head()

Unnamed: 0,id,keyword,location,text,cleaned_text,tokens,docvec
0,0,,,Just happened a terrible car crash,Just happened a terrible car crash,"[just, happened, terrible, car, crash]","[0.03387151, -0.02856735, 0.1680817, -0.070513..."
1,2,,,"Heard about #earthquake is different cities, s...","Heard about earthquake is different cities, st...","[heard, about, earthquake, is, different, citi...","[-0.04121971, -0.13796373, 0.02811779, -0.0775..."
2,3,,,"there is a forest fire at spot pond, geese are...","there is a forest fire at spot pond, geese are...","[there, is, forest, fire, at, spot, pond, gees...","[0.00040748215, 0.060268387, -0.076933786, 0.0..."
3,9,,,Apocalypse lighting. #Spokane #wildfires,Apocalypse lighting. Spokane wildfires,"[apocalypse, lighting, spokane, wildfires]","[0.06158373, 0.008174323, 0.11369743, -0.06341..."
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,Typhoon Soudelor kills in China and Taiwan,"[typhoon, soudelor, kills, in, china, and, tai...","[0.06473778, -0.010235404, 0.13380131, -0.1048..."


## Inspect the docvecs

In [13]:
def check(i):
    print("Sentence:")
    print(test.iloc[i].text)
    sims = model.docvecs.most_similar([test['docvec'].iloc[i]], topn=5)
    idx = [x[0] for x in sims]
    print("Most similar:")
    print(train[train['id'].isin(idx)].text.values)
    print(test[test['id'].isin(idx)].text.values)

In [14]:
check(1)

Sentence:
Heard about #earthquake is different cities, stay safe everyone.
Most similar:
['@DarrylB1979 yea heard about that..not coming out until 2017 and 2019 ?????? Vampiro is bleeding'
 "i'm really sad about red 7 closing :( yuppies n tourists ruin everything"
 "I've been trying to write a theological short story about a monster living in a sinkhole. Then I heard about Brooklyn. #accidentalprophecy"
 'I just heard a really loud bang and everyone is asleep great']
['Heard about #earthquake is different cities, stay safe everyone.']


In [15]:
check(8)

Sentence:
What a nice hat?
Most similar:
['#ActionMoviesTaughtUs things actually can explode with a loud bang...in space.'
 "What's going on in Hollywood? #abc7eyewitness @ABC7 helicopters and sirens. #HometownGlory"
 "@HomeworldGym @thisisperidot D: What? That's a tragedy. You have a wonderful nose"]
['What a nice hat?' 'Soon to be loud bang as backpack detonated.']


In [16]:
check(80)

Sentence:
@margaretcho Call me a fag and I'm going to call you an ambulance :) #RainbowPower
Most similar:
['going to starve to death'
 'Going to the beach with Jim Alves means a guaranteed rainstorm.  #lucky http://t.co/fejs0Bu0sq'
 'This past week has been an absolute whirlwind.... Athens bound']
["@margaretcho Call me a fag and I'm going to call you an ambulance :) #RainbowPower"
 '@TheJasonTaylorR *EMS tries to stablize me and put me on a stretcher*']


In [17]:
check(800)

Sentence:
@PahandaBear @Nethaera Yup EU crashed too :P
Most similar:
['Madhya Pradesh Train Derailment: Village Youth Saved Many Lives'
 "I'm so traumatised."]
['@PahandaBear @Nethaera Yup EU crashed too :P'
 "Reddit's new content policy goes into effect many horrible subreddits banned or quarantined http://t.co/VdvLatelsJ http://t.co/OK2SFDN1I8"
 "@TremendousTroye I'm so traumatised"]


In [18]:
check(3000)

Sentence:
RT MMDA: ADVISORY: Stalled Bus at EDSA Service Road Cubao SB due to mechanical trouble as of 7:53 AM. 1 lane occupied. MMDA T/C on site. TÛ_
Most similar:
['Consent Order on cleanup underway at CSX derailment site - Knoxville News Sentinel http://t.co/xsZx9MWXYp http://t.co/NMFsgKf1Za'
 'FAAN orders evacuation of abandoned aircraft at MMA http://t.co/GsOMtDPmoJ']
['2 NNW Hana [Maui Co HI] COUNTY OFFICIAL reports COASTAL FLOOD at 5 Aug 10:00 AM HST -- WAIANAPANAPA STATE PARK CLOSED DUE TO LARGE SURF. \x89Û_'
 'RT MMDA: ADVISORY: Stalled Bus at EDSA Service Road Cubao SB due to mechanical trouble as of 7:53 AM. 1 lane occupied. MMDA T/C on site. T\x89Û_'
 'RT  ADVISORY: Stalled Bus at EDSA Service Road Cubao SB due to mechanical trouble as of 7:53 AM. 1 lane occupied.\x89Û_ https://t.co/HRNZKU66mm']


# Train a model using the docvecs

In [19]:
import numpy
def get_X(df):
    X = []
    X_ext = []
    xcols = []
    for index, row in df.iterrows():
        x = row['docvec']
        X.append(x)
        for xc in xcols:
            x = numpy.append(x, row[xc])        
        X_ext.append(x)
    return X, X_ext

In [20]:
X, X_ext = get_X(train)

In [21]:
y = train['target']

In [22]:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter = 1000)

In [23]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

{'fit_time': array([0.08888626, 0.08878827, 0.08883762, 0.07926059, 0.08557606]),
 'score_time': array([0.0033567 , 0.00334048, 0.00277281, 0.00292158, 0.00292349]),
 'test_score': array([0.69179229, 0.63728814, 0.69230769, 0.65767285, 0.67491749]),
 'train_score': array([0.68085106, 0.68675951, 0.66905113, 0.67839933, 0.67943381])}

In [24]:
scores['test_score'].mean()

0.6707956928746281

In [25]:
def prepare_submission(model, X, y, X_test, name):
    model.fit(X,y)
    pred = model.predict(X_test)
    submission = pd.DataFrame({"id":test['id'], "target":pred})
    submission.to_csv(name+'.csv', index=False)

In [26]:
X_test, X_test_ext = get_X(test)
prepare_submission(clf, X, y, X_test, 'simple_docvec')

## Try with an SVM
The logistic regression seem somewhat underfittet, let's try with a slightly more complicated model.

In [27]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
svm = SVC(kernel="rbf")
params = {'C': [0.05, 0.1, 0.2, 0.5, 0.75, 1, 1.5, 2, 4]}
clf = GridSearchCV(svm, params, scoring="f1", verbose=1, n_jobs=-2, cv=5)

In [28]:
from sklearn.model_selection import cross_validate
scores = cross_validate(clf, X, y, cv=cv, return_train_score=True, scoring='f1')
scores

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.2min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.4min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.5min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  1.8min finished


Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  2.1min finished


{'fit_time': array([ 76.54185033,  90.08294225,  97.24450159, 114.81477833,
        134.82863832]),
 'score_time': array([0.76988053, 0.99055076, 1.00640273, 1.47335863, 1.47036552]),
 'test_score': array([0.70538002, 0.6609589 , 0.69707401, 0.69237217, 0.68585944]),
 'train_score': array([0.78899083, 0.78006231, 0.76841883, 0.82546878, 0.73794281])}

In [29]:
scores['test_score'].mean()

0.6883289087330164

In [30]:
X_test, X_test_ext = get_X(test)
prepare_submission(clf, X, y, X_test, 'svm_docvec')

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  45 out of  45 | elapsed:  3.3min finished


Still not so impressive... Perhaps not enough data? Use averaged wordvecs instead?