# ⌚️ DM&ML 2020 - Team Rolex

## 🖋 Authors
- Francis Ruckstuhl, 16-821-738
- Hanna Birbaum, 16-050-114
- Loïc Rouiller-Monay, 16-832-453

## 🕵️ Project description

Real or Not? NLP with Disaster Tweets: Machine Learning model that can predict which tweets are about a real disaster and which are not. The project topic is based around a Kaggle competition.


## 📝 Commits

### Best commit:

**Commit 2 : 0.818%**
- data cleaning : remove unicode literals, urls, link, author, hashtags, rt
- feature engineering : num_chars, num_words, avg_words
- BOW
- LogisticRegression(solver='lbfgs', max_iter=1000)

### [B.] Previous commits

**Commit 1 : 0.808%**
- spacy_tokenizer: remove stopwords, punctuation, numbers then lemmatize and lowercase
- TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), tokenizer=spacy_tokenizer)
- LogisticRegression(solver='lbfgs', max_iter=1000)

**Commit 2 : 0.818%**
- data cleaning : remove unicode literals, urls, link, author, hashtags, rt
- feature engineering : num_chars, num_words, avg_words
- BOW
- LogisticRegression(solver='lbfgs', max_iter=1000)

**Commit 3 : 0.809%**
- data cleaning : remove unicode literals, urls, link, author, hashtags, rt, punctuations, lowercase
- feature engineering : num_chars, num_words, avg_words, num_hashtags
- BOW
- LogisticRegression(solver='lbfgs', max_iter=1000)

**Commit 4 : 0.801%**
- data cleaning : remove unicde literals, urls, link, author, hashtags, rt, punctuations, lowercase, lemmatize, stemming
- model_dbow = Doc2Vec(dm=0, vector_size=30, negative=6, hs=0, min_count=1, sample=0, workers=cores, epoch=300)
- Word2Vec
- LogisticRegression(max_iter=1000, solver='lbfgs')

**Commit 5 : 0.812%**
- Same as Commit 4 but without stemming

### [C.] Progression of accuracies

In [None]:
# /!\ You have to run Chapter 1. "libraries“ first before being able to plot the progression of accuracies
accuracy_progression = pd.read_csv('../documents/accuracy_progression.csv', sep=';')
sns.lineplot(x=accuracy_progression.commit_number, y=accuracy_progression.accuracy, linewidth=2)

# 📚 1. Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
import seaborn as sns
sns.set_theme(style="darkgrid")
import spacy
from nltk.stem.snowball import SnowballStemmer
# load English language model of spacy
sp = spacy.load('en_core_web_sm')
import string
from spellchecker import SpellChecker
import pycountry
import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score

from gensim.models.doc2vec import TaggedDocument

In [None]:
# Yet to discuss whether this will be used or not
from sklearn.preprocessing import LabelEncoder

# 📂 2. Download data


## Files
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - a sample submission file in the correct format

In [2]:
train = pd.read_csv('../data/training_data.csv')
test = pd.read_csv('../data/test_data.csv')
sample_submission = pd.read_csv('../data/sample_submission.csv')

In [None]:
train.head(5)

## Features
- id - a unique identifier for each tweet
- text - the text of the tweet
- location - the location the tweet was sent from (may be blank)
- keyword - a particular keyword from the tweet (may be blank)
- target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

In [None]:
train.info()

# 🔬 3. Exploratory Data Analysis

## [A.] What is the baserate of the problem?

In [None]:
base_rate = train.target.value_counts().max()/len(train)
print(f'\nThe base rate is {base_rate}')

## Target class distribution

In [None]:
sns.catplot(x="target", kind="count", data=train);

## Missing values

In [None]:
train.isnull().any()

It misses value in two features : keyword and location.

### Missing value in "keyword"

In [None]:
train.keyword.isnull().value_counts()

### Missing value in "location"

In [None]:
train.location.isnull().value_counts()

## Tweets length

### Number of characters

In [None]:
train["num_char"] = train["text"].apply(len)
test["num_char"] = test["text"].apply(len)

In [None]:
sns.boxplot(x='target', y='num_char', data=train)

##### Findings
Tweets about real disaster seems to be lengthier.

### Number of words

In [None]:
train["num_words"] = train["text"].apply(lambda x: len(x.split()))
test["num_words"] = test["text"].apply(lambda x: len(x.split()))

In [None]:
sns.boxplot(x='target', y='num_words', data=train)

##### Findings
Tweets about real disaster do not seem to have more words. Maybe it'll help to take this into consideration.

### Average word length

In [None]:
train["avg_word_length"] = train['text'].apply(lambda x: np.sum([len(w) for w in x.split()]) / len(x.split()))
test["avg_word_length"] = test['text'].apply(lambda x: np.sum([len(w) for w in x.split()]) / len(x.split()))

In [None]:
sns.boxplot(x='target', y='avg_word_length', data=train)

##### Findings
Tweets about real disaster seems to have lengtier average word length

### Number of hashtags in text

In [None]:
train['num_hashtags'] = train['text'].apply(lambda x: x.count('#'))

In [None]:
sns.boxplot(x='target', y='num_hashtags', data=train)

### Keywords 

In [None]:
### DISCUSS WITH TEAMMATES ###
# Replace NaN values with "Unknown"? (NaNs need to be replace for label encoding)
train["keyword"] = train["keyword"].fillna("Unknown")

In [None]:
# Label encoding for keywords
label = LabelEncoder()
keyword_label = pd.Series(label.fit_transform(train["keyword"]), name="keyword_code")
keyword_label.head()

In [None]:
# Perhaps display the most frequent keywords? 
train["keyword"].value_counts().head()

## Disaster Location

In [None]:
# Where do most disasters occur / where do disaster tweets come from? 
# Potential problem to take care of: USA and United States are separate; Different US States are also separate;

In [None]:
# How many different locations are there?
train["location"].nunique()

In [None]:
# Create regex for countries that require cleaning:

# United States:
usa_regex = re.compile(r"""(?i)Alabama|\bAL\b|Alaska|\bAK\b|Arizona|\bAZ\b|Arkansas|\bAR\b|California|\bCA\b|Colorado|\bCO\b|
                Connecticut|\bCT\b|Delaware|\bDE\b|Florida|\bFL\b|Georgia|\bGA\b|Hawaii|\bHI\b|Idaho|\bID\b|Illinois|\bIL\b|
                Indiana|\bIN\b|Iowa\bIA\b|Kansas|\bKS\b|Kentucky|\bKY\b|Louisiana|\bLA\b|Maine|\bME\b|Maryland|\bMD\b|Massachusetts|
                \bMA\b|Michigan|\bMI\b|Minnesota|\bMN\b|Mississippi|\bMS\b|Missouri|\bMO\b|Montana|\bMT\b|Nebraska|\bNE\b|Nevada|
                \bNV\b|New\sHampshire|\bNH\b|New\sJersey|\bNJ\b|New Mexico|\bNM\b|New\sYork|\bNY\b|\bNYC\b|North\sCarolina|\bNC\b|
                North\sDakota|\bND\b|Ohio|\bOH\b|Oklahoma|\bOK\b|Oregon|\bOR\b|Pennsylvania|\bPA\b|Rhode\sIsland|\bRI\b|South\sCarolina|
                \bSC\b|South\sDakota|\bSD\b|Tennessee|\bTN\b|Texas|\bTX\b|Utah|\bUT\b|Vermont|\bVT\b|Virginia|\bVA\b|Washington|\bWA\b|
                West\sVirginia|\bWV\b|Wisconsin|\bWI\b|Wyoming|\bWY\b|\bUSA\b|San\sFrancisco|Los\sAngeles|Seattle|Chicago|
                Atlanta""", re.VERBOSE)

# United Kingdom:
uk_regex = re.compile(r"""(?i)UK|London|England|Scotland|Wales|Birmingham|Glasgow|Liverpool|Bristol|Manchester|
                      Sheffield|Leeds|Edinburgh|Leicester|Coventry|Bradford|Cardiff|Belfast|Oxford|Plymouth|Aberdeen""", re.VERBOSE)

# Canada:
ca_regex = re.compile(r"""(?i)Canada|Ontario|Quebec|Nova\sScotia|New Brunswick|Manitoba|British\sColumbia|Prince\sEdward\sIsland|
                      Saskatchewan|Alberta|Newfoundland|Labrator|Toronto|Ottawa|Vancouver|Calgary""", re.VERBOSE)

# Australia:
au_regex = re.compile(r"""(?i)australia|Brisbane|Melbourne|Sydney|Perth|Adelaide|Capital\sTerritory|Canberra|Hobart|
                      Darwin|Gold\sCoast|Queensland|Victoria|Tasmania""", re.VERBOSE)

# India:
in_regex = re.compile(r"""(?i)mumbai|Maharashtra|Delhi|Kolkata|West\sBengal|Chennai|Tamil\sNadu|Hyderabad|Bangalore|
                      Ahmedabad|Surat|Jaipur|Kanpur|Nagpur|Gujarat|Uttar\sPradesh""", re.VERBOSE)

In [None]:
# Iterate through the rows and check if any of the locations matches one of our regexes
# If so, the entire value will be replaced by a unified name:

for index, row in train.iterrows():

  # For any location in the United States:
  if re.search(usa_regex, str(train.loc[index, "location"])):
    train.loc[index, "location"] = "United States"

  # For any location in the United Kingdom:
  elif re.search(uk_regex, str(train.loc[index, "location"])):
    train.loc[index, "location"] = "United Kingdom"

  # For any location in Canada:
  elif re.search(ca_regex, str(train.loc[index, "location"])):
    train.loc[index, "location"] = "Canada"
  
  # For any location in Australia:
  elif re.search(au_regex, str(train.loc[index, "location"])):
    train.loc[index, "location"] = "Australia"
  
  # For any location in the India:
  elif re.search(in_regex, str(train.loc[index, "location"])):
    train.loc[index, "location"] = "India"

In [None]:
# Plot the top 5 countries:
countries = train["location"].value_counts()
countries = countries.sort_values(ascending=False).head(5)
countries.plot(kind="bar")

In [None]:
### DISCUSS WITH TEAMMATES ###
# Will this help with data cleaning?

# 🧹 4. Data cleaning

## Keywords

In [None]:
# remove '%20' from keyword feature
train.keyword = train.keyword.apply(lambda lex: str(lex).replace('%20', ' '))
test.keyword = train.keyword.apply(lambda ro: str(ro).replace('%20', ' '))

In [None]:
# check if NaN values in the keyword feature
print(train.keyword.isnull().any())
print(test.keyword.isnull().any())

# There's no null values between the keywords

In [None]:
# use pycountry in order to check if a country appears in the location
# if yes takes the country, else turn it to NaN
# with train and test data set

In [None]:
# train.location.apply(lambda x: print(x))

## Text

In [3]:
stop_words = spacy.lang.en.stop_words.STOP_WORDS
punctuations = string.punctuation

def separate_punct(text):
    temp = []
    for char in text:
        if char not in punctuations:
            temp.append(char)
        else:
            temp.append(' '+char)
    return ''.join(temp)

def clean_text(text):
    # remove unicode literals
    temp = text.encode('ascii',errors='ignore').decode('ascii')
    
    # remove &amp
    temp = temp.replace('&amp;', '')
    
    # remove urls
    temp = re.sub(r"http\S+", "", temp)
    
    # remove html
    temp = re.sub(r'<.*?>', "", temp)
    
    # remove hashtags
    temp = re.sub(r'#', "", temp)

    # remove people account with @
    temp = re.sub(r'@\S+', "", temp)
    
    # remove 'RT'
    temp = temp.replace('RT', '')
    
    # remove punctuation
    temp = ''.join([ char for char in temp if char not in punctuations ])
    
    # separate punctuation
    # temp = separate_punct(temp)

    # remove "."
    #temp.replace('.','')
    
    # lowercase
    temp = temp.lower()
    
    # spell checking
    spell = SpellChecker()
    temp_spellchecked = []
    for word in temp.split():
        temp_spellchecked.append(spell.correction(word))
        
    # stemming with nltk
    #stemmer = SnowballStemmer(language='english')
    #temp_stemmed = []
    #for word in temp_spellchecked:
    #    temp_stemmed.append(stemmer.stem((word)))
    
    # create spacy object
    temp = sp(' '.join(temp_spellchecked))

    # lemmatize each token and convert each token into lowercase
    temp = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in temp ]
    
    # remove stop words 
    temp = [ word for word in temp if word not in stop_words  ]
    
    # join
    temp = ' '.join(temp)
    
    return temp

In [None]:
%%time
# clean text
train.text = train.text.apply(lambda x: clean_text(x))
test.text = test.text.apply(lambda x: clean_text(x))

In [None]:
train.to_csv('train_spellchecked.csv')
test.to_csv('test_spellchecked.csv')


In [None]:
train = pd.read_csv('../data/train_spellchecked.csv')
test = pd.read_csv('../data/test_spellchecked.csv')

In [None]:
train.text.apply(lambda x: print(x))

## Location

In [None]:
train.location.isnull().value_counts()

# 🛠 [D.] 5. Feature Engineering

In [None]:
pycountry.countries.search_fuzzy('England')

In [None]:
# not good 
def location_to_country(location):
    temp = location.split(',')
    countries = list(pycountry.countries)
    
    for word in temp:
        for i in range(len(countries)):
            if (word.strip() in countries[i].alpha_2) or (word.strip() in countries[i].alpha_3) or (word.strip() in countries[i].name):
                return countries[i].name
    return 'Unknown'

train['country'] = train.location.apply(lambda x: location_to_country(str(x)))
train[['location', 'country']].head(20)

# ⚙️ 6. Preprocessing

In [4]:
# Create tokenizer function for preprocessing
def spacy_tokenizer(text):

    # Define stopwords, punctuation, rolex and numbers
    #stop_words = spacy.lang.en.stop_words.STOP_WORDS
    #punctuations = string.punctuation
    # numbers = "0123456789"

    # Create spacy object
    mytokens = sp(text)

    #Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
    #
    # Remove all word with less that 3 letters (remove noise)
    mytokens = [ word for word in mytokens if len(word)>2 ]

    # Return preprocessed list of tokens
    return mytokens

In [None]:
# Tokenize texts
processed_texts = []
for text in train.text:
    processed_text = spacy_tokenizer(text)
    processed_texts.append(processed_text)

# 🤖 7. Models

In [None]:
train = pd.read_csv('../data/training_data_spellchecked.csv')
test = pd.read_csv('../data/test_data_spellchecked.csv')
sample_submission = pd.read_csv('../data/sample_submission.csv')

# change type to string to prevent some errors
train.text = train.text.astype(str)
train.keyword = train.keyword.astype(str)
train.location = train.location.astype(str)

test.text = test.text.astype(str)
test.keyword = test.keyword.astype(str)
test.location = test.location.astype(str)

## BOW with Logistic Regression

In [None]:
# I'll clean this part during the week - Loïc

In [None]:
# Using default tokenizer 
count = CountVectorizer(ngram_range=(1,2), stop_words="english")
bow = count.fit_transform(train.text)

In [None]:
# Get feature names
feature_names = count.get_feature_names()

In [None]:
# Show as a dataframe
processed_train = pd.DataFrame(
    bow.todense(), 
    columns=feature_names
    )

In [None]:
# Select features
X = processed_train # the features we want to analyze
y = train['target'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [None]:
# Define classifier
classifier = LogisticRegressionCV(solver='lbfgs', max_iter=1000, cv=3)

In [None]:
# Fit model on training set
classifier.fit(X_train, y_train)

In [None]:
# Predictions
y_pred = classifier.predict(X_test)

# Evaluate model
print(round(accuracy_score(y_test, y_pred), 4))

#### BOW with more additional features

In [None]:
train_full = pd.concat([train[['num_char', 'num_words', 'avg_word_length', 'num_hashtags']], processed_train], axis=1)

In [None]:
# Select features
X = processed_train # the features we want to analyze
y = train['target'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [None]:
# Define classifier
classifier = LogisticRegressionCV(solver='lbfgs', max_iter=3000, cv=3)

In [None]:
%%time
# Fit model on training set
classifier.fit(X_train, y_train)

In [None]:
# Predictions
y_pred = classifier.predict(X_test)

# Evaluate model
print(round(accuracy_score(y_test, y_pred), 4))

#### BOW with additional features and Decision tree

In [None]:
train_full = pd.concat([train[['num_char', 'num_words', 'avg_word_length']], processed_train], axis=1)

In [None]:
# Select features
X = processed_train # the features we want to analyze
y = train['target'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [None]:
# Define classifier
classifier = DecisionTreeClassifier()

In [None]:
# Fit model on training set
classifier.fit(X_train, y_train)

In [None]:
# Predictions
y_pred = classifier.predict(X_test)

# Evaluate model
print(round(accuracy_score(y_test, y_pred), 4))

## TF-IDF with Logistic Regression

In [None]:
# Select features
X = train['text'] # the features we want to analyze
y = train['target'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=707)

In [None]:
%%time
# Define vectorizer
tfidf = TfidfVectorizer(ngram_range=(1, 1), tokenizer=spacy_tokenizer)

# Define classifier
classifier = LogisticRegressionCV(solver='lbfgs', max_iter=1000, cv=5)

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

In [None]:
# Predictions
y_pred = pipe.predict(X_test)

# Evaluate model
print(round(accuracy_score(y_test, y_pred), 4))

### Perhaps a random forest? 

In [None]:
# Maybe try a Random Forest? (- Hanna)
from sklearn.ensemble import RandomForestClassifier

# Define vectorizer
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer) 

# Define classifier
classifier = RandomForestClassifier()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

## Decision tree

In [None]:
%%time
# Define vectorizer
tfidf = TfidfVectorizer(ngram_range=(1, 2), tokenizer=spacy_tokenizer)

# Define classifier
classifier = DecisionTreeClassifier()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

In [None]:
# Predictions
y_pred = pipe.predict(X_test)

# Evaluate model
print(round(accuracy_score(y_test, y_pred), 4))

## Classification using Doc2Vec and Logistic Regression

In [5]:
sample_tagged = train.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r['text']), tags=[r.target]), axis=1)

In [6]:
# Train test split - same split as before
train_tagged, test_tagged = train_test_split(sample_tagged, test_size=0.2, random_state=1234)

In [7]:
# Allows to speed up a bit
import multiprocessing
cores = multiprocessing.cpu_count()

In [8]:
# Define Doc2Vec and build vocabulary
from gensim.models import Doc2Vec

model_dbow = Doc2Vec(dm=1, vector_size=100, negative=5, hs=0, min_count=1, sample=0, workers=cores, epoch=500)
model_dbow.build_vocab([x for x in train_tagged.values])

In [9]:
# Train distributed Bag of Word model
model_dbow.train(train_tagged, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs)

In [10]:
# Select X and y
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=300)) for doc in sents])
    return targets, regressors

y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

# Each document (i.e. complaint) is now a vector in the space of 30 dimentions.
# Similar complaints should have similar vector representation.

In [11]:
# Fit model on training set - same algorithm as before
logreg = LogisticRegressionCV(max_iter=3000, cv=9, solver='lbfgs')
logreg.fit(X_train, y_train)

LogisticRegressionCV(cv=9, max_iter=3000)

In [12]:
# Predictions
y_pred = logreg.predict(X_test)

# Evaluate model
print(round(accuracy_score(y_test, y_pred), 4))
print('cv = 9, basique, cleané + pays',0.7853)
print(0.7869)
print('avec cv = 9 : ', 0.7892)
print('avec cv = 9, basique clean: ', 0.7923)
print('avec cv = 9, + features ', 0.7861)
print('avec cv = 9, basique, pas clean [import, token --> c tout]', 0.7946)
print('avec cv = 10, basique, pas clean [import, token --> c tout]', 0.7954)

0.7876
cv = 9, basique, cleané + pays 0.7853
0.7869
avec cv = 9 :  0.7892
avec cv = 9, basique clean:  0.7923
avec cv = 9, + features  0.7861
avec cv = 9, basique, pas clean [import, token --> c tout] 0.7946
avec cv = 10, basique, pas clean [import, token --> c tout] 0.7954


In [None]:
train.info()

## Classification using Doc2Vec, more features and Logistic Regression

# 🏆 8. Submission

## BOW

In [None]:
# Using default tokenizer 
count = CountVectorizer(ngram_range=(1,2), stop_words="english")
bow = count.fit(train.text)
bow = count.transform(train.text)

In [None]:
# Get feature names
feature_names = count.get_feature_names()

In [None]:
# Show as a dataframe
processed_train = pd.DataFrame(
    bow.todense(), 
    columns=feature_names
    )

In [None]:
train_full = pd.concat([train[['num_char', 'num_words', 'avg_word_length', 'num_hashtags']], processed_train], axis=1)

In [None]:
# Select features
X = train_full # the features we want to analyze
y = train['target'] # the labels, or answers, we want to test against

In [None]:
# Define classifier
classifier = LogisticRegressionCV(solver='lbfgs', max_iter=6000, cv=3)

In [None]:
%%time
# Fit model on training set
classifier.fit(X, y)

In [None]:
bow_test = count.transform(test.text)
# Get feature names
feature_names_test = count.get_feature_names()

In [None]:
# Show as a dataframe
processed_test = pd.DataFrame(
    bow_test.todense(),
    columns=feature_names_test
    )

In [None]:
test_full = pd.concat([test[['num_char', 'num_words', 'avg_word_length' , 'num_hashtags']], processed_test], axis=1)

In [None]:
# Predictions

y_pred = classifier.predict(test_full)

## TF IDF

In [None]:
# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

In [None]:
pipe.fit(train.text, train.target)

In [None]:
preds = pipe.predict(test.text)

In [None]:
preds

# Word2Vec

In [None]:
train = pd.read_csv('../data/training_data_spellchecked.csv')
test = pd.read_csv('../data/test_data_spellchecked.csv')

train[['location', 'text']] = train[['location', 'text']].astype(str)
test['target'] = ''
test[['location', 'text']] = test[['location', 'text']].astype(str)

In [None]:
train_tagged = train.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r['text']), tags=[r.target]), axis=1)

In [None]:
test_tagged = test.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r['text']), tags=[r.target]), axis=1)

In [None]:
# Allows to speed up a bit
import multiprocessing
cores = multiprocessing.cpu_count()

In [None]:
# Define Doc2Vec and build vocabulary
from gensim.models import Doc2Vec

model_dbow = Doc2Vec(dm=0, vector_size=30, negative=6, hs=0, min_count=1, sample=0, workers=cores, epoch=300)
model_dbow.build_vocab([x for x in train_tagged.values])

In [None]:
# Train distributed Bag of Word model
model_dbow.train(train_tagged, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs)

In [None]:
# Select X and y
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=300)) for doc in sents])
    return targets, regressors

y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

In [None]:
logreg = LogisticRegressionCV(max_iter=1000, solver='lbfgs', cv=3)
logreg.fit(X_train, y_train)

# Predictions
y_pred = logreg.predict(X_test)

## Export submission

In [None]:
sample_submission.target = y_pred

In [None]:
sample_submission

In [None]:
sample_submission.to_csv('submission-005.csv', index=False)