# ⌚️ DM&ML 2020 - Team Rolex

## 🖋 Authors
- Francis Ruckstuhl, 16-821-738
- Hanna Birbaum, 16-050-114
- Loïc Rouiller-Monay, 16-832-453

## 🕵️ Project description

Real or Not? NLP with Disaster Tweets: Machine Learning model that can predict which tweets are about a real disaster and which are not. The project topic is based around a Kaggle competition.


## 📝 Commits

### Best commit:

**Commit 2 : 0.818%**
- data cleaning : remove unicode literals, urls, link, author, hashtags, rt
- feature engineering : num_chars, num_words, avg_words
- BOW
- LogisticRegression(solver='lbfgs', max_iter=1000)

### [B.] Previous commits

**Commit 1 : 0.808%**
- spacy_tokenizer: remove stopwords, punctuation, numbers then lemmatize and lowercase
- TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), tokenizer=spacy_tokenizer)
- LogisticRegression(solver='lbfgs', max_iter=1000)

**Commit 2 : 0.818%**
- data cleaning : remove unicode literals, urls, link, author, hashtags, rt
- feature engineering : num_chars, num_words, avg_words
- BOW
- LogisticRegression(solver='lbfgs', max_iter=1000)

**Commit 3 : 0.809%**
- data cleaning : remove unicode literals, urls, link, author, hashtags, rt, punctuations, lowercase
- feature engineering : num_chars, num_words, avg_words, num_hashtags
- BOW
- LogisticRegression(solver='lbfgs', max_iter=1000)

**Commit 4 : 0.801%**
- data cleaning : remove unicde literals, urls, link, author, hashtags, rt, punctuations, lowercase, lemmatize, stemming
- model_dbow = Doc2Vec(dm=0, vector_size=30, negative=6, hs=0, min_count=1, sample=0, workers=cores, epoch=300)
- Word2Vec
- LogisticRegression(max_iter=1000, solver='lbfgs')

**Commit 5 : 0.812%**
- Same as Commit 4 but without stemming

### [C.] Progression of accuracies

In [None]:
# /!\ You have to run Chapter 1. "libraries“ first before being able to plot the progression of accuracies
accuracy_progression = pd.read_csv('../documents/accuracy_progression.csv', sep=';')
sns.lineplot(x=accuracy_progression.commit_number, y=accuracy_progression.accuracy, linewidth=2)

# 📚 1. Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
import seaborn as sns
sns.set_theme(style="darkgrid")
import spacy
from nltk.stem.snowball import SnowballStemmer
# load English language model of spacy
sp = spacy.load('en_core_web_sm')
import string
from spellchecker import SpellChecker
import pycountry
import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score
from gensim.models import KeyedVectors
from sklearn.preprocessing import StandardScaler
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from gensim.models import Doc2Vec
from sklearn.model_selection import GridSearchCV

from gensim.models.doc2vec import TaggedDocument

# Yet to discuss whether this will be used or not
from sklearn.preprocessing import LabelEncoder

# 📂 2. Download data


## Files
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - a sample submission file in the correct format

In [2]:
train = pd.read_csv('../data/training_data.csv')
test = pd.read_csv('../data/test_data.csv')
sample_submission = pd.read_csv('../data/sample_submission.csv')

In [3]:
test.head(5)

Unnamed: 0,id,keyword,location,text
0,9972,tsunami,,Crptotech tsunami and banks.\r\n http://t.co/K...
1,9865,traumatised,"Portsmouth, UK",I'm that traumatised that I can't even spell p...
2,1937,burning%20buildings,,@foxnewsvideo @AIIAmericanGirI @ANHQDC So ... ...
3,3560,desolate,,Me watching Law &amp; Order (IB: @sauldale305)...
4,2731,crushed,bahstun/porta reeko,Papi absolutely crushed that ball


## Features
- id - a unique identifier for each tweet
- text - the text of the tweet
- location - the location the tweet was sent from (may be blank)
- keyword - a particular keyword from the tweet (may be blank)
- target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6471 entries, 0 to 6470
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        6471 non-null   int64 
 1   keyword   6416 non-null   object
 2   location  4330 non-null   object
 3   text      6471 non-null   object
 4   target    6471 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 252.9+ KB


In [5]:
train['num_hashtags'] = train['text'].apply(lambda x: x.count('#'))
test['num_hashtags'] = test['text'].apply(lambda x: x.count('#'))

train["avg_word_length"] = train['text'].apply(lambda x: np.sum([len(w) for w in x.split()]) / len(x.split()))
test["avg_word_length"] = test['text'].apply(lambda x: np.sum([len(w) for w in x.split()]) / len(x.split()))

train["num_char"] = train["text"].apply(len)
test["num_char"] = test["text"].apply(len)

## Disaster Location

In [6]:
# Create regex for countries that require cleaning:

# United States:
usa_regex = re.compile(r"""(?i)Alabama|\bAL\b|Alaska|\bAK\b|Arizona|\bAZ\b|Arkansas|\bAR\b|California|\bCA\b|Colorado|\bCO\b|
                Connecticut|\bCT\b|Delaware|\bDE\b|Florida|\bFL\b|Georgia|\bGA\b|Hawaii|\bHI\b|Idaho|\bID\b|Illinois|\bIL\b|
                Indiana|\bIN\b|Iowa\bIA\b|Kansas|\bKS\b|Kentucky|\bKY\b|Louisiana|\bLA\b|Maine|\bME\b|Maryland|\bMD\b|Massachusetts|
                \bMA\b|Michigan|\bMI\b|Minnesota|\bMN\b|Mississippi|\bMS\b|Missouri|\bMO\b|Montana|\bMT\b|Nebraska|\bNE\b|Nevada|
                \bNV\b|New\sHampshire|\bNH\b|New\sJersey|\bNJ\b|New Mexico|\bNM\b|New\sYork|\bNY\b|\bNYC\b|North\sCarolina|\bNC\b|
                North\sDakota|\bND\b|Ohio|\bOH\b|Oklahoma|\bOK\b|Oregon|\bOR\b|Pennsylvania|\bPA\b|Rhode\sIsland|\bRI\b|South\sCarolina|
                \bSC\b|South\sDakota|\bSD\b|Tennessee|\bTN\b|Texas|\bTX\b|Utah|\bUT\b|Vermont|\bVT\b|Virginia|\bVA\b|Washington|\bWA\b|
                West\sVirginia|\bWV\b|Wisconsin|\bWI\b|Wyoming|\bWY\b|\bUSA\b|San\sFrancisco|Los\sAngeles|Seattle|Chicago|
                Atlanta""", re.VERBOSE)

# United Kingdom:
uk_regex = re.compile(r"""(?i)UK|London|England|Scotland|Wales|Birmingham|Glasgow|Liverpool|Bristol|Manchester|
                      Sheffield|Leeds|Edinburgh|Leicester|Coventry|Bradford|Cardiff|Belfast|Oxford|Plymouth|Aberdeen""", re.VERBOSE)

# Canada:
ca_regex = re.compile(r"""(?i)Canada|Ontario|Quebec|Nova\sScotia|New Brunswick|Manitoba|British\sColumbia|Prince\sEdward\sIsland|
                      Saskatchewan|Alberta|Newfoundland|Labrator|Toronto|Ottawa|Vancouver|Calgary""", re.VERBOSE)

# Australia:
au_regex = re.compile(r"""(?i)australia|Brisbane|Melbourne|Sydney|Perth|Adelaide|Capital\sTerritory|Canberra|Hobart|
                      Darwin|Gold\sCoast|Queensland|Victoria|Tasmania""", re.VERBOSE)

# India:
in_regex = re.compile(r"""(?i)mumbai|Maharashtra|Delhi|Kolkata|West\sBengal|Chennai|Tamil\sNadu|Hyderabad|Bangalore|
                      Ahmedabad|Surat|Jaipur|Kanpur|Nagpur|Gujarat|Uttar\sPradesh""", re.VERBOSE)

In [7]:
# Iterate through the rows and check if any of the locations matches one of our regexes
# If so, the entire value will be replaced by a unified name:

for index, row in train.iterrows():

  # For any location in the United States:
    if re.search(usa_regex, str(train.loc[index, "location"])):
        train.loc[index, "country"] = "United States"

  # For any location in the United Kingdom:
    elif re.search(uk_regex, str(train.loc[index, "location"])):
        train.loc[index, "country"] = "United Kingdom"

  # For any location in Canada:
    elif re.search(ca_regex, str(train.loc[index, "location"])):
        train.loc[index, "country"] = "Canada"
  
  # For any location in Australia:
    elif re.search(au_regex, str(train.loc[index, "location"])):
        train.loc[index, "country"] = "Australia"
  
  # For any location in the India:
    elif re.search(in_regex, str(train.loc[index, "location"])):
        train.loc[index, "country"] = "India"
        
        
for index, row in test.iterrows():

  # For any location in the United States:
    if re.search(usa_regex, str(test.loc[index, "location"])):
        test.loc[index, "country"] = "United States"

  # For any location in the United Kingdom:
    elif re.search(uk_regex, str(test.loc[index, "location"])):
        test.loc[index, "country"] = "United Kingdom"

  # For any location in Canada:
    elif re.search(ca_regex, str(test.loc[index, "location"])):
        test.loc[index, "country"] = "Canada"
  
  # For any location in Australia:
    elif re.search(au_regex, str(test.loc[index, "location"])):
        test.loc[index, "country"] = "Australia"
  
  # For any location in the India:
    elif re.search(in_regex, str(test.loc[index, "location"])):
        test.loc[index, "country"] = "India"

# 🧹 4. Data cleaning

## Keywords

In [8]:
# remove '%20' from keyword feature
train.keyword = train.keyword.apply(lambda lex: str(lex).replace('%20', ' '))
test.keyword = test.keyword.apply(lambda ro: str(ro).replace('%20', ' '))

# 🛠 [D.] 5. Feature Engineering

In [9]:
train.country = train.country.fillna('nocountry')
train.keyword = train.keyword.fillna('nokeyword')
train.location = train.location.fillna('nolocation')
train.text = train.text.fillna('notext')

test.country = test.country.fillna('nocountry')
test.keyword = test.keyword.fillna('nokeyword')
test.location = test.location.fillna('nolocation')
test.text = test.text.fillna('notext')

In [10]:
train.country = train.country.astype(str)
train.keyword = train.keyword.astype(str)
train.location = train.location.astype(str)
train.text = train.text.astype(str)

test.country = test.country.astype(str)
test.keyword = test.keyword.astype(str)
test.location = test.location.astype(str)
test.text = test.text.astype(str)

In [11]:
# as we caanot use multiple columns in the models except with the bow, we will try to make a column regrouping all the colmns
train['full'] = train['country'].str.cat(train[['location', 'keyword', 'text']], sep=" ")
test['full'] = test['country'].str.cat(test[['location', 'keyword', 'text']], sep=" ")

In [12]:
# remove '%20' from keyword feature
#train.full = train.full.apply(lambda lex: str(lex).replace('nan', ''))
#test.full = test.full.apply(lambda ro: str(ro).replace('nan', ''))

In [13]:
test.text.head(2).apply(lambda x: print(x))

Crptotech tsunami and banks.
 http://t.co/KHzTeVeDja #Banking #tech #bitcoing #blockchain
I'm that traumatised that I can't even spell properly! Excuse the typos!


0    None
1    None
Name: text, dtype: object

In [14]:
test.full.head(2).apply(lambda x: print(x))

nocountry nolocation tsunami Crptotech tsunami and banks.
 http://t.co/KHzTeVeDja #Banking #tech #bitcoing #blockchain
United Kingdom Portsmouth, UK traumatised I'm that traumatised that I can't even spell properly! Excuse the typos!


0    None
1    None
Name: full, dtype: object

In [15]:
train.keyword.isnull().any()

False

# ⚙️ 6. Preprocessing

In [15]:
# Create tokenizer function for preprocessing
def spacy_tokenizer2(text):

    # Define stopwords, punctuation, rolex and numbers
    stop_words = spacy.lang.en.stop_words.STOP_WORDS
    punctuations = string.punctuation
    
     # remove unicode literals
    temp = text.encode('ascii',errors='ignore').decode('ascii')
    
    # remove &amp
    temp = temp.replace('&amp;', '&')
    
    # remove urls
    temp = re.sub(r"http\S+", "", temp)
    
    # remove html
    temp = re.sub(r'<.*?>', "", temp)
    
    # remove hashtags
    temp = re.sub(r'#', "", temp)
    
    # remove accounts
    temp = re.sub(r"@\S+", "", temp)

    # Create spacy object
    mytokens = sp(temp)

    #Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # Return preprocessed list of tokens
    return mytokens

In [16]:
print(spacy_tokenizer2(test.full[2]))

['nocountry', 'nolocation', 'burn', 'building', '...', 'rioter', 'looter', 'burn', 'building', 'white', 'live', 'matter']


# BOW bang

In [None]:
%%time
# Using tokenizer 
count = CountVectorizer(ngram_range=(1,6), min_df=3, tokenizer=spacy_tokenizer2)
bow = count.fit_transform(train.full)

# Get feature names
feature_names = count.get_feature_names()

# Show as a dataframe
processed_train = pd.DataFrame(
    bow.todense(), 
    columns=feature_names
    )

processed_train.shape

In [None]:
train_full = pd.concat([train[['num_char', 'num_hashtags', 'avg_word_length']], processed_train], axis=1)

In [None]:
# Select features
X = train_full # the features we want to analyze
y = train['target'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=707)

In [None]:
scaler = StandardScaler()  # doctest: +SKIP
# Don't cheat - fit only on training data
scaler.fit(X_train)  # doctest: +SKIP
X_train = scaler.transform(X_train)  # doctest: +SKIP
# apply same transformation to test data
X_test = scaler.transform(X_test)  # doctest: +SKIP

In [None]:
%%time
# Define PCA
pca = PCA(n_components=3000)

# Example on X_train_vec
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
print('Shape after PCA: ', X_train_vec_pca.shape)
print('Number of components: ', pca.n_components_)
print('Explained variance ratio: ', sum(pca.explained_variance_ratio_))

In [None]:
# Define classifier
classifier = LogisticRegressionCV(solver='liblinear', max_iter=2000, cv=5, class_weight='balanced', n_jobs=-1)
#classifier = LogisticRegression()

In [None]:
%%time
# Fit model on training set
classifier.fit(X_train, y_train)

# Predictions
y_pred = classifier.predict(X_test)
train_pred = classifier.predict(X_train)

# Evaluate model
print(round(accuracy_score(train_pred, y_train), 4))
print(round(accuracy_score(y_test, y_pred), 4))

print(confusion_matrix(y_test, y_pred))

## TF-IDF with Logistic Regression

In [None]:
# Select features
X = train['text'] # the features we want to analyze
y = train['target'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=707)

In [None]:
%%time
# Define vectorizer
tfidf = TfidfVectorizer(ngram_range=(1, 1), tokenizer=spacy_tokenizer)

# Define classifier
classifier = LogisticRegressionCV(solver='lbfgs', max_iter=1000, cv=5)

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

In [None]:
# Predictions
y_pred = pipe.predict(X_test)

# Evaluate model
print(round(accuracy_score(y_test, y_pred), 4))

### Perhaps a random forest? 

In [None]:
# Maybe try a Random Forest? (- Hanna)
from sklearn.ensemble import RandomForestClassifier

# Define vectorizer
tfidf_vector = TfidfVectorizer(tokenizer=spacy_tokenizer) 

# Define classifier
classifier = RandomForestClassifier()

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

# Predictions
y_pred = pipe.predict(X_test)

# Bigbang

In [17]:
# Select features
X = train['full'] # the features we want to analyze
y = train['target'] # the labels, or answers, we want to test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=707)

In [18]:
%%time
# Define vectorizer - use above cleaning function
tfidf = TfidfVectorizer(sublinear_tf=True, ngram_range=(1,2), tokenizer=spacy_tokenizer2)

# Fit and transform X_train and X_test
X_train_vec = tfidf.fit_transform(X_train).toarray()
X_test_vec = tfidf.transform(X_test).toarray()
print(X_train_vec.shape)

(5176, 51904)
Wall time: 1min 8s


In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('scaler', StandardScaler()),
                 ('logistic reg', LogisticRegressionCV(solver='sag', max_iter=2000, cv=5, class_weight='balanced', n_jobs=-1))
                 ])
# Fit model
pipe.fit(X_train_vec, y_train)
y_pred = pipe.predict(X_test_vec)

print('Train Accuracy: ', round(pipe.score(X_train_vec, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec, y_test), 4))
print(confusion_matrix(y_test, y_pred))

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('scaler', StandardScaler()),
                 ('logistic reg', LogisticRegressionCV(solver='lbfgs', max_iter=5000, cv=5))
                 ])
# Fit model
pipe.fit(X_train_vec, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec, y_test), 4))

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('logistic reg', LogisticRegressionCV(solver='lbfgs', max_iter=5000, cv=7))
                 ])
# Fit model
pipe.fit(X_train_vec_pca, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec_pca, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_pca, y_test), 4))

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('logistic reg', LogisticRegressionCV(solver='lbfgs', max_iter=5000, cv=8))
                 ])
# Fit model
pipe.fit(X_train_vec_pca, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec_pca, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_pca, y_test), 4))

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('logistic reg', LogisticRegressionCV(solver='lbfgs', max_iter=5000, cv=10))
                 ])
# Fit model
pipe.fit(X_train_vec_pca, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec_pca, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_pca, y_test), 4))

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('logistic reg', LogisticRegressionCV(solver='lbfgs', max_iter=5000, cv=12))
                 ])
# Fit model
pipe.fit(X_train_vec_pca, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec_pca, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_pca, y_test), 4))

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('logistic reg', LogisticRegressionCV(solver='lbfgs', max_iter=6000, cv=15))
                 ])
# Fit model
pipe.fit(X_train_vec_pca, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec_pca, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_pca, y_test), 4))

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('logistic reg', LogisticRegressionCV(solver='lbfgs', max_iter=10000, cv=20))
                 ])
# Fit model
pipe.fit(X_train_vec_pca, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec_pca, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec_pca, y_test), 4))

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('scaler', scaler),
                 ('pca', pca),
                 ('logistic reg', LogisticRegressionCV(solver='lbfgs', max_iter=5000, cv=5)),
                 ])
# Fit model
pipe.fit(X_train_vec, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec, y_test), 4))

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('pca', pca),
                 ('knn', KNeighborsClassifier(20)),
                 ])
# Fit model
pipe.fit(X_train_vec, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec, y_test), 4))

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('scaler', scaler),
                 ('pca', pca),
                 ('knn', KNeighborsClassifier(20)),
                 ])
# Fit model
pipe.fit(X_train_vec, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec, y_test), 4))

In [None]:
%%time
tree_para = {'criterion':['gini','entropy'],'max_depth':[4,8,12,15,30,70,90,120,150]}

# Define Model
pipe = Pipeline([
                 ('pca', pca),
                 ('gscv', GridSearchCV(DecisionTreeClassifier(), tree_para, cv=5)),
                 ])
# Fit model

pipe.fit(X_train_vec, y_train)
print('Train Accuracy: ', round(pipe.score(X_train_vec, y_train), 4))
print('Test Accuracy: ', round(pipe.score(X_test_vec, y_test), 4))

## Classification using Doc2Vec and Logistic Regression

In [None]:
sample_tagged = train.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r['text']), tags=[r.target]), axis=1)

In [None]:
# Train test split - same split as before
train_tagged, test_tagged = train_test_split(sample_tagged, test_size=0.2, random_state=1234)

In [None]:
# Allows to speed up a bit
import multiprocessing
cores = multiprocessing.cpu_count()

In [None]:
# Define Doc2Vec and build vocabulary
from gensim.models import Doc2Vec

model_dbow = Doc2Vec(dm=1, vector_size=1000, negative=5, hs=0, min_count=1, sample=0, workers=cores, epoch=500)
model_dbow.build_vocab([x for x in train_tagged.values])

In [None]:
# Train distributed Bag of Word model
model_dbow.train(train_tagged, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs)

In [None]:
# Select X and y
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=300)) for doc in sents])
    return targets, regressors

y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

# Each document (i.e. complaint) is now a vector in the space of 30 dimentions.
# Similar complaints should have similar vector representation.

In [None]:
# Fit model on training set - same algorithm as before
logreg = LogisticRegressionCV(max_iter=3000, cv=10, solver='lbfgs')
logreg.fit(X_train, y_train)

In [None]:
# Predictions
y_pred = logreg.predict(X_test)

# Evaluate model
print(round(accuracy_score(y_test, y_pred), 4))

In [None]:
train.info()

## Classification using Doc2Vec, more features and Logistic Regression

In [None]:
sample_tagged = train.apply(lambda r: TaggedDocument(words=r['text'], tags=[r.target]), axis=1)

# Train test split - same split as before
train_tagged, test_tagged = train_test_split(sample_tagged, test_size=0.2, random_state=707)

# Allows to speed up a bit
import multiprocessing
cores = multiprocessing.cpu_count()

In [None]:
model = Doc2Vec.load('./doc2vec.bin')

In [None]:
# Select X and y
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=300)) for doc in sents])
    return targets, regressors

y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

In [None]:
# Fit model on training set - same algorithm as before
logreg = LogisticRegressionCV(max_iter=3000, cv=10, solver='lbfgs')
logreg.fit(X_train, y_train)

# 🏆 8. Submission

## BOW

In [None]:
# Using default tokenizer 
count = CountVectorizer(ngram_range=(1,10), stop_words="english", min_df = 5, max_df = 0.8, sublinear_tf=True)
bow = count.fit(train.text)
bow = count.transform(train.text)

In [None]:
# Get feature names
feature_names = count.get_feature_names()

In [None]:
# Show as a dataframe
processed_train = pd.DataFrame(
    bow.todense(), 
    columns=feature_names
    )

In [None]:
train_full = pd.concat([train[['num_char', 'num_words', 'avg_word_length', 'num_hashtags']], processed_train], axis=1)

In [None]:
# Select features
X = train_full # the features we want to analyze
y = train['target'] # the labels, or answers, we want to test against

In [None]:
# Define classifier
classifier = LogisticRegressionCV(solver='lbfgs', max_iter=6000, cv=3)

In [None]:
%%time
# Fit model on training set
classifier.fit(X, y)

In [None]:
bow_test = count.transform(test.text)
# Get feature names
feature_names_test = count.get_feature_names()

In [None]:
# Show as a dataframe
processed_test = pd.DataFrame(
    bow_test.todense(),
    columns=feature_names_test
    )

In [None]:
test_full = pd.concat([test[['num_char', 'num_words', 'avg_word_length' , 'num_hashtags']], processed_test], axis=1)

In [None]:
# Predictions

y_pred = classifier.predict(test_full)

## TF IDF

In [None]:
# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

In [None]:
pipe.fit(train.text, train.target)

In [None]:
preds = pipe.predict(test.text)

In [None]:
preds

# Word2Vec

In [None]:
train = pd.read_csv('../data/training_data_spellchecked.csv')
test = pd.read_csv('../data/test_data_spellchecked.csv')

train[['location', 'text']] = train[['location', 'text']].astype(str)
test['target'] = ''
test[['location', 'text']] = test[['location', 'text']].astype(str)

In [None]:
train_tagged = train.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r['text']), tags=[r.target]), axis=1)

In [None]:
test_tagged = test.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r['text']), tags=[r.target]), axis=1)

In [None]:
# Allows to speed up a bit
import multiprocessing
cores = multiprocessing.cpu_count()

In [None]:
# Define Doc2Vec and build vocabulary
from gensim.models import Doc2Vec

model_dbow = Doc2Vec(dm=0, vector_size=30, negative=6, hs=0, min_count=1, sample=0, workers=cores, epoch=300)
model_dbow.build_vocab([x for x in train_tagged.values])

In [None]:
# Train distributed Bag of Word model
model_dbow.train(train_tagged, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs)

In [None]:
# Select X and y
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=300)) for doc in sents])
    return targets, regressors

y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

In [None]:
logreg = LogisticRegressionCV(max_iter=1000, solver='lbfgs', cv=3)
logreg.fit(X_train, y_train)

# Predictions
y_pred = logreg.predict(X_test)

## Export submission

In [None]:
%%time
# Define vectorizer - use above cleaning function
tfidf = TfidfVectorizer(sublinear_tf=True, ngram_range=(1,2), tokenizer=spacy_tokenizer2)

# Fit and transform X_train and X_test
X_train_vec = tfidf.fit_transform(train.full).toarray()
X_test_vec = tfidf.transform(test.full).toarray()
print(X_train_vec.shape)

In [None]:
%%time
# Define PCA
pca = PCA(n_components=0.8)

# Example on X_train_vec
X_train_vec_pca = pca.fit_transform(X_train_vec)
print('Shape after PCA: ', X_train_vec_pca.shape)
print('Number of components: ', pca.n_components_)
print('Explained variance ratio: ', sum(pca.explained_variance_ratio_))

In [None]:
%%time
X_test_vec_pca = pca.transform(X_test_vec)

In [None]:
%%time
# Define Model
pipe = Pipeline([
                 ('scaler', StandardScaler()),
                 ('logistic reg', LogisticRegressionCV(solver='sag', max_iter=2000, cv=5, class_weight='balanced', n_jobs=-1)),
                 ])
# Fit model
pipe.fit(X_train_vec, train.target)

In [None]:
print('Train Accuracy: ', round(pipe.score(X_train_vec, train.target), 4))

In [None]:
y_preds = pipe.predict(X_test_vec)

In [None]:
sample_submission.target = y_preds

In [None]:
sample_submission

In [None]:
sample_submission.to_csv('submission-025.csv', index=False)