References:

[Kaggle challenge](https://www.kaggle.com/c/nlp-getting-started/)

[Cleaning text data](https://towardsdatascience.com/cleaning-text-data-with-python-b69b47b97b76)

[Text classification](https://towardsdatascience.com/introduction-to-text-classification-with-python-c9db137b9d80)

In [45]:
import re
import nltk
import string
import pandas as pd
from unicodedata import normalize

In [12]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

In [9]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [13]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


***

# Cleaning text data

### Lowercase the text:

In [15]:
train['text'] = train.text.str.lower()
test['text'] = test.text.str.lower()

### Remove Unicode characters:

In [27]:
remove_unicode = lambda txt: normalize('NFKD', txt).encode('ascii','ignore').decode()

In [29]:
train['text'] = train.text.apply(remove_unicode)
test['text'] = test.text.apply(remove_unicode)

### Remove stop words:

In [30]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [34]:
from nltk.corpus import stopwords

In [35]:
stop_words = stopwords.words('english')

In [38]:
remove_stopwords = lambda txt: ' '.join([word for word in txt.split(' ') if word not in stop_words])

In [40]:
train['text'] = train.text.apply(remove_stopwords)
test['text'] = test.text.apply(remove_stopwords)

### Remove mentions, links, hashtags etc.:

In [50]:
mention_pattern = '@\S+'
link_pattern = 'https*\S+'
hashtag_pattern = '#\S+'
tick_pattern = '\'\w+' # gov't -> gov
punctuation_pattern = '[%s]' % re.escape(string.punctuation)
number_pattern = r'\w*\d+\w*'
overspace_pattern = '\s{2,}'

In [51]:
def remove_more(txt):
    new_str = txt[:]
    new_str = re.sub(mention_pattern, ' ', new_str)
    new_str = re.sub(link_pattern, ' ', new_str)
    new_str = re.sub(hashtag_pattern, ' ', new_str)
    new_str = re.sub(tick_pattern, '', new_str)
    new_str = re.sub(punctuation_pattern, ' ', new_str)
    new_str = re.sub(number_pattern, '', new_str)
    new_str = re.sub(overspace_pattern, ' ', new_str)

    return new_str


In [53]:
train['text'] = train.text.apply(remove_more)
test['text'] = test.text.apply(remove_more)

***

# Build Term-Document Matrix

**Term-Document Matrix** (TDM): rows represent each document, columns represent each term, each cell contains word count (e.g., TF-IDF).

**Term Frequency - Inverse Document Frequency** (TF-IDF): product of TF and IDF:

- Term Frequncy (TF): $tf_{t,d} = \log_{10} \big( \text{count}(t,d) + 1 \big)$.

- Inverse Document Frequency (IDF): $idf_t = \log_{10} \bigg( \displaystyle\frac{N}{df_t} \bigg)$.

More [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition).


In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [55]:
vectorizer = TfidfVectorizer()

In [59]:
tfidf_arr_train = vectorizer.fit_transform(train['text']).toarray()
tfidf_train = pd.DataFrame(tfidf_arr_train, columns=vectorizer.get_feature_names())

tfidf_arr_test = vectorizer.transform(test['text']).toarray()
tfidf_test = pd.DataFrame(tfidf_arr_test, columns=vectorizer.get_feature_names())

***

# The concept of Naive Bayes

Calculates the probability that a document belogs to a certain class.

Assumes terms in the document are independent to one another.

$$
P(c|d) \propto P(c) \prod_{i=1}^{n} P(t_{id} | c)
$$

$P(c|d)$: probability that document $d$ belongs to class $c$.

$P(c) = \displaystyle\frac{N_c}{N}$: prior probability of class $c$.

$P(t_{id} | c) = \displaystyle\frac{N_{ct} + 1}{N_c + 2}$: probability of a term $t_i$, inside a document $d$ that belongs to class $c$.

**Solution**: $c_{map} = \underset{c\ \in\ C}{\arg \max}\ P(c|d)$.

In [64]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score

In [63]:
X_train = tfidf_train.values
y_train = train.target.values
X_test = tfidf_test.values

In [65]:
kfold = KFold(n_splits=5)

In [66]:
cross_val_score(estimator=MultinomialNB(), X=X_train, y=y_train, cv=kfold, scoring='f1')

array([0.66093601, 0.68041237, 0.66773419, 0.64492754, 0.74962064])

In [67]:
cross_val_score(estimator=BernoulliNB(), X=X_train, y=y_train, cv=kfold, scoring='f1')

array([0.66095238, 0.69295302, 0.68721109, 0.66546763, 0.75111773])

Using `BernoulliNB`:

In [68]:
clf = BernoulliNB()

In [69]:
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [70]:
clf.score(X_train, y_train)

0.8922895047944306

In [72]:
y_pred = clf.predict(X_test)

In [73]:
submission = pd.DataFrame()
submission['id'] = test['id']
submission['target'] = y_pred
submission.to_csv('submission.csv', index=False)