# NLP Logistic Regression

[Disaster Tweets Dataset](https://www.kaggle.com/c/nlp-getting-started)

This notebook is inspired by Course 1, Week 1 of the [deeplearning.ai Natural Language Processing Specialization](https://www.deeplearning.ai/natural-language-processing-specialization/), but the code here is implemented using moden library functions rather than using hand-coded implementions.

The approach here is to tokenize the text, and create word frequencies tables for each of the labels. 
A Nx2 numeric feature matrix is created for each tweet, containing the sum of positive and negative frequencies for each word tokens in the tweet.
Linear Regression solves the problem via gradient decent, and is trained to predict labels given an extracted feature matrix.

# Imports

In [19]:
!pip install -q frozendict > /dev/null

The system cannot find the path specified.


In [None]:
import numpy  as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
import pydash
import math
import os
import itertools

from pydash import flatten, flatten_deep
from collections import Counter, OrderedDict
from frozendict import frozendict
from humanize import intcomma
from operator import itemgetter
from typing import *
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from itertools import product, combinations
from joblib import Parallel, delayed

: 

In [21]:
df_train = pd.read_csv('dataset/train.csv', index_col=0)
df_test  = pd.read_csv('dataset/test.csv', index_col=0)
df_train

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,Our Deeds are the Reason of this #earthquake M...,1
4,,,Forest fire near La Ronge Sask. Canada,1
5,,,All residents asked to 'shelter in place' are ...,1
6,,,"13,000 people receive #wildfires evacuation or...",1
7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...
10869,,,Two giant cranes holding a bridge collapse int...,1
10870,,,@aria_ahrary @TheTawniest The out of control w...,1
10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
10872,,,Police investigating after an e-bike collided ...,1


# Tokenization and Word Frequencies

Here we tokenize the text using nltk.TweetTokenizer, apply lowercasing, tweet preprocessing, and stemming.

Then compute a dictionary lookup of word counts for each label

In [22]:
print(nltk.corpus.stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [23]:
def tokenize_df(
    dfs: List[pd.DataFrame], 
    keys          = ('text', 'keyword', 'location'), 
    stemmer       = True, 
    preserve_case = True, 
    reduce_len    = False, 
    strip_handles = True,
    use_stopwords = True,
    **kwargs,
) -> List[List[str]]:
    # tokenizer = nltk.TweetTokenizer(preserve_case=True,  reduce_len=False, strip_handles=False)  # defaults 
    tokenizer = nltk.TweetTokenizer(preserve_case=preserve_case, reduce_len=reduce_len, strip_handles=strip_handles) 
    porter    = nltk.PorterStemmer()
    stopwords = set(nltk.corpus.stopwords.words('english') + [ 'nan' ])

    output    = []
    for df in flatten([ dfs ]):
        for index, row in df.iterrows():
            tokens = flatten([
                tokenizer.tokenize(str(row[key] or ""))
                for key in keys    
            ])
            if use_stopwords:
                tokens = [ 
                    token 
                    for token in tokens 
                    if token.lower() not in stopwords
                    and len(token) >= 2
                ]                
            if stemmer:
                tokens = [ 
                    porter.stem(token) 
                    for token in tokens 
                ]
            output.append(tokens)

    return output


def word_frequencies(df, **kwargs) -> Dict[int, Counter]:
    tokens = {
        0: flatten(tokenize_df( df[df['target'] == 0], **kwargs )),
        1: flatten(tokenize_df( df[df['target'] == 1], **kwargs )),
    }
    freqs = { 
        target: Counter(dict(Counter(tokens[target]).most_common())) 
        for target in [0, 1]
    }  # sort and cast
    return freqs

In [24]:
tokenize_df(df_train)[:2]

[['deed', 'reason', '#earthquak', 'may', 'allah', 'forgiv', 'us'],
 ['forest', 'fire', 'near', 'la', 'rong', 'sask', 'canada']]

In [25]:
freqs = word_frequencies(df_train)
print('freqs[0]', len(freqs[0]), freqs[0].most_common(10))
print('freqs[1]', len(freqs[1]), freqs[1].most_common(10))

freqs[0] 12811 [('...', 421), ('new', 320), ('like', 309), ('get', 224), ('bodi', 216), ("i'm", 207), ('scream', 194), ('û_', 171), ('burn', 159), ('obliter', 157)]
freqs[1] 10795 [('...', 637), ('fire', 303), ('bomb', 242), ('new', 207), ('suicid', 204), ('evacu', 185), ('flood', 176), ('û_', 171), ('derail', 170), ('kill', 160)]


# Feature Extraction

Here we create a Nx2 feature matrix containing the sum of positive and negative word frequencies for each tweet

In [26]:
def inverse_document_frequency( tokens: List[str] ) -> Counter:
    tokens = flatten_deep(tokens)
    idf = {
        token: math.log( len(tokens) / count ) 
        for token, count in Counter(tokens).items()
    }
    idf = Counter(dict(Counter(idf).most_common()))  # sort and cast
    return idf

def inverse_document_frequency_df( dfs ) -> Counter:
    tokens = flatten_deep([ tokenize_df(df) for df in flatten([ dfs ]) ])
    return inverse_document_frequency(tokens)

idf = inverse_document_frequency_df([ df_train, df_test ])
list(reversed(idf.most_common()))[:20]

[('...', 4.467633783633229),
 ('new', 5.142574999696602),
 ('fire', 5.360577151510393),
 ('like', 5.413220884995814),
 ('û_', 5.568216516288637),
 ('bomb', 5.654690114292464),
 ('get', 5.667677309819275),
 ('burn', 5.792840452773281),
 ('usa', 5.833148176261374),
 ('emerg', 5.8539281447531195),
 ('flood', 5.89136567182525),
 ("i'm", 5.918991738100181),
 ('bodi', 5.935941296413954),
 ('attack', 5.967781902269613),
 ('via', 5.97072741249937),
 ('fatal', 6.000669769114448),
 ('crash', 6.000669769114448),
 ('suicid', 6.015984004087491),
 ('build', 6.025286396749804),
 ('evacu', 6.034676137099644)]

In [13]:
def extract_features(df, freqs, use_idf=True, use_log=True, **kwargs) -> np.array:
    features = []
    tokens   = tokenize_df(df, **kwargs)
    for n in range(len(tokens)):
        bias     = 1  # bias term is implict when using sklearn
        positive = 1
        negative = 1        
        for token in tokens[n]:
            if use_idf:
                positive += freqs[0].get(token, 0) * idf.get(token, 1) 
                negative += freqs[1].get(token, 0) * idf.get(token, 1)
            else:
                positive += freqs[0].get(token, 0) 
                negative += freqs[1].get(token, 0) 
        features.append([ positive, negative ])  

    features = np.array(features)   # accuracy = 0.7166688559043741
    if use_log:
        features = np.log(features) # accuracy = 0.7136477078681204
    return features


Y_train = df_train['target'].to_numpy()
X_train = extract_features(df_train, freqs)
X_test  = extract_features(df_test,  freqs)

print('df_train', df_train.shape)
print('df_test ', df_test.shape)
print('Y_train ', Y_train.shape)
print('X_train ', X_train.shape)
print('X_test  ', X_test.shape)
print(X_test[:5])

df_train (7613, 4)
df_test  (3263, 3)
Y_train  (7613,)
X_train  (7613, 2)
X_test   (3263, 2)
[[6.92293033 7.38327619]
 [7.14708523 7.00546676]
 [7.29343584 8.00157928]
 [6.36825736 5.77734926]
 [5.70946644 7.56250014]]


# Hyperparameter Search


The optimal settings are:
- stemmer = True
- preserve_case = True
- strip_handles = Any
- use_stopwords = True

The above are exactly opposite compared to [TF-IDF Classifier](https://www.kaggle.com/jamesmcguigan/disaster-tweets-tf-idf-classifier?scriptVersionId=50898834), 
but the following settings are shared:

- use_idf = True
- use_log = True

In [27]:
def predict_df(df_train, df_test, **kwargs):
    freqs   = word_frequencies(df_train, **kwargs)

    Y_train = df_train['target'].to_numpy()
    X_train = extract_features(df_train, freqs, **kwargs)
    X_test  = extract_features(df_test,  freqs, **kwargs) \
              if df_train is not df_test else X_train

    model      = LinearRegression().fit(X_train, Y_train)
    prediction = model.predict(X_test)
    prediction = np.round(prediction).astype(np.int)
    return prediction


def get_train_accuracy(splits=3, **kwargs):
    """ K-Fold Split Accuracy """
    accuracy = 0.0
    for _ in range(splits):
        train, test = train_test_split(df_train, test_size=1/splits)      
        prediction  = predict_df(train, test, **kwargs)
        Y_train     = test['target'].to_numpy()
        accuracy   += np.sum( Y_train == prediction ) / len(Y_train) / splits    
    return accuracy
    
    
def train_accuracy_hyperparameter_search():
    results = Counter()
    jobs    = []
    
    # NOTE: reducing input fields has no effect on accuracy
    for keys in [ ('text', 'keyword', 'location'), ]: # ('text', 'keyword'), ('text',) ]:
        strip_handles = 1  # no effect on accuracy 
        # use_log       = 1  # no effect on accuracy
        for stemmer, preserve_case, reduce_len, use_stopwords, use_idf, use_log in product([1,0],[1,0],[1,0],[1,0],[1,0],[1,0]):
            def fn(keys, stemmer, preserve_case, reduce_len, strip_handles, use_stopwords, use_idf, use_log):
                kwargs = {
                    "stemmer":        stemmer,          # stemmer = True is always better
                    "preserve_case":  preserve_case, 
                    "reduce_len":     reduce_len, 
                    # "strip_handles": strip_handles,   # no effect on accuracy
                    "use_stopwords":  use_stopwords,    # use_stopwords = True is always better
                    "use_idf":        use_idf,          # use_idf = True is always better
                    "use_log":        use_log,          # use_log = True is always better
                }
                label = frozendict({
                    **kwargs,
                    # "keys": keys,                     # no effect on accuracy
                })
                accuracy = get_train_accuracy(**kwargs)
                return (label, accuracy)
            
            # hyperparameter search is slow, so multiprocess it
            jobs.append( delayed(fn)(keys, stemmer, preserve_case, reduce_len, strip_handles, use_stopwords, use_idf, use_log) )
            
    results = Counter(dict( Parallel(-1)(jobs) ))
    results = Counter(dict(results.most_common()))  # sort and cast
    return results

In [28]:
%%time
results = train_accuracy_hyperparameter_search()
for label, value in results.items():
    print(f'{value:.5f} |', "  ".join(f"{k.split('_')[-1]} = {v}" for k,v in label.items() ))  # pretty printdd

AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Verify train accuracy given default settings

In [18]:
print('train_accuracy = ', get_train_accuracy())

AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

# Submission

Without additional feature engineering, LinearRegression scores worst than my [TF-IDF Classifier](https://www.kaggle.com/jamesmcguigan/disaster-tweets-tf-idf-classifier?scriptVersionId=50898834)

In [13]:
df_submission = pd.DataFrame({
    "id":     df_test.index,
    "target": predict_df(df_train, df_test)
})
df_submission.to_csv('submission.csv', index=False)
! head submission.csv

id,target
0,1
2,0
3,1
9,0
11,1
12,1
21,0
22,0
27,0


# Further Reading

This notebook is part of a series exploring Natural Language Processing
- 0.74164 - [NLP Logistic Regression](https://www.kaggle.com/jamesmcguigan/disaster-tweets-logistic-regression/)
- 0.77536 - [NLP TF-IDF Classifier](https://www.kaggle.com/jamesmcguigan/disaster-tweets-tf-idf-classifier)
- 0.79742 - [NLP Naive Bayes](https://www.kaggle.com/jamesmcguigan/nlp-naive-bayes)