### Plan:
1. **Get tokens** for positive and negative tweets (by `token` in this context we mean `word`).
2. **Lemmatize** them (convert to base word forms). For that we will use a Part-of-Speech tagger.
3. **Clean'em up** (remove mentions, URLs, stop words).
4. **Prepare models** for the classifier, based on cleaned-up tokens.
5. **Run the Naive Bayes classifier**.

First, download necessary prepared samples.

In [1]:
import nltk

In [2]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

Get some sample positive/negative tweets.

In [3]:
from nltk.corpus import twitter_samples


We can either get the actual string content of those tweets:

In [4]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

In [5]:
positive_tweets[50]

'@groovinshawn they are rechargeable and it normally comes with a charger when u buy it :)'

Or we can get a list of tokens using [tokenized method](https://www.nltk.org/howto/twitter.html) on `twitter_samples`.

In [6]:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(tweet_tokens[50])

['@groovinshawn', 'they', 'are', 'rechargeable', 'and', 'it', 'normally', 'comes', 'with', 'a', 'charger', 'when', 'u', 'buy', 'it', ':)']


Now let's setup a Part-of-Speech tagger.  Download a perceptron tagger that will be used by the PoS tagger.

In [7]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Import Part-of-Speech tagger that will be used for lemmatization

In [8]:
from nltk.tag import pos_tag

Check how it works. Note that it returns tuples, where second element is a Part-of-Speech identifier.

In [9]:
pos_tag(tweet_tokens[50])

[('@groovinshawn', 'NN'),
 ('they', 'PRP'),
 ('are', 'VBP'),
 ('rechargeable', 'JJ'),
 ('and', 'CC'),
 ('it', 'PRP'),
 ('normally', 'RB'),
 ('comes', 'VBZ'),
 ('with', 'IN'),
 ('a', 'DT'),
 ('charger', 'NN'),
 ('when', 'WRB'),
 ('u', 'JJ'),
 ('buy', 'VB'),
 ('it', 'PRP'),
 (':)', 'JJ')]

Let's write a function that will lemmatize twitter tokens.

For that, let's first fetch a WordNet resource. WordNet is a semantically-oriented dictionary of English - check chapter 2.5 of the NLTK book. In online version, this is part 5 [here](https://www.nltk.org/book/ch02.html).

In [10]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Now fetch PoS tokens so that they can be passed to `WordNetLemmatizer`.

In [11]:
from nltk.stem.wordnet import WordNetLemmatizer
tokens = tweet_tokens[50]

lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []

for word, tag in pos_tag(tokens):
    if tag.startswith('NN'):
        pos = 'n'
    elif tag.startswith('VB'):
        pos = 'v'
    else:
        pos = 'a'
    lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
print(lemmatized_sentence)

['@groovinshawn', 'they', 'be', 'rechargeable', 'and', 'it', 'normally', 'come', 'with', 'a', 'charger', 'when', 'u', 'buy', 'it', ':)']


Note that it converts words to their base forms ('are' -> 'be', 'comes' -> 'come').

Now we can proceed to processing. 
During processing, we will perform cleanup:
  - remove URLs and mentions using regexes
  - after lemmatization, remove *stopwords*

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

What are these stopwords? Let's see some.

In [13]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
for i in range(10):
    print(stop_words[i])


179
i
me
my
myself
we
our
ours
ourselves
you
you're


Here comes the `process_tokens` function:

In [14]:
import re, string

def process_tokens(tweet_tokens):

    cleaned_tokens = []
    stop_words = stopwords.words('english')
    lemmatizer = WordNetLemmatizer()

    for token, tag in pos_tag(tweet_tokens):
        if (re.search(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', token) or 
            re.search(r'(@[A-Za-z0-9_]+)', token) or 
            re.search(r'#.*', token)):
            continue

        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
   
        token = lemmatizer.lemmatize(token, pos)

        if token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

Let's test `process_tokens`:

In [15]:
print("Before:", tweet_tokens[50])
print("After:", process_tokens(tweet_tokens[50]))

Before: ['@groovinshawn', 'they', 'are', 'rechargeable', 'and', 'it', 'normally', 'comes', 'with', 'a', 'charger', 'when', 'u', 'buy', 'it', ':)']
After: ['rechargeable', 'normally', 'come', 'charger', 'u', 'buy', ':)']


Run `process_tokens` on all positive/negative tokens.

In [16]:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = [process_tokens(tokens) for tokens in positive_tweet_tokens]
negative_cleaned_tokens_list = [process_tokens(tokens) for tokens in negative_tweet_tokens]

Let's see how did the processing go.

In [17]:
print(positive_tweet_tokens[500])
print(positive_cleaned_tokens_list[500])

['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht']
['dang', 'rad', ':d']


Let's see what is most common there. Add a helper function `get_all_words`:

In [18]:
def get_all_words(cleaned_tokens_list):
    return [w for tokens in cleaned_tokens_list for w in tokens]

all_pos_words = get_all_words(positive_cleaned_tokens_list)

Perform frequency analysis using `FreqDist`:

In [19]:
from nltk import FreqDist

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]


Fine. Now we'll convert these to a data structure usable for NLTK's naive Bayes classifier ([docs here](https://www.nltk.org/_modules/nltk/classify/naivebayes.html)):

In [20]:
[tweet_tokens for tweet_tokens in positive_cleaned_tokens_list][0]

['top', 'engage', 'member', 'community', 'week', ':)']

In [21]:
def get_token_dict(tokens):
    return dict([token, True] for token in tokens)
    
def get_tweets_for_model(cleaned_tokens_list):   
    return [get_token_dict(tweet_tokens) for tweet_tokens in cleaned_tokens_list]

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

Create two datasets for positive and negative tweets. Use 7000/3000 split for train and test data.

In [22]:
import random

positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

Finally we use the nltk's NaiveBayesClassifier on the training data we've just created:

In [23]:
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy without hashtags is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))


Accuracy without hashtags is: 0.9956666666666667
Most Informative Features
                      :( = True           Negati : Positi =   2034.4 : 1.0
                      :) = True           Positi : Negati =   1666.8 : 1.0
                follower = True           Positi : Negati =     39.9 : 1.0
                     sad = True           Negati : Positi =     30.7 : 1.0
                  arrive = True           Positi : Negati =     21.9 : 1.0
                     bam = True           Positi : Negati =     20.8 : 1.0
                    sick = True           Negati : Positi =     19.2 : 1.0
              appreciate = True           Positi : Negati =     15.3 : 1.0
                    glad = True           Positi : Negati =     14.1 : 1.0
               community = True           Positi : Negati =     14.0 : 1.0
None


In [24]:
"""
Accuracy with hashtags is: 0.9956666666666667
Most Informative Features
                      :( = True           Negati : Positi =   2050.2 : 1.0
                follower = True           Positi : Negati =     36.2 : 1.0
                     sad = True           Negati : Positi =     25.8 : 1.0
                     bam = True           Positi : Negati =     22.7 : 1.0
                     x15 = True           Negati : Positi =     17.4 : 1.0
               community = True           Positi : Negati =     15.2 : 1.0
                followed = True           Negati : Positi =     15.2 : 1.0
                    luck = True           Positi : Negati =     14.5 : 1.0
               goodnight = True           Positi : Negati =     13.9 : 1.0
              appreciate = True           Positi : Negati =     13.2 : 1.0
              """

'\nAccuracy with hashtags is: 0.9956666666666667\nMost Informative Features\n                      :( = True           Negati : Positi =   2050.2 : 1.0\n                follower = True           Positi : Negati =     36.2 : 1.0\n                     sad = True           Negati : Positi =     25.8 : 1.0\n                     bam = True           Positi : Negati =     22.7 : 1.0\n                     x15 = True           Negati : Positi =     17.4 : 1.0\n               community = True           Positi : Negati =     15.2 : 1.0\n                followed = True           Negati : Positi =     15.2 : 1.0\n                    luck = True           Positi : Negati =     14.5 : 1.0\n               goodnight = True           Positi : Negati =     13.9 : 1.0\n              appreciate = True           Positi : Negati =     13.2 : 1.0\n              '

Note the Positive:Negative ratios.

Let's check some test phrase. First, download punkt sentence tokenizer ([docs here](https://www.nltk.org/api/nltk.tokenize.punkt.html))

In [25]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Now we won't rely on `twitter_samples.tokenized`, but rather will use a generic tokenization routine - `word_tokenize`.

In [26]:
from nltk.tokenize import word_tokenize

custom_tweet_pos = "I have a good feeling about this one, it is really great"

custom_tweet_neg = "I have a good feeling about this one, it is really"

custom_tweet = "the model doesn't work really GREAT, it just reacts on trigger words. It is expected cause our model doesn't even have trainable parameters."

custom_tokens = process_tokens(word_tokenize(custom_tweet))

print(classifier.classify(get_token_dict(custom_tokens)))

Positive


Let's package it as a function:

In [27]:
def get_sentiment(text):
    custom_tokens = process_tokens(word_tokenize(text))
    return classifier.classify(get_token_dict(custom_tokens))

texts = ["bad", "service is bad", "service is really bad", "service is so terrible", "great service", "they stole my money"]
for t in texts:
    print(t, ": ", get_sentiment(t))


bad :  Negative
service is bad :  Negative
service is really bad :  Negative
service is so terrible :  Negative
great service :  Positive
they stole my money :  Negative


In [28]:
from sklearn.linear_model import LogisticRegression
import numpy as np
import random

In [29]:
def get_vocab(corpus):
    vocab = {}
    for i, word in enumerate(set(corpus), 1):
        vocab[word] = i
    return vocab

In [30]:
vocab = get_vocab(get_all_words(positive_cleaned_tokens_list) + get_all_words(negative_cleaned_tokens_list))
len(vocab)

10600

In [31]:
def ohe(sentence, vocab):
    ohe_sentence = [[0 for _ in range(len(vocab) + 1)] for _ in range(len(sentence))]
    for i, word in enumerate(sentence):
        ohe_sentence[i][vocab.get(word, 0)] = 1
        
    return ohe_sentence

In [32]:
def num_words_representation(sentence, vocab, ohe):
    return list(map(sum, list(zip(*ohe(sentence, vocab)))))

In [33]:
pos_prepared_data = [num_words_representation(sentence, vocab, ohe) for sentence in positive_cleaned_tokens_list]

In [34]:
max_len_pos = max(list(map(sum, pos_prepared_data)))
max_len_pos

51

In [35]:
neg_prepared_data = [num_words_representation(sentence, vocab, ohe) for sentence in negative_cleaned_tokens_list]

In [36]:
max_len_neg = max(list(map(sum, neg_prepared_data)))
max_len_neg

28

In [37]:
pos_prepared_data = list(zip(pos_prepared_data, [1 for _ in range(len(pos_prepared_data))]))
neg_prepared_data = list(zip(neg_prepared_data, [0 for _ in range(len(neg_prepared_data))]))

data = pos_prepared_data + neg_prepared_data

In [38]:
data = list(zip(*data))

In [39]:
X = np.array(data[0])
y = np.array(data[1])

In [40]:
X.shape, y.shape

((10000, 10601), (10000,))

In [41]:
X[0], y[0]

(array([0, 0, 0, ..., 0, 0, 0]), 1)

In [42]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=88)

In [43]:
del data, pos_prepared_data, neg_prepared_data, X, y

In [44]:
log_reg = LogisticRegression().fit(X_train, y_train)

In [45]:
y_test_pred = log_reg.predict(X_test)

In [46]:
from sklearn.metrics import accuracy_score


accuracy_score(y_test, y_test_pred)

0.993

In [82]:
custom_tweet_pos = "listen last night :) bleed amazing track scotland"
custom_tweet_neg = "I fucking hate you"
custom_tweet = "the model doesn't work really GREAT, it just reacts on trigger words. It is expected cause our model doesn't even have trainable parameters."

custom_tokens = process_tokens(word_tokenize(custom_tweet_neg))

test = np.array(num_words_representation(custom_tokens, vocab, ohe))[None, :]


# the model's performance is very skeptical. It gives very small weight to bad words and as a result we often get pos predictions.
# The accuracy is high though.
log_reg.predict_proba(test)

array([[0.52794484, 0.47205516]])