### Plan:
1. **Get tokens** for positive and negative tweets (by `token` in this context we mean `word`).
2. **Lemmatize** them (convert to base word forms). For that we will use a Part-of-Speech tagger.
3. **Clean'em up** (remove mentions, URLs, stop words).
4. **Prepare models** for the classifier, based on cleaned-up tokens.
5. **Run the Naive Bayes classifier**.

First, download necessary prepared samples.

In [27]:
import nltk

In [28]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

Get some sample positive/negative tweets.

In [29]:
from nltk.corpus import twitter_samples

We can either get the actual string content of those tweets:

In [30]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

In [31]:
positive_tweets[50]

'@groovinshawn they are rechargeable and it normally comes with a charger when u buy it :)'

Or we can get a list of tokens using [tokenized method](https://www.nltk.org/howto/twitter.html) on `twitter_samples`.

In [32]:
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(tweet_tokens[50])

['@groovinshawn', 'they', 'are', 'rechargeable', 'and', 'it', 'normally', 'comes', 'with', 'a', 'charger', 'when', 'u', 'buy', 'it', ':)']


Now let's setup a Part-of-Speech tagger.  Download a perceptron tagger that will be used by the PoS tagger.

In [33]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Import Part-of-Speech tagger that will be used for lemmatization

In [34]:
from nltk.tag import pos_tag

Check how it works. Note that it returns tuples, where second element is a Part-of-Speech identifier.

In [35]:
pos_tag(tweet_tokens[50])

[('@groovinshawn', 'NN'),
 ('they', 'PRP'),
 ('are', 'VBP'),
 ('rechargeable', 'JJ'),
 ('and', 'CC'),
 ('it', 'PRP'),
 ('normally', 'RB'),
 ('comes', 'VBZ'),
 ('with', 'IN'),
 ('a', 'DT'),
 ('charger', 'NN'),
 ('when', 'WRB'),
 ('u', 'JJ'),
 ('buy', 'VB'),
 ('it', 'PRP'),
 (':)', 'JJ')]

Let's write a function that will lemmatize twitter tokens.

For that, let's first fetch a WordNet resource. WordNet is a semantically-oriented dictionary of English - check chapter 2.5 of the NLTK book. In online version, this is part 5 [here](https://www.nltk.org/book/ch02.html).

In [36]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Now fetch PoS tokens so that they can be passed to `WordNetLemmatizer`.

In [37]:
from nltk.stem.wordnet import WordNetLemmatizer
tokens = tweet_tokens[50]

# Create a lemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
# Convert PoS tags into a format used by the lemmatizer
# and run lemmatize
for word, tag in pos_tag(tokens):
    if tag.startswith('NN'):
        pos = 'n'
    elif tag.startswith('VB'):
        pos = 'v'
    else:
        pos = 'a'
    lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
print(lemmatized_sentence)

['@groovinshawn', 'they', 'be', 'rechargeable', 'and', 'it', 'normally', 'come', 'with', 'a', 'charger', 'when', 'u', 'buy', 'it', ':)']


Note that it converts words to their base forms ('are' -> 'be', 'comes' -> 'come').

Now we can proceed to processing.
During processing, we will perform cleanup:
  - remove URLs and mentions using regexes
  - after lemmatization, remove *stopwords*

In [38]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

What are these stopwords? Let's see some.

In [39]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
for i in range(10):
    print(stop_words[i])


179
i
me
my
myself
we
our
ours
ourselves
you
you're


Here comes the `process_tokens` function:

In [66]:
import re, string

def process_tokens(tweet_tokens):

    cleaned_tokens = []
    stop_words = stopwords.words('english')
    lemmatizer = WordNetLemmatizer()

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub(r'#', '', token)

        if (re.search(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', token) or
            re.search(r'(@[A-Za-z0-9_]+)', token)):
            continue

        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        token = lemmatizer.lemmatize(token, pos)

        if token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens


**Після виконання завдання 2 (видалення знака #), точність знизилася**


*   **Accuracy is: 0.9963333333333333**
*   **Accuracy is: 0.9956666666666667**

Let's test `process_tokens`:

In [41]:
print("Before:", tweet_tokens[50])
print("After:", process_tokens(tweet_tokens[50]))

Before: ['@groovinshawn', 'they', 'are', 'rechargeable', 'and', 'it', 'normally', 'comes', 'with', 'a', 'charger', 'when', 'u', 'buy', 'it', ':)']
After: ['rechargeable', 'normally', 'come', 'charger', 'u', 'buy', ':)']


Run `process_tokens` on all positive/negative tokens.

In [42]:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = [process_tokens(tokens) for tokens in positive_tweet_tokens]
negative_cleaned_tokens_list = [process_tokens(tokens) for tokens in negative_tweet_tokens]

Let's see how did the processing go.

In [43]:
print(positive_tweet_tokens[500])
print(positive_cleaned_tokens_list[500])

['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht']
['dang', 'rad', 'fanart', ':d']


Let's see what is most common there. Add a helper function `get_all_words`:

In [44]:
def get_all_words(cleaned_tokens_list):
    return [w for tokens in cleaned_tokens_list for w in tokens]

all_pos_words = get_all_words(positive_cleaned_tokens_list)

Perform frequency analysis using `FreqDist`:

In [45]:
from nltk import FreqDist

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 361), ('love', 336), ('...', 290), ('good', 283), ('get', 263), ('thank', 253)]


Fine. Now we'll convert these to a data structure usable for NLTK's naive Bayes classifier ([docs here](https://www.nltk.org/_modules/nltk/classify/naivebayes.html)):

In [46]:
[tweet_tokens for tweet_tokens in positive_cleaned_tokens_list][0]

['followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

In [47]:
def get_token_dict(tokens):
    return dict([token, True] for token in tokens)

def get_tweets_for_model(cleaned_tokens_list):
    return [get_token_dict(tweet_tokens) for tweet_tokens in cleaned_tokens_list]

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

Create two datasets for positive and negative tweets. Use 7000/3000 split for train and test data.

In [48]:
import random

positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

Finally we use the nltk's NaiveBayesClassifier on the training data we've just created:

In [49]:
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.9956666666666667
Most Informative Features
                      :( = True           Negati : Positi =   2080.0 : 1.0
                      :) = True           Positi : Negati =    993.0 : 1.0
                     sad = True           Negati : Positi =     25.7 : 1.0
                      ff = True           Positi : Negati =     24.9 : 1.0
                     bam = True           Positi : Negati =     17.6 : 1.0
                    blog = True           Positi : Negati =     14.9 : 1.0
                 welcome = True           Positi : Negati =     14.7 : 1.0
               goodnight = True           Positi : Negati =     13.6 : 1.0
               community = True           Positi : Negati =     12.9 : 1.0
                 awesome = True           Positi : Negati =     12.5 : 1.0
None


**Here we use the Logistic Regression classifier instead**

In [64]:
logistic_regression_classifier = SklearnClassifier(LogisticRegression())
logistic_regression_classifier.train(train_data)

print("Accuracy is:", classify.accuracy(logistic_regression_classifier, test_data))

coefficients = logistic_regression_classifier._clf.coef_[0]
feature_names = logistic_regression_classifier._vectorizer.feature_names_
features_with_coefficients = list(zip(coefficients, feature_names))
features_with_coefficients.sort(reverse=True)

print("Most Informative Features:")
for coef, feat in features_with_coefficients[:10]:
    print(f"{feat} = {coef}")


Accuracy is: 0.997
Most Informative Features:
:) = 5.487644855817085
:-) = 4.236041086097078
:d = 4.177042207801741
:p = 2.8747290530812415
>:d = 1.277770821614759
catch = 1.0721826145804074
thank = 0.7230743749758483
dots = 0.6396696845806544
braindots = 0.6396696845806544
brain = 0.6334698018980282


**Точність Logistic Regression classifier вища за точність NaiveBayesClassifier**

**Пробуємо використати крос-валідацію**

In [75]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression()
k_fold = StratifiedKFold(n_splits=7, shuffle=True, random_state=101)
cv_scores = cross_val_score(logistic_regression, X, y, cv=k_fold, scoring='accuracy')

print("Cross-validation scores:", cv_scores)
print("Mean Accuracy:", cv_scores.mean())
print("Standard Deviation of Accuracy:", cv_scores.std())


Cross-validation scores: [0.75227432 0.74107768 0.74737579 0.7396781  0.73319328 0.73319328
 0.73739496]
Mean Accuracy: 0.7405981986916529
Standard Deviation of Accuracy: 0.006582123207988753


**Тут можемо зробити висновки такі (це мої suggested improvements; насправді не imrovments, а просто conclusions):**

*   **може бути висока дисперсія і внаслідок цього модель буде мати меншу точність на інших даних**
*   **можемо побачити більш правдиву оцінку моделі через усереднення результатів від кількох тестових наборів**


Note the Positive:Negative ratios.

Let's check some test phrase. First, download punkt sentence tokenizer ([docs here](https://www.nltk.org/api/nltk.tokenize.punkt.html))

In [50]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Now we won't rely on `twitter_samples.tokenized`, but rather will use a generic tokenization routine - `word_tokenize`.

In [51]:
from nltk.tokenize import word_tokenize

custom_tweet = "the service was so bad"

custom_tokens = process_tokens(word_tokenize(custom_tweet))

print(classifier.classify(get_token_dict(custom_tokens)))

Negative


Let's package it as a function:

In [52]:
def get_sentiment(text):
    custom_tokens = process_tokens(word_tokenize(text))
    return classifier.classify(get_token_dict(custom_tokens))

texts = ["bad", "service is bad", "service is really bad", "service is so terrible", "great service", "they stole my money"]
for t in texts:
    print(t, ": ", get_sentiment(t))


bad :  Negative
service is bad :  Negative
service is really bad :  Negative
service is so terrible :  Negative
great service :  Positive
they stole my money :  Negative


Seems ok!