<center>
    <h1> Natural Language Processing and Large Language Models for Research Data Exploration and Analysis
 </h1> </center>

<center> <h1> Day-1: Text Classification and Sentiment Analysis using TextBlob </h1> </center>

<center> <h2> Exercise - 02 (part - 01) </h2> </center>

<center> <h4> Raghava Mukkamala (rrm.digi@cbs.dk)  </h4> </center>


### Instructions

#### Please use Python 3 for working on the following questions.



## Exercise 01 - Text Classification using NaiveBayesClassifier from textblob

Source: https://textblob.readthedocs.io/en/dev/classifiers.html

adapted by Raghava Mukkamala


https://textblob.readthedocs.io/en/dev/install.html

In [None]:
# !pip install nltk
# !pip install textblob
# !pip install prettytable

In [None]:
import textblob
import nltk

In [None]:
import collections
from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
from prettytable import PrettyTable
from nltk import precision
import nltk.metrics



We need to download NLTK Corpus to proceed with text classification. So we first import nltk and then call the download punkt tokenizer.

More information can be found at: https://www.nltk.org/api/nltk.tokenize.punkt.html

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## Preparing training set for sentiment

In [None]:
train = [
    ('I love this sandwich.', 'pos'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg')
]

train

[('I love this sandwich.', 'pos'),
 ('This is an amazing place!', 'pos'),
 ('I feel very good about these beers.', 'pos'),
 ('This is my best work.', 'pos'),
 ('What an awesome view', 'pos'),
 ('I do not like this restaurant', 'neg'),
 ('I am tired of this stuff.', 'neg'),
 ("I can't deal with this", 'neg'),
 ('He is my sworn enemy!', 'neg'),
 ('My boss is horrible.', 'neg')]

## Build the NaiveBayesClassifier using training set


In [None]:
cls = NaiveBayesClassifier(train)

In [None]:
test = [
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]


In [None]:
print('classifier accuracy:', cls.accuracy(test))

classifier accuracy: 0.8333333333333334


## Agreement between  human labels and classifier predictions

In [None]:
tab = PrettyTable(['text', 'human label', 'classifier prediction'])

predicted_labels = collections.defaultdict(set)

actual_labels = collections.defaultdict(set)

i = 0
for (text, label) in test:
    predicted = cls.classify(text)
    tab.add_row([text, label, predicted])
    actual_labels[label].add(i)
    predicted_labels[predicted].add(i)
    i+=1


print(tab)


+---------------------------------+-------------+-----------------------+
|               text              | human label | classifier prediction |
+---------------------------------+-------------+-----------------------+
|        The beer was good.       |     pos     |          pos          |
|      I do not enjoy my job      |     neg     |          neg          |
|   I ain't feeling dandy today.  |     neg     |          neg          |
|         I feel amazing!         |     pos     |          pos          |
|    Gary is a friend of mine.    |     pos     |          neg          |
| I can't believe I'm doing this. |     neg     |          neg          |
+---------------------------------+-------------+-----------------------+


## Print Classification Report (a.k.a Confusion Matrix)

In [None]:
from sklearn.metrics import classification_report

# Generate predictions
y_true = [label for _, label in test]  # True labels
y_pred = [cls.classify(text) for text, _ in test]  # Predicted labels

print(y_true)
print(y_pred)

report = classification_report(y_true, y_pred, target_names=['negative', 'positive'])
print(report)

['pos', 'neg', 'neg', 'pos', 'pos', 'neg']
['pos', 'neg', 'neg', 'pos', 'neg', 'neg']
              precision    recall  f1-score   support

    negative       0.75      1.00      0.86         3
    positive       1.00      0.67      0.80         3

    accuracy                           0.83         6
   macro avg       0.88      0.83      0.83         6
weighted avg       0.88      0.83      0.83         6



## Performance measures using NLTK

In [None]:
tab2 = PrettyTable(['Label', 'precision', 'recall', 'f-measure'])

for label in actual_labels:
    tab2.add_row([label, nltk.precision(actual_labels[label], predicted_labels[label]),
                nltk.recall(actual_labels[label], predicted_labels[label]),
                nltk.f_measure(actual_labels[label], predicted_labels[label])])

print(tab2)

+-------+-----------+--------------------+--------------------+
| Label | precision |       recall       |     f-measure      |
+-------+-----------+--------------------+--------------------+
|  pos  |    1.0    | 0.6666666666666666 |        0.8         |
|  neg  |    0.75   |        1.0         | 0.8571428571428572 |
+-------+-----------+--------------------+--------------------+


## Printing most informative measures

In [None]:
cls.show_informative_features(50)

Most Informative Features
          contains(this) = True              neg : pos    =      2.3 : 1.0
          contains(this) = False             pos : neg    =      1.8 : 1.0
          contains(This) = False             neg : pos    =      1.6 : 1.0
            contains(an) = False             neg : pos    =      1.6 : 1.0
             contains(I) = False             pos : neg    =      1.4 : 1.0
             contains(I) = True              neg : pos    =      1.4 : 1.0
            contains(He) = False             pos : neg    =      1.2 : 1.0
            contains(My) = False             pos : neg    =      1.2 : 1.0
          contains(What) = False             neg : pos    =      1.2 : 1.0
         contains(about) = False             neg : pos    =      1.2 : 1.0
            contains(am) = False             pos : neg    =      1.2 : 1.0
       contains(amazing) = False             neg : pos    =      1.2 : 1.0
       contains(awesome) = False             neg : pos    =      1.2 : 1.0

## Test the classifier on new data

In [None]:
print('label for: "Their burgers are amazing" ', cls.classify("Their burgers are amazing"))
print('label for: "I dont like their pizza" ', cls.classify("I don't like their pizza."))

label for: "Their burgers are amazing"  pos
label for: "I dont like their pizza"  neg


## what is the sentiment of "my boss appreciated me"

In [None]:
print('label for: "my boss appreciated me." ', cls.classify("my boss appreciated me"))

label for:"my boss appreciated me."  neg


## <font color='red'>Task - 01:</font>

    Build a simple Naive Bayes Classifier for a mini set of Emotions (e.g. fear, happiness, and sadness) using
    TextBlob library. You can prepare a simple training set yourselves in the similar lines of the above
    example.

In [None]:
# YOUR SOLUTION HERE


# Create a list of labels



# Create a training set





In [None]:
# Create the Naive Bayes Classifier




In [None]:
# Create the test dataset





In [None]:
# Print the accuracy of the model






In [None]:
# SOLUTION FOR TASK - 01:

emotion_labels = ['fear', 'happiness', 'sadness']

train = [
    ('I am scared of fear.', 'fear'),
    ('I hate ghost movies.', 'fear'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg')
]





In [None]:
cls_emotions = NaiveBayesClassifier(train)

In [None]:
test = [
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]


In [None]:
print('classifier accuracy:', cls_emotions.accuracy(test))

classifier accuracy: 0.8333333333333334


In [None]:
print('label for: "I dont like ghost movies." ', cls_emotions.classify("I dont like ghost movies"))

label for: "I dont like ghost movies."  fear
