# Playing with concepts

This activity pretends that you play around with the concepts and python code, explore libraries, make experiments, and in general check theory with practice and with reality. To do that, you can use any tool you want including LLMs. Just try to have some findings after your experimentation process for most concepts we have seen. Use libraries like, NLTK, Spacy. Research how to implement the theory we have seen like ngrams, naive-bayes, language models...

Yes, it is very similar than exercise S05_3, so if you have started it, you can start from the code you already have. 

In [1]:
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

text = "Natural language processing is a fascinating field of study."

tokens = word_tokenize(text)

bigrams = list(ngrams(tokens, 2))

trigrams = list(ngrams(tokens, 3))

print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

Bigrams: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'a'), ('a', 'fascinating'), ('fascinating', 'field'), ('field', 'of'), ('of', 'study'), ('study', '.')]
Trigrams: [('Natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'a'), ('is', 'a', 'fascinating'), ('a', 'fascinating', 'field'), ('fascinating', 'field', 'of'), ('field', 'of', 'study'), ('of', 'study', '.')]


In [2]:
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy
import random

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] 

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features[f'contains({word})'] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d, c) in documents]

train_set, test_set = featuresets[100:], featuresets[:100]

classifier = NaiveBayesClassifier.train(train_set)

print("Accuracy:", accuracy(classifier, test_set))

classifier.show_most_informative_features(10)

Accuracy: 0.81
Most Informative Features
        contains(seagal) = True              neg : pos    =     13.3 : 1.0
   contains(outstanding) = True              pos : neg    =     11.1 : 1.0
         contains(mulan) = True              pos : neg    =      8.5 : 1.0
   contains(wonderfully) = True              pos : neg    =      7.7 : 1.0
          contains(lame) = True              neg : pos    =      6.0 : 1.0
         contains(flynt) = True              pos : neg    =      5.8 : 1.0
         contains(damon) = True              pos : neg    =      5.8 : 1.0
         contains(awful) = True              neg : pos    =      5.3 : 1.0
        contains(wasted) = True              neg : pos    =      5.3 : 1.0
    contains(ridiculous) = True              neg : pos    =      5.0 : 1.0
