In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("nlp is fun!")
for token in doc:
    explanation = spacy.explain(token.pos_)
    print(token.text, token.lemma_, token.is_alpha, token.is_stop, token.pos_, end=" ")
    print(f"({explanation})")

nlp nlp True False PROPN (proper noun)
is be True True AUX (auxiliary)
fun fun True False ADJ (adjective)
! ! False False PUNCT (punctuation)


# 1) Text normalization

First of all, removing too much from the words/tokens is dangerous, as this will have unwanted results for a keyboard. Punctuation can likely be omitted, as we can observe that commas and periods can occur after a large selection of words, with few obvious patterns.

Stopword filtering is also not useful, as we want to predict words like "is", "the", etc.

Lower/uppercasing, however, is likely a filtering step we can include. Lowercasing gives more data input and allows for more matches in your model.

Lemmatization/stemming could be used in some cases, but this may depend on your domain and other factors. Consider the sentences "I'm studying at NTNU" and "I often study late at night". After "study", this could help the model predict both "at" and "late" as the next word.


# 2) TF-IDF
## 2.1)

In [2]:
import nltk
from math import log10

def tf(document, term):
    tokens = nltk.word_tokenize(document)
    tokens = [t for t in tokens if t.isalpha()]
    freq = nltk.FreqDist(tokens)
    return freq[term]

def idf(documents, term):
    num_docs = len(documents)
    num_docs_with_term = 0
    for d in documents:
        if term.lower() in d.lower():
            num_docs_with_term += 1

    return log10(num_docs / num_docs_with_term + 1)

def tf_idf(all_documents, document, term):
    _tf = tf(document, term)
    _idf = idf(all_documents, term)
    # print(f"TF: {_tf}, IDF: {_idf}")
    return _tf * _idf

d1 = "I love cats"
d2 = "I love dogs"
d3 = "I love cats, but I also like dogs"

documents = [d1, d2, d3]

print(f"TF-IDF for 'love' is: {tf_idf(documents, d3, 'love')}")
print(f"TF-IDF for 'like' is: {tf_idf(documents, d3, 'like')}")

TF-IDF for 'love' is: 0.3010299956639812
TF-IDF for 'like' is: 0.6020599913279624


## 2.2) 
I would not replace words with their TF-IDF values without doing quite a bit of preprocessing first - such as extracting the part-of-speech tags, stems, etc. There's a lot of linguistic features to be extracted.

TF-IDF values can, although, be used directly in certain applications, such as for search and information retrieval.


## 2.3)
You should use the logarithm of the inverse document frequency as you may end up with very large values if you have a big corpora to work with, containing e.g. million of documents.

# 3) Part-of-speech tagging
We'll  now make use of spaCy

In [3]:
sentence = "I saw her duck"

nltk.pos_tag(nltk.word_tokenize(sentence), tagset="universal")

[('I', 'PRON'), ('saw', 'VERB'), ('her', 'PRON'), ('duck', 'NOUN')]

In [4]:
data = nltk.corpus.brown.tagged_sents(tagset="universal")

backoff = nltk.DefaultTagger('NN')
bigramtagger = nltk.UnigramTagger(train=data, backoff=backoff)

In [5]:
bigramtagger.tag(nltk.word_tokenize(sentence))

[('I', 'PRON'), ('saw', 'VERB'), ('her', 'DET'), ('duck', 'VERB')]

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)
[(token.text, token.pos_) for token in doc]

[('I', 'PRON'), ('saw', 'VERB'), ('her', 'PRON'), ('duck', 'NOUN')]

Using only the unigramtagger, we don't get a lot of context together with the word "duck", and apparently the corpus more often than not has the word "duck" used as a verb.