<a href="https://colab.research.google.com/github/nwon24/nlp/blob/main/W3/vectorise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectorising text

Vectorising is the process of converting text into a form that can be used as input to a classifier or neural network. This is because intrinsically computers can't understand text! So we send our text through a preprocessing pipeline, which spits out at the end numerical representations of the tokens in our text.

There are various vectorisers. A simple one is called Bag of Words, which simply counts the frequency of each word in the text (or collection of texts--corpus) and stores those frequency counts in a dictionary. The issue here is that common words that don't really add much to the meaning of the texts (e.g., articles, prepositions) are weighted quite heavily.

An improved vectoriser is called Term Frequency - Inverse Document Frequency (TF-IDF). This vectoriser takes as input a corpus and spits out a score for each token in the each text of the corpus. The higher the score, the greater importance of that token to the meaning of that particular text. The trick here is that words are not only weighted according to their frequency in their own document, but also inversely to the frequency of their apperance in the entire corpus. This means that words that appear often in all the texts (such as stop words) will be weighted very lowly, while words that appear frequently but only in one text will be given a higher weighting, indicating that word is relatively more important to the meaning of that text.

To assign a score to each word, this vectoriser first computes the term frequency of a word in a particular document: the number of times the word appears in the document divided by the number of words in the document. Then the iinverse document frequency is calculated by taking the logarithm of the number of documents in the corpus divided by the number of documents in which that word appears. The TF and IDF are then multiplied together to get the TF-IDF score for that particular word.

In [None]:
import os
import sys
import re
import nltk
from nltk import FreqDist, word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')

stop_words=set(stopwords.words("english"))

lem=WordNetLemmatizer()

def get_pos(tag):
    if re.match(r"^JJ",tag):
        return "a"
    elif re.match(r"^NN",tag) or re.match(r"^PRP",tag):
        return "n"
    elif re.match(r"^RB",tag):
        return "r"
    elif re.match(r"^VB",tag):
        return "v"
    return ""

def tokenize(doc):
    sents = sent_tokenize(doc)
    #stem = nltk.stem.SnowballStemmer('english')
    # Get only the unique words -- otherwise it's going to take a long, long time...
    words=list(set([word.lower() for sent in sents for word in word_tokenize(sent) if word.isalpha()]))
    tagged_words=nltk.pos_tag(words)
    lemmed_words=[lem.lemmatize(word[0],pos=get_pos(word[1])) if get_pos(word[1])!="" else lem.lemmatize(word[0]) for word in tagged_words]
    return lemmed_words

def tokenize_corpus(corpus):
    return [tokenize(doc) for doc in corpus]

def vectorize_tf_idf(tokens):
    texts = nltk.text.TextCollection(tokens)
    for doc in tokens:
        yield {
            term: texts.tf_idf(term, doc)
            for term in doc
        }

def vectorize_bow(tokens):
    for doc in tokens:
        bow = dict.fromkeys(doc,0)
        for word in doc:
            bow[word] += 1
        yield bow

def pipeline(corpus):
    return list(vectorize_tf_idf(tokenize_corpus(corpus)))

corpusdir="corpus"
corpus=[]

with os.scandir(corpusdir) as files:
    for file in files:
        if not os.path.isdir(file):
            with open(file,"r") as f:
                corpus.append(f.read())

tf_idf_words = pipeline(corpus)
for i in tf_idf_words:
    print(i)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


FileNotFoundError: [Errno 2] No such file or directory: 'corpus'

In [None]:
a={'a':1,'b':2}
b=np.array(a)

NameError: name 'np' is not defined