# Ilustration of Vector Space Model Vectorizers

### 1. Data collection

For this "toy" example, we consider a few sentences from the book "A Tale of Two cities" by Charles Dikens.

- It was the best of times,
- it was the worst of times,
- it was the age of wisdom,
- it was the age of foolishness,

Each line will be considered as a separate document and the four lines as the whole collection. 

### 2. Preprocessing

As preprocessing steps we apply tokenizing, lowercasing and removing punctuation. The vocabulary we get consists of the following 10 words:

   - “it”
   - “was”
   - “the”
   - “best”
   - “of”
   - “times”
   - “worst”
   - “age”
   - “wisdom”
   - “foolishness”


In [None]:
import string, nltk
from nltk.tokenize import word_tokenize

collection = [
    "It was the best of the times,",
    "it was the worst of the times,",
    "it was the age of  the wisdom,",
    "it was the age of foolishness,",
]

# removing punctuation
coll_nopunct = [sent.translate(str.maketrans('', '', string.punctuation)) for sent in collection]

# lowercasing 
coll_lower = [sent.lower() for sent in coll_nopunct]

# tokenizing
coll_ready = [' '.join(word_tokenize(sent)) for sent in coll_lower]

# show the preprocessed collection matrix
for i in coll_ready:
    print(i)


### 3. Vectorizers
In this example we show the representation matrix using three common vectorizers, the binary vectorizer, the term frequency vectorizer, and the term frequency - inverse document frequency vectorizer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# printing the collection vocabulary
vec_binar = CountVectorizer(binary=True)
X_binar = vec_binar.fit_transform(coll_ready)
print("Collection vocabulary:\n", vec_binar.get_feature_names(), "\n")

# printing binary vectorizer matrix
print("Binary vectorizer matrix:\n", X_binar.toarray(), "\n")

# printing tf vectorizer matrix
vec_tf = CountVectorizer()
X_tf = vec_tf.fit_transform(coll_ready)
print("Term frequency vectorizer matrix:\n", X_tf.toarray(), "\n")

# printing tf-idf vectorizer matrix
vec_tfidf = TfidfTransformer()
X_tfidf = vec_tfidf.fit_transform(X_tf)
print("Term frequency - inverse document frequency vectorizer matrix:\n", X_tfidf.toarray())
