### Count Vectorizer

In the context of machine learning and `natural` language processing, Count Vectorizer" refers  
to the overall method of `converting` a collection of text documents into a matrix of token counts. 

This process typically `involves` both the "fit" and "transform" steps,  
which are `encapsulated` in the fit_transform method in scikit-learn's CountVectorizer.  

### Vocabulary

Learn the `vocabulary` (unique words) from the provided text data.   
We create a list (set) of all `unique` tokens across all documents.  

In [6]:
# Sample text strings
a = 'London Paris London'
b = 'Paris Paris London'

def create_vocabulary(texts):
    vocabulary = set()

    for t in texts:
        # Split each document into tokens (usually words)
        words = t.split()

        # Create a list of all unique tokens across all documents
        for w in words:
            vocabulary.add(w)
    return vocabulary

vocabulary = create_vocabulary([a, b])
print(vocabulary)

{'Paris', 'London'}


### Token counts

Convert the text documents into a `numerical` format (specifically, a token count matrix).

In [10]:
def fit_transform(texts):
    vocabulary = create_vocabulary(texts) # set
    vocabulary = list(vocabulary)
    matrix = []

    for t in texts:
        count_vector = [0] * len(vocabulary)

        for word in t.split():
            index = vocabulary.index(word)
            count_vector[index] += 1

        matrix.append(count_vector)
    return matrix

# Sample text strings
a = 'London Paris London'
b = 'Paris Paris London'

# Get the frequency matrix
matrix = fit_transform([a, b])
print(matrix)

[[1, 2], [2, 1]]
