### The code below utilizes the SKLearn library to simplify out processes in the previous example quite a bit. 
### We are able to get fit multiple documents to the BoW model, get our vocabulary, generate our document vectors and compare document similarity.

In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents - For the sake of brevity, each of these sentences is a "document"
documents = [
    "This is an example of using the sklearn tookit to produce a vocabulary and output features.",
    "Each of these sentences is a document and each will represent how we can have multiple documents.",
    "This example shows how much easier it is to produce a vocabulary and get features using a toolkit like sklearn!",
    "I hope we see how much easier it is to use existing toolkits rather than re-inventing the wheel",
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer to the documents and transform them into vectors - remember how we made vectors manually before? This is way easier
document_vectors = vectorizer.fit_transform(documents)

# Get the vocabulary (feature names) from the vectorizer
vocabulary = vectorizer.vocabulary_

# Print the vocabulary
print("Vocabulary:", vocabulary)

# Convert the document vectors to a dense numpy array
document_vectors = document_vectors.toarray()

# Print the document vectors
for i, document in enumerate(documents):
    print("Document:", document)
    print("Vector:", document_vectors[i])
    print()

# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Print the feature names
print("Feature Names:", feature_names)

# Compute pairwise cosine similarity
cosine_similarities = cosine_similarity(document_vectors)

# Print the pairwise cosine similarity
num_documents = len(documents)
for i in range(num_documents):
    for j in range(i+1, num_documents):
        similarity = cosine_similarities[i, j]
        print(f"Similarity between Document {i+1} and Document {j+1}: {similarity}")

Vocabulary: {'this': 33, 'is': 15, 'an': 0, 'example': 7, 'of': 20, 'using': 39, 'the': 31, 'sklearn': 29, 'tookit': 35, 'to': 34, 'produce': 22, 'vocabulary': 40, 'and': 1, 'output': 21, 'features': 9, 'each': 5, 'these': 32, 'sentences': 27, 'document': 3, 'will': 43, 'represent': 25, 'how': 13, 'we': 41, 'can': 2, 'have': 11, 'multiple': 19, 'documents': 4, 'shows': 28, 'much': 18, 'easier': 6, 'it': 16, 'get': 10, 'toolkit': 36, 'like': 17, 'hope': 12, 'see': 26, 'use': 38, 'existing': 8, 'toolkits': 37, 'rather': 23, 'than': 30, 're': 24, 'inventing': 14, 'wheel': 42}
Document: This is an example of using the sklearn tookit to produce a vocabulary and output features.
Vector: [1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 1 1 1 0
 0 0 1 1 0 0 0]

Document: Each of these sentences is a document and each will represent how we can have multiple documents.
Vector: [0 1 1 1 1 2 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0
 0 0 0 0 1 0 1]

Document: 