1. Difference between `CountVectorizer.transform()` and `collections.Counter`

In [78]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()

sentence = 'The quick brown fox jumps over the lazy dog'
tokens = sentence.split()
print('Vocabulary size:', len(set(tokens)))

counts = Counter(tokens)
counts_vectorized = count_vectorizer.fit_transform(tokens)

print(counts)
print(counts_vectorized.toarray())

Vocabulary size: 9
Counter({'The': 1, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1, 'over': 1, 'the': 1, 'lazy': 1, 'dog': 1})
[[0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 1 0]
 [1 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0]
 [0 0 0 1 0 0 0 0]
 [0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1]
 [0 0 0 0 1 0 0 0]
 [0 1 0 0 0 0 0 0]]


`CountVectorizer` returns a *vectorized* bag-of-words of the sentence. `collections.Counter` returns a dictionary where each item is a mapping from a token to its count in the sentence.

2. Can you use TFIDFVectorizer on a large corpus (more than 1M documents) with a huge vocabulary (more than 1M tokens)? What problems do you expect to encounter?

The short answer is no. Because `TFIDFVectorizer` creates a sparse vector. On a large corpus with a hugh library, this results in a very large matrix.

3. Think of an example of corpus or task where term frequency (TF) will perform better than TF-IDF.

Technical manuals will benefit from TF because the relevant terms will likely be contained within one document. For example, if we have a PC manual of 4 documents about: CPU, GPU, RAM, peripherals. The CPU documents will contain the terms such as "transistor", "register", "x86", "FLOPS", etc. These terms are not as common or may even be absent in other documents. So using TF to search for documents about CPU will perform better than TF-IDF because we don't have to do extra processing for IDF.

In [79]:
corpus = [
    {
        "title": "CPU",
        "content": """
            The Central Processing Unit (CPU) is the brain of your computer. Our latest model features:
            - Advanced x86-64 Intel/AMD architecture
            - 8 physical cores with 16 threads via hyper-threading
            - 3.5 GHz base clock, up to 5.0 GHz with Turbo Boost
            - 16 MB L3 cache, 512 KB L2 cache per core
            - Support for DDR4-3200 memory
            - 14nm manufacturing process with over 10 billion transistors
            - Integrated heat spreader for optimal thermal management
            - Advanced vector extensions (AVX) for enhanced FLOPS performance
            - Hardware-level virtualization support
            - Secure enclave for encrypted operations
            - Compatible with LGA 1200 socket motherboards
            The CPU's arithmetic logic unit (ALU) performs calculations, while the floating-point unit (FPU) handles decimal computations. The control unit manages instruction flow through the pipeline, optimizing IPC (Instructions Per Clock) for maximum efficiency.
        """,
    },
    {
        "title": "GPU",
        "content": """
            The Graphics Processing Unit (GPU) is responsible for rendering images, video, and 3D graphics. Key features include:
            - NVIDIA Ampere architecture
            - 10 GB GDDR6X memory
            - 8704 CUDA cores for parallel processing
            - 1.71 GHz boost clock
            - Real-time ray tracing capabilities
            - 8K HDR gaming support
            - PCIe 4.0 interface for high-speed data transfer
            - Three DisplayPort 1.4a and one HDMI 2.1 output
            - NVIDIA DLSS (Deep Learning Super Sampling) technology
            - GPU Boost 3.0 for intelligent clock speed management
            - VR Ready for immersive gaming experiences
            The GPU's shader units handle complex lighting and texture calculations, while its raster operations pipeline (ROP) manages final pixel output to your display.
        """,
    },
    
    {
        "title": "RAM",
        "content": """
            Random Access Memory (RAM) provides fast, temporary data storage for active programs and processes. Our latest DDR4 modules offer:
            - 3200 MHz clock speed
            - CL16-18-18-38 timings for responsive performance
            - 32 GB capacity (2 x 16 GB dual-channel kit)
            - XMP 2.0 support for easy overclocking
            - Aluminum heat spreaders for effective cooling
            - 1.35V operating voltage
            - Unbuffered, non-ECC design for consumer systems
            - Lifetime warranty
            RAM communicates with the CPU via the memory controller, utilizing multi-channel architecture to maximize bandwidth. The JEDEC standard ensures compatibility across different systems.
        """,
    },
    {
        "title": "Peripherals",
        "content": """
            Enhance your computing experience with our range of peripherals:
            1. Mechanical Keyboard
            - Cherry MX Blue switches for tactile feedback
            - Full N-key rollover
            - Customizable RGB backlighting
            - Programmable macro keys
            - Detachable USB-C cable
            2. Optical Mouse
            - 16,000 DPI optical sensor
            - 1000 Hz polling rate
            - 8 programmable buttons
            - Adjustable weight system
            - Ergonomic right-handed design
            3. 4K Monitor
            - 27-inch IPS panel
            - 3840 x 2160 resolution
            - 144 Hz refresh rate
            - 1 ms GTG response time
            - HDR400 certified
            - FreeSync and G-Sync compatible
            4. Webcam
            - 1080p/60fps video capture
            - Dual noise-cancelling microphones
            - Auto-focus and light correction
            - Privacy shutter
            - USB 3.0 connectivity
            These peripherals connect to your system via USB ports, utilizing plug-and-play technology for easy setup. Customization software allows for personalized configurations to suit your needs.
        """
    }
]

In [80]:
corpus_text = [doc['content'] for doc in corpus]

TF-IDF

In [81]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [82]:
def reply_tfidf(question):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectorizer.fit(corpus_text)
    tfidf_matrix = tfidf_vectorizer.transform(corpus_text)
        
    question_vectorized = tfidf_vectorizer.transform([question])
    similarity = tfidf_matrix.dot(question_vectorized.T)
    most_similar_idx = similarity.argmax()
    
    print('Question: ', question)
    print('Most relevant document: ', corpus[most_similar_idx]['title'])
    print()

In [83]:
import time
def test_search(search_method):
    start_time = time.time()
    search_method('x86-64 architecture')
    search_method('NVIDIA Ampere architecture')
    search_method('DDR4 memory')
    search_method('mechanical keyboard')
    search_method('1080p 60fps video capture')
    search_method('Intel')
    end_time = time.time()
    print('Time elapsed:', end_time - start_time)

test_search(reply_tfidf)

Question:  x86-64 architecture
Most relevant document:  CPU

Question:  NVIDIA Ampere architecture
Most relevant document:  GPU

Question:  DDR4 memory
Most relevant document:  RAM

Question:  mechanical keyboard
Most relevant document:  Peripherals

Question:  1080p 60fps video capture
Most relevant document:  Peripherals

Question:  Intel
Most relevant document:  CPU

Time elapsed: 0.010000228881835938


TF

In [84]:
import numpy as np
from nltk.tokenize import TreebankWordTokenizer

tb_tokenizer = TreebankWordTokenizer()

def compute_tf(doc):
    word_counts = Counter(tb_tokenizer.tokenize(doc))
    return {
        word: count / len(word_counts) for word, count in word_counts.items()
    }

def search_docs(query):
    query_tokens = tb_tokenizer.tokenize(query)
    
    doc_tfs = [compute_tf(doc) for doc in corpus_text]
    
    scores = []
    for i, doc_tf in enumerate(doc_tfs):
        score = sum(doc_tf.get(token, 0) for token in query_tokens)
        scores.append((i, score))
        
    return sorted(scores, key=lambda x: x[1], reverse=True)

def reply_tf(question):
    results = search_docs(question)
    most_relevant_idx = results[0][0]
    
    print('Question: ', question)
    print('Most relevant document:', corpus[most_relevant_idx]['title'])   
    print() 

In [85]:
test_search(reply_tf)

Question:  x86-64 architecture
Most relevant document: CPU

Question:  NVIDIA Ampere architecture
Most relevant document: GPU

Question:  DDR4 memory
Most relevant document: RAM

Question:  mechanical keyboard
Most relevant document: CPU

Question:  1080p 60fps video capture
Most relevant document: Peripherals

Question:  Intel
Most relevant document: CPU

Time elapsed: 0.006998300552368164
