# Bag of Words(BoW) Model Explained

The **Bag of Words(BoW)** model is a foundational technique in natural language processing(NLP) for representing text data. It simplifies text into an unordered collection of words, disregarding grammar, syntax, and word order, while focusing on word occurrences.

1. **Core Concept**
* BoW treats text as a "bag"(multiset) of words. Only the presence or frequency of words is considered, not their positions.

2. **Implementation Steps**
* **Tokenization**: Split text into individual words/tokens(requires segmentation for languages like Chinese).
* **Build Vocabulary**: Collect all unique words from the corpus and assign each an index.
* **Vectorization**:
    * **Frequency-based**: Count occurrences of each word in a document.
    * **Binary**: Mark presence (1) or absence (0) of words.
    * **TF-IDF**: Adjust word frequency to reflect importance(downweights common words).

3. **Pros and Cons**
* Advantages:
    * Simple, fast, and memory-efficient.
    * Works well for tasks like text classification and sentiment analysis.
* Limitations:
    * Ignores word order and context, losing semantic meaning.
    * High-dimensional sparse vectors(problematic for large vocabularies).
    * Fails to handle out-of-vocabulary(OOV) words.

In [1]:
import numpy as np

# 示例文本数据
documents = [
    '我 喜欢 编程，编程 是 一门 有趣的技术',
    '我 喜欢 旅游，旅游 可以 放松 心情',
    '编程 和 旅游 都是 我的 爱好'
]


# 分词
def tokenize(documents):
    tokenized_documents = [doc.split() for doc in documents]
    return tokenized_documents


# 建立词典
def build_vocabulary(tokenized_documents):
    vocabulary = set()
    for doc in tokenized_documents:
        vocabulary.update(doc)
    return sorted(vocabulary)


# 向量化
def vectorize(tokenized_documents, vocabulary):
    vectors = np.zeros((len(tokenized_documents), len(vocabulary)))
    for i, doc in enumerate(tokenized_documents):
        for word in doc:
            vectors[i, vocabulary.index(word)] += 1
    return vectors


# 分词示例
tokenized_documents = tokenize(documents)
print('分词结果：', tokenized_documents)

# 建立词典示例
vocabulary = build_vocabulary(tokenized_documents)
print('词典：', vocabulary)

# 向量化示例
vectors = vectorize(tokenized_documents, vocabulary)
print('向量化结果：\n', vectors)

分词结果： [['我', '喜欢', '编程，编程', '是', '一门', '有趣的技术'], ['我', '喜欢', '旅游，旅游', '可以', '放松', '心情'], ['编程', '和', '旅游', '都是', '我的', '爱好']]
词典： ['一门', '可以', '和', '喜欢', '心情', '我', '我的', '放松', '旅游', '旅游，旅游', '是', '有趣的技术', '爱好', '编程', '编程，编程', '都是']
向量化结果：
 [[1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
 [0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0. 1.]]
