# Chapter 02: Vectorization - Turning Words into Numbers

Machines don't understand text; they understand numbers. **Vectorization** is the process of converting text into numerical vectors that machine learning algorithms can process.

In this chapter, we will cover:
1. **Bag-of-Words (BoW)**: Simple frequency counting.
2. **TF-IDF**: Weighing words by their importance.

## 1. Bag-of-Words (BoW)
BoW represents text as a "bag" of its words, ignoring grammar and order but keeping track of frequency.

We'll use `scikit-learn`, the industry standard for traditional machine learning in Python.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'The cat sat on the mat.',
    'The dog sat on the log.',
    'Cats and dogs are great pets.'
]

# Initialize the Vectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Look at the 'Vocabulary' (the words the vectorizer learned)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Look at the resulting matrix (Document-Term Matrix)
print("\nBoW Matrix (as array):")
print(X.toarray())

Vocabulary: ['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'great' 'log' 'mat' 'on' 'pets'
 'sat' 'the']

BoW Matrix (as array):
[[0 0 1 0 0 0 0 0 1 1 0 1 2]
 [0 0 0 0 1 0 0 1 0 1 0 1 2]
 [1 1 0 1 0 1 1 0 0 0 1 0 0]]


## 2. Term Frequency-Inverse Document Frequency (TF-IDF)
BoW has a flaw: it treats all words as equally important. But common words like 'the' carry less information than unique words like 'log'.

- **TF (Term Frequency)**: How often a word appears in a document.
- **IDF (Inverse Document Frequency)**: How rare a word is across *all* documents.

TF-IDF gives high scores to words that are frequent in a specific document but rare in the overall collection.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:")
import pandas as pd # Let's use pandas for a prettier view
df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print(df)

Vocabulary: ['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'great' 'log' 'mat' 'on' 'pets'
 'sat' 'the']

TF-IDF Matrix:
        and       are       cat      cats       dog      dogs     great  \
0  0.000000  0.000000  0.427554  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.427554  0.000000  0.000000   
2  0.408248  0.408248  0.000000  0.408248  0.000000  0.408248  0.408248   

        log       mat        on      pets       sat       the  
0  0.000000  0.427554  0.325166  0.000000  0.325166  0.650331  
1  0.427554  0.000000  0.325166  0.000000  0.325166  0.650331  
2  0.000000  0.000000  0.000000  0.408248  0.000000  0.000000  


## Wrap-up Exercise
1. Define a new list of 5 sentences related to a topic (e.g., Space, Cooking, or Sports).
2. Apply `TfidfVectorizer` to it.
3. Find the word with the highest TF-IDF score in the first document. Hint: You can use `df.iloc[0].sort_values(ascending=False)`.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# 1. Define 5 sentences about Space
space_corpus = [
    "The Hubble Space Telescope has captured stunning images of distant nebulas.",
    "Mars is often called the Red Planet because of its iron oxide surface.",
    "Black holes are regions of space-time where gravity is so strong nothing escapes.",
    "The International Space Station orbits Earth every ninety minutes.",
    "Astronauts training for moon missions often practice in extreme environments."
]

# 2. Apply TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(space_corpus)

# Create a DataFrame for easy visualization
df_space = pd.DataFrame(
    tfidf_matrix.toarray(), 
    columns=tfidf_vectorizer.get_feature_names_out()
)

# 3. Find the word with the highest TF-IDF score in the first document
# We look at index 0 (the first sentence about Hubble)
first_doc_scores = df_space.iloc[0].sort_values(ascending=False)

print("Top words in the 첫번째 (first) document:")
print(first_doc_scores.head(5))

# Extract the top word
top_word = first_doc_scores.index[0]
top_score = first_doc_scores.values[0]

print(f"\nResult: The most important word in Document 1 is '{top_word}' with a score of {top_score:.4f}")

Top words in the 첫번째 (first) document:
stunning    0.327113
distant     0.327113
hubble      0.327113
has         0.327113
nebulas     0.327113
Name: 0, dtype: float64

Result: The most important word in Document 1 is 'stunning' with a score of 0.3271
