## One Hot Encoding and Label Encoding Exercise

This notebook demonstrates OneHot and Label Encoding using scikit-learn on custom text documents.

In [1]:
# Import required libraries
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [2]:
# Define the text documents
S1 = 'Often, machine learning tutorials will recommend.'
S2 = 'Getting started in applied machine learning.'
S3 = 'One good example is to use.'

# Preprocess: convert to lowercase and remove punctuation
processed_docs = [doc.lower().replace(',', '').replace('.', '') for doc in [S1, S2, S3]]
print("Processed documents:")
for i, doc in enumerate(processed_docs, 1):
    print(f"S{i}: {doc}")

Processed documents:
S1: often machine learning tutorials will recommend
S2: getting started in applied machine learning
S3: one good example is to use


## Label Encoding

Label Encoding converts each word in the corpus into a numeric value between 0 and n-1 (where n is the number of unique words).

In [3]:
# Prepare data for encoding
data = [doc.split() for doc in processed_docs]
values = data[0] + data[1] + data[2]
print("All words:", values)
print("\nTotal words:", len(values))
print("Unique words:", len(set(values)))

All words: ['often', 'machine', 'learning', 'tutorials', 'will', 'recommend', 'getting', 'started', 'in', 'applied', 'machine', 'learning', 'one', 'good', 'example', 'is', 'to', 'use']

Total words: 18
Unique words: 16


In [4]:
# Label Encoding
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)

print("Label Encoded:", integer_encoded)
print("\nWord to Label mapping:")
for word, label in zip(values, integer_encoded):
    print(f"  {word:15s} -> {label}")

Label Encoded: [ 8  7  6 13 15 10  2 11  4  0  7  6  9  3  1  5 12 14]

Word to Label mapping:
  often           -> 8
  machine         -> 7
  learning        -> 6
  tutorials       -> 13
  will            -> 15
  recommend       -> 10
  getting         -> 2
  started         -> 11
  in              -> 4
  applied         -> 0
  machine         -> 7
  learning        -> 6
  one             -> 9
  good            -> 3
  example         -> 1
  is              -> 5
  to              -> 12
  use             -> 14


In [5]:
# Show unique label encodings
print("\nUnique word to label mapping:")
unique_words = sorted(set(values))
unique_labels = label_encoder.transform(unique_words)
for word, label in zip(unique_words, unique_labels):
    print(f"  {word:15s} -> {label}")


Unique word to label mapping:
  applied         -> 0
  example         -> 1
  getting         -> 2
  good            -> 3
  in              -> 4
  is              -> 5
  learning        -> 6
  machine         -> 7
  often           -> 8
  one             -> 9
  recommend       -> 10
  started         -> 11
  to              -> 12
  tutorials       -> 13
  use             -> 14
  will            -> 15


## One-Hot Encoding

One-Hot Encoding represents each word as a binary vector where only one element is 1 (hot) and all others are 0.

In [6]:
# One-Hot Encoding
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(data).toarray()

print("Onehot Encoded Matrix:")
print(onehot_encoded)
print("\nShape:", onehot_encoded.shape)
print("(rows = number of documents, columns = total unique words across all positions)")

Onehot Encoded Matrix:
[[0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0.]
 [1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0.]
 [0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1.]]

Shape: (3, 18)
(rows = number of documents, columns = total unique words across all positions)


In [7]:
# Display each document's encoding
for i, (doc, encoding) in enumerate(zip(processed_docs, onehot_encoded), 1):
    print(f"\nDocument {i}: '{doc}'")
    print(f"One-hot encoding: {encoding}")


Document 1: 'often machine learning tutorials will recommend'
One-hot encoding: [0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0.]

Document 2: 'getting started in applied machine learning'
One-hot encoding: [1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0.]

Document 3: 'one good example is to use'
One-hot encoding: [0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1.]


## Vocabulary Analysis

In [8]:
# Build vocabulary
vocab = {}
count = 0
for doc in processed_docs:
    for word in doc.split():
        if word not in vocab:
            count += 1
            vocab[word] = count

print("Vocabulary:")
for word, idx in sorted(vocab.items(), key=lambda x: x[1]):
    print(f"  {word:15s} -> {idx}")

Vocabulary:
  often           -> 1
  machine         -> 2
  learning        -> 3
  tutorials       -> 4
  will            -> 5
  recommend       -> 6
  getting         -> 7
  started         -> 8
  in              -> 9
  applied         -> 10
  one             -> 11
  good            -> 12
  example         -> 13
  is              -> 14
  to              -> 15
  use             -> 16
