# Bag-of-Words


Bag-of-words is a simple approach to convert textual information to numbers. As the name says we can represent text in the form of an unique set of words (“bag”), i.e. a vector containing word counts of a document.

This notebook will show in a basic example the implementation of bag-of-words.

### Create word token

In [None]:
# Input sentences
s1="Federer is one of the greatest tennis players of all time."
s2="Federer has won twenty grand slam titles to date."

# Find unique word tokens for both sentences

# Remove '.' and concatenate the sentences to 1 list
doc = s1.replace('.','').split() + s2.replace('.','').split()

# Build an unordered collection of unique elements
oc = set()
# Iterate though the document and collect the unique elements
unique_tokens = [
                 t for t in doc if not (t in oc or oc.add(t))
                ] 

print(unique_tokens)


['Federer', 'is', 'one', 'of', 'the', 'greatest', 'tennis', 'players', 'all', 'time', 'has', 'won', 'twenty', 'grand', 'slam', 'titles', 'to', 'date']


### Count words token frequencies

In [None]:
# Count the frequency of each unique word token in sentence s1

vec_1 = []
token_1 = s1.replace('.','').split()
for t in unique_tokens:
  count = token_1.count(t)
  print(f'{t}: {count}')
  vec_1.append(count)

print(f'\nVector ouput:\n{s1}\n{vec_1}')

Federer: 1
is: 1
one: 1
of: 2
the: 1
greatest: 1
tennis: 1
players: 1
all: 1
time: 1
has: 0
won: 0
twenty: 0
grand: 0
slam: 0
titles: 0
to: 0
date: 0

vector ouput
Federer is one of the greatest tennis players of all time.
[1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]


In [None]:
# Count the frequency of each unique word token in sentence s2

vec_2 = []
token_2 = s2.replace('.','').split()
for t in unique_tokens:
  count = token_2.count(t)
  print(f'{t}: {count}')
  vec_2.append(count)

print(f'\nVector ouput:\n{s2}\n{vec_2}')

Federer: 1
is: 0
one: 0
of: 0
the: 0
greatest: 0
tennis: 0
players: 0
all: 0
time: 0
has: 1
won: 1
twenty: 1
grand: 1
slam: 1
titles: 1
to: 1
date: 1

vector ouput
Federer has won twenty grand slam titles to date.
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]


### Print vectors

In [None]:
# Print both vectors
print(f'\n{s1}\n{vec_1}')
print(f'\n{s2}\n{vec_2}')


Federer is one of the greatest tennis players of all time.
[1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]

Federer has won twenty grand slam titles to date.
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
