### Tokenization
We want to eventually train a machine learning algorithm that predicts the number of upvotes a headline would receive. To do this, we'll need to be able to convert each headline to a numerical representation.

One way to do this is with something called a bag of words model. The bag of words model represents each piece of text as a numerical vector.

We'll look in more depth at each step in the bag of words process throughout this mission. Here's a high-level diagram showing how two sentences, I rode my horse to Berlin., and You rode my horse to Berlin in the winter. turn into a bag of words:

1
I
r
o
d
e
m
y
h
o
r
s
e
t
o
B
e
r
l
i
n
.
2
Y
o
u
r
o
d
e
m
y
h
o
r
s
e
t
o
B
e
r
l
i
n
i
n
t
h
e
w
i
n
t
e
r
.
B
a
g
o
f
w
o
r
d
s
i
y
o
u
r
o
d
e
m
y
h
o
r
s
e
t
o
b
e
r
l
i
n
i
n
t
h
e
w
i
n
t
e
r
1
1
0
1
1
1
1
1
0
0
0
2
0
1
1
1
1
1
1
1
1
1
The first step in creating a bag of words model is known as tokenization. In tokenization, we break a sentence into disconnected words.

Here's a diagram, where we tokenize the two sentences from earlier:

1
I
r
o
d
e
m
y
h
o
r
s
e
t
o
B
e
r
l
i
n
.
2
Y
o
u
r
o
d
e
m
y
h
o
r
s
e
t
o
B
e
r
l
i
n
i
n
t
h
e
w
i
n
t
e
r
.
T
o
k
e
n
i
z
a
t
i
o
n
1
[
I
,
r
o
d
e
,
m
y
,
h
o
r
s
e
,
t
o
,
B
e
r
l
i
n
.
]
2
[
Y
o
u
,
r
o
d
e
,
m
y
,
h
o
r
s
e
,
t
o
,
B
e
r
l
i
n
,
i
n
,
t
h
e
,
w
i
n
t
e
r
.
]
As you can see, all we're doing is splitting each sentence into a list of tokens. The split happens when a space character is found.

In [None]:
tokenized_headlines = []

tokenized_headlines = [ line.split(" ") for line in submissions['headline']]

Preprocessing
We now have tokens, but they need some processing to make our predictions more accurate. We know that Berlin, Berlin., and berlin are all referring to the same word, but the computer doesn't know that unless we convert them all to be the same.

We can do this by lowercasing, so Berlin is turned into berlin, and removing punctuation, so Berlin. becomes Berlin.

1
[
I
,
r
o
d
e
,
m
y
,
h
o
r
s
e
,
t
o
,
B
e
r
l
i
n
.
]
2
[
Y
o
u
,
r
o
d
e
,
m
y
,
h
o
r
s
e
,
t
o
,
B
e
r
l
i
n
,
i
n
,
t
h
e
,
w
i
n
t
e
r
.
]
P
r
e
p
r
o
c
e
s
s
i
n
g
1
[
i
,
r
o
d
e
,
m
y
,
h
o
r
s
e
,
t
o
,
b
e
r
l
i
n
]
2
[
y
o
u
,
r
o
d
e
,
m
y
,
h
o
r
s
e
,
t
o
,
b
e
r
l
i
n
,
i
n
,
t
h
e
,
w
i
n
t
e
r
]
Preprocessing doesn't have to be perfect, but the more we can help the computer group the same word together, the higher our prediction accuracy will be. It's useful to look through your tokens, and see if there are any instances of the same word that you haven't grouped together.

In [None]:
import numpy as np
from collections import Counter
unique_tokens = []
single_tokens = []


compelete_token = [word for headLine in clean_tokenized for word in headLine]

freqs = {}
for word in compelete_token:
    freqs[word] = freqs.get(word, 0) + 1
    
unique_tokens = []
for key, value in freqs.items():
    if value >1 :
        unique_tokens.append(key)
        
counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)