### [NLTK](https://www.nltk.org/)
##### The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.  (From Wiki)
`conda install -c anaconda nltk`

Corpus is collection of human language senteces.

In [None]:
sentences = "The Eastern Conference champion Cleveland Cavaliers defeated the defending NBA champion and Western Conference champion Golden State Warriors 4–3 in a rematch of the 2015 NBA Finals. Golden State, which earned home-court advantage with setting the NBA regular season wins record (73–9), jumped to a 2–0 lead in the series while recording the largest combined margin of victory (48) through two games in NBA Finals history. Cleveland returned home and responded with a 120–90 win in Game 3, but the Warriors won Game 4 to take a 3–1 series lead. The Cavaliers won the next three games to become the first team in Finals history to successfully overcome a 3–1 deficit. It also marked the first time since 1978 that Game 7 was won by the road team. "

#### Import nltk and download necessary dependencies

In [None]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
# import nltk
# nltk.download('stopwords')
# nltk.download('punkt')

`nltk` can help us remove stopwords that doesn't contribute much to the semantics of the sentence(s).

### Now let's create Vocabulary of our corpus.

In [12]:
stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(sentences)
#We only want unique tokens
unique_word_tokens = set(word_tokens)
listofwords = [w for w in unique_word_tokens if not w in stop_words]  
print(listofwords) 

['win', 'two', 'State', 'team', 'road', 'regular', 'home', '7', 'history', 'Golden', 'Finals', 'games', 'Eastern', 'take', 'jumped', 'earned', 'victory', '2015', 'responded', 'Warriors', 'next', 'Cavaliers', 'successfully', 'first', 'rematch', 'time', 'deficit', 'Cleveland', 'series', 'defeated', '4', 'defending', '48', ',', '(', 'champion', 'Western', 'also', 'Conference', 'largest', 'become', '1978', 'record', '.', 'advantage', '3–1', 'season', '120–90', 'NBA', 'It', 'lead', 'Game', 'returned', '4–3', '2–0', 'combined', 'setting', 'wins', '73–9', 'overcome', 'margin', 'three', 'home-court', 'The', 'marked', '3', 'recording', ')', 'since']


#### Now lets build one-hot encoded representations of each word from a sample test sentence.

In [17]:
import numpy as np
test = "Cleveland Cavaliers are the Eastern Conference champion"
test = set(word_tokenize(test))
test = [w for w in test if not w in stop_words]
for i in range(0, len(test)):
    one_hot = listofwords.index(test[i])
    vector = np.zeros(len(listofwords))
    vector[one_hot] = 1
    print(test[i], " : ", vector)

Eastern  :  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Cleveland  :  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Conference  :  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Cavaliers  :  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
champion  :  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0.