# Hello!

## Today we will learn how to perform word embeddings using one-hot encoding or count-vectorizing method

### In simple terms, word embedding is the conversion from words to numbers which the machine understands

### We can simply convert the words to random numbers, but how would the machine understand if the two separate words should be related?

### Like, we know 'hi' and 'hello' mean the same to greet someone, but assigning them to random numbers loses context :( 

### Let's start with a naive method of word embedding called one-hot encoding or count vectorizing

### Install all packages in your conda environment. You just need `scikit-learn` for this. `pandas` is used to visualize the results.

In [128]:
!/anaconda2/envs/word-embeddings/bin/pip install scikit-learn pandas



### Import the packages

In [129]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE

### Let's define a simple corpus as list of documents containing some texts

In [130]:
corpus = [
    'This is our first example document',
    'this is the second.',
    'Third it is!'
]

### As you would expect, the length of our corpus is 3

In [131]:
len(corpus)

3

### Now define the count vectorizer object that will perform the one hot encoding on our corpus

In [132]:
vectorizer = CountVectorizer()
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

### Fit the vectorizer on our corpus, and use the trained vectorizer to transform our corpus

In [133]:
transformed_corpus = vectorizer.fit_transform(corpus)
transformed_corpus.shape

(3, 10)

### The vectorizer found 10 features. These features are simply words in the corpus.

### Let's check those features

In [134]:
vectorizer.get_feature_names()

['document',
 'example',
 'first',
 'is',
 'it',
 'our',
 'second',
 'the',
 'third',
 'this']

### Let's see what the one-hot encoded vector looks like on our corpus

In [135]:
transformed_corpus.toarray()

array([[1, 1, 1, 1, 0, 1, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 1, 1, 0, 1],
       [0, 0, 0, 1, 1, 0, 0, 0, 1, 0]])

### So the words that are present in each document gets a one for the corresponding word, else they get 0

### Let's check in more detail

In [136]:
df = pd.DataFrame(transformed_corpus.toarray(),
                 columns=vectorizer.get_feature_names())
df

Unnamed: 0,document,example,first,is,it,our,second,the,third,this
0,1,1,1,1,0,1,0,0,0,1
1,0,0,0,1,0,0,1,1,0,1
2,0,0,0,1,1,0,0,0,1,0


### Now if we have a new document, we can just pass this document through the trained vectorizer

In [137]:
new_doc = ['this is the fourth document']
transformed_new_doc = vectorizer.transform(new_doc)
transformed_new_doc.toarray()

array([[1, 0, 0, 1, 0, 0, 0, 1, 0, 1]])

In [138]:
pd.DataFrame(transformed_new_doc.toarray(),
             columns=vectorizer.get_feature_names())

Unnamed: 0,document,example,first,is,it,our,second,the,third,this
0,1,0,0,1,0,0,0,1,0,1


### Observe that the new word 'fourth' was just ignored by the vectorizer as that didn't exist in our training corpus

### This is normal, and shouldn't be the case if the vectorizer is trained on thousands of documents. Sometimes, it is better to train them on a book corpus to encompass entire English common vocabulary.

### Next time we will learn a new word embedding method

## Hope you enjoyed the music and video. Do subscribe to enjoy similar contents.

# Jabraghe