## Pretrained Word Embeddings

From Driverless AI version 1.7.0, text models can take in pretrained word embeddings through expert settings. There are several pre-trained word embeddings available in the open source domain like [Glove](https://nlp.stanford.edu/projects/glove/) and [Fasttext](https://fasttext.cc/docs/en/crawl-vectors.html). We can download these embeddings and use them in our models. These embeddings are trained on corpus like wikipedia, common crawl etc. 

We can also train our own embeddings on our domain dataset instead of using the publicly available ones. This one is particularly useful when there is a good amount of text data that is not tagged and want to use that information. This notebook is to help create custom pre-trained embeddings.

The data used in this example is [US Airline Sentiment dataset](https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv) from [Figure Eight’s Data for Everyone](https://www.figure-eight.com/data-for-everyone/) library. The dataset is split into training and test with this [simple script](https://gist.github.com/woobe/bd79d9f4d7ea139c5d2eb4cf1de1e7db) and the train file is used for word embeddings creation. Please use your own text corpus inplace of this airline train file.

In [1]:
# Please enter the file name
file_name = "train_airline_sentiment.csv"
# Please enter the name of the text column
col_name = "text"

Import the h2o module and H2OWord2vecEstimator

In [2]:
import h2o
h2o.init()
from h2o.estimators.word2vec import H2OWord2vecEstimator

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.1" 2018-10-16; OpenJDK Runtime Environment 18.9 (build 11.0.1+13); OpenJDK 64-Bit Server VM 18.9 (build 11.0.1+13, mixed mode)
  Starting server from /Users/srk/envs/DS2/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/db/49r_20s91bg8qhg08qf78x100000gn/T/tmp8m3vtkx0
  JVM stdout: /var/folders/db/49r_20s91bg8qhg08qf78x100000gn/T/tmp8m3vtkx0/h2o_srk_started_from_python.out
  JVM stderr: /var/folders/db/49r_20s91bg8qhg08qf78x100000gn/T/tmp8m3vtkx0/h2o_srk_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,Asia/Kolkata
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.4
H2O cluster version age:,1 month and 24 days
H2O cluster name:,H2O_from_python_srk_z7y5eb
H2O cluster total nodes:,1
H2O cluster free memory:,4 Gb
H2O cluster total cores:,12
H2O cluster allowed cores:,12


Import the dataset file. Please note that the input file should be a csv file with a valid header in the first line.

In [3]:
df = h2o.import_file(file_name, header=1, sep=",")
df = df[[col_name]].ascharacter()

Parse progress: |█████████████████████████████████████████████████████████| 100%


Do some text preprocessing.

In [4]:
def tokenize(sentences):
    # tokenize the sentences
    tokenized = sentences.tokenize("\\W+")
    # lower case the text column
    tokenized = tokenized.tolower()
    # filter out the sentences which has less than 2 characters or where text is missing
    tokenized = tokenized[(tokenized.nchar() >= 2) | (tokenized.isna()),:]
    return tokenized

words = tokenize(df[col_name])

The next step is to build the word2vec model. We can also adjust the parameters of the word2vec mdoel. Please refer to the [documentation of H2oWord2vecEstimator](http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#h2oword2vecestimator) for more details on the parameters. 

In [5]:
print("Build word2vec model")
w2v_model = H2OWord2vecEstimator(min_word_freq=3,
                                 vec_size=300,
                                 window_size=5,
                                 epochs=10,
                                 word_model="skip_gram")
w2v_model.train(training_frame=words)

Build word2vec model
word2vec Model Build progress: |██████████████████████████████████████████| 100%


Save the word embeddings as text file. 

This file can be given as pre-trained word embedding input for Driverless AI. The option is present in `Expert Settings -> NLP -> Path to pretrained embeddings for TensorFlow NLP models` 

In [6]:
w2v_model.to_frame().as_data_frame().to_csv("w2vec.txt", float_format='%.6f', sep=" ", header=False, index=False)