# Working with Skip-gram Embeddings


With neural networks, we can make the embedding values part of the training procedure. The first such method we will explore is called **skip-gram embedding**.

We will use a neural network that predicts surrounding words giving an input word. We could, just as easily, switch that and try to predict a target word given a set of surrounding words. Both are variations of the **Word2vec** procedure. But the prior method of predicting the surrounding words (the context) from a target word is called the **skip-gram model**. In the next recipe, we will implement the other method, predicting the target word from the context, which is called the **continuous bag of words (CBOW)** method.

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from nltk.corpus import stopwords

In [None]:
sess = tf.Session()

In [None]:
BATCH_SIZE = 50
EMBEDDING_SIZE = 200
VOCABULARY_SIZE = 10000
WINDOW_SIZE = 2
GENERATIONS = 50000
LOSS_LOG_INTERVAL = 500
VALID_LOG_INTERVAL = 2000

en_stopwords = stopwords.words('english')
valid_words = ['cliche', 'love', 'hate', 'silly', 'sad']

In [None]:
!wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz

In [None]:
!tar -xf rt-polaritydata.tar.gz

In [None]:
def ddd():
    return (1, 2)

a = ddd()
a

In [1]:
from MovieData import MovieData

In [2]:
movieData = MovieData()

In [3]:
(pos_data, neg_data) = movieData.load_data()

In [4]:
pos_data.tail()

Unnamed: 0,Content
5326,exuberantly romantic serenely melancholy time ...
5327,mazel tov film familys joyous life acting yidd...
5328,standing shadows motown best kind documentary ...
5329,nice see piscopo years chaykin headly priceless
5330,provides porthole noble trembling incoherence ...


In [5]:
neg_data.tail()

Unnamed: 0,Content
5326,terrible movie people nevertheless find moving
5327,many definitions time waster movie must surely...
5328,stands crocodile hunter hurried badly cobbled ...
5329,thing looks like madeforhomevideo quickie
5330,enigma wellmade dry placid


In [None]:
dir(pos)

In [None]:
pos.read()

In [None]:
with open('test.txt', 'w') as f:
    f.write(pos)

In [None]:
import pandas as pd

df = pd.DataFrame(['a b c', 'a b', 'a b c d e', 'a', 'a b c d'], columns=['Content'])
df

In [None]:
df[df['Content'].str.len() >= 3]

In [None]:
def count_words(text):
    print(text.str.split().str.len())
    return len(text.split())

In [None]:
df[df['Content'].str.split().str.len() >= 3]

In [None]:
df['Content'].loc[lambda s: count_words(s) >= 3]