# Tokenize and sequence a bigger corpus of text

## About the dataset
### You will use a dataset containing Amazon and Yelp reviews of products and restaurants. This dataset was originally extracted from Kaggle.

### The dataset includes reviews, and each review is labelled as 0 (bad) or 1 (good). 

In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import numpy as np
import pandas as pd

## Get the corpus of text

In [2]:
path = tf.keras.utils.get_file('reviews.csv', 
                               'https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P')
print (path)

/home/nikhil/.keras/datasets/reviews.csv


In [3]:
dataset = pd.read_csv(path)
dataset.head()

Unnamed: 0.1,Unnamed: 0,text,sentiment
0,0,So there is no way for me to plug it in here i...,0
1,1,Good case Excellent value.,1
2,2,Great for the jawbone.,1
3,3,Tied to charger for conversations lasting more...,0
4,4,The mic is great.,1


## Get the reviews from the csv file

In [4]:
reviews = dataset['text'].to_list()
print(reviews)



## Tokenize the text

In [5]:
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(reviews)
print(tokenizer)

word_index = tokenizer.word_index
print(len(word_index))
print(word_index)

<keras_preprocessing.text.Tokenizer object at 0x7f3e37f6fcc0>
3261


## Generate sequences for the reviews

In [6]:
sequences = tokenizer.texts_to_sequences(reviews)
padded_sequences = pad_sequences(sequences, padding='post')

print(padded_sequences.shape)

print(reviews[0])

print(padded_sequences[0])

(1992, 139)
So there is no way for me to plug it in here in the US unless I go by a converter.
[  28   59    8   56  142   13   61    7  269    6   15   46   15    2
  149  449    4   60  113    5 1429    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0]
