# Sentiment Analysis - Tokenizing news headlines for data preparation!
The notebook covers the data preparation step by tokenizing the headlines and creating padded sequences of news headlines.

Data preparation include the following steps:
1. Download and read the data
2. Segregate the headlines and their labels.
3. Tokenize the headlines
4. Create sequences and add padding.

## 1. Download and read the news headlines data

This is a [kaggle dataset](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection) which is further corrected and then hosted on Google Cloud Storage.

In [1]:
!wget --no-check-certificate \
    https://storage.googleapis.com/wdd-2-node.appspot.com/x1.json \
    -o /tmp/headlines.json

In [2]:
!ls

sample_data  x1.json


In [3]:
# read the data using the pandas library
import pandas as pd

df = pd.read_json('./x1.json')
df.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


## Segregating the headlines

In [4]:
# create lists to store the headlines and labels
headlines = list(df['headline'])
labels = list(df['is_sarcastic'])

In [8]:
# headlines
# labels

## Import the APIs

In [11]:
# import the required APIs
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

## 3. Tokenize the data

In [14]:
##set up the tokenizer
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(headlines)

In [16]:
word_index = tokenizer.word_index
print(word_index)
print('Number of unique words: ', len(word_index) - 1)

Number of unique words:  30884


## 4. Create padded sequences

In [12]:
# find out maximum length of the headlines sentences
max_length = max([len(x) for x in headlines])
print(max_length)

926


In [18]:
##create sequences of the headlines
seqs = tokenizer.texts_to_sequences(headlines)

##post-pad sequences
padded_seqs = pad_sequences(seqs, maxlen=max_length, padding="post")


In [19]:
##printing padded sequences sample
print(padded_seqs)

[[16004   355  3167 ...     0     0     0]
 [ 7475  1775   758 ...     0     0     0]
 [  863    33 11427 ...     0     0     0]
 ...
 [    4   100   629 ...     0     0     0]
 [ 1870  1313  3317 ...     0     0     0]
 [  217  3283    21 ...     0     0     0]]


In [21]:
print(padded_seqs[1])

[7475 1775  758 3168   47  239   11 1844 1048    8 1528 2154 1845    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 