### Review of Basics

In [1]:
# Using lambda function to create a tokenizer

tokenizer = lambda x : x.split()

In [3]:
!wget https://github.com/jibinmathew69/PyTorch1.x-Tutorial/blob/master/NewsClassification/test.csv
!wget https://github.com/jibinmathew69/PyTorch1.x-Tutorial/blob/master/NewsClassification/train.csv
!wget https://github.com/jibinmathew69/PyTorch1.x-Tutorial/blob/master/NewsClassification/valid.csv

--2020-07-03 17:27:45--  https://github.com/jibinmathew69/PyTorch1.x-Tutorial/blob/master/NewsClassification/test.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘test.csv’

test.csv                [    <=>             ]   1.01M   987KB/s    in 1.1s    

2020-07-03 17:27:47 (987 KB/s) - ‘test.csv’ saved [1063021]

--2020-07-03 17:27:49--  https://github.com/jibinmathew69/PyTorch1.x-Tutorial/blob/master/NewsClassification/train.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘train.csv’

train.csv               [  <=>               ]  72.97K   200KB/s    in 0.4s    

2020-07-03 17:27:51 (200 KB/s) - ‘train.csv’ saved [74718]

--2020-07-03 17:27:52--  https://gi

In [4]:
mkdir NewsClassificaiton

In [5]:
from torchtext.data import Field

# Fields for Reviews
Review = Field(sequential= True, tokenize= tokenizer , lower= True)

# Fields for Labels
Label = Field(sequential= False, use_vocab= False)

# Adding token at end and starting of the input string

SequenceField = Field(tokenize= tokenizer, init_token= '<sos>', eos_token= '<sos>', lower= True)

# Setting a field with a fix length

SequenceField = Field(tokenize= tokenizer, init_token= '<sos>', eos_token= '<sos>', lower= True, fix_length= 50)

# Setting an unknown token

SequenceField = Field(tokenize= tokenizer, init_token= '<sos>', eos_token= '<sos>', lower= True, unk_token= '<unk>')

In [6]:
from torchtext.data import TabularDataset

# Selecting the training columns
train_datafields = [("id", None), ("content", Review), ("Business", Label), ("SciTech", Label), ("Sports", Label), ("World", Label)]

# Selecting the testing columns
test_datafields = [('id',None), ('content',Review)]

# Reading the training and validation file
train, valid = TabularDataset.splits(path= '/content/NewsClassificaiton' ,train='train.csv', validation='valid.csv', format='csv', skip_header=True, fields=train_datafields)

# Reading the test file

test = TabularDataset(path= '/content/NewsClassificaiton/test.csv', format = 'csv', skip_header= True, fields= test_datafields)

In [7]:
# Building the vocabulary
Review = Field(sequential= True, tokenize= tokenizer , lower= True)

Review.build_vocab(train, min_freq=2)

### Word Embeddings

Word embeddings are learned representations of words. They are dense representations of words, where each word is assigned a vector, that is, a real-valued vector in a pre-defined vector space.

In [9]:
from torchtext import vocab

# loading vector embeddings

vec = vocab.Vectors('glove.6B.100d.txt', cache='./vec/glove_embedding/', url='http://nlp.stanford.edu/data/glove.6B.zip')

./vec/glove_embedding/glove.6B.zip: 862MB [06:30, 2.21MB/s]                           
100%|█████████▉| 398113/400000 [00:18<00:00, 22447.38it/s]

In [10]:
Review.build_vocab(train, min_freq = 2, vectors = vec)

TorchText has a vocab module that deals with embeddings. We can download pretrained embeddings by mentioning the name of the embedding that we need in this recipe.   
We then built the vocabulary from those pretrained embeddings, which added to the vocabulary of our training data, using the build_vocab method of the Review field object.  

### Building LSTM Network

Long short-term memory (LSTM) networks are a type of recurrent neural network that has internal gates that helps in better information persistence. 

In [11]:
import torch.nn as nn

In [17]:
class LSTMClassifier(nn.Module):
  def __init__(self, embedding_dim, hidden_dim, output_dim, dropout):
    super().__init__()
    self.embedding = nn.Embedding(len(Review.vocab),embedding_dim= embedding_dim)
    self.lstm = nn.LSTM(embedding_dim,hidden_dim)
    self.fc = nn.Linear(hidden_dim,output_dim)
    self.dropout = nn.Dropout(dropout)
  
  def forward(self, x):
    x = self.embedding(x)
    output , (hidden,cell) = self.lstm(x)
    hidden = self.dropout(hidden)
    return self.fc(hidden)

In [18]:
# Setting the hyperparameters:

EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
DROPOUT = 0.2

model = LSTMClassifier(EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT)

In [19]:
print(model)

LSTMClassifier(
  (embedding): Embedding(2, 100)
  (lstm): LSTM(100, 256)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)
