# Learning NLP using FastAI Huggig Face Module

### This Notebook can be used for learning NLP using Huuging Face Module. This notebook is created in DeepNote and We need to install the Fastai and Pytorch for this.

In [None]:
!pip install fastai==2.5.2

Collecting fastai==2.5.2
  Downloading fastai-2.5.2-py3-none-any.whl (186 kB)
[K     |████████████████████████████████| 186 kB 24.8 MB/s 
Collecting fastdownload<2,>=0.0.5
  Downloading fastdownload-0.0.5-py3-none-any.whl (13 kB)
Collecting fastprogress>=0.2.4
  Downloading fastprogress-1.0.0-py3-none-any.whl (12 kB)
Collecting fastcore<1.4,>=1.3.8
  Downloading fastcore-1.3.26-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 9.6 MB/s 
Installing collected packages: fastprogress, fastcore, fastdownload, fastai
Successfully installed fastai-2.5.2 fastcore-1.3.26 fastdownload-0.0.5 fastprogress-1.0.0
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m


##  Looking at the Data [Pandas]
For this notebook, we'll be looking at the Amazon Reviews Polarity dataset! The task is to predict whether a review is of positive or negative sentiment. The original Amazon Reviews dataset contains review scores ranging from 1-5. This polarity dataset combines review scores 1-2 into the negative class, 4-5 into the positive class, and ignores/drops review scores of 3!

In [1]:
from fastai.text.all import *
import pandas as pd

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
path = untar_data(URLs.AMAZON_REVIEWS_POLARITY)
path

Path('f:/Notebooks/.fastai/data/amazon_review_polarity_csv')

#### Let's go ahead and take a look at our two df's: train_df and valid_df
We’re going to use 40k instead of 3.6m samples for training, and 2k instead of 400k samples for validation



In [3]:
train_df = pd.read_csv(path/'train.csv', names=['label', 'title', 'text'], nrows=40000)
valid_df = pd.read_csv(path/'test.csv', names=['label', 'title', 'text'], nrows=2000)
train_df.head()

Unnamed: 0,label,title,text
0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^
1,2,The best soundtrack ever to anything.,"I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny."
2,2,Amazing!,"This soundtrack is my favorite music of all time, hands down. The intense sadness of ""Prisoners of Fate"" (which means all the more if you've played the game) and the hope in ""A Distant Promise"" and ""Girl who Stole the Star"" have been an important inspiration to me personally throughout my teen years. The higher energy tracks like ""Chrono Cross ~ Time's Scar~"", ""Time of the Dreamwatch"", and ""Chronomantique"" (indefinably remeniscent of Chrono Trigger) are all absolutely superb as well.This soundtrack is amazing music, probably the best of this composer's work (I haven't heard the Xenogears s..."
3,2,Excellent Soundtrack,"I truly like this soundtrack and I enjoy video game music. I have played this game and most of the music on here I enjoy and it's truly relaxing and peaceful.On disk one. my favorites are Scars Of Time, Between Life and Death, Forest Of Illusion, Fortress of Ancient Dragons, Lost Fragment, and Drowned Valley.Disk Two: The Draggons, Galdorb - Home, Chronomantique, Prisoners of Fate, Gale, and my girlfriend likes ZelbessDisk Three: The best of the three. Garden Of God, Chronopolis, Fates, Jellyfish sea, Burning Orphange, Dragon's Prayer, Tower Of Stars, Dragon God, and Radical Dreamers - Uns..."
4,2,"Remember, Pull Your Jaw Off The Floor After Hearing it","If you've played the game, you know how divine the music is! Every single song tells a story of the game, it's that good! The greatest songs are without a doubt, Chrono Cross: Time's Scar, Magical Dreamers: The Wind, The Stars, and the Sea and Radical Dreamers: Unstolen Jewel. (Translation varies) This music is perfect if you ask me, the best it can be. Yasunori Mitsuda just poured his heart on and wrote it down on paper."


In [4]:
sample_text = train_df['text'][0]
sample_text

'This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

#### We need to Install torchtext for next Section

In [9]:
!pip install torchtext==0.10.0

Collecting torchtext==0.10.0
  Using cached torchtext-0.10.0-cp36-cp36m-win_amd64.whl (1.3 MB)
Installing collected packages: torchtext
Successfully installed torchtext-0.10.0


##  Tokenization and Numericalization [PyTorch]
We now want to first tokenize our inputs, then numericalize them using a vocab. Quick recap of these terms:

Tokenization = The process of converting an input string into "pieces"
These pieces can be whole words, sub words, or even characters
Numericalization = The process of converting a token into a numeric representation
(e.g. token -> number)
This is done through the use (and creation of) a vocab
There are many fancy tokenizers out there, but since we're first doing things from scratch we'll go ahead and use a simple basic_english tokenizer from torchtext and split on spaces

In [11]:
import torch
import torchtext
from torchtext.data import get_tokenizer

tokenizer = get_tokenizer("basic_english")

##### Fastai’s L is basically list from Python, but has some convienent properties such as displaying the number of elements, and additionally doesn’t spam your screen with output if the list is too long!

In [12]:
tokens = L(tokenizer(sample_text))
tokens

(#81) ['this','sound','track','was','beautiful','!','it','paints','the','senery'...]

##### Next we'll need to check how many tokens there are in our dataset, and keep the frequent ones as part of our vocab.

In [13]:
from collections import Counter

token_counter = Counter()

for sample_text in train_df['text']:
    tokens = tokenizer(sample_text)
    token_counter.update(tokens)

token_counter.most_common(n=25)

[('.', 213962),
 ('the', 158787),
 (',', 116525),
 ('i', 91270),
 ('and', 86059),
 ('a', 77977),
 ('to', 74984),
 ('it', 69999),
 ('of', 65144),
 ("'", 60523),
 ('this', 59382),
 ('is', 56445),
 ('in', 37890),
 ('that', 33891),
 ('for', 30532),
 ('was', 29163),
 ('you', 26740),
 ('!', 25238),
 ('book', 24698),
 ('s', 23897),
 ('but', 22602),
 ('with', 21998),
 ('not', 21988),
 ('on', 20759),
 ('t', 20097)]

In [14]:
len(token_counter)

75889

#### Now that we have our token frequency counter, we can go ahead and make our vocab!

##### Important: We’ll be using <unk> as our default token for tokens that are out of our vocab!
 Important: Notice how we passed in a min_freq argument. This ensures that the vocab only includes high frequency tokens. We wouldn’t want to include tokens that only occur once/rarely. This brought our vocab count down from 75,889 to 7,591! A ~90% reduction!



In [19]:
sorted_counter = dict(token_counter.most_common())

# Create vocab containing tokens with a minimum frequency of 20
my_vocab = torchtext.vocab.vocab(sorted_counter, min_freq=20)

# Add the unknown token, and use this by default for unknown words
unk_token = '<unk>'
my_vocab.insert_token(unk_token, 0)
my_vocab.set_default_index(0)

# Add the pad token
pad_token = '<pad>'
my_vocab.insert_token(pad_token, 1)

# Show vocab size, and examples of tokens
len(my_vocab.get_itos()), my_vocab.get_itos()[:25]

AttributeError: module 'torchtext.vocab' has no attribute 'vocab'

#### Rather than starting from scratch, we can preload GloVe embeddings into our vocabulary!



In [16]:
glove = torchtext.vocab.GloVe(name = '6B', dim = 100)
glove.vectors.shape

.vector_cache\glove.6B.zip: 862MB [29:27, 488kB/s]                                                                     
100%|██████████████████████████████████████████████████████████████████████▉| 399999/400000 [00:33<00:00, 11839.95it/s]


torch.Size([400000, 100])

#### Since we're using GloVe vectors for transfer learning (by preloading our embedding), let's take a look at how many tokens can be successfully transferred from GloVe into our own vocab. Each token will have an embedding (vector) of size 100. This results in an embedding of size 7591x100

In [None]:
my_vocab.vectors = glove.get_vecs_by_tokens(my_vocab.get_itos())
my_vocab.vectors.shape

torch.Size([7591, 100])

##### By default, tokens that aren't able to transfer from GloVe into our own dataset get initialized with a vector of 0's. We can use this to count how many tokens were successfully preloaded!

In [None]:
tot_transferred = 0
for v in my_vocab.vectors:
    if not v.equal(torch.zeros(100)):
        tot_transferred += 1
        
tot_transferred, len(my_vocab)

NameError: name 'my_vocab' is not defined

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=49aec45e-6e0f-4e9f-a765-361581a0655a' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>