# **Working with text**


>

**Step 1. Import your data**

**Step 2. Clean text**
* Encoding errors
* HTML tags
* Punctuations (regular expression)

**Step 3. Tokenization**
* Word_tokenizer
* Regex_tokenizer

**Step 4. More processing**
* Removing stop words
* Lemmatization
* Stemming

**Step 5. Text featurization**
- Bag of Words
- N-gram
- Word Embedding
...

.

**TorchText**

>

In [0]:
from google.colab import files
import os
import pandas as pd
import nltk
import pprint

>

## **Step 1. Import your data**

>

**What's in *example.txt*:**

< p >

"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.< br >
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.< br >
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.< br >
"We've got Father and Mother, and each other," said Beth contentedly from her corner.< br >
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time." She didn't say "perhaps never," but each silently added it, thinking of Father far away, where the fighting was.< br >

< /p >

>

In [5]:
# google file open
file_path = "./example.txt"
if not os.path.isfile(file_path):
    uploaded = files.upload()

Saving example.txt to example.txt


In [6]:
# OS file open
with open(file_path, "r") as f:
    data = [line for line in f.readlines()]

pprint.pprint(data)

['<p>\n',
 '"Christmas won\'t be Christmas without any presents," grumbled Jo, lying on '
 'the rug.<br>\n',
 '"It\'s so dreadful to be poor!" sighed Meg, looking down at her old '
 'dress.<br>\n',
 '"I don\'t think it\'s fair for some girls to have plenty of pretty things, '
 'and other girls nothing at all," added little Amy, with an injured '
 'sniff.<br>\n',
 '"We\'ve got Father and Mother, and each other," said Beth contentedly from '
 'her corner.<br>\n',
 'The four young faces on which the firelight shone brightened at the cheerful '
 'words, but darkened again as Jo said sadly, "We haven\'t got Father, and '
 'shall not have him for a long time." She didn\'t say "perhaps never," but '
 'each silently added it, thinking of Father far away, where the fighting '
 'was.<br>\n',
 '</p>']


>

In [7]:
# Pandas dataframe
data = pd.read_csv(file_path, sep="\n", header=None)

pprint.pprint(data)

                                                   0
0                                                <p>
1  Christmas won't be Christmas without any prese...
2  It's so dreadful to be poor! sighed Meg, looki...
3  I don't think it's fair for some girls to have...
4  We've got Father and Mother, and each other, s...
5  The four young faces on which the firelight sh...
6                                               </p>


>

## **Step 2. Clean Text**

>

### **Study your data**

Some questions to consider:

- Is there any encoding error?

- Should I remove tags?

- Is there any unnecessary spaces?

- Are the punctuations important?



>

**What's in *example.txt*:**

< p >

"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.< br >
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.< br >
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.< br >
"We've got Father and Mother, and each other," said Beth contentedly from her corner.< br >
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time." She didn't say "perhaps never," but each silently added it, thinking of Father far away, where the fighting was.< br >

< /p >

 >

In [0]:
import re

def preprocessing(data):
    new_data = []
    new_sentence = ""
    for sentence in (data[:][0]):
        new_sentence = re.sub('<.*?>', '', sentence) # remove HTML tags
        #new_sentence = re.sub(r'[^\w\s]', '', new_sentence) # remove punctuation
        new_sentence = new_sentence.lower() # convert to lower case
        if new_sentence != '':
            new_data.append(new_sentence)
    return new_data

>

***Why would you consider converting text into lower cases?***

>All these are considered different words:

>CHRISTMAS, Christmas, christmas, ChRiStMaS


>



***Some example of cases where you would keep the letter cases***

>When your task requires an analysis on proper nouns such as names of location, organizations and/or human names.

>>- "Buttons" as a cat name **vs** "buttons" as objects

>>- "Fluffy" as a cat name **vs** ''fluffy" as an adjective describing a property

>

In [11]:
cleaned_data = preprocessing(data)
pprint.pprint(cleaned_data)

["christmas won't be christmas without any presents, grumbled jo, lying on the "
 'rug.',
 "it's so dreadful to be poor! sighed meg, looking down at her old dress.",
 "i don't think it's fair for some girls to have plenty of pretty things, and "
 'other girls nothing at all, added little amy, with an injured sniff.',
 "we've got father and mother, and each other, said beth contentedly from her "
 'corner.',
 'the four young faces on which the firelight shone brightened at the cheerful '
 'words, but darkened again as jo said sadly, "we haven\'t got father, and '
 'shall not have him for a long time." she didn\'t say "perhaps never," but '
 'each silently added it, thinking of father far away, where the fighting was.']


>

## **Step 3. Tokenize into words**

>

In [12]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')


token_text = [word_tokenize(d) for d in cleaned_data] # tokenizes without removing punctuation
print(token_text[0])

token_text = [tokenizer.tokenize(d) for d in cleaned_data] # tokenizes with punctuations removed
print(token_text[0])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['christmas', 'wo', "n't", 'be', 'christmas', 'without', 'any', 'presents', ',', 'grumbled', 'jo', ',', 'lying', 'on', 'the', 'rug', '.']
['christmas', 'won', 't', 'be', 'christmas', 'without', 'any', 'presents', 'grumbled', 'jo', 'lying', 'on', 'the', 'rug']


>


## **Step 4. More Processing**


>

### **a. Remove stop words**
*Removing commonly used word (such as “the”, “a”, “an”, “in”)*

>

**Why would you REMOVE stop words?**
> If you are using some bag of words based methods that works on counts and frequency of the words, removing stop words is great as it lowers the dimensional space and also a few stop words won't drive your analysis.

>

***Why would you KEEP stop words?***

>In the context of sentiment analysis, removing stop words can be problematic if context is affected.

>> example:
>>> "**I do not** like **this** movie"  **vs** "I like **this** movie"

>>> ---> *[I, do, not, this] are stop words*

>>> **output**: "like movie" **vs** "like movie"




>

In [13]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = stopwords.words('english')

print(stop_words[:5])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['i', 'me', 'my', 'myself', 'we']


>

In [14]:
rm_stop = [[word for word in sent if word not in stop_words] for sent in token_text]
pprint.pprint(rm_stop[0])

['christmas',
 'christmas',
 'without',
 'presents',
 'grumbled',
 'jo',
 'lying',
 'rug']


>

### **b. Lemmatization**
*The process of grouping together the different inflected forms of a word so they can be analysed as a single item*


> **Examples of lemmatization:**

>rocks -> rock

>corpora -> corpus

>better -> good

.

***Why would you lemmatize the data?***
> We want to make our system recognize different forms of words as same tokens.

>> rocks, rock -> rock

>> corpora, corpus -> corpus

>> better, well, good -> good

In [16]:
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer() 

print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora")) 
print("better :", lemmatizer.lemmatize("better", pos ="a")) # a for adj

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
rocks : rock
corpora : corpus
better : good


In [17]:
lemm_data = [[lemmatizer.lemmatize(word) for word in sent] for sent in token_text]
print(lemm_data)


[['christmas', 'won', 't', 'be', 'christmas', 'without', 'any', 'present', 'grumbled', 'jo', 'lying', 'on', 'the', 'rug'], ['it', 's', 'so', 'dreadful', 'to', 'be', 'poor', 'sighed', 'meg', 'looking', 'down', 'at', 'her', 'old', 'dress'], ['i', 'don', 't', 'think', 'it', 's', 'fair', 'for', 'some', 'girl', 'to', 'have', 'plenty', 'of', 'pretty', 'thing', 'and', 'other', 'girl', 'nothing', 'at', 'all', 'added', 'little', 'amy', 'with', 'an', 'injured', 'sniff'], ['we', 've', 'got', 'father', 'and', 'mother', 'and', 'each', 'other', 'said', 'beth', 'contentedly', 'from', 'her', 'corner'], ['the', 'four', 'young', 'face', 'on', 'which', 'the', 'firelight', 'shone', 'brightened', 'at', 'the', 'cheerful', 'word', 'but', 'darkened', 'again', 'a', 'jo', 'said', 'sadly', 'we', 'haven', 't', 'got', 'father', 'and', 'shall', 'not', 'have', 'him', 'for', 'a', 'long', 'time', 'she', 'didn', 't', 'say', 'perhaps', 'never', 'but', 'each', 'silently', 'added', 'it', 'thinking', 'of', 'father', 'far',

>

### **c. Stemming**
*Producing morphological variants of a root/base word*


>**Example of stemming:**

>likes, liked, likely, liking -> like

>chocolates,chocolatey -> choco


In [18]:
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer() 
  
words = ["program", "programs", "programer", "programing", "programers"] 
  
for w in words: 
    print(w, " : ", ps.stem(w)) 

program  :  program
programs  :  program
programer  :  program
programing  :  program
programers  :  program


In [19]:
stem_data = [[ps.stem(word) for word in sent] for sent in token_text]
print(stem_data)

[['christma', 'won', 't', 'be', 'christma', 'without', 'ani', 'present', 'grumbl', 'jo', 'lie', 'on', 'the', 'rug'], ['it', 's', 'so', 'dread', 'to', 'be', 'poor', 'sigh', 'meg', 'look', 'down', 'at', 'her', 'old', 'dress'], ['i', 'don', 't', 'think', 'it', 's', 'fair', 'for', 'some', 'girl', 'to', 'have', 'plenti', 'of', 'pretti', 'thing', 'and', 'other', 'girl', 'noth', 'at', 'all', 'ad', 'littl', 'ami', 'with', 'an', 'injur', 'sniff'], ['we', 've', 'got', 'father', 'and', 'mother', 'and', 'each', 'other', 'said', 'beth', 'contentedli', 'from', 'her', 'corner'], ['the', 'four', 'young', 'face', 'on', 'which', 'the', 'firelight', 'shone', 'brighten', 'at', 'the', 'cheer', 'word', 'but', 'darken', 'again', 'as', 'jo', 'said', 'sadli', 'we', 'haven', 't', 'got', 'father', 'and', 'shall', 'not', 'have', 'him', 'for', 'a', 'long', 'time', 'she', 'didn', 't', 'say', 'perhap', 'never', 'but', 'each', 'silent', 'ad', 'it', 'think', 'of', 'father', 'far', 'away', 'where', 'the', 'fight', 'wa'

>

***Lemmatization vs Stemming***

Lemmatization performs morphological analysis so it preserves the context of words

wheras stemming cuts off parts of words so the morphological variants produced are not always real words.

>

## **Step 6. Text featurization**


### **Bag of words**

Text (such as a sentence or a document) is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.

.

**Step 1.** Create vocabulary

![alt text](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/assets/atap_0401.png)




**Step 2.** Encode


>

### **a. Count/frequency based BOW**

*Counts how many times each word appears in a sentence from the bag*

![alt text](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/assets/atap_0402.png)


In [42]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(data[:][0])

print(vectors[1].toarray())


[[0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0]]


>



### **b. Binary one-hot encoding**

*Checks the existance of words in a sentence from the bag*

![alt text](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/assets/atap_0403.png)

In [41]:
from sklearn.preprocessing import Binarizer

freq   = CountVectorizer()
corpus = freq.fit_transform(data[:][0])

onehot = Binarizer()
corpus = onehot.fit_transform(corpus)

print(corpus[1].toarray())


[[0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0]]


>



### **c. TF-IDF**

*TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.*

*The value increases proportionally to the number of times a word appears in the sentence and is offset by the number of sentences in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.*

![alt text](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/assets/atap_0404.png)


In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf  = TfidfVectorizer()
corpus = tfidf.fit_transform(data[:][0])

print(corpus[1].toarray())


[[0.         0.         0.         0.         0.         0.
  0.26681038 0.         0.         0.         0.22147553 0.
  0.14397509 0.         0.         0.         0.53362075 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.26681038 0.         0.         0.         0.
  0.         0.         0.22147553 0.         0.         0.
  0.26681038 0.         0.         0.         0.         0.
  0.         0.         0.22147553 0.         0.         0.
  0.         0.26681038 0.         0.26681038 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.22147553 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.26681038 0.26681038 0.
  0.        ]]


>

***Pros and Cons of Bag-of-words representations:***

> **Pros:** Can work with small amount of data. Different preprocessing methods can help control the dimentionality.

> **Cons:** Does not take context of text into account. Syntactic and semantic information is lost.
>> ie. These sentences will be considerd the same:
 >>>The king threw a ball at a dog.
 
 >>>A dog threw a ball at the king.

### N-gram

*An n-gram is a n-tuple or group of n words or characters (grams, for pieces of grammar) which follow one another. So an n of 3 for the words from your sentence would be like "# I live", "I live in", "live in NY", "in NY #". This is used to create an index of how often words follow one another.*


![alt text](https://i.stack.imgur.com/8ARA1.png)


>

In [44]:

bigram = CountVectorizer(ngram_range=(2,2)) # bigram example(unigram range(1,1),bigram(2,2),trigram(3,3),...)

corpus = bigram.fit_transform(data[:][0])

print(corpus[1].toarray())
[print(w) for w in bigram.get_feature_names()]
    


[[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
  0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 0 0 0 0 1 1 0 0]]
added it
added little
again as
all added
amy with
an injured
and each
and mother
and other
and shall
any presents
as jo
at all
at her
at the
away where
be christmas
be poor
beth contentedly
brightened at
but darkened
but each
cheerful words
christmas without
christmas won
contentedly from
corner br
darkened again
didn say
don think
down at
dreadful to
dress br
each other
each silently
faces on
fair for
far away
father and
father far
fighting was
firelight shone
for long
for some
four young
from her
girls nothing
girls to
got father
grumbled jo
have him
have plenty
haven got
her corner
her old
him for
injured sniff
it fair
it so
it thinking
jo lying
jo said
little amy
long time
looking down
lying on
meg looking
mother and
never but
not have
noth

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

>

**Pros and Cons**

The N-gram model, like many statistical models, is very dependent on the training corpus. One implication of this is that the probabilities often encode very specific facts about a given training corpus. Another implication is that N-grams do a better and better job of modeling the training corpus as we increase the value of N.

>

### **Word Embedding**

*Vector representations of words*

![alt text](https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/01/word-vector-space-similar-words.png)

>

***Pros and Cons of Word Embeddings:***

> **Pros:**
>- Capable of capturing context of a word in a document,
>- Semantic and syntactic similarity, relation with other words.

> **Cons:** 

>- Large amount of data required for training.
> -Word embeddings are known for containing biases and debiasing trained embeddings does not remove biases entirely.
>- Words with the same spellings with different meanings will be considered the same.

>>> Walls that cannot **bear** a stone vault

>>> How to be a good mama ***bear***







>

In [0]:
# Word2vec model for embeddings  
from gensim.models import Word2Vec

model = Word2Vec(size=300)
model.build_vocab(token_text)

total_examples = model.corpus_count # len corpus
model.train(token_text, total_examples=total_examples, epochs=model.epochs)


X_data = model[model.wv.vocab]
words = list(model.wv.vocab)

  # Remove the CWD from sys.path while we load stuff.


In [0]:
model.wv['christmas']

array([ 1.54286833e-03, -1.65511540e-03, -3.94200397e-05,  8.41436733e-04,
        1.22157449e-03,  3.10122879e-04,  4.81995703e-05, -4.56807757e-04,
       -7.90196413e-04, -4.09548433e-04,  1.28969131e-03,  2.05074291e-04,
        1.05342537e-03,  5.16896893e-04,  6.29765331e-04, -1.46103022e-03,
       -1.09345556e-05,  8.13270570e-04,  6.35003846e-04,  4.06869513e-04,
        3.47754365e-04,  6.57792727e-04,  1.29037525e-03,  1.49559265e-03,
        1.14368740e-03, -3.85907595e-04,  2.14804852e-04, -3.97174241e-04,
        1.06305641e-03, -1.24074775e-03,  6.78825541e-04,  6.65061671e-05,
       -1.47427330e-04,  1.61041308e-03,  6.63857674e-04,  1.72056083e-04,
       -4.65855497e-04,  5.62146830e-04, -1.08092767e-03, -1.28072267e-03,
        6.68620167e-04,  5.32450969e-04, -1.07941160e-03, -1.28256751e-03,
       -7.81939540e-04,  9.98259289e-04,  7.94147898e-04,  2.18402070e-04,
       -2.93476012e-04,  1.11167668e-03, -6.87367690e-04,  2.64775910e-04,
       -5.75456186e-04,  

In [0]:
model.wv.most_similar('christmas')

  if np.issubdtype(vec.dtype, np.int):


[('faces', 0.14987346529960632),
 ('things', 0.1376577764749527),
 ('t', 0.10588370263576508),
 ('but', 0.0995744988322258),
 ('silently', 0.07620739191770554),
 ('without', 0.07240990549325943),
 ('and', 0.0710473582148552),
 ('some', 0.06711910665035248),
 ('any', 0.06545284390449524),
 ('thinking', 0.06207112967967987)]

>

>

## **TorchText**


PyTorch package for text preprocessing: [TorchText document](https://torchtext.readthedocs.io/en/latest/)

In [0]:
# google file open
file_path_csv = "./example.csv"
if not os.path.isfile(file_path_csv):
    uploaded = files.upload()
    
# Pandas dataframe
csv_data = pd.read_csv(file_path_csv,error_bad_lines=False)
pprint.pprint(csv_data)

Saving example.csv to example.csv
   ID                                           Sentence  Label
0   1  Christmas won't be Christmas without any prese...      0
1   2  It's so dreadful to be poor! sighed Meg, looki...      1
2   3  I don't think it's fair for some girls to have...      0
3   4  We've got Father and Mother, and each other, s...      1


In [0]:
from torchtext.data import Field
from torchtext.data import TabularDataset
from torchtext.vocab import GloVe, Vectors

# You can define your own tokenizer or call onto existing
tokenize = lambda x: lemmatizer.lemmatize(re.sub(r'<.*?>|[^\w\s]|\d+', '', x)).split()  tokenizers

# Define your field placeholders
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
LABEL = Field(sequential=False, use_vocab=False)


csv_data = TabularDataset(file_path_csv, format="csv",
                          fields = [("ID", None),("Sentence", TEXT), ("Label", LABEL)])

# Build your vocabulary bank
TEXT.build_vocab(csv_data, max_size=200, vectors=GloVe(name='6B', dim=300))
word_emb = TEXT.vocab.vectors

LABEL.build_vocab(csv_data)

.vector_cache/glove.6B.zip: 862MB [00:50, 17.2MB/s]                           
100%|█████████▉| 399957/400000 [00:46<00:00, 8686.71it/s]

In [0]:
# Word List
word_lst =list(TEXT.vocab.itos)
print(word_lst)

['<unk>', '<pad>', 'and', 'at', 'be', 'christmas', 'girls', 'her', 'its', 'other', 'to', 'added', 'all', 'amy', 'an', 'any', 'beth', 'contentedly', 'corner', 'dont', 'down', 'dreadful', 'dress', 'each', 'fair', 'father', 'for', 'from', 'got', 'grumbled', 'have', 'i', 'injured', 'jo', 'little', 'looking', 'lying', 'meg', 'mother', 'nothing', 'of', 'old', 'on', 'plenty', 'poor', 'presents', 'pretty', 'rug', 'said', 'sentence', 'sighed', 'sniff', 'so', 'some', 'the', 'things', 'think', 'weve', 'with', 'without', 'wont']


In [0]:
print(word_lst[20])
print(word_emb[20])

down
tensor([-8.1429e-02, -1.1004e-01, -3.1034e-02,  6.0457e-01,  6.7606e-02,
        -3.1609e-01, -4.6059e-01, -2.0273e-01,  4.6852e-01, -1.6009e+00,
        -4.5330e-01, -1.9684e-01,  2.0119e-01, -1.7361e-01, -1.0069e-01,
         3.7192e-01, -4.9373e-02,  2.2970e-01, -2.1218e-01, -4.1316e-02,
        -2.7262e-01,  3.8110e-01,  4.9122e-01, -2.0979e-01, -5.1443e-01,
        -3.5561e-01,  2.2332e-01, -2.4093e-01, -3.2059e-03,  4.0946e-02,
         2.2596e-03,  2.5204e-01, -1.7363e-02,  1.3675e-01, -1.3253e+00,
        -1.4328e-01, -4.2798e-02,  1.5638e-01, -3.2210e-01, -1.0234e-01,
        -2.9886e-01,  1.5377e-01, -3.6970e-01,  2.0354e-01, -2.2908e-01,
         4.0350e-01,  3.7391e-01,  4.7247e-01,  4.1236e-02, -2.6902e-01,
        -2.5978e-01, -1.1549e-01, -5.9670e-02, -3.8557e-01, -1.7710e-01,
         5.7339e-02, -1.3457e-02,  1.4074e-01,  5.2561e-02, -8.4316e-02,
         1.8599e-01,  5.2734e-02,  2.2400e-01, -1.5718e-02, -1.9844e-01,
        -7.7889e-01,  8.4829e-02,  2.6423e-01,

In [0]:
# LABELS
LABEL.vocab.itos

['<unk>', '0', '1']