<a href="https://colab.research.google.com/github/navneetkrc/Flair_SOTA_NLP/blob/master/2_PreTrained_Model_for_Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. NLP Base Types and tasks

## Install Dependencies

In [0]:
!pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html
!pip install flair

In [0]:
from flair.data import Sentence
from flair.models import SequenceTagger

##Check for some basics tasks

In [5]:
# make a sentence
sentence = Sentence('I love Amsterdam .')

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)

[Sentence: "I love Amsterdam ." - 4 Tokens]

Done! The Sentence now has entity annotations. Print the sentence to see what the tagger found.



In [6]:
print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

Sentence: "I love Amsterdam ." - 4 Tokens
The following NER tags are found:
LOC-span [3]: "Amsterdam"


This should print:

Sentence: "I love Amsterdam ." - 4 Tokens

The following NER tags are found:

LOC-span [3]: "Amsterdam"



---



##Basic1- Creating a Sentence

In [7]:
# The sentence objects holds a sentence that we may want to embed or tag
from flair.data import Sentence

# Make a sentence object by passing a whitespace tokenized string
sentence = Sentence('The grass is green .')

# Print the object to see what's in there
print(sentence)

#expected output-> Sentence: "The grass is green ." - 5 Tokens

Sentence: "The grass is green ." - 5 Tokens


In [8]:
##The print-out tells us that the sentence consists of 5 tokens. You can access the tokens of a sentence via their token id or with their index:

# using the token id
print(sentence.get_token(4))

# using the index itself
print(sentence[3])

Token: 4 green
Token: 4 green


For both the cases it prints:

Token: 4 green

Token: 4 green

In [9]:
for token in sentence:
    print(token)

Token: 1 The
Token: 2 grass
Token: 3 is
Token: 4 green
Token: 5 .


This should print:

Token: 1 The

Token: 2 grass

Token: 3 is

Token: 4 green

Token: 5 .

##Tokenization

In some use cases, you might not have your text already tokenized. For this case, we added a simple tokenizer using the lightweight segtok library.

Simply use the use_tokenizer flag when instantiating your Sentence with an untokenized string:



In [10]:
from flair.data import Sentence

# Make a sentence object by passing an untokenized string and the 'use_tokenizer' flag
sentence = Sentence('The grass is green.', use_tokenizer=True)

# Print the object to see what's in there
print(sentence)

Sentence: "The grass is green ." - 5 Tokens


This should print 

Sentence: "The grass is green ." - 5 Tokens

###Adding Tags to tokens

In [11]:
# add a tag to a word in the sentence
sentence[3].add_tag('ner', 'color')

# print the sentence with all tags of this type
print(sentence.to_tagged_string())

The grass is green <color> .


this should print:

The grass is green <color> .

Each tag is of class Label which next to the value has a score indicating confidence. Print like this:



In [12]:
from flair.data import Label

tag: Label = sentence[3].get_tag('ner')

print(f'"{sentence[3]}" is tagged as "{tag.value}" with confidence score "{tag.score}"')

"Token: 4 green" is tagged as "color" with confidence score "1.0"


This should print:

"Token: 4 green" is tagged as "color" with confidence score "1.0"

Also our color tag has a score of 1.0 since we manually added it. If a tag is predicted by our sequence labeler, the score value will indicate classifier confidence.



###Adding Labels to Sentences

A Sentence can have one or multiple labels that can for example be used in text classification tasks. For instance, the example below shows how we add the label 'sports' to a sentence, thereby labeling it as belonging to the sports category.

In [0]:
sentence = Sentence('France is the current world cup winner.')

# add a label to a sentence
sentence.add_label('sports')

# a sentence can also belong to multiple classes
sentence.add_labels(['sports', 'world cup'])

# you can also set the labels while initializing the sentence
sentence = Sentence('France is the current world cup winner.', labels=['sports', 'world cup'])


Labels are also of the Label class. So, you can print a sentence's labels like this:



In [14]:
sentence = Sentence('France is the current world cup winner.', labels=['sports', 'world cup'])

print(sentence)
for label in sentence.labels:
    print(label)

Sentence: "France is the current world cup winner." - 7 Tokens
sports (1.0)
world cup (1.0)


This should print:

sports (1.0)

world cup (1.0)

**This indicates that the sentence belongs to these two classes, each with confidence score 1.0.**





---

**In next column we will check about the use of pre-trained model for tagging our text in the next segment.**

#2. Use of Pretrained model to tag your text

##Tutorial 2: Tagging your Text
Here, we show how to use our pre-trained models to tag your text data.

Tagging with Pre-Trained Sequence Tagging Models
Let's use a pre-trained model for named entity recognition (NER). 

This model was trained over the English CoNLL-03 task and can recognize 4 different entity types.

In [0]:
from flair.models import SequenceTagger

tagger = SequenceTagger.load('ner')

All you need to do is use the predict() method of the tagger on a sentence. 

This will add predicted tags to the tokens in the sentence. Lets use a sentence with two named entities:



In [16]:
sentence = Sentence('George Washington went to Washington .')

# predict NER tags
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())

George <B-PER> Washington <E-PER> went to Washington <S-LOC> .


This should print:

George <B-PER> Washington <E-PER> went to Washington <S-LOC> . 


###Getting Annotated Spans
Many sequence labeling methods annotate spans that consist of multiple words, such as "George Washington" in our example sentence. You can directly get such spans in a tagged sentence like this:



In [17]:
for entity in sentence.get_spans('ner'):
    print(entity)

PER-span [1,2]: "George Washington"
LOC-span [5]: "Washington"


This should print:

PER-span [1,2]: "George Washington"

LOC-span [5]: "Washington"

Which indicates that "George Washington" is a person (PER) and "Washington" is a location (LOC). Each such Span has a text, a tag value, its position in the sentence and "score" that indicates how confident the tagger is that the prediction is correct. You can also get additional information, such as the position offsets of each entity in the sentence by calling:


In [19]:
print(sentence.to_dict(tag_type='ner'))

{'text': 'George Washington went to Washington .', 'labels': [], 'entities': [{'text': 'George Washington', 'start_pos': 0, 'end_pos': 17, 'type': 'PER', 'confidence': 0.9999276995658875}, {'text': 'Washington', 'start_pos': 26, 'end_pos': 36, 'type': 'LOC', 'confidence': 0.9988662004470825}]}




This should print:

{'text': 'George Washington went to Washington .',
    'entities': [      
        {'text': 'George Washington', 'start_pos': 0, 'end_pos': 17, 'type': 'PER', 'confidence': 0.999},        
        {'text': 'Washington', 'start_pos': 26, 'end_pos': 36, 'type': 'LOC', 'confidence': 0.998}
    ]}

##Tagging Multilingual Text
If you have text in many languages (such as English and German), you can use our new multilingual models:

Same approach I will try to use the same for HINGLISH (Hindi+English) Dataset as well


In [21]:
# load model
tagger = SequenceTagger.load('pos-multi')

# text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort kaufte er einen Hut .')

# predict PoS tags
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())

George <PROPN> Washington <PROPN> went <VERB> to <ADP> Washington <PROPN> . <PUNCT> Dort <ADV> kaufte <VERB> er <PRON> einen <DET> Hut <NOUN> . <PUNCT>


This should print:

George <PROPN> Washington <PROPN> went <VERB> to <ADP> Washington <PROPN> . <PUNCT>

Dort <ADV> kaufte <VERB> er <PRON> einen <DET> Hut <NOUN> . <PUNCT>
So, both 'went' and 'kaufte' are identified as VERBs in these sentences.



##Experimental: Semantic Frame Detection
For English, we provide a pre-trained model that detects semantic frames in text, trained using Propbank 3.0 frames. This provides a sort of word sense disambiguation for frame evoking words, and we are curious what researchers might do with this.

Here's an example:


In [23]:
# load model
tagger = SequenceTagger.load('frame')

# make German sentence
sentence_1 = Sentence('George returned to Berlin to return his hat .')
sentence_2 = Sentence('He had a look at different hats .')

# predict NER tags
tagger.predict(sentence_1)
tagger.predict(sentence_2)

# print sentence with predicted tags
print(sentence_1.to_tagged_string())
print(sentence_2.to_tagged_string())

George returned <return.01> to Berlin to return <return.02> his hat .
He had <have.LV> a look <look.01> at different hats .


This should print:

George returned <return.01> to Berlin to return <return.02> his hat .

He had <have.LV> a look <look.01> at different hats .





As we can see, the frame detector makes a distinction in sentence 1 between two different meanings of the word 'return'. 'return.01' means returning to a location, while 'return.02' means giving something back.

Similarly, in sentence 2 the frame detector finds a light verb construction in which 'have' is the light verb and 'look' is a frame evoking word.



##Tagging a List of Sentences
Often, you may want to tag an entire text corpus. In this case, you need to split the corpus into sentences and pass a list of Sentence objects to the .predict() method.

For instance, you can use the sentence splitter of segtok to split your text:


In [24]:
# your text of many sentences
text = "This is a sentence. This is another sentence. I love Berlin."

# use a library to split into sentences
from segtok.segmenter import split_single
sentences = [Sentence(sent, use_tokenizer=True) for sent in split_single(text)]

# predict tags for list of sentences
tagger: SequenceTagger = SequenceTagger.load('ner')
tagger.predict(sentences)


[Sentence: "This is a sentence ." - 5 Tokens,
 Sentence: "This is another sentence ." - 5 Tokens,
 Sentence: "I love Berlin ." - 4 Tokens]


Using the mini_batch_size parameter of the .predict() method, you can set the size of mini batches passed to the tagger. Depending on your resources, you might want to play around with this parameter to optimize speed.



##Tagging with Pre-Trained Text Classification Models
Let's use a pre-trained model for detecting positive or negative comments. This model was trained over the IMDB dataset and can recognize positive and negative sentiment in English text.


In [25]:

from flair.models import TextClassifier

classifier = TextClassifier.load('en-sentiment')

#All you need to do is use the predict() method of the classifier on a sentence.
#This will add the predicted label to the sentence. Lets use a sentence with negative sentiment:

sentence = Sentence('This film hurts. It is so bad that I am confused.')

# predict NER tags
classifier.predict(sentence)

# print sentence with predicted labels
print(sentence.labels)

#This should print:
#[NEGATIVE (1.0)]

2019-01-24 23:32:26,139 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/models-v0.4/TEXT-CLASSIFICATION_imdb/imdb.pt not found in cache, downloading to /tmp/tmpese0hsa6


100%|██████████| 2794252905/2794252905 [04:00<00:00, 11602723.62B/s]

2019-01-24 23:36:28,181 copying /tmp/tmpese0hsa6 to cache at /root/.flair/models/imdb.pt





2019-01-24 23:36:52,960 removing temp file /tmp/tmpese0hsa6
[NEGATIVE (1.0)]




---

**In next column we will check about the use of word embeddings to embed our text**

#3. Use of word Embeddings

#4. Using Bert, Elmo and Flair Embeddings

#5. Using Document Embeddings

#6. Loading your own Corpus

#7. Training your own Model

#8. Optimizing our models

#9. Training your own Flair Embeddings



---

