<a href="https://colab.research.google.com/github/navneetkrc/Colab_fastai/blob/master/1_NLP_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. NLP Base Types and tasks

## Install Dependencies

In [0]:
!pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html
!pip install flair

In [0]:
from flair.data import Sentence
from flair.models import SequenceTagger

##Check for some basics tasks

In [14]:
# make a sentence
sentence = Sentence('I love Amsterdam .')

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)

[Sentence: "I love Amsterdam ." - 4 Tokens]

Done! The Sentence now has entity annotations. Print the sentence to see what the tagger found.



In [15]:
print(sentence)
print('The following NER tags are found:')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

Sentence: "I love Amsterdam ." - 4 Tokens
The following NER tags are found:
LOC-span [3]: "Amsterdam"


This should print:

Sentence: "I love Amsterdam ." - 4 Tokens

The following NER tags are found:

LOC-span [3]: "Amsterdam"



---



##Basic1- Creating a Sentence

In [16]:
# The sentence objects holds a sentence that we may want to embed or tag
from flair.data import Sentence

# Make a sentence object by passing a whitespace tokenized string
sentence = Sentence('The grass is green .')

# Print the object to see what's in there
print(sentence)

#expected output-> Sentence: "The grass is green ." - 5 Tokens

Sentence: "The grass is green ." - 5 Tokens


In [17]:
##The print-out tells us that the sentence consists of 5 tokens. You can access the tokens of a sentence via their token id or with their index:

# using the token id
print(sentence.get_token(4))

# using the index itself
print(sentence[3])

Token: 4 green
Token: 4 green


For both the cases it prints:

Token: 4 green

Token: 4 green

In [7]:
for token in sentence:
    print(token)

Token: 1 The
Token: 2 grass
Token: 3 is
Token: 4 green
Token: 5 .


This should print:

Token: 1 The

Token: 2 grass

Token: 3 is

Token: 4 green

Token: 5 .

##Tokenization

In some use cases, you might not have your text already tokenized. For this case, we added a simple tokenizer using the lightweight segtok library.

Simply use the use_tokenizer flag when instantiating your Sentence with an untokenized string:



In [8]:
from flair.data import Sentence

# Make a sentence object by passing an untokenized string and the 'use_tokenizer' flag
sentence = Sentence('The grass is green.', use_tokenizer=True)

# Print the object to see what's in there
print(sentence)

Sentence: "The grass is green ." - 5 Tokens


This should print 

Sentence: "The grass is green ." - 5 Tokens

###Adding Tags to tokens

In [9]:
# add a tag to a word in the sentence
sentence[3].add_tag('ner', 'color')

# print the sentence with all tags of this type
print(sentence.to_tagged_string())

The grass is green <color> .


this should print:

The grass is green <color> .

Each tag is of class Label which next to the value has a score indicating confidence. Print like this:



In [10]:
from flair.data import Label

tag: Label = sentence[3].get_tag('ner')

print(f'"{sentence[3]}" is tagged as "{tag.value}" with confidence score "{tag.score}"')

"Token: 4 green" is tagged as "color" with confidence score "1.0"


This should print:

"Token: 4 green" is tagged as "color" with confidence score "1.0"

Also our color tag has a score of 1.0 since we manually added it. If a tag is predicted by our sequence labeler, the score value will indicate classifier confidence.



###Adding Labels to Sentences

A Sentence can have one or multiple labels that can for example be used in text classification tasks. For instance, the example below shows how we add the label 'sports' to a sentence, thereby labeling it as belonging to the sports category.

In [0]:
sentence = Sentence('France is the current world cup winner.')

# add a label to a sentence
sentence.add_label('sports')

# a sentence can also belong to multiple classes
sentence.add_labels(['sports', 'world cup'])

# you can also set the labels while initializing the sentence
sentence = Sentence('France is the current world cup winner.', labels=['sports', 'world cup'])


Labels are also of the Label class. So, you can print a sentence's labels like this:



In [12]:
sentence = Sentence('France is the current world cup winner.', labels=['sports', 'world cup'])

print(sentence)
for label in sentence.labels:
    print(label)

Sentence: "France is the current world cup winner." - 7 Tokens
sports (1.0)
world cup (1.0)


This should print:

sports (1.0)

world cup (1.0)

**This indicates that the sentence belongs to these two classes, each with confidence score 1.0.**





---

**In next column we will check about the use of pre-trained model for tagging our text in the next segment.**

#2. Use of Pretrained model to tag your text

#3. Use of word Embeddings

#4. Using Bert, Elmo and Flair Embeddings

#5. Using Document Embeddings

#6. Loading your own Corpus

#7. Training your own Model

#8. Optimizing our models

#9. Training your own Flair Embeddings



---

