<a href="https://colab.research.google.com/github/mahaanand7/transformers/blob/master/text_classification_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install transformers[sentencepiece]

In [3]:
from transformers import pipeline
import textwrap
wrapper = textwrap.TextWrapper(width=80,break_long_words=False,break_on_hyphens=False)

In [8]:
sentence ='if you want to enjoy a feel good movie this movie is not your choice'
classifier = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english')
c = classifier(sentence)
print('\nSentence:')
print(wrapper.fill(sentence))
print(f"\nThis sentence is classified with a {c[0]['label']} sentiment")


Sentence:
if you want to enjoy a feel good movie this movie is not your choice

This sentence is classified with a NEGATIVE sentiment


### Classifying each word in a Sentence (Named Entity Recognition)

In [10]:
sentence = "Govt of India - parliment was situvated at New Delhi, Where the Prime Minister attends the session"
ner = pipeline('token-classification',model ='dbmdz/bert-large-cased-finetuned-conll03-english',grouped_entities =True)
ners = ner(sentence)

print('\nSentence:')
print(wrapper.fill(sentence))
print('\n')

for n in ners:
  print(f"{n['word']} - > {n['entity_group']}")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Sentence:
Govt of India - parliment was situvated at New Delhi, Where the Prime Minister
attends the session


Govt of - > ORG
India - > LOC
New Delhi - > LOC


# Answering a question given a context


In [22]:
context ='''

Evidence before this study
Parkinson’s disease is a neurodegenerative disorder heterogeneous in both its clinical manifestations and progression, which serves as possible evidence for the existence of disease subtypes. We searched PubMed and Google Scholar for original articles published in English with the terms “Parkinson’s disease” and “subtypes” or “clusters” published up until Aug 15, 2020. Previous studies have identified disease subtypes using data-driven approaches, but replication across cohorts has not always been possible, and translation of these subtypes into clinical practice has not yet been achieved. Models developed to date have several key limitations including use of cross-sectional data to define the subtypes, model assumptions that each subtype follows a fixed progression, and not accounting for both positive and negative effects of symptomatic therapies for Parkinson’s disease. Allowing for heterogeneity not only in disease manifestations but also progression, and accounting for medication effects is critical towards developing accurate disease models that can be used in clinical and research settings.'''

question = 'Please list down the key limitations of this study ?'

print('Text:')

print(wrapper.fill(context))
print('\nQuestion')
print(question)


Text:
  Evidence before this study Parkinson’s disease is a neurodegenerative disorder
heterogeneous in both its clinical manifestations and progression, which serves
as possible evidence for the existence of disease subtypes. We searched PubMed
and Google Scholar for original articles published in English with the terms
“Parkinson’s disease” and “subtypes” or “clusters” published up until Aug 15,
2020. Previous studies have identified disease subtypes using data-driven
approaches, but replication across cohorts has not always been possible, and
translation of these subtypes into clinical practice has not yet been achieved.
Models developed to date have several key limitations including use of
cross-sectional data to define the subtypes, model assumptions that each subtype
follows a fixed progression, and not accounting for both positive and negative
effects of symptomatic therapies for Parkinson’s disease. Allowing for
heterogeneity not only in disease manifestations but also progress

In [23]:
qa = pipeline('question-answering', model='distilbert-base-cased-distilled-squad')

print('\nQuestion:')
print(question + '\n')
print('Answer:')

a = qa(context=context, question = question)
a['answer']


Question:
Please list down the key limitations of this study ?

Answer:


'cross-sectional data to define the subtypes'

In [25]:
review ='''

Evidence before this study Parkinson’s disease is a neurodegenerative disorder
heterogeneous in both its clinical manifestations and progression, which serves
as possible evidence for the existence of disease subtypes. We searched PubMed
and Google Scholar for original articles published in English with the terms
“Parkinson’s disease” and “subtypes” or “clusters” published up until Aug 15,
2020. Previous studies have identified disease subtypes using data-driven
approaches, but replication across cohorts has not always been possible, and
translation of these subtypes into clinical practice has not yet been achieved.
Models developed to date have several key limitations including use of
cross-sectional data to define the subtypes, model assumptions that each subtype
follows a fixed progression, and not accounting for both positive and negative
effects of symptomatic therapies for Parkinson’s disease. Allowing for
heterogeneity not only in disease manifestations but also progression, and
accounting for medication effects is critical towards developing accurate
disease models that can be used in clinical and research settings. '''

print( '\nOriginal Text:\n')

print(wrapper.fill(review))
summarize = pipeline('summarization',model ='sshleifer/distilbart-cnn-12-6')

summarized_text = summarize(review)[0]['summary_text']
print('\nSummarized text:')

print(wrapper.fill(summarized_text))


Original Text:

  Evidence before this study Parkinson’s disease is a neurodegenerative disorder
heterogeneous in both its clinical manifestations and progression, which serves
as possible evidence for the existence of disease subtypes. We searched PubMed
and Google Scholar for original articles published in English with the terms
“Parkinson’s disease” and “subtypes” or “clusters” published up until Aug 15,
2020. Previous studies have identified disease subtypes using data-driven
approaches, but replication across cohorts has not always been possible, and
translation of these subtypes into clinical practice has not yet been achieved.
Models developed to date have several key limitations including use of
cross-sectional data to define the subtypes, model assumptions that each subtype
follows a fixed progression, and not accounting for both positive and negative
effects of symptomatic therapies for Parkinson’s disease. Allowing for
heterogeneity not only in disease manifestations but al

# Fill in the Blanks

In [29]:
sentence = 'India is always <mask> with neighbouring countries'

mask = pipeline('fill-mask',model='distilroberta-base')

masks = mask(sentence)

for m in masks:
  print(m['sequence'])

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


India is always competitive with neighbouring countries
India is always friendly with neighbouring countries
India is always competing with neighbouring countries
India is always negotiating with neighbouring countries
India is always cooperating with neighbouring countries


# Translation (English to Tamil)

In [32]:
english = ''' Singapore airlines is my favourite airline'''

translator = pipeline('translation_en_to_de', model='t5-base')

german = translator(english)

print('\nEnglish')
print(english)

print('\nGerman')
print(german[0]['translation_text'])

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.



English
 Singapore airlines is my favourite airline

German
Singapore Airlines ist meine Lieblingsfluggesellschaft
