<a href="https://colab.research.google.com/github/not-sid-29/transformers_huggingface/blob/main/1_Using_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing(NLP) - 1:<br>
## Transformers & Pipeline function introductions


- This notebook covers the most basic applications of transformer models on various language tasks

In [None]:
!pip install --q datasets evaluate transformers

### Tasks included in NLP:<br>
- Sentiment Analysis
- Zero-Shot Classification
- Text generation
- Named-Entity Recognition(NER)
- Question-Answering
- Summarizers
- Language translation

#### a. Sentiment Analysis: <br>
- Classifying the sentiment of a given sentence

In [None]:
from transformers import pipeline

pipe = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
pipe(["We won the Soccer game!", "The red team won the match by an enormous lead.", "The green team lost"])

[{'label': 'POSITIVE', 'score': 0.9998049139976501},
 {'label': 'POSITIVE', 'score': 0.9997949004173279},
 {'label': 'NEGATIVE', 'score': 0.9974135756492615}]

#### b. Zero-Shot Classification:<br>
- is the process of specifying the labels to use for classification, this process helps in creating a better set of labels for unlabeled data.

In [None]:
pipe = pipeline("zero-shot-classification")
pipe(
    "The presidential elections will take place in 2025",
    candidate_labels = ["politics", "military", "gaming"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'The presidential elections will take place in 2025',
 'labels': ['politics', 'military', 'gaming'],
 'scores': [0.9754886627197266, 0.014444125816226006, 0.0100672272965312]}

- using a different model in the pipeline

In [None]:
classifier = pipeline("zero-shot-classification", model="sileod/deberta-v3-base-tasksource-nli")

config.json:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/738M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

In [None]:
classifier(
    "Cyberpunk 2077 was a movie-like phenomena!",
    candidate_labels = ["education", "gaming", "arts"]
)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'sequence': 'Cyberpunk 2077 was a movie-like phenomena!',
 'labels': ['gaming', 'arts', 'education'],
 'scores': [0.6532778143882751, 0.3049091398715973, 0.04181301221251488]}

#### c. Text-Generation:<br>
- is the process of generating complete sentences, from a prompt.{eg. Autotype feature in mobiles}

In [None]:
text_gen = pipeline("text-generation", model="distilgpt2")

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
text_gen(
    "Once upon a time in the land of reeds lived a stray samurai",
    max_length = 99,
    num_return_sequences = 2,
    truncation=True
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Once upon a time in the land of reeds lived a stray samurai from the village of Jigga, where the samurai and others used their power to bring the samurai and the other race to their side and kill them, this war between the two nations started in the summer of 1958 when the Japanese Emperor Nobuko Todashi, his successor to Nobuko Todashi, invaded Japan against an enemy called the Tsudakugan, the largest country in the South Pacific. When'},
 {'generated_text': 'Once upon a time in the land of reeds lived a stray samurai on a farm and saw a boy walk around in a snowstorm. As the night turned upside down in the snow, a big shadow fell over him.\nThe samurai fell into the thick of a snowstorm when the fire broke out and he fell off the ground, so he did not suffer injuries and death.\nThe fire broke out at a very young age, and this caused a lot of damage and death and injury'}]

#### d. Mask Filling:<br>
- is the process of filling in missing words

In [None]:
mask = pipeline("fill-mask")

No model was supplied, defaulted to distilbert/distilroberta-base and revision ec58a5b (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
mask(
    "I love to study about <mask> in my university",
    top_k=3
)

[{'score': 0.06804715842008591,
  'token': 10561,
  'token_str': ' philosophy',
  'sequence': 'I love to study about philosophy in my university'},
 {'score': 0.04707271605730057,
  'token': 35638,
  'token_str': ' sociology',
  'sequence': 'I love to study about sociology in my university'},
 {'score': 0.045197565108537674,
  'token': 17759,
  'token_str': ' physics',
  'sequence': 'I love to study about physics in my university'}]

#### e. Named-Entity Recognition:<br>
- is a task where the model has to find entities occuring in the input texts.

In [None]:
ner_pipe = pipeline("ner", grouped_entities=True)


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]



In [None]:
ner_pipe(
    "My name is Jack and I work as a Machine Learning Engineer, I want to travel to Tokyo,Japan"
)

[{'entity_group': 'PER',
  'score': 0.9991054,
  'word': 'Jack',
  'start': 11,
  'end': 15},
 {'entity_group': 'MISC',
  'score': 0.61482406,
  'word': 'Engineer',
  'start': 49,
  'end': 57},
 {'entity_group': 'LOC',
  'score': 0.9996711,
  'word': 'Tokyo',
  'start': 79,
  'end': 84},
 {'entity_group': 'LOC',
  'score': 0.99974984,
  'word': 'Japan',
  'start': 85,
  'end': 90}]

**note**:<br>
- `grouped_entities` param: groups words in the input which belong to the same Entity group (here: Machine Learning Engineer was grouped into Engineer)

#### f. Question-Answering:<br>
- is the task where the models answers for a given input question, when some context is also provided.



In [None]:
answering_pipe = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
answering_pipe(
    question="What's my interest in computer science?",
    context="I want to learn more and more about artificial intelligence."
)

{'score': 0.949404239654541,
 'start': 36,
 'end': 59,
 'answer': 'artificial intelligence'}

#### g. Summarizer:<br>
- in this task the model, summarizes a given input paragraph aka, converts a longer input text into shorter text.

In [None]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
summarizer(
    """
    Quantum mechanics allows the calculation of properties and behaviour of physical systems.
    It is typically applied to microscopic systems: molecules, atoms and sub-atomic particles.
    It has been demonstrated to hold for complex molecules with thousands of atoms, but its application to human beings raises philosophical problems,
    such as Wigner's friend, and its application to the universe as a whole remains speculative. Predictions of quantum mechanics have been verified experimentally to an extremely high degree of accuracy.
    For example, the refinement of quantum mechanics for the interaction of light and matter, known as quantum electrodynamics (QED), has been shown to agree with experiment to within 1 part in 1012 when predicting the magnetic properties of an electron.
    A fundamental feature of the theory is that it usually cannot predict with certainty what will happen, but only give probabilities.
    Mathematically, a probability is found by taking the square of the absolute value of a complex number, known as a probability amplitude.
    This is known as the Born rule, named after physicist Max Born. For example, a quantum particle like an electron can be described by a wave function, which associates to each point in space a probability amplitude. Applying the Born rule to these amplitudes gives a probability density function for the position that the electron will be found to have when an experiment is performed to measure it. This is the best the theory can do; it cannot say for certain where the electron will be found. The Schrödinger equation relates the collection of probability amplitudes that pertain to one moment of time to the collection of probability amplitudes that pertain to another.
    One consequence of the mathematical rules of quantum mechanics is a tradeoff in predictability between different measurable quantities. The most famous form of this uncertainty principle says that no matter how a quantum particle is prepared or how carefully experiments upon it are arranged, it is impossible to have a precise prediction for a measurement of its position and also at the same time for a measurement of its momentum.

Another consequence of the mathematical rules of quantum mechanics is the phenomenon of quantum interference, which is often illustrated with the double-slit experiment. In the basic version of this experiment, a coherent light source, such as a laser beam, illuminates a plate pierced by two parallel slits, and the light passing through the slits is observed on a screen behind the plate.
The wave nature of light causes the light waves passing through the two slits to interfere, producing bright and dark bands on the screen – a result that would not be expected if light consisted of classical particles. However, the light is always found to be absorbed at the screen at discrete points, as individual particles rather than waves; the interference pattern appears via the varying density of these particle hits on the screen. Furthermore, versions of the experiment that include detectors at the slits find that each detected photon passes through one slit (as would a classical particle), and not through both slits (as would a wave).However, such experiments demonstrate that particles do not form the interference pattern if one detects which slit they pass through. This behavior is known as wave–particle duality. In addition to light, electrons, atoms, and molecules are all found to exhibit the same dual behavior when fired towards a double slit.
    """
)

[{'summary_text': ' Quantum mechanics allows the calculation of properties and behaviour of physical systems . Predictions of quantum mechanics have been verified experimentally to an extremely high degree of accuracy . It has been demonstrated to hold for complex molecules with thousands of atoms, but its application to the universe as a whole remains speculative .'}]