<a href="https://colab.research.google.com/github/iamhasanhumane/Hugging_Face/blob/main/Chapter_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

In [1]:
!pip install huggingface_hub
!pip install transformers



## Pipeline Function

The Pipeline function returns an end-to-end object that performs an NLP task on one or several texts

 Pre-Processing ------ Model ----- Post-Processing

In [2]:
from transformers import pipeline

### Sentiment Analysis

The first task we will try the pipeline API on is sentiment analysis. It classifies texts as positive or negative

In [3]:
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for an AI Internship my whole life")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.9962140917778015}]

we can pass multiple texts to the object returned by a pipeline to treat them together

In [4]:
classifier([
    "I've been waiting for a AI Engineer Job my whole life",
    "I hate this so much!"
])

[{'label': 'NEGATIVE', 'score': 0.9938697814941406},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

### Zero Shot Classification

The zero-shot-classification pipeline lets you select the labels for classification

In [5]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education","politics","business"]
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445994257926941, 0.11197380721569061, 0.04342673346400261]}

In [6]:
classifier(
    "Patience is the key to succeed in business",
    candidate_labels=["education","politics","business"]
)

{'sequence': 'Patience is the key to succeed in business',
 'labels': ['business', 'education', 'politics'],
 'scores': [0.994697093963623, 0.002830005483701825, 0.0024728807620704174]}

In [7]:
classifier(
    "Democracy is the rule by the people , for the people and to the people",
    candidate_labels=["education","politics","business"]
)

{'sequence': 'Democracy is the rule by the people , for the people and to the people',
 'labels': ['politics', 'business', 'education'],
 'scores': [0.8204069137573242, 0.0935339629650116, 0.08605916053056717]}

### Text Generation

The text generation pipeline uses an input prompt to generate text

In [9]:
generator = pipeline("text-generation")
generator("In this course , we will teach you how to ")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course , we will teach you how to \xa0draw 3d shapes with a pencil through an image-based model, and provide a tutorial on the approach to combining them using BladerConversion and the Kino toolbox. As a'}]

### Loading with Distill GPT2 Model

Here is another text generation pipeline , using the distillgpt2 model

In [10]:
distill_generator = pipeline("text-generation", model = "distilgpt2")
distill_generator(
    "In this course , we will learn how to ",
    max_length   = 30,
    num_return_sequences = 2
)

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course , we will learn how to irc.paa(...).\n\n\nThe easiest way to test is to use a simple'},
 {'generated_text': 'In this course , we will learn how to ive egotistical. ives and ives for any variety of different topics related to the concept'}]

### Text completion ( Mask Filling )

The fill-mask pipeline will predict missing words in a sentence.

In [None]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about solving complex <mask> .", top_k = 2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.5938500761985779,
  'token': 1272,
  'token_str': ' problems',
  'sequence': 'This course will teach you all about solving complex problems .'},
 {'score': 0.13669510185718536,
  'token': 43123,
  'token_str': ' equations',
  'sequence': 'This course will teach you all about solving complex equations .'}]

### Named Entity Recognition

The NER Pipeline identifies entities such as persons, organizations or locations in a sentence

In [11]:
ner = pipeline("ner", grouped_entities = True)
ner("My name is Hassan and I am pursuing Data Science at IIT Madras in Chennai")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cpu


[{'entity_group': 'PER',
  'score': 0.99946505,
  'word': 'Hassan',
  'start': 11,
  'end': 17},
 {'entity_group': 'ORG',
  'score': 0.95293397,
  'word': 'IIT Madras',
  'start': 52,
  'end': 62},
 {'entity_group': 'LOC',
  'score': 0.99823654,
  'word': 'Chennai',
  'start': 66,
  'end': 73}]

### Question Answering

In [12]:
question_answerer = pipeline("question-answering")
question_answerer(
    question = "Where am I studying?",
    context = "My name is Hasan and I am studying at IIT Madras"
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cpu


{'score': 0.9873796105384827, 'start': 38, 'end': 48, 'answer': 'IIT Madras'}

### Summarization

The summarization pipeline creates summaries of long texts

In [None]:
summarizer = pipeline("summarization")
summarizer(
    """
    Deep learning (also known as deep structured learning) is part of a broader family of machine learning
methods based on artificial neural networks with representation learning.
Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as
deep neural networks, deep belief networks, deep reinforcement learning,
recurrent neural networks and convolutional neural networks have been applied to
fields including computer vision, speech recognition, natural language processing,
machine translation, bioinformatics, drug design, medical image analysis,
material inspection and board game programs, where they have produced results
comparable to and in some cases surpassing human expert performance.
Artificial neural networks (ANNs) were inspired by information processing and
distributed communication nodes in biological systems. ANNs have various differences
from biological brains. Specifically, neural networks tend to be static and symbolic,
while the biological brain of most living organisms is dynamic (plastic) and analogue.
The adjective "deep" in deep learning refers to the use of multiple layers in the network.
Early work showed that a linear perceptron cannot be a universal classifier,
but that a network with a nonpolynomial activation function with one hidden layer of
unbounded width can. Deep learning is a modern variation which is concerned with an
unbounded number of layers of bounded size, which permits practical application and
optimized implementation, while retaining theoretical universality under mild conditions.
In deep learning the layers are also permitted to be heterogeneous and to deviate widely
from biologically informed connectionist models, for the sake of efficiency, trainability
and understandability, whence the structured part.
    """
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'summary_text': ' Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning . Learning can be supervised, semi-supervised or unsupervised . Deep-learning architectures have been applied to computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis and board game programs .'}]

### Translation

The translation pipeline translates text from one language to another

In [13]:
translator = pipeline("translation", model = "Helsinki-NLP/opus-mt-fr-en")
translator("Je te regarde")

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Device set to use cpu


[{'translation_text': "I'm looking at you."}]

## Summary







*   Text classification
*   Zero-shot-classification
*   Text generation
*   Text completion ( mask filling )
*   Token classification
*   Question answering
*   Summarization
*   Translation



The pipeline API supports most common NLP tasks out of box.