In [21]:
# transformers 

# transformer library from hugging face
# basic object --> pipeline() function : connects a model with its necessary preprocessing and post postprocessing steps. 

In [22]:
from transformers import pipeline 
classifier = pipeline('sentiment-analysis') 
result_1= classifier('I have been waiting to learn NLP course my whole life')
result_2 = classifier(
    ['I have been waiting to learn NLP course my whole life',
    'I litle nervous']
    )
print(result_1)
print(result_2)

# model is downloaded and cached when you create the classifier object.
# rerun the command, the cached model will be used instead

# Some of the currently available pipelines are:
# -----------------------------------------------
# feature-extraction (get the vector representation of a text)
# fill-mask
# ner (named entity recognition)
# question-answering
# sentiment-analysis
# summarization
# text-generation
# translation
# zero-shot-classification

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.6276514530181885}]
[{'label': 'POSITIVE', 'score': 0.6276514530181885}, {'label': 'NEGATIVE', 'score': 0.9888225793838501}]


In [23]:
# zero-shot classifiction 
classifier_zs = pipeline('zero-shot-classification') 
classifier_zs(
    'this is a course about the Transformer library',
    candidate_labels = ['education','politics', 'business']
)
# allows you to specify which labels to use for the classification.
# This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. 
# It can directly return probability scores for any list of labels you want!

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'this is a course about the Transformer library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.9569939970970154, 0.030598720535635948, 0.01240723580121994]}

In [24]:
# text generation 
generator = pipeline('text-generation')
generator('In this course, we will teach you how to')
# you provide a prompt and the model will auto-complete it by generating the remaining text. 
# text generation involves randomness
# it’s normal if you don’t get the same results as shown below.

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to play and write the first chapter of the "Magic Magic of the Golden Ages" (part 3, part 4, part 5, part 6, part 7, part 8, part 9, part 10'}]

In [25]:
# using a specfic model from hugging face hug, (instead of default model)
generator_disgpt2 = pipeline('text-generation', model='distilgpt2')
generator_disgpt2('In this course, we will teach you how to')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to set the "T" into operation, and will show you how to set it in real time and within seconds. In this course we will show you how to set the "T" into operation,'}]

In [26]:
# mask filling
unmask = pipeline('fill-mask')
unmask('This course will teach you all about <mask> models', top_k =2)
# The top_k argument controls how many possibilities you want to be displayed. 
# Note that here the model fills in the special <mask> word, which is often referred to as a mask token. 
# Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. 

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.19631513953208923,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models'},
 {'score': 0.04449228197336197,
  'token': 745,
  'token_str': ' building',
  'sequence': 'This course will teach you all about building models'}]

In [None]:
# ner
ner = pipeline('ner', grouped_entities = True)
ner('My name is Praveen Veera and I am learning NLP from Hyderabad')
# We pass the option grouped_entities=True in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
model.safetensors: 100%|██████████| 1.33G/1.33G [01:35<00:00, 14.0MB/s]
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

[{'entity_group': 'PER',
  'score': 0.9993941,
  'word': 'Praveen Veera',
  'start': 11,
  'end': 24},
 {'entity_group': 'MISC',
  'score': 0.585764,
  'word': 'NL',
  'start': 43,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9987062,
  'word': 'Hyderabad',
  'start': 52,
  'end': 61}]

In [29]:
# qa
qa = pipeline('question-answering')
qa(
    question = 'Where do I work?',
    context='My name is Praveen Veera and I work at ITCI from Hyderabad'
    
)
# Note that this pipeline works by extracting information from the provided context; it does not generate the answer

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
model.safetensors: 100%|██████████| 261M/261M [00:17<00:00, 14.8MB/s] 
tokenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 151kB/s]
vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 13.6MB/s]
tokenizer.json: 100%|██████████| 436k/436k [00:00<00:00, 13.0MB/s]


{'score': 0.5419266819953918, 'start': 39, 'end': 43, 'answer': 'ITCI'}

In [None]:
# translation
translator = pipeline('translation', model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.") # French to English


In [31]:
# summarization
summerization = pipeline('summarization')
summerization(
    '''m a Data Scientist with over 8.11 years of IT experience, specializing in data-centric applications, predictive modelling, Generative AI, and MLOps. Proficient in designing and implementing solutions using Azure AI Studio and Azure Machine Learning.

Accomplishments:
* Consistently delivering adaptable machine learning solutions, rooted in strong MLOps principles.
* Holder of the Microsoft Azure Data Scientist Associate certification.
* Winner of the Microsoft #BuildFor2030 Hackathon in 2022, focused on Climate Action and Sustainability.
* Enhanced my academic qualifications with a Post Graduate Diploma in Machine Learning and Artificial Intelligence.

AI Leadership and Expertise:
* Active contributor to ITC Infotech's Center of Excellence, specializing in Generative AI and Azure AI.
* Proficient in identifying AI and ML opportunities for process optimization, automation, and decision enhancement.
* Skilled in designing end-to-end AI and ML architectures aligned with business goals, considering data integration, model selection, and scalability.
* Leading discussions on AI solution design, providing technical leadership and insights.
* Evaluating and selecting AI technologies, frameworks, and tools for solution development.
* Leading experiments and initiatives in Generative AI, overseeing multiple use cases and task management.
* Actively participating in pioneering initiatives with expertise in OpenAI, Azure OpenAI, Azure Cognitive Services, and Langchain.
* Proficient in developing Generative AI conversation chatbots and solving Retrieval-augmented Generation (RAG) based GenAI use cases.

GenAI Use Cases:
Led and contributed to diverse GenAI use cases, including:
* Medical Product Comparison: Leveraged Generative AI for medical product comparisons through natural language interactions, extracting information from product documentation.
* Automated Medical Paper Writing: Developed AI chatbots to expedite the creation of medical research papers by referencing previous results and current research data.
* FAQ Generation for Medical Products: Designed an AI solution to auto-generate product FAQs from internal sources, reducing manual effort and turnaround time.
* Multiple-choice questions (MCQs) for LnD Team: Created automated solutions for MCQ generation, facilitating training and educational content development.

I'm passionate about leveraging AI to drive innovation, streamline processes, and enhance decision-making. I'm open to connecting for opportunities at the intersection of data-driven innovation and technology.
''')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
config.json: 100%|██████████| 1.80k/1.80k [00:00<00:00, 1.96MB/s]
pytorch_model.bin: 100%|██████████| 1.22G/1.22G [01:36<00:00, 12.7MB/s]
tokenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 32.9kB/s]
vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 10.1MB/s]
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 15.1MB/s]


[{'summary_text': ' Data Scientist with over 8.11 years of IT experience, specializing in data-centric applications, predictive modelling, Generative AI, and MLOps . Proficient in designing and implementing solutions using Azure AI Studio and Azure Machine Learning . Awarded Microsoft #BuildFor2030 Hackathon in 2022, focused on Climate Action and Sustainability .'}]