# Modern AI Pro: Advanced Operations with Open Source models
You know you don't always need GPT4. You can use a collection of open source models that can be cheaper, flexible and provide better privacy based on the business need.

## 1.Let's start with text classification with a smaller model

In [None]:
from transformers import pipeline

In [None]:
roberta = pipeline("zero-shot-classification",model="facebook/bart-large-mnli")

In [None]:
sequence_to_classify = "I'm going to make a mark in the Software World. I can lead this."
candidate_labels = ['travel', 'cooking', 'dancing','auto suggestion', 'advice']
output = roberta(sequence_to_classify, candidate_labels)
output

In [None]:
# Find the index of the highest score
max_score_index = output['scores'].index(max(output['scores']))

# Retrieve the corresponding label
label_with_highest_score = output['labels'][max_score_index]

# Print the label
print("Label with the highest score:", label_with_highest_score)

## 2. Text Summarization

In [None]:
summarizer = pipeline(task="summarization",model="facebook/bart-large-cnn")

In [None]:
news = """ Even as the decision to omit Shreyas Iyer and Ishan Kishan from the central contracts list for not turning up for domestic tournaments continues to receive mixed reactions, there was plenty of discussion about Hardik Pandya’s inclusion as well. The Indian Express understands that the selectors and the BCCI only handed Hardik a contract after the all-rounder gave an undertaking that if there are no white-ball commitments with the national team, he would feature for Baroda in the Syed Mushtaq Ali T20s and Vijay Hazare Trophy.

During the recent meeting, apart from Shreyas and Ishan, there was discussion regarding Pandya’s place in Grade A of the annual contract list as well. Since injuring his ankle during the World Cup in October, Pandya had remained out of action till last week when he returned to competitive cricket in the DY Patil tournament, where he is turning out for Reliance. Like Ishan, Pandya has been training individually in Vadodara, but what worked in his favour is that he has been reporting at the National Cricket Academy (NCA) on a time-to-time basis to have his fitness assessed.

According to a top BCCI official, Pandya has also given assurance that he would feature in domestic tournaments if they don’t overlap with international commitments. “We have had discussions with Pandya, who has been told to play domestic white-ball tournaments when he is available. At this stage, according to the assessment of the BCCI’s medical team, he is not in a position to bowl in red-ball tournaments. So playing Ranji Trophy is out of the equation for Pandya. But he has to play other white-ball tournaments if there are no India commitments. If not, he will miss out on a contract,” the official told The Indian Express.

According to the Future Tours Programme, India are scheduled to play only three T20Is at home against Bangladesh as the team has a busy Test calendar. In the October-December period, when India don’t have any white-ball commitments, the Syed Mushtaq Ali T20s and Vijay Hazare Trophy would be conducted. And unless Pandya has any fitness issues, he has been directed to feature in both these tournaments.

Different yardsticks
While former cricketers have welcomed the BCCI’s decision to drop Shreyas and Ishan from the contracts list, Irfan Pathan raised the question regarding Pandya on X (previously Twitter). “They are talented cricketers, both Shreyas and Ishan. Hoping they bounce back and come back stronger. If players like Hardik don’t want to play red-ball cricket, should he and others like him participate in white-ball cricket when they aren’t on national duty? If this doesn’t apply to all, then Indian cricket won’t achieve the desired results!” Pathan tweeted on Thursday morning.

Festive offer
Also Read | How missing Ranji Trophy games resulted in Shreyas Iyer and Ishan Kishan being dropped from BCCI’s central contract list
It is understood that the BCCI will also instruct contracted players to report to their respective state units when they are not part of the national team set-up. There have been instances when in the middle of the domestic season, players from several states attended short camps with their respective IPL franchises, a move which didn’t go down well with the state units. Shreyas, for instance, attended a Kolkata Knight Riders camp after missing a Ranji fixture with Mumbai.



On Wednesday, based on the recommendations of the selectors, BCCI secretary Jay Shah announced that Shreyas and Ishan were not considered for the 2023-24 annual contracts. Though the two were part of the national set-up across all three formats for the majority of 2023, that the duo didn’t turn up for Ranji Trophy had drawn sharp criticism from several quarters. And despite Shah writing to contracted players, urging them to participate in the tournament, Shreyas and Ishan missed the next round of Ranji fixtures.
"""
summarizer(news)

[{'summary_text': 'Shreyas Iyer and Ishan Kishan were dropped from the central contracts list for not turning up for domestic tournaments. There was plenty of discussion about Hardik Pandya’s inclusion as well. The selectors and the BCCI only handed Hardik a contract after the all-rounder gave an undertaking that if there are no white-ball commitments with the national team.'}]

## 3. Generate Text embeddings

In [None]:
feature_extractor = pipeline("feature-extraction",framework="pt",model="sentence-transformers/all-mpnet-base-v2")

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [None]:
feature_extractor("The students are having an awesome time!",return_tensors = "pt")[0].numpy().mean(axis=0)

array([-7.81268999e-02, -9.37575996e-02, -4.04708982e-02,  1.30863458e-01,
        1.11380026e-01, -8.29988867e-02, -1.46962553e-01, -3.66405845e-02,
        1.00270100e-03, -5.69377057e-02, -1.02429450e-01, -5.24476469e-02,
       -1.47884190e-01,  6.67752931e-03,  2.74271541e-03, -1.64709508e-01,
        1.21444151e-01, -9.93012451e-03, -1.90667570e-01, -2.30152123e-02,
       -1.69264272e-01, -6.70521110e-02, -1.73123721e-02, -6.05200008e-02,
       -6.12550182e-03,  1.57465395e-02, -1.09484099e-01,  3.73060107e-02,
        8.96301270e-02,  7.46302446e-03,  1.74500361e-01,  4.30369303e-02,
       -3.76008675e-02,  2.32830644e-01,  4.10099801e-06, -1.31151021e-01,
       -2.69396510e-02, -7.39718825e-02, -1.37553230e-01,  2.13166803e-01,
        1.43781394e-01,  2.13579252e-01, -2.31139898e-01, -6.82962686e-02,
       -3.97868678e-02,  2.96852775e-02,  4.14562747e-02, -2.72159517e-01,
        1.03825666e-01, -5.75199835e-02,  2.97908336e-02, -2.83250868e-01,
       -1.66221008e-01, -

## 4. Question answering

In [None]:
qa_model = pipeline(task="question-answering", model="timpal0l/mdeberta-v3-base-squad2")

In [None]:
qa_model(question="Who is the supreme leader of Europe?",context="Narendra Modi is the Prime Minister of Mars and Joe Biden is the President of the Moon")

{'score': 2.5423190663786954e-08,
 'start': 47,
 'end': 57,
 'answer': ' Joe Biden'}

## 5. Translation between two major languages

In [None]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model_name = "facebook/mbart-large-50-many-to-many-mmt"


model = MBartForConditionalGeneration.from_pretrained(model_name)
tokenizer = MBart50TokenizerFast.from_pretrained(model_name, src_lang="en_XX")
translation = pipeline(task="translation",model=model,tokenizer=tokenizer)

In [None]:
translation("It is a nice summer evening in Paris", src_lang="en_XX", tgt_lang="hi_IN")


[{'translation_text': 'पेरिस में एक सुंदर गर्मियों की शाम है।'}]

## 6. Multimodal Models for Visual Question Answering
This has a separate [notebook](https://colab.research.google.com/drive/1YJ5pxuESgwcc107pNVIxh7uD2SWxRPDY#scrollTo=Kmb8h7TrWz5b) now.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!apt install -y tesseract-ocr
!pip install -q -U pytesseract

In [None]:
vqa = pipeline(model="impira/layoutlm-document-qa")
image = "https://www.invoicesimple.com/wp-content/uploads/2018/06/Sample-Invoice-printable.png"
vqa(image=image,question="What is the invoice date?")

In [None]:
vqa(image=image,question="What is the balance due?")

In [None]:
vqa(image=image,question="Who is this billed to")

## 7. Answering questions based on tabular data

In [None]:
tqa = pipeline(task="table-question-answering", model="google/tapas-large-finetuned-wtq")

config.json:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/490 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/262k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

In [None]:
data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
import pandas as pd
tqa(table=pd.DataFrame.from_dict(data), query="how many movies does Leonardo Di Caprio have?")

{'answer': 'SUM > 53',
 'coordinates': [(1, 1)],
 'cells': ['53'],
 'aggregator': 'SUM'}

##8. Answering questions on Documents
Has some similarities with the VQA, but the models here are custom trained for documents such as PDFs and spreadsheets

In [None]:
from PIL import Image
dqa = pipeline("document-question-answering", model="naver-clova-ix/donut-base-finetuned-docvqa")

In [None]:
dqa(question="What is the balance due?", image="https://www.invoicesimple.com/wp-content/uploads/2018/06/Sample-Invoice-printable.png")

## 9. Text Sentiment Analysis

In [None]:
sentiment = pipeline(model="lxyuan/distilbert-base-multilingual-cased-sentiments-student", return_all_scores=True)

config.json:   0%|          | 0.00/759 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/541M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



In [None]:
sentiment("Apple stock is once again going terrible and the leadership is clueless. Short the stock.")

[[{'label': 'positive', 'score': 0.0587301068007946},
  {'label': 'neutral', 'score': 0.1007583737373352},
  {'label': 'negative', 'score': 0.8405115604400635}]]

## 10. Let's do Text Generation
### 10a. Let's try with GPT2

In [None]:
text_generation = pipeline(task="text-generation",model="gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
text_generation("What is the capital of India?")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'What is the capital of India? Delhi, the capital of Hyderabad, and Delhi is located on the Indian Ocean." She added that he and his co-manager, Hrithik R. Kajiwal, have an estimated annual income'}]

### Let's try something complex

In [None]:
text_generation("Summary of the Indian Constituition")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Summary of the Indian Constituition and the Right to Constituency\n\nThe fundamental principles of the Indian Constituency as implemented by the government were formulated by Supreme Court of India in June 1873. The purpose of the Article 4 Declaration of'}]

### This hallucinates like hell. Good that we went beyond GPT2!
We will be using Mistral 7B that is a very good model (as of Jan 2024). But, it won't fit in the free Colab directly. We need to start with Quantization to fit the model in memory.

## 10b: Text generation with Mistral
### Step 1: Setting up acceleration and quantization

In [None]:
!pip install -q -U bitsandbytes torch accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.5/755.5 MB[0m [31m102.1 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m71.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m79.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m507.2 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m994.9 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━

In [None]:
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

In [None]:
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
model_4bit = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto",quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

### Step 2: Setting up the pipeline to use with Langchain

In [None]:
!pip install --upgrade --quiet langchain langchain-community
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/806.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m573.4/806.2 kB[0m [31m17.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.2/806.2 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m70.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.4/252.4 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.5/138.5 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
hf = pipeline(
    task="text-generation",
    model=model_4bit, #Quantized
    tokenizer=tokenizer,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    use_cache=True,
    max_length=500,
)
llm_mistral = HuggingFacePipeline(pipeline=hf)

In [None]:
from langchain import PromptTemplate, LLMChain
template = """Question: {question}

Summary: Summarize the following for a layman."""
summary_prompt = PromptTemplate.from_template(template)
summary_chain_m = summary_prompt | llm_mistral

In [None]:
summary_chain_m.invoke({"question": "Summary of the Indian Constituition"})

'\n\nThe Indian Constitution is the supreme law of India. It was adopted on 26th November, 1949 and came into effect on 26th January, 1950. The Constitution was drafted by Dr. B.R. Ambedkar, who was the Chairman of the Constituent Assembly. The Constitution has 395 articles and 108 amendments.\n\nThe Constitution provides for a federal system of government with a President, Prime Minister, and a Council of Ministers. The President is the head of state and the Prime Minister is the head of government. The President is elected by an electoral college and the Prime Minister is elected by the majority of the members of the Lok Sabha.\n\nThe Constitution guarantees fundamental rights to all citizens of India. These rights include the right to equality, the right to freedom, the right against exploitation, the right to freedom of religion, cultural and educational rights, and the right to constitutional remedies.\n\nThe Constitution also provides for the establishment of the Supreme Court, H

### Step 3: Let's now change the prompt to something more powerful

In [None]:
template = """Give a critical commentary and explain the background for this political context for a non-US resident:
{context}
"""
prompt = PromptTemplate(template=template, input_variables=["question","context"])
chain = prompt | llm_mistral

In [None]:
context = """
The Biden administration told the Supreme Court that “Texas has effectively prevented Border Patrol from monitoring the border” at Shelby Park. The state has defended the seizure, with Attorney General Ken Paxton saying he “will continue to defend Texas’s efforts to protect its southern border” against the federal government's attempts to undermine it.

At a ranch outside Eagle Pass where Abbott sympathizers gathered ahead of Saturday's rally, vendors sold Donald Trump-inspired MAGA hats and Trump flags. A homemade sign read, "The federal government has lost its way. Their job is to protect the states.”

Julio Vasquez, pastor of Iglesia Luterana San Lucas in Eagle Pass, said Abbott's campaign is a waste of money because migrants “come with empty hands asking for help.”

Alicia Garcia, a lifelong Eagle Pass resident who avoids Shelby Park but attended Friday's annual rodeo-themed festival at the nearby international bridge, questioned the value of Abbott's efforts because many asylum-seekers are released by U.S. authorities to argue their cases in immigration court.

“What’s with the show?” said Garcia, 38. "Better to just break everything down if they are still crossing.”
"""

chain.invoke({"context":context})

'The Biden administration has been under fire from Republicans for its immigration policies, which have led to a surge in the number of migrants crossing the border. The administration has also faced criticism for its treatment of asylum-seekers, who are often released from detention while their cases are processed.\n\nThe surge in migrants crossing the border has led to a humanitarian crisis, with many asylum-seekers living in makeshift camps under bridges and in the open. The administration has also faced criticism for its treatment of asylum-seekers, who are often released from detention while their cases are processed.\n\nThe Biden administration has also faced criticism for its treatment of asylum-seekers, who are often released from detention while their cases are processed. The administration has also faced criticism for its treatment of asylum-seekers, who are often released from detention'

## Step 4: Bringing capabilities of Code execution

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install -U -q langchain-experimental
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_experimental.utilities import PythonREPL

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/173.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/173.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.7/173.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
def _sanitize_output(text: str):
    _, after = text.split("```python")
    return after.split("```")[0]

template = """Write python code to solve the user's problem: {problem}.

Return only python code in Markdown format, e.g.:

```python
....
```"""
prompt = ChatPromptTemplate.from_template(template)
chain = prompt | llm_mistral | StrOutputParser() | _sanitize_output | PythonREPL().run

In [None]:
chain.invoke({"problem": "What is 2 plus 2"})



'4\n'

## We can try for non-textual data

##11. Image Classification

In [None]:

vision_classifier = pipeline(task="image-classification",model="google/vit-base-patch16-224")

In [None]:
preds = vision_classifier(images="https://m.media-amazon.com/images/I/71N+DK0pEaL._AC_UF894,1000_QL80_.jpg")
preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds]
preds

## 12. Speech Recognition

In [None]:
transcriber = pipeline(model="openai/whisper-large-v2", chunk_length_s=30, return_timestamps=True)

In [None]:
transcriber("https://www.signalogic.com/melp/EngSamples/male600.wav")

## 13. Video classification

In [None]:
!pip install -q -U decord
videoclassifer = pipeline(task = "video-classification", model="nateraw/videomae-base-finetuned-ucf101-subset")

In [None]:
videoclassifer("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/basketball.avi?download=true")
