<a href="https://colab.research.google.com/github/sudarshan-koirala/youtube-stuffs/blob/main/huggingface_crash_course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Huggingface Crash Course

## Install necesarry libraries

In [None]:
%%capture capturet 
#https://ipython.readthedocs.io/en/stable/interactive/magics.html#cellmagic-capture
!pip install transformers

In [None]:
#capturet.show()

## Pipelines
- https://huggingface.co/docs/transformers/main_classes/pipelines

In [None]:
from transformers import pipeline

In [None]:
classifier = pipeline('sentiment-analysis') #we put sentiment-analysis task in this case
result = classifier("I enjoy sunny weather.")
print(result)

- It seems its easy, right ? 😄 Behind the scene, pipeline is handling the heavy lifting for us. 
- It first does the prepreocessing, apply a tokenizer.
- Then feeds the pre-processed text to the model, in our case `distilbert-base-uncased-finetuned-sst-2-english`.
- Finally, does the post-porcessing, meaning that the output must be what we want. In our case, it is sentiment analysis, so negative or positive.

## Tokenizers / Models
- Grab any model: https://huggingface.co/models

In [None]:
# lets use the auto stuffs, so it will update the rest of the code when you change the model name.
# Its becoz, different models have their own tokenizer.
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
#from transformers import BertForSequenceClassification

# lets use some different model from model hub
model_name = "bert-base-cased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) 
result = classifier("I enjoy sunny weather.")
print(result)

### So what is tokenizer ?
- A tokenizer is in charge of preparing the inputs for a model.
- https://huggingface.co/docs/transformers/main_classes/tokenizer

In [None]:
# simple example
text = "I enjoy sunny weather."
tokenized_format = tokenizer(text)
print(tokenized_format)

- The 'input_ids' list contains a sequence of integers representing the tokenized text. Each integer corresponds to a specific token in the vocabulary of the language model. In this case, the tokenized text consists of five tokens, with token IDs [146, 5548, 21162, 4250].

- The 'token_type_ids' list indicates which tokens belong to the first sentence or the second sentence (in case of sentence pair tasks). Since there is only one sentence in this example, all the values in the list are zero.

- The 'attention_mask' list is used to indicate which tokens in the 'input_ids' list should be attended to by the model. In this case, all tokens have a value of 1, indicating that they should all be attended to.

In [None]:
## lets dig little bit more 
tokens = tokenizer.tokenize(text)
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
back_to_token = tokenizer.convert_ids_to_tokens(ids)
print(back_to_token)
text_back = tokenizer.decode(ids)
print(text_back)

## You can implement with different ML / DL libraries (pytorch, tensorflow, jax) but lets go how we can interact with the help of LangChain.
- Finetuning a pretrained model with pytorch: https://huggingface.co/docs/transformers/training
- Use huggingface ecosystem with Langchain: https://python.langchain.com/en/latest/ecosystem/huggingface.html

## Huggingface Hub with LangChain [link](https://python.langchain.com/en/latest/modules/models/llms/integrations/huggingface_hub.html)
- There exists two Hugging Face LLM wrappers, one for a local pipeline and one for a model hosted on Hugging Face Hub. Note that these wrappers only work for models that support the following tasks: `text2text-generation`, `text-generation`.

In [None]:
%%capture
!pip install huggingface_hub langchain transformers

In [None]:
# get tokens --> https://huggingface.co/settings/tokens
# https://docs.python.org/3.8/library/getpass.html
from getpass import getpass
HUGGINGFACEHUB_API_TOKEN = getpass()

In [None]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

### Huggingface Hub

In [None]:
# load the model

from langchain import HuggingFaceHub

repo_id = "google/flan-t5-xl"

llm = HuggingFaceHub(repo_id=repo_id, model_kwargs={"temperature":0, "max_length":64})

In [None]:
from langchain import PromptTemplate, LLMChain

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who won the FIFA World Cup in the year 1994? "

print(llm_chain.run(question))

### Huggingface Local Pipelines [link](https://python.langchain.com/en/latest/modules/models/llms/integrations/huggingface_pipelines.html#integrate-the-model-in-an-llmchain)

In [None]:
%%capture
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

In [None]:
# load the model

from langchain import HuggingFacePipeline

#model_id = "bigscience/bloom-1b7" # https://huggingface.co/bigscience/bloom-1b7
model_id = "facebook/opt-350m"
llm = HuggingFacePipeline.from_model_id(model_id=model_id, task="text-generation", model_kwargs={"temperature":0.1, "max_length":64})

In [None]:
from langchain import PromptTemplate,  LLMChain

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "What is electroencephalography?"

print(llm_chain.run(question))