<a href="https://colab.research.google.com/github/khanfawaz/LLM/blob/main/Text_classification_using_DistilGPT_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Text classification project using a free and open-source pre-trained decoder-only model from Hugging Face. We'll use the DistilGPT-2 model and Gradio to build the app. The project will classify input text into different categories.

In [None]:
# Install necessary libraries
!pip install transformers gradio

# Installs the required libraries (transformers for Hugging Face models and gradio for creating the UI).

Collecting gradio
  Downloading gradio-4.10.0-py3-none-any.whl (16.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.105.0-py3-none-any.whl (93 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.1/93.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.7.3 (from gradio)
  Downloading gradio_client-0.7.3-py3-none-any.whl (304 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m304.8/304.8 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx (from gradio)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# Load libraries
import gradio as gr
from transformers import GPT2ForSequenceClassification, GPT2Tokenizer

# Loads the necessary libraries

In [None]:
# Load pre-trained model and tokenizer
model_name = "distilgpt2"  # You can use other models from Hugging Face
model = GPT2ForSequenceClassification.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Loads the DistilGPT-2 model and tokenizer from Hugging Face.

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
# Define the text classification function
def classify_text(text):
    # Tokenize and predict the class
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax().item()

    # Return the predicted class label
    return f"Predicted Class: {predicted_class}"

# Defines a text classification function (classify_text) that tokenizes the input text, passes it through the model, and returns the predicted class label.

In [None]:
# Create the Gradio Interface
iface = gr.Interface(
    fn=classify_text,
    inputs=gr.Textbox(),
    outputs="text",
    live=True,
    #interpretation="default",
    title="Text Classification with GPT-2",
    description="Enter text for classification",
)

# Creates a Gradio interface with a textbox for user input and displays the predicted class label as output.

In [None]:
# Launch the Gradio app and get a public link
iface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://5e90ae87e616140976.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




Check Tokenization:
To check if the text is properly tokenized, you can print the tokens generated by the tokenizer. Here's an example assuming you're using the transformers library:

In [None]:
from transformers import AutoTokenizer

# Replace 'model_name' with the actual model name you are using
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example text
text = "I love my india"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print(tokens)

['I', 'Ġlove', 'Ġmy', 'Ġind', 'ia']


Confirm Class Labels:
Make sure the class labels you are using match the model's training data. If you're fine-tuning the model, check the training script or documentation for the expected class labels. For a pre-trained model, inspect the model's documentation or configuration file.

In [None]:
from transformers import AutoModelForSequenceClassification

# Replace 'model_name' with the actual model name you are using
model = AutoModelForSequenceClassification.from_pretrained("distilgpt2", token="your_token")

# Print the model's class labels
print(model.config.id2label)


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at distilgpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{0: 'LABEL_0'}


Ensure Correct Interface in Gradio:
If you're using Gradio to create an interface, ensure that you are passing the input text correctly to the model. Here's a simple example:

In [None]:
import gradio as gr

# Assume 'model' is your loaded model and 'tokenizer' is your loaded tokenizer
def predict(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors='pt')

    # Make predictions
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits).item()

    # Return the predicted class label
    return f'Predicted Class: {predicted_class}'

# Create a Gradio interface
iface = gr.Interface(fn=predict, inputs="text", outputs="text")

# Launch the Gradio app
iface.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://9764ecee53352e4c84.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


