# **Intent Classification**
  **The primary application of intent detection is in chatbots and virtual assistants, where the system needs to comprehend user queries to provide accurate information or perform specific tasks.**

* **Install the transformers**

In [None]:
!pip install transformers



* **Import necessary Libraries**

In [None]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline

* **Refer to the documenation at any given point: [LINK]("https://huggingface.co/qanastek/XLMRoberta-Alexa-Intents-Classification")**

**Model Name**

In [None]:
# We are defining the model that is required / that is in use (Model Name)
model_name = 'qanastek/XLMRoberta-Alexa-Intents-Classification'

**Fetch the tokenizer**

In [None]:
# Since, you are working on textual. Hence we need a tokenizer
# Because, for each model there is a different tokenizer associated
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/398 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

**Call the model**

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)

config.json:   0%|          | 0.00/4.00k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

**Bring the model and the tokenizer together, using pipeline functionality**

In [None]:
classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
res = classifier("What is the weather today?")

In [None]:
res[0]["label"]

'weather_query'

**Working with Gradio**

In [None]:
! pip install gradio

Collecting gradio
  Downloading gradio-5.6.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.5-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.3 (from gradio)
  Downloading gradio_client-1.4.3-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart==0.0.12 (from gradio)
  Downloading python_multipart-0.0.12-py3-none-any.whl.metadata (1.9 kB)
Collecting ruff>=0.2.2 (from gradio)
  Downloading ruff-0.8.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metad

In [None]:
import gradio as gr

In [None]:
def intent_classifier(text):
  res = classifier(text)
  return res[0]["label"]

In [None]:
iface = gr.Interface(
    fn = intent_classifier,
    inputs = gr.Textbox(lines = 2, placeholder = "Enter the text..."),
    outputs = "text",
    title = "Intent Classification using HuggingFace"
)

In [None]:
iface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://22a0f653e48c06adbb.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




# **NER - Named Entity Recognition**

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Tokenizer, for tokenizing the words
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
# The model
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


In [None]:
# Create a pipeline, Model + Tokenizer
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
ner_results = nlp(example)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


In [None]:
# OUtput functionality
def ner_classifier(text):
  ner_results = nlp(text)
  return ner_results

# Gradion Interface
iface = gr.Interface(
    fn = ner_classifier,
    inputs = gr.Textbox(lines = 2, placeholder = "Enter the text..."),
    outputs = "text",
    title = "NER using HuggingFace"
)

# To launch the interface
iface.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://8a2aa59337293799a4.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **Topic Modelling**

In [None]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.16.4-py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m75.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB

In [None]:
from bertopic import BERTopic

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Datasets/tokyo_2020_tweets.csv")

  df = pd.read_csv("/content/drive/MyDrive/Datasets/tokyo_2020_tweets.csv")


In [None]:
df.head()

Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet
0,1418888645105356803,Abhishek Srivastav,"Udupi, India",Trying to be mediocre in many things,2021-02-01 06:33:51,45.0,39.0,293.0,False,2021-07-24 10:59:49,Let the party begin\n#Tokyo2020,['Tokyo2020'],Twitter for Android,0.0,0.0,False
1,1418888377680678918,Saikhom Mirabai Channu🇮🇳,"Manipur, India",Indian weightlifter 48 kg category. Champion🏆,2018-04-07 10:10:22,5235.0,5.0,2969.0,False,2021-07-24 10:58:45,Congratulations #Tokyo2020 https://t.co/8OFKMs...,['Tokyo2020'],Twitter for Android,0.0,0.0,False
2,1418888260886073345,Big Breaking,Global,All breaking news related to Financial Market....,2021-05-29 08:51:25,3646.0,3.0,5.0,False,2021-07-24 10:58:17,Big Breaking Now \n\nTokyo Olympic Update \n\n...,,Twitter for Android,0.0,1.0,False
3,1418888172864299008,International Hockey Federation,Lausanne,Official International Hockey Federation Twitt...,2010-10-20 10:45:59,103975.0,2724.0,36554.0,True,2021-07-24 10:57:56,Q4: 🇬🇧3-1🇿🇦\n\nGreat Britain finally find a wa...,,Twitter Web App,1.0,0.0,False
4,1418886894478270464,Cameron Hart,Australia,Football & Tennis Coach,2020-10-31 08:46:17,6.0,37.0,31.0,False,2021-07-24 10:52:51,All I can think of every time I watch the ring...,"['Tokyo2020', 'ArtisticGymnastics', '7Olympics...",Twitter for iPhone,0.0,0.0,False


In [None]:
docs = df[0:10000].text.to_list()

In [None]:
docs

['Let the party begin\n#Tokyo2020',
 'Congratulations #Tokyo2020 https://t.co/8OFKMs9ukq',
 "Big Breaking Now \n\nTokyo Olympic Update \n\nJapan won his first Gold 🥇 Takato Naohisa won Gold in men's 60 kg Judo, C… https://t.co/tRcfDd7clY",
 'Q4: 🇬🇧3-1🇿🇦\n\nGreat Britain finally find a way way Pieterse, with Jack Waller finding the net via the stick of a Sou… https://t.co/kdeNYg9THk',
 'All I can think of every time I watch the rings event #Tokyo2020 #ArtisticGymnastics #7Olympics #OlympicGames… https://t.co/cJaxEFnyzD',
 '#Tokyo2020 #Olympics\n#MirabaiChanu\n#Weightlifting\n\nWomen Empowerment\nREAL                         Vs… https://t.co/XLsJb2RH76',
 "Can't help but cheer for them. Banda 6 goals in 2 games. \nZambia goal difference 7-14 😄\nWell done on getting that p… https://t.co/2G8UDgMClT",
 "@inquirerdotnet @ftjochoaINQ Caloy Yulo's 14.000, however, was only good for sixth in the rings preliminaries.… https://t.co/bXPm1E0RbF",
 "Q3 🇨🇦 1-4 🇩🇪\n\nGreen card for Canada's captain Sc

In [None]:
model = BERTopic(verbose = True)

In [None]:
model.fit(docs)

2024-11-26 03:17:05,589 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

2024-11-26 03:17:25,697 - BERTopic - Embedding - Completed ✓
2024-11-26 03:17:25,699 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-11-26 03:18:39,805 - BERTopic - Dimensionality - Completed ✓
2024-11-26 03:18:39,810 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-11-26 03:18:40,483 - BERTopic - Cluster - Completed ✓
2024-11-26 03:18:40,502 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-11-26 03:18:41,416 - BERTopic - Representation - Completed ✓


<bertopic._bertopic.BERTopic at 0x7f5d468d4cd0>

In [None]:
topics, probs = model.transform(docs)

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

2024-11-26 03:18:50,528 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-11-26 03:18:50,558 - BERTopic - Dimensionality - Completed ✓
2024-11-26 03:18:50,559 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-11-26 03:18:50,943 - BERTopic - Cluster - Completed ✓


In [None]:
model.get_topic_freq().head(10)

Unnamed: 0,Topic,Count
8,-1,2533
4,0,527
11,1,459
6,2,402
13,3,373
143,4,287
0,5,216
30,6,197
10,7,176
31,8,129


In [None]:
model.get_topic(2)

[('banda', 0.06129184580208688),
 ('zambia', 0.05416879229627973),
 ('barbra', 0.04098433603231592),
 ('china', 0.035586492351768716),
 ('barbara', 0.02733202706859233),
 ('hattricks', 0.020951114186643836),
 ('hattrick', 0.019192926106796716),
 ('goals', 0.018907847992350027),
 ('44', 0.01822021273347142),
 ('two', 0.016693939702296767)]

In [None]:
model.visualize_topics()

In [None]:
model.visualize_barchart()