# **`Introduction to HuggingFace`**

# **`Author`** : **`Muhammad Adil Naeem`**

```
!pip install transformers datasets
```



### **Let's Check if GPU is Available**

In [4]:
!nvidia-smi

Thu Oct 10 11:07:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## **What You Can Do With Huggingface**



# --------------------------------------------------- #
#                     **NLP TASKS**                     #
# --------------------------------------------------- #

1. **Text Classification**: Assigning a category to a piece of text.
   - Sentiment Analysis
   - Topic Classification
   - Spam Detection

   ```python
   classifier = pipeline("text-classification")
   ```

2. **Token Classification**: Assigning labels to individual tokens in a sequence.
   - Named Entity Recognition (NER)
   - Part-of-Speech Tagging

   ```python
   token_classifier = pipeline("token-classification")
   ```

3. **Question Answering**: Extracting an answer from a given context based on a question.

   ```python
   question_answerer = pipeline("question-answering")
   ```

4. **Text Generation**: Generating text based on a given prompt.
   - Language Modeling
   - Story Generation

   ```python
   text_generator = pipeline("text-generation")
   ```

5. **Summarization**: Condensing long documents into shorter summaries.

   ```python
   summarizer = pipeline("summarization")
   ```

6. **Translation**: Translating text from one language to another.

   ```python
   translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
   ```

7. **Text2Text Generation**: General-purpose text transformation, including summarization and translation.

   ```python
   text2text_generator = pipeline("text2text-generation")
   ```

8. **Fill-Mask**: Predicting the masked token in a sequence.

   ```python
   fill_mask = pipeline("fill-mask")
   ```

9. **Feature Extraction**: Extracting hidden states or features from text.

   ```python
   feature_extractor = pipeline("feature-extraction")
   ```

10. **Sentence Similarity**: Measuring the similarity between two sentences.

    ```python
    sentence_similarity = pipeline("sentence-similarity")
    ```

# --------------------------------------------------- #
#             **Computer Vision TASKS**                 #
# --------------------------------------------------- #

1. **Image Classification**: Classifying the main content of an image.

   ```python
   image_classifier = pipeline("image-classification")
   ```

2. **Object Detection**: Identifying objects within an image and their bounding boxes.

   ```python
   object_detector = pipeline("object-detection")
   ```

3. **Image Segmentation**: Segmenting different parts of an image into classes.

   ```python
   image_segmenter = pipeline("image-segmentation")
   ```

4. **Image Generation**: Generating images from textual descriptions (using DALL-E or similar models).

# --------------------------------------------------- #
#             **Speech Processing TASKS**               #
# --------------------------------------------------- #

1. **Automatic Speech Recognition (ASR)**: Converting spoken language into text.

   ```python
   speech_recognizer = pipeline("automatic-speech-recognition")
   ```

2. **Speech Translation**: Translating spoken language from one language to another.
3. **Audio Classification**: Classifying audio signals into predefined categories.

# --------------------------------------------------- #
#                   **Multimodal TASKS**                #
# --------------------------------------------------- #

1. **Image Captioning**: Generating a textual description of an image.

   ```python
   image_captioner = pipeline("image-to-text")
   ```

2. **Visual Question Answering (VQA)**: Answering questions about the content of an image.

# --------------------------------------------------- #
#                     **Other TASKS**                   #
# --------------------------------------------------- #

1. **Table Question Answering**: Answering questions based on tabular data.

   ```python
   table_qa = pipeline("table-question-answering")
   ```

2. **Document Question Answering**: Extracting answers from documents like PDFs.

   ```python
   doc_qa = pipeline("document-question-answering")
   ```

3. **Time Series Forecasting**: Predicting future values in time series data (not directly supported in the main Transformers library but available through extensions).

---

### **Import Libraries**

In [35]:
from datasets import load_dataset
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import TrainingArguments, Trainer

## **NLP Task**

### **1. Sentiment Analysis**

##### **Method 1:**

In [2]:
classifier = pipeline("sentiment-analysis")
result = classifier("I was not happy with last Mission Imposible Movie")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'NEGATIVE', 'score': 0.9996819496154785}]


##### **Method 2:**

In [3]:
pipeline(task="sentiment-analysis")("I was confused with Barbie Movie")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'NEGATIVE', 'score': 0.9990049004554749}]

##### **Deal With Multiline Text Witout Specifyning Model**

In [4]:
pipeline(task = "sentiment-analysis")\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9964345693588257}]


##### **Deal With Multiline Text by Specifyning Model**

In [5]:
pipeline(task = "sentiment-analysis", model="facebook/bart-large-mnli")\
                                      ("Everyday lots of LLMs papers are published about LLMs Evlauation. \
                                      Lots of them Looks very Promising. \
                                      I am not sure if we CAN actually Evaluate LLMs. \
                                      There is still lots to do.\
                                      Don't you think?")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'neutral', 'score': 0.76933354139328}]

##### **Batch Setiment Analysis**

In [6]:
classifier = pipeline(task = "sentiment-analysis")

task_list = ["I really like Autoencoders, best models for Anomaly Detection", \
            "I am not sure if we CAN actually Evaluate LLMs.", \
            "PassiveAgressive is the name of a Linear Regression Model that so many people do not know.",\
            "I hate long Meetings."]
classifier(task_list)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.9978686571121216},
 {'label': 'NEGATIVE', 'score': 0.9995476603507996},
 {'label': 'NEGATIVE', 'score': 0.9983084201812744},
 {'label': 'NEGATIVE', 'score': 0.9969879984855652}]

##### **If we want to capture Emotions from data**

In [8]:
classifier = pipeline(task = "sentiment-analysis", model = "SamLowe/roberta-base-go_emotions")

task_list = ["I really like Autoencoders, best models for Anomaly Detection", \
            "I am not sure if we CAN actually Evaluate LLMs.", \
            "PassiveAgressive is the name of a Linear Regression Model that so many people do not know. It is pretty funny name for a Regression Model.",\
            "I hate long Meetings."]
classifier(task_list)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'admiration', 'score': 0.7406534552574158},
 {'label': 'confusion', 'score': 0.9066852331161499},
 {'label': 'amusement', 'score': 0.9083252549171448},
 {'label': 'anger', 'score': 0.7870617508888245}]

### **2. Text Generation Task**

In [9]:
text_generator = pipeline("text-generation", model="distilbert/distilgpt2")
generated_text = text_generator("Today is a rainy day in London",
                                truncation=True,
                                num_return_sequences = 2)
print("Generated_text:\n ", generated_text[0]['generated_text'])

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated_text:
  Today is a rainy day in London and we’ll call it a night, and I am getting pretty sure I am going to be awake in the morning and looking at my iPad. No I'm not going to do anything that other people do


### **3. Question Answering Task**

In [10]:
qa_model = pipeline("question-answering")
question = "What is my job?"
context = "I am developing AI models with Python."
qa_model(question = question, context = context)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'score': 0.7823824286460876,
 'start': 5,
 'end': 25,
 'answer': 'developing AI models'}

### **4. Tokenization Task**

In [15]:
# Import necessary classes from the transformers library
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Specify the model name for sentiment analysis
model_name2 = "nlptown/bert-base-multilingual-uncased-sentiment"

# Load the pre-trained model for sequence classification
mymodel2 = AutoModelForSequenceClassification.from_pretrained(model_name2)

# Load the tokenizer associated with the specified model
mytokenizer2 = AutoTokenizer.from_pretrained(model_name2)

# Create a sentiment analysis pipeline using the model and tokenizer
classifier = pipeline("sentiment-analysis", model=mymodel2, tokenizer=mytokenizer2)

# Analyze the sentiment of the given text and store the result
res = classifier("I was so not happy with the Barbie Movie")

# Print the sentiment analysis result
print(res)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': '2 stars', 'score': 0.5099301934242249}]


#### **Let's See the Working of Tokenizer**

In [16]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example text
text = "I was so not happy with the Barbie Movie"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokens: ['i', 'was', 'so', 'not', 'happy', 'with', 'the', 'barbie', 'movie']


In [17]:
# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

Input IDs: [1045, 2001, 2061, 2025, 3407, 2007, 1996, 22635, 3185]


In [18]:
# Encode the text (tokenization + converting to input IDs)
encoded_input = tokenizer(text)
print("Encoded Input:", encoded_input)

Encoded Input: {'input_ids': [101, 1045, 2001, 2061, 2025, 3407, 2007, 1996, 22635, 3185, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [20]:
# Decode the text
decoded_output = tokenizer.decode(input_ids)
print("Decode Output: ", decoded_output)

Decode Output:  i was so not happy with the barbie movie


###### **Let's Perform in 1 step**

In [21]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Example text
text = "I was so not happy with the Barbie Movie"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

# Encode the text (tokenization + converting to input IDs)
encoded_input = tokenizer(text)
print("Encoded Input:", encoded_input)

# Decode the text
decoded_output = tokenizer.decode(input_ids)
print("Decode Output: ", decoded_output)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Tokens: ['I', 'was', 'so', 'not', 'happy', 'with', 'the', 'Barbie', 'Movie']
Input IDs: [146, 1108, 1177, 1136, 2816, 1114, 1103, 25374, 8275]
Encoded Input: {'input_ids': [101, 146, 1108, 1177, 1136, 2816, 1114, 1103, 25374, 8275, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Decode Output:  I was so not happy with the Barbie Movie


### **5. Fine Tunig model on IMDB Dataset**

##### **1. Import Libraries**

In [27]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

##### **2. Load IMDB Dataset**

In [24]:
dataset  = load_dataset("imdb")

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [25]:
# metadata
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

##### **3. Pre-Process the Dataset**

In [26]:
# Load the tokenizer for the BERT model with uncased text
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Define a function to tokenize the dataset
def tokenize_function(examples):
    # Tokenize the 'text' field from the examples
    # Apply padding to the maximum length and truncate if necessary
    return tokenizer(examples['text'], padding="max_length", truncation=True)

# Apply the tokenize_function to the dataset
# The 'batched=True' option processes the examples in batches for efficiency
tokenized_datasets = dataset.map(tokenize_function, batched=True)



Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [28]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [31]:
tokenized_datasets['train'][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

##### **4. Set Up the Training Arguments**

- Specify the hyperparameters and training settings.

In [33]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    eval_strategy ="epoch",     # Evaluate every epoch
    learning_rate=2e-5,              # Learning rate
    per_device_train_batch_size=16,  # Batch size for training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    num_train_epochs=1,              # Number of training epochs
    weight_decay=0.01,               # Strength of weight decay
)
training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=epoch,
eval_use_gather_object=False,
evaluation_strategy=None,
fp1

#### **5. Initialize the Model**

- Load the pre-trained model and define the training procedure.

In [36]:
from transformers import AutoModelForSequenceClassification, Trainer

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### **6. Train the Model**

- Fine-tune the pre-trained model on your specific dataset.

In [None]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss


#### **7. Evaluate the Model**

- Assess the model's performance on a validation set.

In [None]:
# Evaluate the model
results = trainer.evaluate()
print(results)

#### **8. Save the Fine-Tuned Model**

- Save the fine-tuned model for later use.

In [None]:
# Save the model
model.save_pretrained('./fine-tuned-model')
tokenizer.save_pretrained('./fine-tuned-tokenizer')

## **6. Axirv Project**

```python
!pip install arxiv
```

##### **1. Import Libraries**

In [8]:
import arxiv
import pandas as pd
from transformers import pipeline

##### **2. Query to fetch AI-related papers**

In [3]:
# Define a search query for arXiv
query = 'ai OR artificial intelligence OR machine learning'

# Create a search object for arXiv with the specified query
# Limit results to a maximum of 10 and sort by submission date
search = arxiv.Search(query=query, max_results=10, sort_by=arxiv.SortCriterion.SubmittedDate)

# Initialize an empty list to store paper details
papers = []

# Fetch papers from the search results
for result in search.results():
    # Append the relevant details of each paper to the papers list
    papers.append({
        'published': result.published,  # Publication date
        'title': result.title,          # Title of the paper
        'abstract': result.summary,     # Abstract of the paper
        'categories': result.categories  # Categories the paper belongs to
    })

# Convert the list of papers into a pandas DataFrame for easier manipulation and display
df = pd.DataFrame(papers)

# Set pandas option to display the full content of the columns without truncation
pd.set_option('display.max_colwidth', None)

# Display the first 10 entries of the DataFrame
df.head(10)

  for result in search.results():


Unnamed: 0,published,title,abstract,categories
0,2024-10-09 17:59:59+00:00,MM-Ego: Towards Building Egocentric Multimodal LLMs,"This research aims to comprehensively explore building a multimodal\nfoundation model for egocentric video understanding. To achieve this goal, we\nwork on three fronts. First, as there is a lack of QA data for egocentric video\nunderstanding, we develop a data engine that efficiently generates 7M\nhigh-quality QA samples for egocentric videos ranging from 30 seconds to one\nhour long, based on human-annotated data. This is currently the largest\negocentric QA dataset. Second, we contribute a challenging egocentric QA\nbenchmark with 629 videos and 7,026 questions to evaluate the models' ability\nin recognizing and memorizing visual details across videos of varying lengths.\nWe introduce a new de-biasing evaluation method to help mitigate the\nunavoidable language bias present in the models being evaluated. Third, we\npropose a specialized multimodal architecture featuring a novel ""Memory Pointer\nPrompting"" mechanism. This design includes a global glimpse step to gain an\noverarching understanding of the entire video and identify key visual\ninformation, followed by a fallback step that utilizes the key visual\ninformation to generate responses. This enables the model to more effectively\ncomprehend extended video content. With the data, benchmark, and model, we\nsuccessfully build MM-Ego, an egocentric multimodal LLM that shows powerful\nperformance on egocentric video understanding.","[cs.CV, cs.AI, cs.LG]"
1,2024-10-09 17:59:58+00:00,Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models,"Retrieval-Augmented Generation (RAG), while effective in integrating external\nknowledge to address the limitations of large language models (LLMs), can be\nundermined by imperfect retrieval, which may introduce irrelevant, misleading,\nor even malicious information. Despite its importance, previous studies have\nrarely explored the behavior of RAG through joint analysis on how errors from\nimperfect retrieval attribute and propagate, and how potential conflicts arise\nbetween the LLMs' internal knowledge and external sources. We find that\nimperfect retrieval augmentation might be inevitable and quite harmful, through\ncontrolled analysis under realistic conditions. We identify the knowledge\nconflicts between LLM-internal and external knowledge from retrieval as a\nbottleneck to overcome in the post-retrieval stage of RAG. To render LLMs\nresilient to imperfect retrieval, we propose Astute RAG, a novel RAG approach\nthat adaptively elicits essential information from LLMs' internal knowledge,\niteratively consolidates internal and external knowledge with source-awareness,\nand finalizes the answer according to information reliability. Our experiments\nusing Gemini and Claude demonstrate that Astute RAG significantly outperforms\nprevious robustness-enhanced RAG methods. Notably, Astute RAG is the only\napproach that matches or exceeds the performance of LLMs without RAG under\nworst-case scenarios. Further analysis reveals that Astute RAG effectively\nresolves knowledge conflicts, improving the reliability and trustworthiness of\nRAG systems.","[cs.CL, cs.AI, cs.LG]"
2,2024-10-09 17:59:45+00:00,Neural Circuit Architectural Priors for Quadruped Locomotion,"Learning-based approaches to quadruped locomotion commonly adopt generic\npolicy architectures like fully connected MLPs. As such architectures contain\nfew inductive biases, it is common in practice to incorporate priors in the\nform of rewards, training curricula, imitation data, or trajectory generators.\nIn nature, animals are born with priors in the form of their nervous system's\narchitecture, which has been shaped by evolution to confer innate ability and\nefficient learning. For instance, a horse can walk within hours of birth and\ncan quickly improve with practice. Such architectural priors can also be useful\nin ANN architectures for AI. In this work, we explore the advantages of a\nbiologically inspired ANN architecture for quadruped locomotion based on neural\ncircuits in the limbs and spinal cord of mammals. Our architecture achieves\ngood initial performance and comparable final performance to MLPs, while using\nless data and orders of magnitude fewer parameters. Our architecture also\nexhibits better generalization to task variations, even admitting deployment on\na physical robot without standard sim-to-real methods. This work shows that\nneural circuits can provide valuable architectural priors for locomotion and\nencourages future work in other sensorimotor skills.","[q-bio.NC, cs.AI, cs.LG, cs.NE, cs.RO]"
3,2024-10-09 17:59:33+00:00,Do better language models have crisper vision?,"How well do text-only Large Language Models (LLMs) grasp the visual world? As\nLLMs are increasingly used in computer vision, addressing this question becomes\nboth fundamental and pertinent. However, existing studies have primarily\nfocused on limited scenarios, such as their ability to generate visual content\nor cluster multimodal data. To this end, we propose the Visual Text\nRepresentation Benchmark (ViTeRB) to isolate key properties that make language\nmodels well-aligned with the visual world. With this, we identify large-scale\ndecoder-based LLMs as ideal candidates for representing text in vision-centric\ncontexts, counter to the current practice of utilizing text encoders. Building\non these findings, we propose ShareLock, an ultra-lightweight CLIP-like model.\nBy leveraging precomputable frozen features from strong vision and language\nmodels, ShareLock achieves an impressive 51% accuracy on ImageNet despite\nutilizing just 563k image-caption pairs. Moreover, training requires only 1 GPU\nhour (or 10 hours including the precomputation of features) - orders of\nmagnitude less than prior methods. Code will be released.","[cs.CL, cs.AI, cs.CV]"
4,2024-10-09 17:59:14+00:00,Glider: Global and Local Instruction-Driven Expert Router,"The availability of performant pre-trained models has led to a proliferation\nof fine-tuned expert models that are specialized to particular domains. This\nhas enabled the creation of powerful and adaptive routing-based ""Model\nMoErging"" methods with the goal of using expert modules to create an aggregate\nsystem with improved performance or generalization. However, existing MoErging\nmethods often prioritize generalization to unseen tasks at the expense of\nperformance on held-in tasks, which limits its practical applicability in\nreal-world deployment scenarios. We observe that current token-level routing\nmechanisms neglect the global semantic context of the input task. This\ntoken-wise independence hinders effective expert selection for held-in tasks,\nas routing decisions fail to incorporate the semantic properties of the task.\nTo address this, we propose, Global and Local Instruction Driven Expert Router\n(GLIDER) that integrates a multi-scale routing mechanism, encompassing a\nsemantic global router and a learned local router. The global router leverages\nLLM's advanced reasoning capabilities for semantic-related contexts to enhance\nexpert selection. Given the input query and LLM, the router generates semantic\ntask instructions that guide the retrieval of the most relevant experts across\nall layers. This global guidance is complemented by a local router that\nfacilitates token-level routing decisions within each module, enabling finer\ncontrol and enhanced performance on unseen tasks. Our experiments using\nT5-based models for T0 and FLAN tasks demonstrate that GLIDER achieves\nsubstantially improved held-in performance while maintaining strong\ngeneralization on held-out tasks. We also perform ablations experiments to dive\ndeeper into the components of GLIDER. Our experiments highlight the importance\nof our multi-scale routing that leverages LLM-driven semantic reasoning for\nMoErging methods.",[cs.LG]
5,2024-10-09 17:59:13+00:00,IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation,"Advanced diffusion models like RPG, Stable Diffusion 3 and FLUX have made\nnotable strides in compositional text-to-image generation. However, these\nmethods typically exhibit distinct strengths for compositional generation, with\nsome excelling in handling attribute binding and others in spatial\nrelationships. This disparity highlights the need for an approach that can\nleverage the complementary strengths of various models to comprehensively\nimprove the composition capability. To this end, we introduce IterComp, a novel\nframework that aggregates composition-aware model preferences from multiple\nmodels and employs an iterative feedback learning approach to enhance\ncompositional generation. Specifically, we curate a gallery of six powerful\nopen-source diffusion models and evaluate their three key compositional\nmetrics: attribute binding, spatial relationships, and non-spatial\nrelationships. Based on these metrics, we develop a composition-aware model\npreference dataset comprising numerous image-rank pairs to train\ncomposition-aware reward models. Then, we propose an iterative feedback\nlearning method to enhance compositionality in a closed-loop manner, enabling\nthe progressive self-refinement of both the base diffusion model and reward\nmodels over multiple iterations. Theoretical proof demonstrates the\neffectiveness and extensive experiments show our significant superiority over\nprevious SOTA methods (e.g., Omost and FLUX), particularly in multi-category\nobject composition and complex semantic alignment. IterComp opens new research\navenues in reward feedback learning for diffusion models and compositional\ngeneration. Code: https://github.com/YangLing0818/IterComp",[cs.CV]
6,2024-10-09 17:59:06+00:00,One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation,"Foundation models (FMs) are pre-trained on large-scale datasets and then\nfine-tuned on a downstream task for a specific application. The most successful\nand most commonly used fine-tuning method is to update the pre-trained weights\nvia a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are\nusually initialized at random with a uniform rank distribution across model\nweights. Recent works focus on weight-driven initialization or learning of\nadaptive ranks during training. Both approaches have only been investigated in\nisolation, resulting in slow convergence or a uniform rank distribution, in\nturn leading to sub-optimal performance. We propose to enhance LoRA by\ninitializing the new weights in a data-driven manner by computing singular\nvalue decomposition on minibatches of activation vectors. Then, we initialize\nthe LoRA matrices with the obtained right-singular vectors and re-distribute\nranks among all weight matrices to explain the maximal amount of variance and\ncontinue the standard LoRA fine-tuning procedure. This results in our new\nmethod Explained Variance Adaptation (EVA). We apply EVA to a variety of\nfine-tuning tasks ranging from language generation and understanding to image\nclassification and reinforcement learning. EVA exhibits faster convergence than\ncompetitors and attains the highest average score across a multitude of tasks\nper domain.","[cs.LG, cs.AI, cs.CL, stat.ML]"
7,2024-10-09 17:59:04+00:00,Sylber: Syllabic Embedding Representation of Speech from Raw Audio,"Syllables are compositional units of spoken language that play a crucial role\nin human speech perception and production. However, current neural speech\nrepresentations lack structure, resulting in dense token sequences that are\ncostly to process. To bridge this gap, we propose a new model, Sylber, that\nproduces speech representations with clean and robust syllabic structure.\nSpecifically, we propose a self-supervised model that regresses features on\nsyllabic segments distilled from a teacher model which is an exponential moving\naverage of the model in training. This results in a highly structured\nrepresentation of speech features, offering three key benefits: 1) a fast,\nlinear-time syllable segmentation algorithm, 2) efficient syllabic tokenization\nwith an average of 4.27 tokens per second, and 3) syllabic units better suited\nfor lexical and syntactic understanding. We also train token-to-speech\ngenerative models with our syllabic units and show that fully intelligible\nspeech can be reconstructed from these tokens. Lastly, we observe that\ncategorical perception, a linguistic phenomenon of speech perception, emerges\nnaturally in our model, making the embedding space more categorical and sparse\nthan previous self-supervised learning approaches. Together, we present a novel\nself-supervised approach for representing speech as syllables, with significant\npotential for efficient speech tokenization and spoken language modeling.","[cs.CL, cs.SD, eess.AS]"
8,2024-10-09 17:59:00+00:00,Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making,"We aim to evaluate Large Language Models (LLMs) for embodied decision making.\nWhile a significant body of work has been leveraging LLMs for decision making\nin embodied environments, we still lack a systematic understanding of their\nperformance because they are usually applied in different domains, for\ndifferent purposes, and built based on different inputs and outputs.\nFurthermore, existing evaluations tend to rely solely on a final success rate,\nmaking it difficult to pinpoint what ability is missing in LLMs and where the\nproblem lies, which in turn blocks embodied agents from leveraging LLMs\neffectively and selectively. To address these limitations, we propose a\ngeneralized interface (Embodied Agent Interface) that supports the\nformalization of various types of tasks and input-output specifications of\nLLM-based modules. Specifically, it allows us to unify 1) a broad set of\nembodied decision-making tasks involving both state and temporally extended\ngoals, 2) four commonly-used LLM-based modules for decision making: goal\ninterpretation, subgoal decomposition, action sequencing, and transition\nmodeling, and 3) a collection of fine-grained metrics which break down\nevaluation into various types of errors, such as hallucination errors,\naffordance errors, various types of planning errors, etc. Overall, our\nbenchmark offers a comprehensive assessment of LLMs' performance for different\nsubtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI\nsystems, and providing insights for effective and selective use of LLMs in\nembodied decision making.","[cs.CL, cs.AI, cs.LG, cs.RO]"
9,2024-10-09 17:58:12+00:00,Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning,"In this work, we address the problem of large language model (LLM)\nunlearning, aiming to remove unwanted data influences and associated model\ncapabilities (e.g., copyrighted data or harmful content generation) while\npreserving essential model utilities, without the need for retraining from\nscratch. Despite the growing need for LLM unlearning, a principled optimization\nframework remains lacking. To this end, we revisit the state-of-the-art\napproach, negative preference optimization (NPO), and identify the issue of\nreference model bias, which could undermine NPO's effectiveness, particularly\nwhen unlearning forget data of varying difficulty. Given that, we propose a\nsimple yet effective unlearning optimization framework, called SimNPO, showing\nthat 'simplicity' in removing the reliance on a reference model (through the\nlens of simple preference optimization) benefits unlearning. We also provide\ndeeper insights into SimNPO's advantages, supported by analysis using mixtures\nof Markov chains. Furthermore, we present extensive experiments validating\nSimNPO's superiority over existing unlearning baselines in benchmarks like TOFU\nand MUSE, and robustness against relearning attacks. Codes are available at\nhttps://github.com/OPTML-Group/Unlearn-Simple.","[cs.CL, cs.AI, cs.LG]"


#### **3. Example abstract from API**

In [6]:
# Extract the abstract of the first paper from the DataFrame
abstract = df['abstract'][0]

# Display the extracted abstract
abstract

'This research aims to comprehensively explore building a multimodal\nfoundation model for egocentric video understanding. To achieve this goal, we\nwork on three fronts. First, as there is a lack of QA data for egocentric video\nunderstanding, we develop a data engine that efficiently generates 7M\nhigh-quality QA samples for egocentric videos ranging from 30 seconds to one\nhour long, based on human-annotated data. This is currently the largest\negocentric QA dataset. Second, we contribute a challenging egocentric QA\nbenchmark with 629 videos and 7,026 questions to evaluate the models\' ability\nin recognizing and memorizing visual details across videos of varying lengths.\nWe introduce a new de-biasing evaluation method to help mitigate the\nunavoidable language bias present in the models being evaluated. Third, we\npropose a specialized multimodal architecture featuring a novel "Memory Pointer\nPrompting" mechanism. This design includes a global glimpse step to gain an\noverarchin

#### **4. Summarize the Abstract**

In [9]:
# Create a summarization pipeline using the specified pre-trained model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


#### **5. Summarization of Abstract**

In [10]:
# Use the summarization pipeline to generate a summary of the extracted abstract
summarization_result = summarizer(abstract)

In [11]:
# Access the summarized text from the summarization result
summary_text = summarization_result[0]['summary_text']

'There is a lack of QA data for egocentric videounderstanding. We develop a data engine that efficiently generates 7M high-quality QA samples. We introduce a new de-biasing evaluation method to help mitigate theavoidable language bias. We propose a specialized multimodal architecture featuring a novel "Memory PointerPrompting" mechanism.'

-----