<a href="https://colab.research.google.com/github/krugis/alitolgaeskici/blob/main/tweet_inference_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install peft transformers datasets accelerate scikit-learn



In [8]:
!pip3 install emoji==0.6.0

Collecting emoji==0.6.0
  Downloading emoji-0.6.0.tar.gz (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-0.6.0-py3-none-any.whl size=49719 sha256=211771cca9ff995912b965c9a05a5fd8faab155fd65ae707c4cc54d8c6e444ba
  Stored in directory: /root/.cache/pip/wheels/b7/23/31/f9b93f25b95da9b91729c4cd5f35a2b692ab06f688f6759630
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-0.6.0


In [11]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import Dataset, DatasetDict
import pandas as pd
import numpy as np
from peft import LoraConfig, get_peft_model, TaskType
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


# --- REPLACE THESE WITH YOUR ACTUAL VALUES ---
dataset_path = "/content/tweets_20250122074841.csv"
model_name = "vinai/bertweet-base"

# --- DATA LOADING AND PREPROCESSING ---
try:
    df = pd.read_csv(dataset_path)
    # Check if the 'text' column exists, if not, try other common names
    if 'text' not in df.columns:
        if 'Tweet Text' in df.columns:  # Check for 'Tweet Text'
            df.rename(columns={'Tweet Text': 'text'}, inplace=True)
        elif 'tweet' in df.columns:  # Check for 'tweet'
            df.rename(columns={'tweet': 'text'}, inplace=True)
        # ... Add more potential column names to check ...
        else:
            raise KeyError(
                "Could not find a column with tweet text. Please ensure your dataset has a column named 'text', 'Tweet Text', 'tweet', or similar."
            )

    # Ensure your dataframe has a column named "text"
    dataset = DatasetDict({
        "data": Dataset.from_pandas(df)  # Create a single dataset named "data"
    })

except FileNotFoundError:
    print(f"Error: Dataset file not found at {dataset_path}")
    raise  # Re-raise the exception to stop execution
except pd.errors.EmptyDataError:
    print(f"Error: Dataset file at {dataset_path} is empty.")
    raise  # Re-raise the exception to stop execution
except KeyError as e:
    print(
        f"Error: Missing column in your dataset. Ensure you have a 'text' column. Error: {e}"
    )
    raise  # Re-raise the exception to stop execution
except Exception as e:
    print(f"An unexpected error occurred while loading the dataset: {e}")
    raise  # Re-raise the exception to stop execution

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)  # You can keep num_labels, it won't be used for inference

# --- LoRA Configuration ---
config = LoraConfig(
    r=8,
    lora_alpha=32,
    # Target the specific Linear layers within RobertaSdpaSelfAttention
    target_modules=["query", "key", "value"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS,
)

# --- Apply LoRA to the model ---
model = get_peft_model(model, config)
model.print_trainable_parameters()

# --- Generate embeddings ---
def get_embeddings(examples):
    """Function to generate embeddings using the LoRA-adapted model."""
    inputs = tokenizer(
        examples["text"], padding=True, truncation=True, return_tensors="pt"
    )
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        embeddings = outputs.hidden_states[-1][:, 0, :].cpu(
        ).numpy()  # Get [CLS] token embeddings and move to CPU
    return {"embeddings": embeddings}


dataset_with_embeddings = dataset.map(get_embeddings, batched=True, batch_size=8)
embeddings = dataset_with_embeddings["data"]["embeddings"]

# --- Saving the embeddings to a file ---
output_file = "tweet_embeddings_lora.npy"
np.save(output_file, embeddings)
print(f"Embeddings saved to: {output_file}")

# --- K-Means Clustering and Evaluation ---
num_clusters = 5  # Choose the desired number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)

# Evaluate clustering using silhouette score
silhouette_avg = silhouette_score(embeddings, cluster_labels)
print(f"Silhouette Score: {silhouette_avg}")

# Add cluster labels to the dataset
dataset_with_embeddings["data"] = dataset_with_embeddings["data"].add_column(
    "cluster", cluster_labels
)

for cluster_id in range(num_clusters):
    print(f"Cluster {cluster_id}:")
    # Filter the dataset to select rows belonging to the current cluster_id
    cluster_tweets = dataset_with_embeddings["data"].filter(lambda example: example['cluster'] == cluster_id)

    # Get the "text" column from the filtered dataset
    cluster_tweets_text = cluster_tweets["text"]

    for tweet in cluster_tweets_text[:5]:  # Print the first 5 tweets in the cluster
        print(f"- {tweet}")
    print("\n")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at vinai/bertweet-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,034,498 || all params: 135,936,004 || trainable%: 0.7610


Map:   0%|          | 0/15 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Embeddings saved to: tweet_embeddings_lora.npy
Silhouette Score: 0.05227946568154859
Cluster 0:


Filter:   0%|          | 0/15 [00:00<?, ? examples/s]

- smart to watch goat index closely. tools being built are the meat of this cycle. cookie was good but sol ai infra is where the juice is
- Finding great alpha using the tools being made available for free on . @GoatIndexAI.  . . The team are responsive to feedback and constantly building. I remember when cookie first dropped it was a god send but the fact goats focus is on the Solana ecosystem will help all AI projects.
- BREAKING: We're joining Dev Wars in 75 minutes, hosted by the legendary . @notthreadguy.  and . @frankdegods. !. . Get ready for exclusive alpha on the Goat Index roadmap. You don’t wanna miss this. 
- Glorious morning to be building AI infra. $SNAI
- Quick reminder 


Cluster 1:


Filter:   0%|          | 0/15 [00:00<?, ? examples/s]

- Solana AI Index Recap (Jan 21). . Total Projects: 539. Projects with tokens: 311. Total Market Cap: $7B (+12%). Trading Volume (24H): $733M (-52%). .  Top Projects by Mindshare (6H). - . @aixbt_agent.  (8.18% mindshare, $678M MC, +19.6%). - . @cleopetrafun.  (5.41% mindshare, no token yet). -
- Final update: . . This #crypto whale went from $20m to $390m all in one year. . So what coins did they buy and how did they do almost 20,000% plus in just a year . . His wallet + I’ll teach the easiest ways to find wallets like this - All in this 
- Following smart money in #crypto can turn you into a profitable trader.. . With a  portfolio worth $394 million in crypto holdings. . With $6M made in the last 24 hours.. . Here's what we can learn from this millionaire  ↓


Cluster 2:


Filter:   0%|          | 0/15 [00:00<?, ? examples/s]

- Introducing Solana AI Ecosystem Airdrop by . @GoatIndexAI. A movement to rally the Solana AI Ecosystem to grow 100x from here.. . Launching on . @MeteoraAG.  on Jan 16th.


Cluster 3:


Filter:   0%|          | 0/15 [00:00<?, ? examples/s]

- I strongly believe AI will create generational wealth in 2025!. . Here’s my curated list of game-changing AI coins across different categories for you. . Let’s dive in: . .  My Top Conviction AI Agents. . - $GRIFFAIN | . @griffaindotcom.  | Market Cap: $436M. . - 


Cluster 4:


Filter:   0%|          | 0/15 [00:00<?, ? examples/s]

- The biggest #crypto bull run is coming.. . But only the well-prepared will be able to go from $1,000 to $100,000. . Don't worry, I've got you covered.. . Here are 10 game-changing crypto resources that will help you move from 0 to Hero. . A THREAD  - Series 1


