# Document Classification



### 1. Environment Setup
Libraries include `transformers` for the ParsBERT model, `pandas` for data manipulation, `numpy` for numerical operations, `sklearn` for data splitting and evaluation metrics, `opendatasets` for downloading the dataset, and `torch` for PyTorch functionality. You can install these libraries using the following command:

In [None]:
!pip install pandas numpy scikit-learn torch opendatasets datasets
!pip install transformers[torch] -U

Please run this notebook in Google Colab with T4 GPU runtime. Check with this:

In [None]:
!nvidia-smi

### 2. Loading and Preparing the Data

Here, we'll load the dataset from the specified CSV file, then split it into training and testing sets with a 90-10 split. We won't create a separate validation set for the base model, as we're going to use cross-validation.

In [2]:
import opendatasets as od
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Download the dataset
od.download("https://www.kaggle.com/datasets/amirpourmand/tasnimdataset")
df = pd.read_csv("./tasnimdataset/tasnim.csv")

df = df.dropna(subset=['body'])
# For reducing the time of train
df = df.sample(frac=0.5, random_state=42)

label_encoder = LabelEncoder()
df['category_encoded'] = label_encoder.fit_transform(df['category'])

# Splitting the dataset into training and testing sets (90% train, 10% test)
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42)

Skipping, found downloaded files in "./tasnimdataset" (use force=True to force download)


### 3. Implementing the Base Model
For the base model, we'll use a TF-IDF vectorizer and a logistic regression classifier. We'll perform cross-validation to evaluate this base model.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

# Creating a pipeline with TF-IDF vectorization and Logistic Regression
pipeline_base = make_pipeline(TfidfVectorizer(), LogisticRegression(max_iter=1000))

# Executing 5-fold cross-validation and reporting accuracy
scores = cross_val_score(pipeline_base, train_df['body'], train_df['category'], cv=5, scoring='accuracy')

print(f"Accuracy scores for each fold are: {scores}")
print(f"Mean Accuracy: {scores.mean()}, Standard Deviation in Accuracy: {scores.std()}")


Accuracy scores for each fold are: [0.86485527 0.8740675  0.86039076 0.86376554 0.8625222 ]
Mean Accuracy: 0.8651202538093964, Standard Deviation in Accuracy: 0.004713311898895908


### 4. Implementing the Main Model with ParsBERT
For the main model, we'll use ParsBERT. First, we need to load the model and tokenizer. Then, we prepare our dataset for training by tokenizing the texts.


In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")

# Function to tokenize the text
def tokenize_function(examples):
    return tokenizer(examples["body"], padding="max_length", truncation=True, max_length=128)

# Create Hugging Face Dataset objects from pandas dataframes
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Tokenizing the datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Ensure that the 'labels' column is of type 'int'
train_dataset = train_dataset.rename_column("category_encoded", "labels")
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset = test_dataset.rename_column("category_encoded", "labels")
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/28151 [00:00<?, ? examples/s]

Map:   0%|          | 0/3128 [00:00<?, ? examples/s]

In [5]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained("HooshvareLab/bert-base-parsbert-uncased", num_labels=len(df['category'].unique()))

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=64,
    evaluation_strategy="epoch",
    fp16=True,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Train
trainer.train()

# Evaluate
results = trainer.evaluate()
print(results)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at HooshvareLab/bert-base-parsbert-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,0.117352
2,0.176600,0.101039
3,0.068600,0.100654


Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'eval_loss': 0.10065397620201111, 'eval_runtime': 8.3014, 'eval_samples_per_second': 376.802, 'eval_steps_per_second': 47.1, 'epoch': 3.0}
