<h1>BERT Method for Comment Categorization</h1>

In [2]:
pip install torch

Collecting torch
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/5a/6a/775b93d6888c31f1f1fc457e4f5cc89f0984412d5dcdef792b8f2aa6e812/torch-2.4.1-cp311-cp311-win_amd64.whl.metadata
  Downloading torch-2.4.1-cp311-cp311-win_amd64.whl.metadata (27 kB)
Downloading torch-2.4.1-cp311-cp311-win_amd64.whl (199.4 MB)
   ---------------------------------------- 0.0/199.4 MB ? eta -:--:--
   ---------------------------------------- 0.1/199.4 MB 2.6 MB/s eta 0:01:17
   ---------------------------------------- 0.4/199.4 MB 4.1 MB/s eta 0:00:49
   ---------------------------------------- 0.7/199.4 MB 5.4 MB/s eta 0:00:37
   ---------------------------------------- 1.0/199.4 MB 5.5 MB/s eta 0:00:36
   ---------------------------------------- 1.4/199.4 MB 6.5 MB/s eta 0:00:31
   ---------------------------------------- 1.9/199.4 MB 7.0 MB/s eta 0:00:29
   ---------------------------------------- 2.3/199.4 MB 7.2 MB/s eta 0:00:28
    ----------------------------

In [3]:
import pandas as pd
import torch
from transformers import BertModel, BertTokenizer
from sklearn.metrics.pairwise import cosine_similarity

# Load tokenizer and model from Hugging Face Transformers
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode text into BERT embeddings
def encode_text(text):
    encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        model_output = model(**encoded_input)
    return model_output.last_hidden_state[:,0,:].numpy()

# Category descriptions
category_descriptions = {
    'infrastructure': "Issues related to physical and organizational structures needed for operation",
    'social_resistance': "Community opposition or resistance to policies or projects",
    'financial_constraints': "Financial issues that prevent progress",
    'technological_shortcomings': "Failures or limitations in technology",
    'regulatory_challenges': "Difficulties arising from regulations or laws",
    'not-defined': "No specific category defined"
}

# Encode category descriptions to create their embeddings
category_embeddings = {key: encode_text(value) for key, value in category_descriptions.items()}

# List of files to process
files = ['australia_energy_analyzed.csv', 'india_energy_analyzed.csv', 'nz_energy_analyzed.csv', 'usa_energy_analyzed.csv','france_energy_analyzed.csv']

# Function to find the best category based on cosine similarity
def find_best_category(comment):
    comment_embedding = encode_text(comment)
    similarities = {key: cosine_similarity(comment_embedding, emb)[0][0] for key, emb in category_embeddings.items()}
    return max(similarities, key=lambda key: similarities[key])

# Loop over files and apply the categorization function
for file in files:
    df = pd.read_csv(file)
    df['reasons'] = df['Cleaned_Comment'].apply(find_best_category)
    df.to_csv(file, index=False)

  torch.utils._pytree._register_pytree_node(


Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]