# 10 dimensions of Social Exchange

Goal of this notebook is to get the 10 dimensions of social exchange embeddings for the inaturalist text. 
Actually, this runs super fast on two gpus (maybe 1-2 minutes for all dimensions for roughly 10000 instances each). 
TODO: Get data in right format. 

In [1]:
import logging
import torch
from torch.utils.data import TensorDataset, DataLoader
import os
from os.path import join
import sys
from transformers import BertTokenizer,BertForSequenceClassification
from transformers.optimization import AdamW
import numpy as np
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# setting some variables:
# set cuda (not advisable to run on cpu)
device = torch.device("cuda")

BERT_MODEL = 'bert-base-cased' # BERT model type
CACHE_DIR = 'cache/' # where BERT will look for pre-trained models to load parameters from

num_labels = 2

OUTPUT_MODE = 'classification'
CONFIG_NAME = "config.json"
WEIGHTS_NAME = "pytorch_model.bin"

Can play around with batch size. For current config (although quite small) batch size of 1000 only uses 30 percent of gpu memory

### Prepare Data

In [3]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
idx_list = []
sentence_list = []
text_array = np.load("data/naturalist_title_text.npy", allow_pickle=True)
for idx, text in enumerate(text_array):
    sentences = sent_tokenize(text)
    idx_list.extend([idx] * len(sentences))
    sentence_list.extend(sentences)
    

[nltk_data] Downloading package punkt to /home/bkomander/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
# prepare the data this will probably change 
tokenizer = BertTokenizer.from_pretrained(BERT_MODEL, do_lower_case=False)
test_array = np.load("data/naturalist_question.npy", allow_pickle=True)
encodings = tokenizer(sentence_list, padding=True, add_special_tokens=True, truncation=True, return_tensors='pt').to(device)
data_set = TensorDataset(encodings['input_ids'], encodings['attention_mask'])
test_loader = DataLoader(data_set,  batch_size=100, shuffle=False)

In [5]:
#dimensions = ["conflict","fun", "identity"]
import warnings
warnings.filterwarnings("ignore", message=".*Some weights of BertForSequenceClassification were not initialized.*")
dimensions = ["conflict","fun", "identity", "knowledge", "power", "respect", "romance", "similarity", "social_support", "trust"]
results = {}
for dim in dimensions:
    OUTPUT_DIR = 'weights/BERT/%s' %dim 
    output_model_file = os.path.join(OUTPUT_DIR, WEIGHTS_NAME)
    model = BertForSequenceClassification.from_pretrained(BERT_MODEL,cache_dir=CACHE_DIR, num_labels=num_labels)
    model.load_state_dict(torch.load(output_model_file))
    if torch.cuda.is_available():
        #print('CUDA devices:', torch.cuda.device_count())  # This should print 2 if both GPUs are available
        device = torch.device("cuda")
    
        # Data Parallelism
        if torch.cuda.device_count() > 1:
            #print("Let's use", torch.cuda.device_count(), "GPUs!")
            model = torch.nn.DataParallel(model)
        model.to(device) 
    # run model
    preds = []
    for step, batch in tqdm(enumerate(test_loader)):
        inputs, att_mask = batch
        with torch.no_grad():
            outputs = model(inputs, attention_mask=att_mask)
            probabilities = torch.softmax(outputs.logits, dim=1)
            positive_class_probs = probabilities[:, 1].detach().cpu().numpy()
            preds.append(positive_class_probs)
    results[dim] = np.concatenate(preds)
    del model
    torch.cuda.empty_cache()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
835it [17:49,  1.28s/it]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
835it [26:04,  1.87s/it]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
835it [27:09,  1.95s/it]
Some weights of BertForSequenceClassification were n

runs a bit long but thats okay. New thing I have to consider: Mean Across all the sentences:
Actually not using the GPU to its fullest, but at this stage this is okay..

In [10]:
import pandas as pd
df = pd.DataFrame(results)
df.loc[:, "index"] = idx_list

In [12]:
df.to_csv("data/inat_10dims.csv")

In [20]:
reduced_df = df.groupby('index')[df.columns].mean()

In [25]:
reduced_df.drop(columns="index").to_csv("data/reduced_inat_10dims.csv", index=False)