# DeepKLM: A Library for Language Experiment using a Deep Language Model

_This is a light version. Please use the full version to see the examples and use visualization features._

*Last Update: October 26, 2022*
<!--what's new
- Load KR-BERT directly from HuggingFace Hub
- Cleaned up the unnecessary requirements for the light version
- setup.sh was only a single line after clean up 
    - thus the separated file was removed
    - the last remaining line is now in the .ipynb directly
- Now uses AutoModel and AutoTokenizerFast for versatility
- Model can be set with language name only for default models
- Will try loading the file with EUC-KR if UTF-8 fails
- Made an option to convert the output as xlsx
- Made an option to grab the required files by git-cloning
-->

## Setting up

To set up, run the following commands

In [None]:
!git clone 

In [1]:
%pip install -r ./requirements.txt

Collecting torch
  Using cached torch-1.12.1-cp39-none-macosx_11_0_arm64.whl (49.1 MB)
Collecting numpy
  Downloading numpy-1.23.4-cp39-cp39-macosx_11_0_arm64.whl (13.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting transformers
  Using cached transformers-4.23.1-py3-none-any.whl (5.3 MB)
Collecting tqdm
  Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting filelock
  Using cached filelock-3.8.0-py3-none-any.whl (10 kB)
Collecting huggingface-hub<1.0,>=0.10.0
  Using cached huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
Collecting requests
  Using cached requests-2.28.1-py3-none-any.whl (62 kB)
Collecting pyyaml>=5.1
  Using cached PyYAML-6.0-cp39-cp39-macosx_11_0_arm64.whl (173 kB)
Collecting regex!=2019.12.17
  Downloading regex-2022.9.13-cp39-cp39-macosx_11_0_arm64.whl (287 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.2/287.2 kB[0

In [15]:
import os

import torch

import pandas as pd

from transformers import AutoTokenizer, AutoModelForMaskedLM

from surprisal import bert_token_surprisal

In [3]:
if torch.cuda.is_available():    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using CPU instead.')
    device = torch.device("cpu")

No GPU available, using CPU instead.


## Loading Models

### Set the language for the data

- Currently tested: Korean, English
- "Korean" will load "snunlp/KR-BERT-char16424" from the Hugging Face hub
- "English" will load "bert-large-uncased" from the Hugging Face hub
- You can optionally set the name of any models from Hugging Face hub as LANGUAGE, but the functionality is not guaranteed

In [4]:
LANGUAGE = "Korean"

In [5]:
if LANGUAGE.lower() == "korean":
    model_name = "snunlp/KR-BERT-char16424" 
elif LANGUAGE.lower() == "english":
    model_name = "bert-large-uncased"
else:
    model_name = LANGUAGE

try:
    mask_model = AutoModelForMaskedLM.from_pretrained(model_name, output_attentions=True)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
except OSError:
    print(f"{LANGUAGE} is either...")
    print("\t- NOT a supported language")
    print("\t- NOT a model available at the HuggingFace Hub")
    print("Languages currently available:\tKorean, & English")
except TypeError:
    print(f"{LANGUAGE} model does not have a MaskedLM model")
    print(f"Note that our method is only applicable to MaskedLM models")    

Some weights of the model checkpoint at snunlp/KR-BERT-char16424 were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Loading the data

In [6]:
filename = "input.txt"

In [7]:
try:
    with open(filename) as f:
        lines = f.readlines()
except UnicodeDecodeError:
    try:
        with open(filename, encoding='euc-kr') as f:
            lines = f.readlines()
    except UnicodeDecodeError:
        print("Failed to load the file.")
        print("Make sure it is in UTF-8 UNICODE, or at least EUC-KR")
        print("(Other EUC encodings may work but won't function properly)")

## Calcuating

The result will be saved as `output_name`.

In [8]:
output_name = "output.txt"

In [10]:
f = open(output_name, 'w')
f.write("IDX\tITEM1\tITEM2\n")
for i in range(1, len(lines)):
  line = lines[i].strip()
  if i % 10 == 0: print(i)
  each = line.split("\t")
  result = bert_token_surprisal(each[1].strip(), [each[2].strip(), each[3].strip()], mask_model, tokenizer, device, printing=False)
  scores = ""
  for res in result:
    scores += str(res[2])
    scores += "\t"    
  f.write(each[0] + "\t" + scores.strip() + "\n")
f.close()

## Converting to xlsx

- Convert the output_file to xlsx to be read with MS Office Excel

In [12]:
df = pd.read_csv(output_name, sep='\t', index_col=0, header=0)

In [26]:
excel_name = os.path.splitext(output_name)[0] + '.xlsx'
df.to_excel(excel_name)