Skip to content

lopentu/CxLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CxLM

Open In Colab

CxLM is a masked language model fine-tuned on constructions in traditional Chinese. The model generates construction-informed and context-awared candidates at "variable" sites in the constructions. The generated samples can be further applied in corpus linguistic, psycholinguistics, or behavioral studies.

!pip -q install transformers
!git clone https://github.com/seantyh/CxLM
import sys
sys.path.append("CxLM/src")
import re
import numpy as np
import torch
from transformers import BertTokenizerFast, BertForMaskedLM
from conart.sample import sample_site
device = torch.device("cuda") \
         if torch.cuda.is_available() else torch.device("cpu")
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = BertForMaskedLM.from_pretrained('seantyh/CxLM')
model = model.to(device)
def generate_samples(input_text):
    tokens = re.findall("\[MASK\]|.", input_text)
    mindex = [i for i, x in enumerate(tokens) if x=="[MASK]"]
    batch = {
        "masked": [tokens],
        "mindex": np.array([mindex])
    }
    
    samples = sample_site(batch, model, tokenizer, n_sample=10)[0]
    prob_sort = samples["probs"].sum(0).argsort()
    print("CxLM (separated): ")
    for site_x in tokenizer.batch_decode(samples["ids"]):
        print("  ", site_x)
    samples = sample_site(batch, model, tokenizer, merge_pair2=True)[0]
    prob_sort = samples["probs"].sum(0).argsort()
    print("CxLM (merged): ")
    print("  ", " ".join(tokenizer.batch_decode(samples["ids"])))
generate_samples("[MASK]一[MASK]")
CxLM (separated): 
   想 算 看 洗 笑 舔 聊 摸 走 動
   想 算 笑 洗 看 舔 聊 忍 摸 動
CxLM (merged): 
   想 算 看 洗 笑 舔 聊 摸 忍 動
generate_samples("買本書[MASK]一[MASK]")
CxLM (separated): 
   讀 看 寫 翻 唸 買 聽 逛 走 想
   讀 看 寫 翻 想 買 唸 逛 聽 走
CxLM (merged): 
   讀 看 寫 翻 唸 想 買 逛 聽 走
generate_samples("[MASK]一[MASK]也好")
CxLM (separated): 
   忍 哭 笑 算 罵 洗 死 吵 收 想
   忍 哭 算 笑 死 罵 洗 收 喊 吵
CxLM (merged): 
   忍 哭 算 笑 死 罵 洗 收 吵 喊

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published