# AI Generated Text Detection

With the recent dramatic increase in the prevalence of large language models and machine-generated content across all fields, the potential for misuse of such content has also become relevant. For instance, this is a major concern for educational institutions as LLMs provide new ways for students to cheat on assignments. Furthermore, research has found that human performance on the classification task is only slightly better than chance, meaning that computerized models may be necessary in order to achieve meaningful performance on the Machine-Generated Text Detection task. Solving this problem aims to address misuse of Large Language Models by creating systems to identify machine-generated natural language content. This project will primarily focus on the SemEval-2024 Task 8’s first subtask, which is to determine if a given text is human-written or machine-generated.

As the discourse surrounding this issue has exploded in the past year, several university researchers and companies have already created detection tools to identify machine-generated text. However, continuous improvement of such tools is necessary, as false positives could potentially have pronounced consequences depending on the use case.

More information about the SemEval-2024 Task 8’s first subtask can be found at this link: https://github.com/mbzuai-nlp/SemEval2024-task8?tab=readme-ov-file

In [None]:
print(">:)")

>:)


## 1. Loading the Original Dataset

The data used in the SemEval-2024 Task 8 is an extension of the M4 dataset as described in https://aclanthology.org/2024.eacl-long.83/. The following are statistics on the datset:

![picture](https://drive.google.com/uc?id=1vSGghvTFX0biktiA-LWnFK-qnwC40ZgV)

Install the `gdown` package with pip in order to download the dataset folders and use the provided file IDs to access the subtask data. The data used for this project are exclusively given by the task without any other external sources.

In [None]:
%pip install gdown
!gdown --folder https://drive.google.com/drive/folders/1CAbb3DjrOPBNm0ozVBfhvrEh9P9rAppc

Retrieving folder contents
Processing file 1e_G-9a66AryHxBOwGWhriePYCCa4_29e subtaskA_dev_monolingual.jsonl
Processing file 123UQ92LxtHaVTbNYlmjnG1CWwD-x7wDL subtaskA_dev_multilingual.jsonl
Processing file 1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI6OG subtaskA_train_monolingual.jsonl
Processing file 13-9-DakCeLFbPgCiVIU0v6_BCQx0ppz6 subtaskA_train_multilingual.jsonl
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1e_G-9a66AryHxBOwGWhriePYCCa4_29e
To: /content/SubtaskA/subtaskA_dev_monolingual.jsonl
100% 10.8M/10.8M [00:00<00:00, 33.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=123UQ92LxtHaVTbNYlmjnG1CWwD-x7wDL
To: /content/SubtaskA/subtaskA_dev_multilingual.jsonl
100% 21.2M/21.2M [00:00<00:00, 86.5MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI6OG
From (redirected): https://drive.google.com/uc?id=1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI

The data is now located in the SubtaskA folder in files. For this project, only the monolingual datasets will be used. An object in the file will have the following JSON format:

`{ \
  id -> identifier of the example, \
  label -> label (human text: 0, machine text: 1,), \
  text -> text generated by a machine or written by a human, \
  model -> model that generated the data, \
  source -> source (Wikipedia, Wikihow, Peerread, Reddit, Arxiv) on English or language (Arabic, Russian, Chinese, Indonesian, Urdu, Bulgarian, German) \
}`

In [None]:
from google.colab import drive
drive.mount('/content/drive')

!pip install datasets
!pip install evaluate
!pip install accelerate -U
!pip install tensorflow[and-cuda]

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.10.0

The following is an example of the dataset and its formatting before any manipulation is performed. Note that only the first 5 lines of the dataset input are printed as an example; the actual dataset is much longer. \

We will run the dataset modifications on each of the train, dev, and test datasets. This can be done simply by changing the file path on line 3, which will be stored in the json_file variable (for example, use the file path for the subtaskA_dev_monolingual.jsonl dataset when modifying the dev set). \

The test dataset used is also exclusive to the SemEval-2024 Task 8, and it contains test gold labels that can be found at this link: https://drive.google.com/drive/folders/13aFJK4UyY3Gxg_2ceEAWfJvzopB1vkPc

In [None]:
import json

with open('/content/drive/MyDrive/2023-2024/cpsc 477/cpsc 477 final project datasets/subtaskA_monolingual.jsonl', 'r') as json_file:
    json_list = list(json_file)

#for i in range(len(json_list)):
for i in range(5):
    result = json.loads(json_list[i])
    print(f"result: {result}")

result: {'text': "Today, many adults or teenage drivers are hooked onto their phones. While driving, they can be prompted to use their phones for text messaging. It may cause many accidents, death, serious injuries and more. I honestly think that drivers should use phones while driving  because they are taking risks that could kill or injure others and also yourself. There are also laws against using a phone while operating a moving vehicle but people still disobey them. I think that there should be more consequences when it comes down to texting and driving.  Using your cell phones causes many distractions. It only takes a blink of an eye to cause an accident. Yeah, resisting the urge to text while driving may be hard but it can and will save lives including yours. When driving, your eyes and mind are programmed to be focused on the road at all times. Having a cellphone on your person is a hazard in my opinion. Just think about it, anyone could be crossing a busy street. The person wa

## 2. Dataset Modification: Synonym Attacks

Below is the dataset with manipulations on the synonyms from the input. Currently it prompts for percentage of words to replace with synonyms, which synonym collection to use, percentage of adjectives to change emphasis on, whether or not to ignore quotations, and whether or not to only use common words. \

To make the FinNLP synonym datasets, the settings were set to the recommended ones: 70% of words replaced with synonyms, FinNLP collection used, 50% of adjectives to change emphasis on, ignored quotations, and not using only common words. Similarly to make the Zaibacu Thesaurus synonym datasets, the settings were set to the recommended ones: 20% of words replaced with synonyms, Zaibacu Thesaurus collection used, 50% of adjectives to change emphasis on, ignored quotations, and not using only common words. \

The code used to create the synonym attack datasets is directly from GPTZzzs as outlined in the final report (https://github.com/Declipsonator/GPTZzzs). \

Depending on which dataset is being modified, line 222 can be changed to write to different files (i.e. file synonym_test.jsonl). If a test file is being written to, note that these datasets do not include model and source information, which means that lines 294-295 must be commented out when processing test datasets.

In [None]:
import os
import urllib.request
import random
import re
import time

file_name = "synonyms"
zaibacu_url = "https://raw.githubusercontent.com/zaibacu/thesaurus/master/en_thesaurus.jsonl"
finnlp_url = "https://raw.githubusercontent.com/FinNLP/synonyms/master/src.json"
adjective_url = "https://raw.githubusercontent.com/rgbkrk/adjectives/master/index.js"
common_words_url = "https://raw.githubusercontent.com/first20hours/google-10000-english/master/20k.txt"


percentToChangeSyn = input("\033[1;32;40mPercentage of words to replace with synonyms (Recommended: 20-30 for Zaibacu, 70-90 for FinNLP):\033[0m")

try:
    percentToChangeSyn = float(percentToChangeSyn)
except:
    print("\033[0;31;40mNot valid number\033[0m")
    exit()

if percentToChangeSyn < 0 or percentToChangeSyn > 100:
    print("\033[0;31;40mNumber needs to be between 0 and 100\033[0m")
    exit()


collection = input("""
Synonym Collection to Use:

1. Zaibacu Thesaurus
2. FinNLP Synonyms (Recommended)

\033[1;32;40mChoice: \033[0m""")

try:
    collection = int(collection)
except:
    print("\033[0;31;40mNot valid number\033[0m")
    exit()

if collection < 1 or collection > 2:
    print("\033[0;31;40mNumber needs to be between 1 and 2\033[0m")
    exit()



percentToChangeAdj = input("\n\033[1;32;40mPercentage of adjectives to change emphasis on (Recommended: 50-80): \033[0m")

try:
    percentToChangeAdj = float(percentToChangeAdj)
except:
    print("\033[0;31;40mNot valid number\033[0m")
    exit()

if percentToChangeAdj < 0 or percentToChangeAdj > 100:
    print("\033[0;31;40mNumber needs to be between 0 and 100\033[0m")
    exit()



ignore_quotes = input("""
\033[1;32;40mIgnore Quotations (y/N): \033[0m""")

if ignore_quotes.lower() == "y" or ignore_quotes.lower() == "yes":
    ignore_quotes = True

elif ignore_quotes.lower() == "n" or ignore_quotes.lower() == "yes":
    ignore_quotes = False
else:
    print("\033[0;31;40mInvalid answer, answer with y or n\033[0m")
    exit()


use_common_words = input("""
\033[1;32;40mOnly Use Common Words (y/N) (Not Recommended): \033[0m""")

if use_common_words.lower() == "y" or use_common_words.lower() == "yes":
    ignore_quotes = True

elif use_common_words.lower() == "n" or use_common_words.lower() == "yes":
    use_common_words = False
else:
    print("\033[0;31;40mInvalid answer, answer with y or n\033[0m")
    exit()



# Load synonym file or download it if it doesn't exist
print("")
if os.path.exists("{}{}.json".format(collection, file_name)) and (not use_common_words or not os.path.exists("{}{}-common.json".format(collection, file_name))):
    with open("{}{}.json".format(collection, file_name), "r") as f:
        text = f
        synonyms = json.load(text)
        print("\033[1;33;40mLoaded synonym file from local folder")
elif not os.path.exists("{}{}.json".format(collection, file_name)) and (not use_common_words or not os.path.exists("{}{}-common.json".format(collection, file_name))):
    try:
        if collection == 1:
            response = urllib.request.urlopen(zaibacu_url)
            data = response.read()
            text = data.decode('utf-8')

            lines = text.split("\n")
            word_synonyms = [json.loads(line) for line in lines if line]
            print("Loaded file from remote URL")

            # Create word-synonyms dictionary

            synonyms = {entry['word']: entry['synonyms'] for entry in word_synonyms}

        elif collection == 2:
            response = urllib.request.urlopen(finnlp_url)
            data = response.read()
            text = data.decode('utf-8')

            synonyms = json.loads(text)
            print("Loaded synonym file from remote URL")

            # Create word-synonyms dictionary

            for key, value in synonyms.items():
                synonyms[key] = []
                for k, v in value.items():
                    synonyms[key] += [word for word in v[1:]]
                if ("v" in synonyms[key]):
                    synonyms[key].remove("v")
                if ("s" in synonyms[key]):
                    synonyms[key].remove("s")
                if ("r" in synonyms[key]):
                    synonyms[key].remove("r")
                if ("a" in synonyms[key]):
                    synonyms[key].remove("a")
                if ("n" in synonyms[key]):
                    synonyms[key].remove("n")

        # Save dictionary to file

        dict_file_name = "{}synonyms.json".format(collection)
        with open(dict_file_name, "w") as f:
            f.write(json.dumps(synonyms))
            print("Saved synonym dictionary to local folder")

    except Exception as e:
        print("\033[0;31;40mFailed to load synonyms file\033[0m")
        print(e)
        exit()



# Make common words if needed
print("")
if use_common_words and os.path.exists("{}{}-common.json".format(collection, file_name)):
    with open("{}{}-common.json".format(collection, file_name), "r") as f:
        text = f
        synonyms = json.load(text)
        print("\033[1;33;40mLoaded common synonym file from local folder")
elif use_common_words:
    try:
        response = urllib.request.urlopen(common_words_url)
        data = response.read()
        text = data.decode('utf-8')

        common_words = text.split("\n")

        new_synonyms = {}
        for key, value in synonyms.items():
          updated_value = []
          for synonym in value:
              if synonym in common_words:
                updated_value.append(synonym)
          if updated_value:
            new_synonyms[key] = updated_value
        synonyms = new_synonyms




        # Save dictionary to file

        dict_file_name = "{}synonyms-common.json".format(collection)
        with open(dict_file_name, "w") as f:
            f.write(json.dumps(synonyms))
            print("Saved common synonym dictionary to local folder")

    except Exception as e:
        print("\033[0;31;40mFailed to load common synonyms file\033[0m")
        print(e)
        exit()


# Load adjective file or download it if it doesn't exist
print("")
if os.path.exists("adjectives.json"):
    with open("adjectives.json", "r") as f:
        text = f
        adjectives = json.load(text)
        print("\033[1;33;40mLoaded adjective file from local folder")
else:
    try:
        response = urllib.request.urlopen(adjective_url)
        data = response.read()
        text = data.decode('utf-8')

        lines = text.split("\n")
        adjectives = []

        for i in range(1, len(lines) - 2):
            adjective = re.search('\'(.*)\',', lines[i]).group(1)
            adjectives.append(adjective)
        print("Loaded adjective file from remote URL")

        dict_file_name = "adjectives.json"
        with open(dict_file_name, "w") as f:
            f.write(json.dumps(adjectives))
            print("Saved adjective dictionary to local folder")

    except Exception as e:
        print("\033[0;31;40mFailed to load adjective file\033[0m")
        print(e)
        exit()

# Load text file
with open("synonym_test_2.jsonl", "w") as f:

  for curr in range(len(json_list)):
    curr_json_line = json.loads(json_list[curr])
    text = curr_json_line["text"]

    words = text.split(" ")

    newWords = ""

    num_words = int(len(words) * percentToChangeSyn / 100)
    chosen_indices = random.sample(range(len(words)), num_words)

    quotation_count = 0
    for i in range(len(words)):
        if "\"" in words[i]:
            quotation_count += words[i].count("\"")

        if i in chosen_indices and (quotation_count % 2 == 0 or not ignore_quotes):
            word = words[i]
            endswith = ""
            if word.endswith((".", ",", "!", "'", "?", ":", ";")):
                endswith = word[len(word) - 1]
                word = word[:-1]

            if len(word) > 3 and word in synonyms.keys() and len(synonyms[word]) != 0:
                word = random.choice(synonyms[word])

            newWords = "{}{}{}".format(newWords, word, endswith)
            if i != len(words) - 1:
                newWords = "{}{}".format(newWords, " ")


        else:
            newWords = "{}{}".format(newWords, words[i])
            if i != len(words) - 1:
                newWords = "{} ".format(newWords)

    words = newWords.split(" ")

    newWords = ""

    num_words = int(len(words) * percentToChangeSyn / 100)

    quotation_count = 0
    emphasis_words = ["very", "very", "very", "really", "really", "really", "extremely", "quite", "so", "too", "very", "really"]
    dont_change = False
    for i in range(len(words)):
        if "\"" in words[i]:
            quotation_count += words[i].count("\"")

        if words[i] in emphasis_words and i + 1 < len(words) and words[i+1] and words[i+1] in adjectives and (quotation_count % 2 == 0 or not ignore_quotes):
            if random.randint(0, 100) < percentToChangeAdj:
                dont_change = True
                continue

        if words[i] in adjectives and not dont_change and (quotation_count % 2 == 0 or not ignore_quotes):
            if random.randint(0, 100) < percentToChangeAdj:
                emp_word = random.choice(emphasis_words)
                newWords = "{}{} {}".format(newWords, emp_word, words[i])
                if i != len(words) - 1:
                    newWords = "{} ".format(newWords)
                continue

        dont_change = False
        newWords = "{}{}".format(newWords, words[i])
        if i != len(words) - 1:
            newWords = "{} ".format(newWords)

    # format output and write to new output jsonl file
    new_input = {"text": newWords,
                 "label": curr_json_line["label"],
                 #"model": curr_json_line["model"],
                 #"source": curr_json_line["source"],
                 "id": curr_json_line["id"]}
    json.dump(new_input, f)
    f.write("\n")

f.close()

[1;32;40mPercentage of words to replace with synonyms (Recommended: 20-30 for Zaibacu, 70-90 for FinNLP):[0m20

Synonym Collection to Use:

1. Zaibacu Thesaurus
2. FinNLP Synonyms (Recommended)

[1;32;40mChoice: [0m1

[1;32;40mPercentage of adjectives to change emphasis on (Recommended: 50-80): [0m50

[1;32;40mIgnore Quotations (y/N): [0my

[1;32;40mOnly Use Common Words (y/N) (Not Recommended): [0mN

[1;33;40mLoaded synonym file from local folder


[1;33;40mLoaded adjective file from local folder


## 3. Dataset Modification: Homoglyph Attack

Load in the standard confusables tables from the Unicode Technical Standards (as described in http://www.unicode.org/reports/tr39/#def_whole_script_confusables) with link to the entire table (https://www.unicode.org/Public/security/12.0.0/confusables.txt). \

This creates a dictionary of character keys and values of lists containing all other characters they could be confused with.

In [None]:
# code_confusables represents confusable conversions as key/val pairs in unicode
# char_confusables represents confusable conversions as key/val pairs as the actual characters
code_confusables = {}
char_confusables = {}

# read confusables table text file by line
# note: keys are set to the right conversion of the confusable file as many different left codes map to same right code
with open('drive/MyDrive/confusables.txt') as f:
  lines = f.readlines()
  for i in range(len(lines)):
    line = lines[i].strip()
    line = line.split(';')
    line = [j.strip() for j in line]
    if len(line) > 1:
      if line[1] in code_confusables.keys():
        code_confusables[line[1]].append(line[0])
      else:
        code_confusables[line[1]] = [line[0]]

for key in code_confusables.keys():
  # process keys such as combining if confusable is multiple characters, format to hex to use chr function
  code_key = key.split(' ')
  char_key = ''
  for code in code_key:
    code = '0x' + code
    char_key += chr(int(code, 16))

  # process vals such as combining if confusable is multiple characters, format to hex to use chr function
  all_char_vals = []
  for i in range(len(code_confusables[key])):
    code_val = code_confusables[key][i].split(' ')
    char_val = ''
    for code in code_val:
      code = '0x' + code
      char_val += chr(int(code, 16))
    all_char_vals.append(char_val)

  # check if either conversion is invalid (ex. \u2028 or empty string), in which case ignore that confusable
  if char_key == ' ' or char_val == [' ']:
    continue
  else:
    char_confusables[char_key] = all_char_vals

print(char_confusables) # file total 6296 --> 3308 after switching key and val then combining

{'֖': ['֭'], '֘': ['֮'], '֙': ['֨'], '֚': ['֤'], 'ۛ': ['᪴', '⃛'], '̓': ['ؙ', 'ࣳ', '̓', '̕', 'ُ'], '̔': ['ٝ'], '́': ['֜', '֝', 'ؘ', '݇', '́', '॔', 'َ'], '̀': ['̀', '॓'], '̆': ['̌', '꙼', '٘', 'ٚ', 'ͮ'], '̆̇': ['ۨ', '̐', 'ँ', 'ঁ', 'ઁ', 'ଁ', 'ఀ', 'ಁ', 'ഁ', '𑒿'], '̂': ['᳐', '̑', 'ٛ', '߮', '꛰'], '̊': ['֯', '۟', '៓', '゚', 'ْ', 'ஂ', 'ံ', 'ំ', '𑌀', 'ํ', 'ໍ', 'ͦ', 'ⷪ'], '̈': ['࣫', '߳'], '̋': ['ً', 'ࣰ'], '̃': ['͂', 'ٓ'], '̇': ['ׄ', '۬', '݀', '࣪', '݁', '͘', 'ֹ', 'ֺ', 'ׂ', 'ׁ', '߭', 'ं', 'ਂ', 'ં', '்'], '̸': ['̷'], '̨': ['᪷', '̢', 'ͅ'], '̄': ['᳒', '̅', 'ٙ', '߫', '꛱'], '̎': ['᳚'], '̒': ['ٗ'], '͐': ['͗', 'ࣿ', 'ࣸ'], '͒': ['ऀ'], '̖': ['᳭'], '̩': ['᳜', 'ٖ'], '̫': ['᳕'], '̳': ['͇'], '͔': ['ࣹ'], '͕': ['ࣺ'], 'ﾞ': ['゛'], 'ﾟ': ['゜'], '̵': ['̶'], '̉': ['〬'], '̣': ['ׅ', '࣭', '᳝', 'ִ', 'ٜ', '़', '়', '਼', '઼', '଼', '𑇊', '𑓃', '𐨺'], '̤': ['࣮', '᳞'], '̥': ['༷', '〭'], '̦': ['̧', '̡', '̹'], '̭': ['᳙'], '̮': ['᳘'], '̱': ['॒', '̠'], 'ٌ': ['ࣱ', 'ࣨ', 'ࣥ'], 'ﹲّ': ['ﱞ'], 'ٍ': ['ࣲ'], 'ﹴّ': ['ﱟ'], 'ﹷّ': ['ﳲ'], 'ﹶّ': ['ﱠ'], 

The following code sets a homoglyph conversion rate where each character in the input generates a random number and is converted to a corresponding confusable if the generated number is less than or equal to the conversion rate. Similar to the synonym attacks, line 1 should be changed to match the desired output file name while lines 22-23 should be commented out when modifying test datasets as those do not have model or source information.

In [None]:
with open("homoglyph_test.jsonl", "w") as f:
  for curr in range(len(json_list)):
    curr_json_line = json.loads(json_list[curr])
    original_text = curr_json_line['text']
    original_chars = [x for x in original_text]
    new_chars = []

    # for each char in input change to a random confusable if the generated random number
    # is less than or equal to the conversion rate
    homoglyph_rate = 0.1
    for char in original_chars:
      temp = random.random()
      if temp <= homoglyph_rate and char in char_confusables.keys():
        new_chars.append(random.choice(char_confusables[char]))
      else:
        new_chars.append(char)

    # format output and write to new output jsonl file
    new_text = ''.join(new_chars)
    new_input = {"text": new_text,
                 "label": curr_json_line["label"],
                 #"model": curr_json_line["model"],
                 #"source": curr_json_line["source"],
                 "id": curr_json_line["id"]}
    json.dump(new_input, f)
    f.write("\n")

f.close()

## 4. Dataset Modification: Duplication Attack

The below attack duplicates approximately 10% of the words in the input dataset to make the input text seem to have more errors. This is done with similar formatting code as the synonym attack. As mentioned, line 3 should be updated to reflect the desired output file name, and lines 44-45 should be commented out when modifying test datasets.

In [None]:
duplicate_rate = 0.1

with open("duplicate_test.jsonl", "w") as f:

  for curr in range(len(json_list)):
    curr_json_line = json.loads(json_list[curr])
    text = curr_json_line["text"]

    words = text.split(" ")
    newWords = ""

    num_words = int(len(words) * duplicate_rate)
    chosen_indices = random.sample(range(len(words)), num_words)

    quotation_count = 0
    for i in range(len(words)):
        if "\"" in words[i]:
            quotation_count += words[i].count("\"")

        if i in chosen_indices and quotation_count % 2 == 0:
            word = words[i]
            endswith = ""
            if word.endswith((".", ",", "!", "'", "?", ":", ";", "(", ")")):
                endswith = word[len(word) - 1]
                word = word[:-1]

            startswith = ""
            if word.startswith((".", ",", "!", "'", "?", ":", ";", "(", ")")):
                startswith = word[0]
                word = word[1:]

            newWords = "{}{}{} {}{}".format(newWords, startswith, word, word, endswith)
            if i != len(words) - 1:
                newWords = "{}{}".format(newWords, " ")

        else:
            newWords = "{}{}".format(newWords, words[i])
            if i != len(words) - 1:
                newWords = "{} ".format(newWords)

    # format output and write to new output jsonl file
    new_input = {"text": newWords,
                 "label": curr_json_line["label"],
                 #"model": curr_json_line["model"],
                 #"source": curr_json_line["source"],
                 "id": curr_json_line["id"]}
    json.dump(new_input, f)
    f.write("\n")

f.close()