# Large Language Model Training Tools

Textual  data can be quite difficult to clean. This notebook is a tool to assist students with best practices to cleaning textual data before performing analysis or doing model training. As we see in the tokenization with word embeddings notebook, analyzing your data before properly cleaning it can give you worse results and stop your model from performing as well as it could. Not only that, but you can introduce hateful words and phrases into your trained model if you do not take the care to clean your data appropriately. 

This notebook will show you how to:

1. Clean data that has been ingested in a .txt file
2. Remove punctuation, numbers and stopwords
3. Create Counterfactual Datasets
4. Remove (prune) hateful words and phrases that have been tagged
5. Flag words and phrases for tagging

In [2]:
import os, json, glob
import pandas as pd
import numpy as np
import re
from typing import Iterable

## Saving Data From .txt File

Many text data sets are stored in .txt files due to their relatively light weight compared to a .csv file. To save data coming from a .txt file, follow the code below. 

In [None]:
ls = [] # list that you will add each line from the .txt line to

# Open connection to file
with open("<file>.txt", "r") as file: # add file 
    # Read in tweets and store in list
    for line in file:
        data = json.loads(line)
        ls.append(data) #add the lines to your list

## Cleaning Data Inside of More than One File

In the *word-embs-fun.ipynb* file, I show how to clean a single .txt file and save as a dictionary. Below, I show you how to save multiple files into a list.

In [3]:
path = 'src/<path>' # designate the path where your data is houses

format_txt = os.path.join(path, '*.txt')

file_list = glob.glob(format_txt)

In [5]:
def PreProcess(ls):
    file_list = []
    for files in file_list:
        with open(files) as f:
            data = f.read()
            data = data.lower()          
            data = re.sub(r'\d+', '', data) #removes numbers
            newdata = re.sub(r'[^\w\s]','',data) #removes punctuation
            data = re.sub('â,â,â', ' ', newdata, re.IGNORECASE) #removes special characters
            filename = 'corpora.txt' 
        with open(filename, 'w') as f:
            f.writelines(data)
    print('Done!')

In [6]:
PreProcess(file_list)

Done!


## Removing Stopwords

Stopwords are commonly used words a language that do not actively contribute to the meaning of a sentence. Top stopwords are articles such as 'and', the' and 'at'. 

In [None]:
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(data)
result = [i for i in tokens if not i in stop_words]

## Create Counterfactual Dataset
In Stage Two of Our Chapter, we discuss counterfactuals. You can create datasets that replace target words with non-target words to test if a model will provide the same output it would have if the target words would have been left in tact. 

Once you have created the counterfactual dataset, run your model on both the counterfactual dataset and the original dataset to see if your model scores each sentiment the same on specific tasks. 

Tools such as DiCE (https://github.com/interpretml/DiCE) and TensorFlow Responsible (https://www.tensorflow.org/responsible_ai/model_remediation/counterfactual/guide/creating_a_custom_counterfactual_dataset) have more information on how to train a counterfactual model using counterfactual data. 

In [None]:
# Counterfactual Dataset

replacement = {"hate": "love", "bad": "good", "ugly": "beauty"}


df["column"].replace(replacement, inplace=True)

# if you want to save replacement data as a new column instead of inplace

df["counter_column"] = df["column"].replace(replacement, inplace=False)

## Remove (Prune) Hateful Words In a String Within a List

If you have a list of hateful words, you can prune them from your dataset using this code. We learn about the benefits of pruning in Stage Three of the Chapter. 

In [None]:
ls = ['hate', 'bad', 'ugly']

words = [w.sub(ls, ' ') for w in words] #removes all words from list and replaces with a space

In [None]:
# can also replace hateful words with better ones
ls = ['hate', 'bad', 'ugly']

rep = ['love', 'good', 'beautiful']

words = [w.replace(ls, rep) for w in words]

## Flag and Tag Words 

You can flag hateful words in a dataset and even create a new column designating whether a flagged word is in a given sentence. We discuss the benefits of flagging words in a dataset in Stage Three of the Chapter. 

In [None]:
# flag hateful words in a "flagged" column

ls = ['hate', 'bad', 'ugly']


hate_list = '|'.join([x.strip() for x in words])

#remove words in the list and 
def flagged_column(df, column, hate_list):
    #convert True to 1 and False to 0
    df["column"] = df[column].str.contains(hate_list, case=False, na=False).astype(int)
    return df


In [None]:
flagged_column(df_name, "flagged_column_name", ls) # add df name, name of new flagged column and list of words to flag

## Let's Think

What are some ways that these functions can be used to improve your projects using Large Language Models? 

What are you excited to create now that you have some tools to begin exploring Large Language Model training responsibly?