# Data Cleaning/Preperation for LLMs

Training a Large Language Model (LLM) requires massive amounts of high-quality, diverse, and well-structured text data. However, raw data from the web, documents, chat logs, or repositories is often noisy, inconsistent, and redundant.

**Without proper cleaning and preprocessing:**

- The model may learn incorrect patterns or biases.
- Training can be inefficient and resource-heavy.
- Model performance on real-world tasks may be poor.

**Common Challenges in Data Cleaning for LLMs**

1. Data Volume

- LLMs are trained on billions of tokens—manual cleaning is impractical.
- Requires automated yet intelligent cleaning workflows.

2. Data Diversity

- Data may come from multiple domains, formats, and languages.
- Requires normalization to maintain coherence and consistency.

3. Noise and Irrelevant Content

- Raw text includes typos, HTML tags, boilerplate code, spam, etc.
- Models trained on noisy data may produce unreliable or harmful outputs.

4. Bias and Toxicity

- Unfiltered data can contain societal, cultural, or political biases.
- Requires specialized filtering techniques to promote fairness and safety.

5. Redundancy and Duplicates

- Duplicate content can lead to overfitting and wasted computation.
- De-duplication is crucial for scalable and diverse learning.

6. Privacy and Compliance

- Must avoid including Personally Identifiable Information (PII).
- Legal constraints like GDPR make ethical data handling critical.

[the data divided mckinsey](https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-data-dividend-fueling-generative-ai)

### Data Quality for LLM Training: Key Parameters & Tools

1. Completeness: Ensures required fields are filled and entries are not missing. 
2. Cleanliness (Noise): Measures presence of irrelevant characters, malformed text, or corruption.
3. Redundancy: Identifies duplicate or near-duplicate entries.
4. Relevance: Measures how well data aligns with the target domain or task. KeyBERT, RAKE, YAKE
5. Diversity: Checks variety in topics, styles, and intents. tools: Gensim (LDA)
6. Bias and Toxicity: Evaluates presence of harmful, offensive, or biased language.
    - [AI fairness 360](https://research.ibm.com/blog/ai-fairness-360)
    - Perspective API
7. Grammar and Coherence: Measures how fluent and logically connected the text is. tools: textstat (readability scores)
8. PII Detection: Identifies sensitive personal information in the dataset. tools: [Microsoft Presidio](https://microsoft.github.io/presidio/)
9. Scalability and Automation: For handling large datasets and building repeatable data pipelines. tools: Databricks and Airflow.


**open-source data sources suitable for LLM training**

1. [The Pile](https://pile.eleuther.ai/) - Diverse text sources (academic, code, books, web, etc.)
2. [C4 (Colossal Clean Crawled Corpus)](https://huggingface.co/datasets/legacy-datasets/c4) - Filtered web text from Common Crawl
3. [Harvard Library Datasets](https://curiosity.lib.harvard.edu/digital-collections/digital-collections)


**Synthetic Data Sources**

1. [Mostly AI ](https://mostly.ai/)	
2. [Gretel.ai](https://gretel.ai/)

### Tools and Techniques for LLM Data Cleaning

**Pandas – Data Structuring and Management**

Pandas is primarily used to:

- Load large datasets from CSV, JSON, or text files.
- Handle missing or null entries.
- Remove duplicates and irrelevant entries.
- Filter or tag specific content types.
- Prepare datasets for NLP pipeline input.

**NLTK – Text Preprocessing and Linguistic Cleaning**

- NLTK is widely used for cleaning and standardizing raw text before it’s tokenized or embedded.



In [20]:
! pip install pandas nltk




[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [24]:
import pandas as pd

df = pd.read_csv('C:/GenAI/AI-for-beginners/AI-for-beginners/week1/datasets/revature_chatbot_qa_uncleaned.csv')

df.columns = ['question', 'answer']

df.head(10)



Unnamed: 0,question,answer
0,Tell me more about Revature's technology train...,Revature provides training in many current tec...
1,What support do Revature employees get on clie...,Revature typically offers ongoing support for ...
2,How does Revature handle employee feedback and...,Revature generally has a system for regular pe...
3,Are there opportunities for development beyond...,Yes Revature generally encourages continuous p...
4,What is Revature's approach to balancing work ...,The balance between work and personal life can...
5,Can you describe a typical project for a Revat...,A Revature associate trained in full-stack Jav...
6,What are the travel expectations for Revature ...,Revature employees should generally be willing...
7,Does Revature have employee resource groups?,Many companies that value diversity and inclus...
8,How does Revature contribute to the tech commu...,Revature often partners with universities to c...
9,What is the typical commitment duration when j...,Revature's employment agreements often include...


In [25]:
# lowercase

df['question'] = df['question'].str.lower()
df['answer'] = df['answer'].str.lower()

df.head(10)



Unnamed: 0,question,answer
0,tell me more about revature's technology train...,revature provides training in many current tec...
1,what support do revature employees get on clie...,revature typically offers ongoing support for ...
2,how does revature handle employee feedback and...,revature generally has a system for regular pe...
3,are there opportunities for development beyond...,yes revature generally encourages continuous p...
4,what is revature's approach to balancing work ...,the balance between work and personal life can...
5,can you describe a typical project for a revat...,a revature associate trained in full-stack jav...
6,what are the travel expectations for revature ...,revature employees should generally be willing...
7,does revature have employee resource groups?,many companies that value diversity and inclus...
8,how does revature contribute to the tech commu...,revature often partners with universities to c...
9,what is the typical commitment duration when j...,revature's employment agreements often include...


In [26]:
# remove whitespaces

def remove_whitespaces(text):
    return ' '.join(text.split())

df['question'] = df['question'].apply(remove_whitespaces)
df['answer'] = df['answer'].apply(remove_whitespaces)
df.head(10)


Unnamed: 0,question,answer
0,tell me more about revature's technology train...,revature provides training in many current tec...
1,what support do revature employees get on clie...,revature typically offers ongoing support for ...
2,how does revature handle employee feedback and...,revature generally has a system for regular pe...
3,are there opportunities for development beyond...,yes revature generally encourages continuous p...
4,what is revature's approach to balancing work ...,the balance between work and personal life can...
5,can you describe a typical project for a revat...,a revature associate trained in full-stack jav...
6,what are the travel expectations for revature ...,revature employees should generally be willing...
7,does revature have employee resource groups?,many companies that value diversity and inclus...
8,how does revature contribute to the tech commu...,revature often partners with universities to c...
9,what is the typical commitment duration when j...,revature's employment agreements often include...


In [28]:
# tokenize

import nltk

nltk.download('punkt')
nltk.download('punkt_tab')


from nltk import word_tokenize

df['question'] = df['question'].apply(word_tokenize)
df['answer'] = df['answer'].apply(word_tokenize)

df.head(10)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\KrishnaGopikaUrlagan\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\KrishnaGopikaUrlagan\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


Unnamed: 0,question,answer
0,"[tell, me, more, about, revature, 's, technolo...","[revature, provides, training, in, many, curre..."
1,"[what, support, do, revature, employees, get, ...","[revature, typically, offers, ongoing, support..."
2,"[how, does, revature, handle, employee, feedba...","[revature, generally, has, a, system, for, reg..."
3,"[are, there, opportunities, for, development, ...","[yes, revature, generally, encourages, continu..."
4,"[what, is, revature, 's, approach, to, balanci...","[the, balance, between, work, and, personal, l..."
5,"[can, you, describe, a, typical, project, for,...","[a, revature, associate, trained, in, full-sta..."
6,"[what, are, the, travel, expectations, for, re...","[revature, employees, should, generally, be, w..."
7,"[does, revature, have, employee, resource, gro...","[many, companies, that, value, diversity, and,..."
8,"[how, does, revature, contribute, to, the, tec...","[revature, often, partners, with, universities..."
9,"[what, is, the, typical, commitment, duration,...","[revature, 's, employment, agreements, often, ..."


In [29]:
# remove stopwords

nltk.download('stopwords')

from nltk.corpus import stopwords

en_stopwords = stopwords.words('english')

def remove_stopwords(text):
    result = []
    for token in text:
        if token not in en_stopwords:
            result.append(token)
    return result

df['question'] = df['question'].apply(remove_stopwords)
df['answer'] = df['answer'].apply(remove_stopwords)

df.head(10)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\KrishnaGopikaUrlagan\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Unzipping corpora\stopwords.zip.


Unnamed: 0,question,answer
0,"[tell, revature, 's, technology, training, .]","[revature, provides, training, many, current, ..."
1,"[support, revature, employees, get, client, si...","[revature, typically, offers, ongoing, support..."
2,"[revature, handle, employee, feedback, evaluat...","[revature, generally, system, regular, perform..."
3,"[opportunities, development, beyond, initial, ...","[yes, revature, generally, encourages, continu..."
4,"[revature, 's, approach, balancing, work, life...","[balance, work, personal, life, vary, based, s..."
5,"[describe, typical, project, revature, associa...","[revature, associate, trained, full-stack, jav..."
6,"[travel, expectations, revature, employees, ?]","[revature, employees, generally, willing, relo..."
7,"[revature, employee, resource, groups, ?]","[many, companies, value, diversity, inclusion,..."
8,"[revature, contribute, tech, community, ?]","[revature, often, partners, universities, crea..."
9,"[typical, commitment, duration, joining, revat...","[revature, 's, employment, agreements, often, ..."


In [30]:
# remove punctuation

from nltk.tokenize import RegexpTokenizer

def remove_punctuations(text):
    tokenizer = RegexpTokenizer(r'\w+')
    result = tokenizer.tokenize(' '.join(text))
    return result

df['question'] = df['question'].apply(remove_punctuations)
df['answer'] = df['answer'].apply(remove_punctuations)

df.head(10)


Unnamed: 0,question,answer
0,"[tell, revature, s, technology, training]","[revature, provides, training, many, current, ..."
1,"[support, revature, employees, get, client, si...","[revature, typically, offers, ongoing, support..."
2,"[revature, handle, employee, feedback, evaluat...","[revature, generally, system, regular, perform..."
3,"[opportunities, development, beyond, initial, ...","[yes, revature, generally, encourages, continu..."
4,"[revature, s, approach, balancing, work, life,...","[balance, work, personal, life, vary, based, s..."
5,"[describe, typical, project, revature, associate]","[revature, associate, trained, full, stack, ja..."
6,"[travel, expectations, revature, employees]","[revature, employees, generally, willing, relo..."
7,"[revature, employee, resource, groups]","[many, companies, value, diversity, inclusion,..."
8,"[revature, contribute, tech, community]","[revature, often, partners, universities, crea..."
9,"[typical, commitment, duration, joining, revat...","[revature, s, employment, agreements, often, i..."


In [None]:
# Lemmatize

# nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize

def lemmatize_text(text):
    result = []
    wordnet_lem = WordNetLemmatizer()
    for token, pos in pos_tag(text):
        pos = pos[0].lower()
        if pos not in ['a', 'r', 'n', 'v']:
            pos = 'n'
        result.append(wordnet_lem.lemmatize(token, pos))
    return result

df['question'] = df['question'].apply(lemmatize_text)
df['answer'] = df['answer'].apply(lemmatize_text)

df.head(10)




[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\KrishnaGopikaUrlagan\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\KrishnaGopikaUrlagan\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\KrishnaGopikaUrlagan\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\KrishnaGopikaUrlagan\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,question,answer
0,"[tell, revature, s, technology, training]","[revature, provide, train, many, current, tech..."
1,"[support, revature, employee, get, client, site]","[revature, typically, offer, ongoing, support,..."
2,"[revature, handle, employee, feedback, evaluat...","[revature, generally, system, regular, perform..."
3,"[opportunity, development, beyond, initial, tr...","[yes, revature, generally, encourage, continuo..."
4,"[revature, s, approach, balance, work, life, e...","[balance, work, personal, life, vary, base, sp..."
5,"[describe, typical, project, revature, associate]","[revature, associate, train, full, stack, java..."
6,"[travel, expectation, revature, employee]","[revature, employee, generally, willing, reloc..."
7,"[revature, employee, resource, group]","[many, company, value, diversity, inclusion, o..."
8,"[revature, contribute, tech, community]","[revature, often, partner, university, create,..."
9,"[typical, commitment, duration, joining, revat...","[revature, s, employment, agreement, often, in..."


In [35]:
# remove words with len less than 1

def remove_word(text):
    result = []
    for token in text:
        if len(token) > 1:
            result.append(token)
    return result

df['question'] = df['question'].apply(remove_word)
df['answer'] = df['answer'].apply(remove_word)
df.head(10)



Unnamed: 0,question,answer
0,"[tell, revature, technology, training]","[revature, provide, train, many, current, tech..."
1,"[support, revature, employee, get, client, site]","[revature, typically, offer, ongoing, support,..."
2,"[revature, handle, employee, feedback, evaluat...","[revature, generally, system, regular, perform..."
3,"[opportunity, development, beyond, initial, tr...","[yes, revature, generally, encourage, continuo..."
4,"[revature, approach, balance, work, life, empl...","[balance, work, personal, life, vary, base, sp..."
5,"[describe, typical, project, revature, associate]","[revature, associate, train, full, stack, java..."
6,"[travel, expectation, revature, employee]","[revature, employee, generally, willing, reloc..."
7,"[revature, employee, resource, group]","[many, company, value, diversity, inclusion, o..."
8,"[revature, contribute, tech, community]","[revature, often, partner, university, create,..."
9,"[typical, commitment, duration, joining, revat...","[revature, employment, agreement, often, inclu..."


In [36]:
df['question'] = [' '.join(map(str,token)) for token in df['question']]
df['answer'] = [' '.join(map(str,token)) for token in df['answer']]

df.head(10)


Unnamed: 0,question,answer
0,tell revature technology training,revature provide train many current technology...
1,support revature employee get client site,revature typically offer ongoing support emplo...
2,revature handle employee feedback evaluation,revature generally system regular performance ...
3,opportunity development beyond initial trainin...,yes revature generally encourage continuous pr...
4,revature approach balance work life employee,balance work personal life vary base specific ...
5,describe typical project revature associate,revature associate train full stack java might...
6,travel expectation revature employee,revature employee generally willing relocate a...
7,revature employee resource group,many company value diversity inclusion often e...
8,revature contribute tech community,revature often partner university create caree...
9,typical commitment duration joining revature,revature employment agreement often include co...


In [37]:
df.to_csv('./cleaned_data.csv', index=False, encoding='utf-8')