## Building Custom Datasets for LLMs

### Lesson Objectives
By the end of this lesson, you will be able to:

- Identify relevant data sources for the domain
- Collect public data in an LLM-friendly way
- Clean and structure data for easy tokenization
- Structure datasets for causal language modeling with fine-tuned Hugging Face models
- Structure datasets for question answering with fine-tuned Hugging Face models

![image.png](attachment:3c045b63-abff-4b14-b4ea-7c6b7cf26b1a.png)

## Collecting Data

### Overview
In practice, internal data sources with clear formatting are rare. Most often, data is not well-structured and may lack context. To create a usable dataset, we need to supplement internal data with external sources or use external datasets from the start.

### Steps to Collect Data
1. **Identify Potential Data Sources**: Understand what you want your model to accomplish.
2. **Define the Task**: Determine if machine learning is the right tool for your problem.
3. **Understand Input and Output**:
   - What does typical input look like? (structured data, plain text, etc.)
   - What is the expected output? (summaries, classifications, etc.)
4. **Explore Data Resources**: Investigate what data resources are commonly used in the field.

### Focus on External Data
The lesson emphasizes the importance of external data sources, particularly from the Internet, for building effective datasets.


## Methods of Collecting Data

### APIs (Application Programming Interfaces):
- APIs provide programmatic access to hosted data.
- Typically, you need to register for API access to obtain an API key.
- Making requests to an API usually returns structured data, often in formats like JSON or XML.
- APIs are beneficial for data collection and are often integrated into larger systems.

### Web Scraping:
- Scraping involves making a series of HTTP requests to retrieve HTML documents from web pages.
- This method can be resource-intensive for the target site and may lead to being blocked if the site does not permit scraping.
- It's important to scrape responsibly and minimally to avoid overloading the host site.

## Legal and Ethical Considerations
- **Licensing**: Some data may require licensing, especially for commercial use. Always check the licensing terms of the data you collect.
- **Scraping Legality**: While scraping is generally legal, it is not always welcome. Be mindful of the site's terms of service.
- **Data Toxicity**: Not all data on the Internet is safe to use. Be cautious of harmful content that could negatively impact your dataset.

## Conclusion
The Internet can be a valuable resource for collecting relevant data, but it's essential to operate within legal and ethical frameworks. Understanding the methods of data collection and the considerations involved will help ensure successful and responsible data gathering.
