# Document Loaders: Streamlined Data Ingestion
There are many document loaders available in LangChain, but we will focus here on a few important ones. The `TextLoader` handles plain text files, `CSVLoader` handles csv files, while the `PyPDFLoader` specializes in PDF files, offering easy access to content and metadata. `SeleniumURLLoader` is designed for loading HTML documents from URLs that require JavaScript rendering. Lastly, the `GoogleDriveLoader` provides seamless integration with Google Drive, allowing for the import of data from Google Docs or folders.

---

## Set environment variables

In [1]:
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

## `TextLoader` (Text)
Handles plain text files. You can use the `encoding` argument to change the encoding type.

In [3]:
from langchain.document_loaders import TextLoader

loader = TextLoader("../../data/PaLM.txt")
documents = loader.load()
print(len(documents))
documents

1


[Document(page_content="Google opens up its AI language model PaLM to challenge OpenAI and GPT-3\nGoogle is offering developers access to one of its most advanced AI language models: PaLM.\nThe search giant is launching an API for PaLM alongside a number of AI enterprise tools\nit says will help businesses “generate text, images, code, videos, audio, and more from\nsimple natural language prompts.”\n\nPaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or\nMeta's LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,\nPaLM is a flexible system that can potentially carry out all sorts of text generation and\nediting tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for\nexample, or you could use it for tasks like summarizing text or even writing code.\n(It's similar to features Google also announced today for its Workspace apps like Google\nDocs and Gmail.)\n", metadata={'source': '../../data/PaLM.txt'})

## `CSVLoader` (csv)

In [2]:
from langchain.document_loaders import CSVLoader

file = "../../data/OutdoorClothingCatalog_1000_small.csv"
loader = CSVLoader(file_path=file)
data = loader.load()
data[0]

Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \r\n\r\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \r\n\r\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \r\n\r\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \r\n\r\nQuestions? Please contact us for any inquiries.", metadata={'source': '../../data/OutdoorClothingCatalog_1000_small.csv', 'row': 0})

## `PyPDFLoader` (PDF)
Handles PDF files. The LangChain library provides two methods for loading and processing PDF files: `PyPDFLoader` and `PDFMinerLoader`. Using `PyPDFLoader` offers advantages such as simple, straightforward usage and easy access to page content and metadata, like page numbers, in a structured format. However, it has disadvantages, including limited text extraction capabilities compared to `PDFMinerLoader`.

In [3]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../../data/WritingArticleSummary.pdf")
pages = loader.load_and_split()

print(pages[0])

page_content='1 \n \nAcademic Skills, Trent University    www.trentu.ca/academicskills  \nPeterborough, ON Canada                     © 2014  \n  Writing Article Summaries  \n \n \nUnderstanding Article Summaries  \nAn article summary is a short, focused paper about one scholarly \narticle. This paper is informed by critical reading of an article. For \nargumentative articles, the summary identifies, explains, and \nanalyses the thesis and supporting arguments; for empirica l articles, \nthe summary identifies, explains, and analyses the research \nquestions, methods, and findings.  \nAlthough article summaries are often short and rarely account for a \nlarge portion of your grade, they are a strong indicator of your \nreading and writ ing skills. Professors ask you to write article \nsummaries to help you to develop essential skills in critical reading, \nsummarizing, and clear, organized writing. Furthermore, an article \nsummary requires you to read a scholarly article quite closely

## `SeleniumURLLoader` (URL)
The `SeleniumURLLoader` module offers a robust yet user-friendly approach for loading HTML documents from a list of URLs requiring JavaScript rendering. The `SeleniumURLLoader` class includes the following attributes:
- URLs (List[str]): List of URLs to load.
- continue_on_failure (bool, default=True): Continues loading other URLs on failure if True.
- browser (str, default="chrome"): Browser selection, either 'Chrome' or 'Firefox'.
- executable_path (Optional[str], default=None): Browser executable path.
- headless (bool, default=True): Browser runs in headless mode if True.

> Note: Please provide the full path to your browser executable file in `binary_location`.

In [4]:
from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s",
]

loader = SeleniumURLLoader(
    urls=urls,
    binary_location=os.environ.get("BROWSER_EXEC_PATH"),
)
data = loader.load()

print(data[0])

page_content="OPENASSISTANT TAKES ON CHATGPT!\n\nInfo\n\nShopping\n\nWatch Later\n\nShare\n\nCopy link\n\nTap to unmute\n\nIf playback doesn't begin shortly, try restarting your device.\n\nUp next\n\nLiveUpcoming\n\nPlay now\n\nMachine Learning Street Talk\n\nSubscribe\n\nSubscribed\n\nYou're signed out\n\nVideos that you watch may be added to the TV's watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer.\n\nSwitch camera\n\nShare\n\nAn error occurred while retrieving sharing information. Please try again later.\n\n2:19\n\n2:19 / 59:51\n\nWatch full video\n\n•\n\nScroll for details\n\nNew!\n\nWatch ads now so that you can enjoy fewer interruptions\n\nGot it\n\nAbout\n\nPress\n\nCopyright\n\nContact us\n\nCreator\n\nAdvertise\n\nDevelopers\n\nTerms\n\nPrivacy\n\nPolicy & Safety\n\nHow YouTube works\n\nTest new features\n\n© 2023 Google LLC" metadata={'source': 'https://www.youtube.com/watch?v=TFa539R09EQ&t=139s'}


## `GoogleDriveLoader`
Prepare necessary credentials and tokens:
- By default, the GoogleDriveLoader searches for the credentials.json file in ~/.credentials/credentials.json. Use the `credentials_file` keyword argument to modify this path.
- The token.json file follows the same principle and will be created automatically upon the loader's first use.

To set up the credentials_file, follow these steps:
- Create a new Google Cloud Platform project or use an existing one by visiting the Google Cloud Console. Ensure that billing is enabled for your project.
- Enable the Google Drive API by navigating to its dashboard in the Google Cloud Console and clicking "Enable."
- Create a service account by going to the Service Accounts page in the Google Cloud Console. Follow the prompts to set up a new service account.
- Assign necessary roles to the service account, such as "Google Drive API - Drive File Access" and "Google Drive API - Drive Metadata Read/Write Access," depending on your needs.
- After creating the service account, access the "Actions" menu next to it, select "Manage keys," click "Add Key," and choose "JSON" as the key type. This generates a JSON key file and downloads it to your computer, which serves as your credentials_file.

Retrieve the folder or document ID from the URL:
- Folder: https://drive.google.com/drive/u/0/folders/{folder_id}
- Document: https://docs.google.com/document/d/{document_id}/edit

In [None]:
from langchain.document_loaders import GoogleDriveLoader

loader = GoogleDriveLoader(
    folder_id="your_folder_id",
    recursive=False,  # Optional: Fetch files from subfolders recursively. Defaults to False.
)
docs = loader.load()