[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/use_cases/extraction/working_with_files.ipynb)

# Working with Files

Besides raw text data, you may wish to extract information from other file types such as PowerPoint presentations or PDFs.


Use LangChain [document loaders](/modules/data_connection/document_loaders/) to parse files into a text format 
that can be fed into LLMs.|

Please see [integrations](/docs/integrations/document_loaders) for a large list of available document loaders.

Here, we will share sample code that allows you to more configure parsers based on mime-type.

## Mime Type Based Parsing

For simple applications it's often suffient to choose a loader based on the file extension of the file.

However, if you're writing server code, it may be best to assume that the file extension of the file could be wrong and instead infer the mimetype from the binary content of the file.

Let's download some content. This will be an HTML file, but the code below will work with other file types.

In [1]:
import requests

response = requests.get("https://en.wikipedia.org/wiki/Car")
data = response.content
data[:20]

b'<!DOCTYPE html>\n<htm'

Configure the parsers

In [2]:
import magic
from langchain.document_loaders.parsers import BS4HTMLParser, PDFMinerParser
from langchain.document_loaders.parsers.generic import MimeTypeBasedParser
from langchain.document_loaders.parsers.txt import TextParser
from langchain_community.document_loaders import Blob

# Configure the parsers that you want to use per mime-type!
HANDLERS = {
    "application/pdf": PDFMinerParser(),
    "text/plain": TextParser(),
    "text/html": BS4HTMLParser(),
}

# Instantiate a mimetype based parser with the given parsers
MIMETYPE_BASED_PARSER = MimeTypeBasedParser(
    handlers=HANDLERS,
    fallback_parser=None,
)

mime = magic.Magic(mime=True)
mime_type = mime.from_buffer(data)

# A blob represents binary data by either reference (path on file system)
# or value (bytes in memory).
blob = Blob.from_data(
    data=data,
    mime_type=mime_type,
)

parser = HANDLERS[mime_type]
documents = parser.parse(blob=blob)

In [3]:
print(documents[0].page_content[:30].strip())

Car - Wikipedia
