# Google Drive Loader
This notebook covers how to retrieve documents from Google Drive.

## Prerequisites

1. Create a Google Cloud project or use an existing project
1. Enable the [Google Drive API](https://console.cloud.google.com/flows/enableapi?apiid=drive.googleapis.com)
1. [Authorize credentials for desktop app](https://developers.google.com/drive/api/quickstart/python#authorize_credentials_for_a_desktop_application)
1. `pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib`

## Instructions for retrieving your Google Docs data
By default, the `GoogleDriveLoader` expects the `credentials.json` file to be `~/.credentials/credentials.json`, but this is configurable using the `GOOGLE_ACCOUNT_FILE` environment variable. 
The location of `token.json` use the same directory (or use the parameter `token_path`). Note that `token.json` will be created automatically the first time you use the loader.


You can obtain your folder and document id from the URL:
* Folder: https://drive.google.com/drive/u/0/folders/1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5 -> folder id is `"1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5"`
* Document: https://docs.google.com/document/d/1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw/edit -> document id is `"1bfaMQ18_i56204VaQDVeAFpqEijJTgvurupdEDiaUQw"`

The special value `root` is for your personal home.

In [1]:
#pip install langchain-googledrive
from langchain_googledrive.document_loaders import GoogleDriveLoader

ModuleNotFoundError: No module named 'langchain_googledrive'

In [2]:
!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

Defaulting to user installation because normal site-packages is not writeable
Collecting google-api-python-client
  Using cached google_api_python_client-2.97.0-py2.py3-none-any.whl (12.0 MB)
Collecting google-auth-httplib2
  Using cached google_auth_httplib2-0.1.0-py2.py3-none-any.whl (9.3 kB)
Collecting google-auth-oauthlib
  Using cached google_auth_oauthlib-1.0.0-py2.py3-none-any.whl (18 kB)
Collecting uritemplate<5,>=3.0.1
  Using cached uritemplate-4.1.1-py2.py3-none-any.whl (10 kB)
Collecting google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5
  Using cached google_api_core-2.11.1-py3-none-any.whl (120 kB)
Collecting google-auth<3.0.0.dev0,>=1.19.0
  Using cached google_auth-2.22.0-py2.py3-none-any.whl (181 kB)
Collecting requests-oauthlib>=0.7.0
  Using cached requests_oauthlib-1.3.1-py2.py3-none-any.whl (23 kB)
Collecting googleapis-common-protos<2.0.dev0,>=1.56.2
  Using cached googleapis_common_protos-1.60.0-py2.py3-none-any.whl (227 kB)
Collecting protobuf!=

In [3]:
folder_id='root'
#folder_id='1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5'

In [4]:
from langchain_googledrive.document_loaders import GoogleDriveLoader

ModuleNotFoundError: No module named 'langchain_googledrive'

In [5]:
loader = GoogleDriveLoader(
    folder_id=folder_id,
    recursive=False,
    num_results=2,  # Maximum number of file to load
)

NameError: name 'GoogleDriveLoader' is not defined

By default, all files with these mime-type can be converted to `Document`.
- text/text
- text/plain
- text/html
- text/csv
- text/markdown
- image/png
- image/jpeg
- application/epub+zip
- application/pdf
- application/rtf
- application/vnd.google-apps.document (GDoc)
- application/vnd.google-apps.presentation (GSlide)
- application/vnd.google-apps.spreadsheet (GSheet)
- application/vnd.google.colaboratory (Notebook colab)
- application/vnd.openxmlformats-officedocument.presentationml.presentation (PPTX)
- application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)

It's possible to update or customize this. See the documentation of `GDriveLoader`.

But, the corresponding packages must be installed.

In [5]:
!pip install unstructured

Defaulting to user installation because normal site-packages is not writeable


In [6]:
for doc in loader.load():
    print("---")
    print(doc.page_content.strip()[:60]+"...")

---
[

NOM

PROJET + Référence Mission]



Une fois la fiche ter...
---
[

NOM

PROJET + Référence Mission]



Une fois la fiche ter...


# Customize the search pattern

All parameter compatible with Google [`list()`](https://developers.google.com/drive/api/v3/reference/files/list)
API can be set.

To specify the new pattern of the Google request, you can use a `PromptTemplate()`.
The variables for the prompt can be set with `kwargs` in the constructor.
Some pre-formated request are proposed (use `{query}`, `{folder_id}` and/or `{mime_type}`):

You can customize the criteria to select the files. A set of predefined filter are proposed:
| template                               | description                                                           |
| -------------------------------------- | --------------------------------------------------------------------- |
| gdrive-all-in-folder                   | Return all compatible files from a `folder_id`                        |
| gdrive-query                           | Search `query` in all drives                                          |
| gdrive-by-name                         | Search file with name `query`                                        |
| gdrive-query-in-folder                 | Search `query` in `folder_id` (and sub-folders if `recursive=true`)  |
| gdrive-mime-type                       | Search a specific `mime_type`                                         |
| gdrive-mime-type-in-folder             | Search a specific `mime_type` in `folder_id`                          |
| gdrive-query-with-mime-type            | Search `query` with a specific `mime_type`                            |
| gdrive-query-with-mime-type-and-folder | Search `query` with a specific `mime_type` and in `folder_id`         |


In [7]:
loader = GoogleDriveLoader(
    folder_id=folder_id,
    recursive=False,
    template="gdrive-query",  # Default template to use
    query="machine learning",
    num_results=2,            # Maximum number of file to load
    supportsAllDrives=False,  # GDrive `list()` parameter
)
for doc in loader.load():
    print("---")
    print(doc.page_content.strip()[:60]+"...")

---
A Document with the word machine learning.



Another paragr...
---
Autre document sur le machine learning...


You can customize your pattern.

In [8]:
from langchain.prompts.prompt import PromptTemplate
loader = GoogleDriveLoader(
    folder_id=folder_id,
    recursive=False,
    template=PromptTemplate(
        input_variables=["query", "query_name"],
        template="fullText contains '{query}' and name contains '{query_name}' and trashed=false",
        ),  # Default template to use
    query="machine learning",
    query_name="ML",    
    num_results=2,  # Maximum number of file to load
)
for doc in loader.load():
    print("---")
    print(doc.page_content.strip()[:60]+"...")

---
Je vous invite à lire

cette page

pour suivre les recommand...
---
The Springer Series on Challenges in Machine Learning

Frank...


# Modes for GSlide and GSheet

The parameter `mode` accept differents values:
- `"document"`: return the body of each documents
- `"snippets"`: return the `description` of each files (set in metadata of google drive files).


The parameter `gslide_mode` accept differents values:
- `"single"`   : one document with `<PAGE BREAK>`
- `"slide"`    : one document by slide
- `"elements"` : one document for each `elements`.

In [9]:
loader = GoogleDriveLoader(
    template="gdrive-mime-type",
    mime_type="application/vnd.google-apps.presentation", # Only GSlide files
    gslide_mode="slide",
    num_results=2,  # Maximum number of file to load
)
for doc in loader.load():
    print("---")
    print(doc.page_content.strip()[:60]+"...")

---
Expériences  :

UX Researcher et UX Analyst chez OCTO Techno...
---
Mini-bio : FEBO



En tant que

UX researcher

et

UX analyt...


The parameter `gsheet_mode` accept differents values:
- `"single"`: Generate one document by line
- `"elements"` : one document with markdown array and `<PAGE BREAK>` tags.

In [10]:
loader = GoogleDriveLoader(
    template="gdrive-mime-type",
    mime_type="application/vnd.google-apps.spreadsheet", # Only GSheet files
    gsheet_mode="elements",
    num_results=2,  # Maximum number of file to load
)
for doc in loader.load():
    print("---")
    print(doc.page_content.strip()[:60]+"...")

---
Instructions for use: 1. Make a copy of this document; do no...
---
Instructions for use: 2. Check you're in the new copy and it...


# Advanced usage
- All google file have a 'description' in the metadata. This field can be use to memorize a summary of the document or others indexed tags (See method `lazy_update_description_with_summary()`).
- If you use the `mode="snippet"`, only the description will be used for the body. Else, the `metadata['summary']` has the field.
- Sometime, a specific filter can be used to extract some information from the filename, to select some files with specific criteria. You can use a `filter`.
- Sometimes, many documents are returned. It's not necessary to have all documents in memory at the same time. You can use the *lazy* versions of methods, to get one document at a time.
It's better to use a complex query in place of a recursive search. For each folder, a query must be apply if you activate `recursive=True`.

In [13]:
import os
loader = GoogleDriveLoader(
                gdrive_api_file=os.environ["GOOGLE_ACCOUNT_FILE"],
                num_results=2,
                template="gdrive-query",
                filter=lambda search, file: "#test" not in file.get('description',''),
                query='machine learning',
                supportsAllDrives=False,
                )
for doc in loader.load():
    print("---")
    print(doc.page_content.strip()[:60]+"...")

Ignore 'Dossier - 01 - Introduction au Deep Learning.zip' with type 'application/zip'
Ignore 'Actualité - 01 - Deep learning et  humanités numériques.zip' with type 'application/zip'


---
A Document with the word machine learning.



Another paragr...
---
Autre document sur le machine learning...
---
Matrice Cynefin x

Machine Learning

- Aller vite en product...
---
Deep Learning humanités numériques

“les gens qui se forment...
---
Eli Stevens Luca Antiga Thomas Viehmann Foreword by Soumith ...
---
Deep Learning et humanités numériques

Karim Sayadi

Data Sc...
---
Le machine learning portable avec Go



Dans cet article, je...
---
Deep Learning humanités numériques



“Les gens qui se forme...
---
01

R&D collective ?



<PAGE BREAK>

Synthèse d’une discuss...
---
L’i-PPR n°02



Deep learning of Python

V1.0 -

5 Avril 201...
