# Extending these workflows to other repositories
**IMPORTANT**: Please be respectful of understanding the licenses of repositories you choose to crawl with any tools, including those provided in this demonstration. This repository is intended to help you quickly get started indexing documentation in GitHub repositories, not for large-scale web scraping.

## The main tool

In [1]:
from rag_tools.repo.ops import DocumentationExtractor

The `DocumentationExtractor` has two functions, `filter_files` and `extract`.

## Finding files to chunk
First, a set of parameters defining how to crawl a GitHub repository is passed to a custom object defined in `rag_tools.repo.ops.DocumentationExtractor`. One option is to clone a repository and pass a local path to `filter_files` to return all the paths to the `considered_extensions`

In [7]:
import os
METAFLOW_DOCS_REPO_PATH = os.path.expanduser("~/Dev/metaflow-docs")
file_paths = DocumentationExtractor().filter_files(
    METAFLOW_DOCS_REPO_PATH,
    base_search_path = "docs",
    exclude_paths = ["docs/v"],
    exclude_files = ["README.md", "README"],
    considered_extensions = [".md"],
)

In [9]:
len(file_paths), file_paths[0]

(58, ('/Users/eddie/Dev/metaflow-docs/docs/index.md', 'index.md'))

Another workflow option is to only download the repository temporarily by passing the remote repositories URL to `extract`, and letting this function handle the `filter_files` call and any other parsing internally.

In [19]:
import pandas as pd
from functools import partial
from rag_tools.filetypes.markdown import parse_md_file_headers

# we will pass this to the extract(parser=...) arg. 
# you can replace this with any function that takes in filepath and filename args.
def headers_df_parser(filepath, filename, deployment_url, subdir) -> pd.DataFrame:
    return parse_md_file_headers(filepath, deployment_url=deployment_url, subdir=subdir)

In [20]:
params = {
    "deployment_url": "docs.metaflow.org",
    "repository_path": "https://github.com/Netflix/metaflow-docs",
    "repository_ref": "master",
    "base_search_path": "docs",
    "exclude_paths": ["docs/v"],
    "exclude_files": ["README.md", "README"],
}

In [21]:
extractor = DocumentationExtractor(repo_url=params["repository_path"])

dfs = extractor.extract(
    base_path=params["base_search_path"],
    ref=params["repository_ref"],
    exclude_paths=params["exclude_paths"],
    exclude_files=params["exclude_files"],
    considered_extensions=[".md"],
    parser=partial(headers_df_parser, deployment_url=params['deployment_url'], subdir=params['base_search_path']),
)

In [24]:
df = pd.concat(dfs)

In [26]:
df.sample(5)

Unnamed: 0,header,contents,type,page_url,is_howto,char_count,word_count
10,**Store and load objects to/from a known S3 lo...,The above examples inferred the S3 location ba...,H4,https://docs.metaflow.org/scaling/data#store-a...,False,1367,219
0,Deploying Variants of Event-Triggered Flows,Consider this advanced scenario: You have depl...,H1,https://docs.metaflow.org/production/event-tri...,False,983,119
99,[Fix `environment is not callable` error when ...,Using `@environment` would often result in an ...,H4,https://docs.metaflow.org/internals/release-no...,False,279,34
71,State Machine execution history logging for AW...,Metaflow now logs [State Machine execution his...,H4,https://docs.metaflow.org/internals/release-no...,False,587,72
3,"Single Flow, multiple developers",If `ProjectFlow` did not have a `@project deco...,H3,https://docs.metaflow.org/production/coordinat...,False,1773,232
