# Markdown Chunking Results

This cell uses Metaflow's [Client API](https://docs.metaflow.org/api/client) to access the results of splitting markdown documents.

In [2]:
from metaflow import Flow

flow = Flow("MarkdownChunker")
run = flow.latest_successful_run
print(run.id)
df = run.data.df

1692826202549977


In [3]:
df.describe()

Unnamed: 0,char_count,word_count
count,839.0,839.0
mean,723.185936,107.460072
std,990.522047,152.878742
min,0.0,1.0
25%,79.0,10.0
50%,377.0,52.0
75%,985.0,145.5
max,8541.0,1593.0


In [4]:
# Filter to get rid of URLs that don't work before including them in a dataset.

FILTER_URLS = False # this can take ~ minute for ~ thousand rows of data

import requests
def good_url(url):
    response = requests.get(url)
    return response.ok, response.status_code 

if FILTER_URLS:

    # unpack the data in the df
    url_info = df.page_url.apply(good_url)
    df['ok_url'] = [is_ok for is_ok, _ in url_info]
    df['url_status_code'] = [status for _, status in url_info]

    # what are pages the reuquest was not in 200-400?   
    df[~df.ok_url].page_url.values

In [5]:
# Fetch some random instances of metadata/links that
# will come along with the embeddings of the contents column.
df.sample(10)

Unnamed: 0,header,contents,type,page_url,is_howto,char_count,word_count
147,Handle null assignment to `IncludeFile` properly,A workflow executed without a required `Includ...,H4,https://docs.metaflow.org/internals/release-no...,False,213,34
48,"[2.4.6 (Dec 16, 2021)](https://github.com/Netf...",This version was skipped due to technical reasons,H2,https://docs.metaflow.org/internals/release-no...,False,49,8
190,"2.0.5 (Apr 30th, 2020)",The Metaflow 2.0.5 release is a minor patch re...,H2,https://docs.metaflow.org/internals/release-no...,False,52,9
562,Quick tour,Let's have a look at the 🤗 Accelerate main fea...,H1,https://huggingface.co/docs/accelerate/source/...,False,71,14
515,Compatibility with Conda decorator,The above instructions work even if you use [`...,H3,https://docs.metaflow.org/metaflow/debugging#c...,False,529,81
696,Prepare a 🤗 Accelerate fine-tuning script,The training script is very similar to a train...,H3,https://huggingface.co/docs/accelerate/source/...,False,299,46
262,Timing out with the `timeout` Decorator,"By default, there is no timeout for steps. If ...",H2,https://docs.metaflow.org/scaling/failures#tim...,False,1941,436
444,@project,[The @project decorator](/production/coordinat...,H3,https://docs.metaflow.org/api/current#project,False,1976,193
570,Defer execution,"When you run your usual script, instructions a...",H3,https://huggingface.co/docs/accelerate/source/...,False,765,128
103,Features,,H3,https://docs.metaflow.org/internals/release-no...,False,0,1


In [8]:
OB_DOCS_INCLUDED=False
# This cell is extracing documents from outerbounds.com with a specific Q&A format.
# This repository is closed source so this code will error out with zero samples if your GH account doesn't have OB access.

if OB_DOCS_INCLUDED:

    # Get the how-to guides from outerbounds.com/docs that have explicit Q&A format,
    # which are the only place where the content itself will be a question.
    from IPython.display import display, Markdown
    
    # the is_howto column is a pattern match on the header of all sections of how to guides on outerbounds.com/docs
    howto_sample = df[df.is_howto == True].sample(10)
    for _, sample in howto_sample.iterrows():
        display(Markdown(f"[**Question**]({sample.page_url}): {sample.contents}"))

# Post-processing results

In the following cell, we get the postprocessing result of the `DataTableProcessor` flow.

In [10]:
from metaflow import Flow
import pandas as pd
flow = Flow('DataTableProcessor')
run = flow.latest_successful_run
print(run.id)

1692826394541260


In [13]:
from metaflow.cards import get_cards
get_cards(f'{run.pathspec}/start/1')

In [14]:
DATA_DIR = '../data'
df_post = pd.read_csv('%s/processed_df_%s.csv' % (DATA_DIR, run.id))
df_post.sample(3)

Unnamed: 0,index,header,contents,type,page_url,is_howto,char_count,word_count,tld
485,619,accelerate test,`accelerate test` or `accelerate-test`\n \n Ru...,H2,https://huggingface.co/docs/accelerate/source/...,False,747,150,https://huggingface.co
603,810,The slowdown in gradient accumulation,You now understand that PyTorch adds hooks to ...,H2,https://huggingface.co/docs/accelerate/source/...,False,1055,163,https://huggingface.co
541,715,Pre-Requisites,"You will need to install the latest pytorch, c...",H2,https://huggingface.co/docs/accelerate/source/...,False,1012,128,https://huggingface.co
