<a href="https://colab.research.google.com/github/mlfisch3/Predibase/blob/main/PredibaseSDK2CodeLlama13BDocstringTutorial05232024as000.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Docstring Generation Using Codellama-13b Model on a Single GPU**

Ever wondered if you can have an assistant that can generate docstring for any function you code. In this notebook, we show how a generative model can generate function docstrings based on the Python codes.

To adopt a pre-trained large language model and fine-tune it in the Predibase (https://app.predibase.com/) platform. By the end of this example, you will have gained a comprehensive understanding of the Predibase platform and how to use the platform for fine-tuning and deploying a fine-tuned LLM.

<br>

👀 Try Predibase's free trial–complete with $25 of credit–by signing up [here](https://pbase.ai/3OD77wQ)

# **Goal: Use LLMs For Docstring-Generation** 💻

This notebook demonstrates how to fine-tune a Codellama-13b model on a docstring generation dataset on a single GPU.

As an example, if we prompt the model with this instruction:

```
Instruction: Write an appropriate docstring for the following Python function. Return the entire function with the in-line docstring.

Function:

def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):
    html = get_content(rebuilt_url(url))
    info = json.loads(match1(html, r'qualities":({.+?}),"'))
    title = match1(html, r'"video_title"\s*:\s*"([^"]+)"') or \
            match1(html, r'"title"\s*:\s*"([^"]+)"')
    title = unicodize(title)

    for quality in ['1080','720','480','380','240','144','auto']:
        try:
            real_url = info[quality][1]["url"]
            if real_url:
                break
        except KeyError:
            pass

    mime, ext, size = url_info(real_url)

    print_info(site_info, title, mime, size)
    if not info_only:
        download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)
```

We want the model to produce exactly this response:

```
Docstring:

    Download from dailymotion.com.

    Examples:
        >>> dailymotion_download('http://www.dailymotion.com/video/x2bq33')
```



# **Getting Python Docstring Generation Dataset** 💽 ##

First step in our notebook is to collect the docstring dataset from https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text

In [None]:
!wget https://github.com/microsoft/CodeXGLUE/raw/main/Code-Text/code-to-text/dataset.zip && unzip dataset.zip && cd dataset && wget https://zenodo.org/record/7857872/files/python.zip && unzip python.zip
!cd dataset && python preprocess.py

--2024-05-24 05:32:41--  https://github.com/microsoft/CodeXGLUE/raw/main/Code-Text/code-to-text/dataset.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/microsoft/CodeXGLUE/main/Code-Text/code-to-text/dataset.zip [following]
--2024-05-24 05:32:41--  https://raw.githubusercontent.com/microsoft/CodeXGLUE/main/Code-Text/code-to-text/dataset.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12396864 (12M) [application/zip]
Saving to: ‘dataset.zip’


2024-05-24 05:32:41 (110 MB/s) - ‘dataset.zip’ saved [12396864/12396864]

Archive:  dataset.zip
   creating: dataset/
   creating: dataset/go/
  in

Once you run the previous commands, you should have a folder structure as follows:

```
python
  train.jsonl
  valid.jsonl
  test.jsonl
```

Next, we read these files using pandas.

In [None]:
import pandas as pd

train_df = pd.read_json("dataset/python/train.jsonl", lines=True)
test_df = pd.read_json("dataset/python/test.jsonl", lines=True)

Lets see some of the examples.

In [None]:
pd.options.display.max_colwidth = 999

train_df.head(3)

Unnamed: 0,repo,path,func_name,original_string,language,code,code_tokens,docstring,docstring_tokens,sha,url,partition
0,smdabdoub/phylotoast,phylotoast/util.py,split_phylogeny,"def split_phylogeny(p, level=""s""):\n """"""\n Return either the full or truncated version of a QIIME-formatted taxonomy string.\n\n :type p: str\n :param p: A QIIME-formatted taxonomy string: k__Foo; p__Bar; ...\n\n :type level: str\n :param level: The different level of identification are kingdom (k), phylum (p),\n class (c),order (o), family (f), genus (g) and species (s). If level is\n not provided, the default level of identification is species.\n\n :rtype: str\n :return: A QIIME-formatted taxonomy string up to the classification given\n by param level.\n """"""\n level = level+""__""\n result = p.split(level)\n return result[0]+level+result[1].split("";"")[0]",python,"def split_phylogeny(p, level=""s""):\n """"""\n Return either the full or truncated version of a QIIME-formatted taxonomy string.\n\n :type p: str\n :param p: A QIIME-formatted taxonomy string: k__Foo; p__Bar; ...\n\n :type level: str\n :param level: The different level of identification are kingdom (k), phylum (p),\n class (c),order (o), family (f), genus (g) and species (s). If level is\n not provided, the default level of identification is species.\n\n :rtype: str\n :return: A QIIME-formatted taxonomy string up to the classification given\n by param level.\n """"""\n level = level+""__""\n result = p.split(level)\n return result[0]+level+result[1].split("";"")[0]","[def, split_phylogeny, (, p, ,, level, =, ""s"", ), :, level, =, level, +, ""__"", result, =, p, ., split, (, level, ), return, result, [, 0, ], +, level, +, result, [, 1, ], ., split, (, "";"", ), [, 0, ]]","Return either the full or truncated version of a QIIME-formatted taxonomy string.\n\n :type p: str\n :param p: A QIIME-formatted taxonomy string: k__Foo; p__Bar; ...\n\n :type level: str\n :param level: The different level of identification are kingdom (k), phylum (p),\n class (c),order (o), family (f), genus (g) and species (s). If level is\n not provided, the default level of identification is species.\n\n :rtype: str\n :return: A QIIME-formatted taxonomy string up to the classification given\n by param level.","[Return, either, the, full, or, truncated, version, of, a, QIIME, -, formatted, taxonomy, string, .]",0b74ef171e6a84761710548501dfac71285a58a3,https://github.com/smdabdoub/phylotoast/blob/0b74ef171e6a84761710548501dfac71285a58a3/phylotoast/util.py#L159-L177,train
1,smdabdoub/phylotoast,phylotoast/util.py,ensure_dir,"def ensure_dir(d):\n """"""\n Check to make sure the supplied directory path does not exist, if so, create it. The\n method catches OSError exceptions and returns a descriptive message instead of\n re-raising the error.\n\n :type d: str\n :param d: It is the full path to a directory.\n\n :return: Does not return anything, but creates a directory path if it doesn't exist\n already.\n """"""\n if not os.path.exists(d):\n try:\n os.makedirs(d)\n except OSError as oe:\n # should not happen with os.makedirs\n # ENOENT: No such file or directory\n if os.errno == errno.ENOENT:\n msg = twdd(""""""One or more directories in the path ({}) do not exist. If\n you are specifying a new directory for output, please ensure\n all other directories in the path currently exist."""""")\n return msg.format(d)\n else:\n ...",python,"def ensure_dir(d):\n """"""\n Check to make sure the supplied directory path does not exist, if so, create it. The\n method catches OSError exceptions and returns a descriptive message instead of\n re-raising the error.\n\n :type d: str\n :param d: It is the full path to a directory.\n\n :return: Does not return anything, but creates a directory path if it doesn't exist\n already.\n """"""\n if not os.path.exists(d):\n try:\n os.makedirs(d)\n except OSError as oe:\n # should not happen with os.makedirs\n # ENOENT: No such file or directory\n if os.errno == errno.ENOENT:\n msg = twdd(""""""One or more directories in the path ({}) do not exist. If\n you are specifying a new directory for output, please ensure\n all other directories in the path currently exist."""""")\n return msg.format(d)\n else:\n ...","[def, ensure_dir, (, d, ), :, if, not, os, ., path, ., exists, (, d, ), :, try, :, os, ., makedirs, (, d, ), except, OSError, as, oe, :, # should not happen with os.makedirs, # ENOENT: No such file or directory, if, os, ., errno, ==, errno, ., ENOENT, :, msg, =, twdd, (, """"""One or more directories in the path ({}) do not exist. If\n you are specifying a new directory for output, please ensure\n all other directories in the path currently exist."""""", ), return, msg, ., format, (, d, ), else, :, msg, =, twdd, (, """"""An error occurred trying to create the output directory\n ({}) with message: {}"""""", ), return, msg, ., format, (, d, ,, oe, ., strerror, )]","Check to make sure the supplied directory path does not exist, if so, create it. The\n method catches OSError exceptions and returns a descriptive message instead of\n re-raising the error.\n\n :type d: str\n :param d: It is the full path to a directory.\n\n :return: Does not return anything, but creates a directory path if it doesn't exist\n already.","[Check, to, make, sure, the, supplied, directory, path, does, not, exist, if, so, create, it, ., The, method, catches, OSError, exceptions, and, returns, a, descriptive, message, instead, of, re, -, raising, the, error, .]",0b74ef171e6a84761710548501dfac71285a58a3,https://github.com/smdabdoub/phylotoast/blob/0b74ef171e6a84761710548501dfac71285a58a3/phylotoast/util.py#L180-L206,train
2,smdabdoub/phylotoast,phylotoast/util.py,file_handle,"def file_handle(fnh, mode=""rU""):\n """"""\n Takes either a file path or an open file handle, checks validity and returns an open\n file handle or raises an appropriate Exception.\n\n :type fnh: str\n :param fnh: It is the full path to a file, or open file handle\n\n :type mode: str\n :param mode: The way in which this file will be used, for example to read or write or\n both. By default, file will be opened in rU mode.\n\n :return: Returns an opened file for appropriate usage.\n """"""\n handle = None\n if isinstance(fnh, file):\n if fnh.closed:\n raise ValueError(""Input file is closed."")\n handle = fnh\n elif isinstance(fnh, str):\n handle = open(fnh, mode)\n\n return handle",python,"def file_handle(fnh, mode=""rU""):\n """"""\n Takes either a file path or an open file handle, checks validity and returns an open\n file handle or raises an appropriate Exception.\n\n :type fnh: str\n :param fnh: It is the full path to a file, or open file handle\n\n :type mode: str\n :param mode: The way in which this file will be used, for example to read or write or\n both. By default, file will be opened in rU mode.\n\n :return: Returns an opened file for appropriate usage.\n """"""\n handle = None\n if isinstance(fnh, file):\n if fnh.closed:\n raise ValueError(""Input file is closed."")\n handle = fnh\n elif isinstance(fnh, str):\n handle = open(fnh, mode)\n\n return handle","[def, file_handle, (, fnh, ,, mode, =, ""rU"", ), :, handle, =, None, if, isinstance, (, fnh, ,, file, ), :, if, fnh, ., closed, :, raise, ValueError, (, ""Input file is closed."", ), handle, =, fnh, elif, isinstance, (, fnh, ,, str, ), :, handle, =, open, (, fnh, ,, mode, ), return, handle]","Takes either a file path or an open file handle, checks validity and returns an open\n file handle or raises an appropriate Exception.\n\n :type fnh: str\n :param fnh: It is the full path to a file, or open file handle\n\n :type mode: str\n :param mode: The way in which this file will be used, for example to read or write or\n both. By default, file will be opened in rU mode.\n\n :return: Returns an opened file for appropriate usage.","[Takes, either, a, file, path, or, an, open, file, handle, checks, validity, and, returns, an, open, file, handle, or, raises, an, appropriate, Exception, .]",0b74ef171e6a84761710548501dfac71285a58a3,https://github.com/smdabdoub/phylotoast/blob/0b74ef171e6a84761710548501dfac71285a58a3/phylotoast/util.py#L209-L231,train


To construct the original code from the original string, we replace the docstring from the original string.

In [None]:
train_df['code_with_docstring'] = train_df['code'].copy()
train_df['raw_code'] = train_df.apply(lambda x: x['code_with_docstring'].replace('"""\n','').replace(x['docstring'],''), axis=1)
train_df = train_df[['raw_code','docstring','code_with_docstring']]

In [None]:
test_df['code_with_docstring'] = test_df['code'].copy()
test_df['raw_code'] = test_df.apply(lambda x: x['code_with_docstring'].replace('"""\n','').replace(x['docstring'],''), axis=1)
test_df = test_df[['raw_code','docstring','code_with_docstring']]

Now let's create the properly templated prompt and completion columns.

In [None]:
df_dataset = train_df.iloc[:5000].copy()

In [None]:
df_test = test_df.iloc[:100].copy()

In [None]:
code_llama_13b_instruct_prompt_template: str = "<s>[INST] {prompt} [/INST]"

In [None]:
fine_tuning_prompt = """
    Write an appropriate docstring for the following Python function. Return the
    entire function with the in-line docstring.

    ### Function: {raw_code}

    ### Function with docstring:
"""

In [None]:
def convert_instruction_to_prompt(raw_code: str) -> str:
  return code_llama_13b_instruct_prompt_template.format(
      prompt=fine_tuning_prompt.format(
          raw_code=raw_code,
      ),
  )

In [None]:
df_dataset["prompt"] = df_dataset.apply(
  lambda row: convert_instruction_to_prompt(raw_code=row["raw_code"]),
  axis=1,
)

In [None]:
df_dataset["completion"] = df_dataset["code_with_docstring"]

In [None]:
df_test["prompt"] = df_test.apply(
  lambda row: convert_instruction_to_prompt(raw_code=row["raw_code"]),
  axis=1,
)

In [None]:
df_test["completion"] = df_test["code_with_docstring"]

Lets take a final look at the dataset.

In [None]:
df_dataset.head(3)

Unnamed: 0,raw_code,docstring,code_with_docstring,prompt,completion
0,"def split_phylogeny(p, level=""s""):\n \n level = level+""__""\n result = p.split(level)\n return result[0]+level+result[1].split("";"")[0]","Return either the full or truncated version of a QIIME-formatted taxonomy string.\n\n :type p: str\n :param p: A QIIME-formatted taxonomy string: k__Foo; p__Bar; ...\n\n :type level: str\n :param level: The different level of identification are kingdom (k), phylum (p),\n class (c),order (o), family (f), genus (g) and species (s). If level is\n not provided, the default level of identification is species.\n\n :rtype: str\n :return: A QIIME-formatted taxonomy string up to the classification given\n by param level.","def split_phylogeny(p, level=""s""):\n """"""\n Return either the full or truncated version of a QIIME-formatted taxonomy string.\n\n :type p: str\n :param p: A QIIME-formatted taxonomy string: k__Foo; p__Bar; ...\n\n :type level: str\n :param level: The different level of identification are kingdom (k), phylum (p),\n class (c),order (o), family (f), genus (g) and species (s). If level is\n not provided, the default level of identification is species.\n\n :rtype: str\n :return: A QIIME-formatted taxonomy string up to the classification given\n by param level.\n """"""\n level = level+""__""\n result = p.split(level)\n return result[0]+level+result[1].split("";"")[0]","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def split_phylogeny(p, level=""s""):\n \n level = level+""__""\n result = p.split(level)\n return result[0]+level+result[1].split("";"")[0]\n\n ### Function with docstring:\n [/INST]","def split_phylogeny(p, level=""s""):\n """"""\n Return either the full or truncated version of a QIIME-formatted taxonomy string.\n\n :type p: str\n :param p: A QIIME-formatted taxonomy string: k__Foo; p__Bar; ...\n\n :type level: str\n :param level: The different level of identification are kingdom (k), phylum (p),\n class (c),order (o), family (f), genus (g) and species (s). If level is\n not provided, the default level of identification is species.\n\n :rtype: str\n :return: A QIIME-formatted taxonomy string up to the classification given\n by param level.\n """"""\n level = level+""__""\n result = p.split(level)\n return result[0]+level+result[1].split("";"")[0]"
1,"def ensure_dir(d):\n \n if not os.path.exists(d):\n try:\n os.makedirs(d)\n except OSError as oe:\n # should not happen with os.makedirs\n # ENOENT: No such file or directory\n if os.errno == errno.ENOENT:\n msg = twdd(""""""One or more directories in the path ({}) do not exist. If\n you are specifying a new directory for output, please ensure\n all other directories in the path currently exist."""""")\n return msg.format(d)\n else:\n msg = twdd(""""""An error occurred trying to create the output directory\n ({}) with message: {}"""""")\n return msg.format(d, oe.strerror)","Check to make sure the supplied directory path does not exist, if so, create it. The\n method catches OSError exceptions and returns a descriptive message instead of\n re-raising the error.\n\n :type d: str\n :param d: It is the full path to a directory.\n\n :return: Does not return anything, but creates a directory path if it doesn't exist\n already.","def ensure_dir(d):\n """"""\n Check to make sure the supplied directory path does not exist, if so, create it. The\n method catches OSError exceptions and returns a descriptive message instead of\n re-raising the error.\n\n :type d: str\n :param d: It is the full path to a directory.\n\n :return: Does not return anything, but creates a directory path if it doesn't exist\n already.\n """"""\n if not os.path.exists(d):\n try:\n os.makedirs(d)\n except OSError as oe:\n # should not happen with os.makedirs\n # ENOENT: No such file or directory\n if os.errno == errno.ENOENT:\n msg = twdd(""""""One or more directories in the path ({}) do not exist. If\n you are specifying a new directory for output, please ensure\n all other directories in the path currently exist."""""")\n return msg.format(d)\n else:\n ...","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def ensure_dir(d):\n \n if not os.path.exists(d):\n try:\n os.makedirs(d)\n except OSError as oe:\n # should not happen with os.makedirs\n # ENOENT: No such file or directory\n if os.errno == errno.ENOENT:\n msg = twdd(""""""One or more directories in the path ({}) do not exist. If\n you are specifying a new directory for output, please ensure\n all other directories in the path currently exist."""""")\n return msg.format(d)\n else:\n msg = twdd(""""""An error occurred trying to create the output directory\n ({}) with message: {}"""""")\n return msg.format(d, oe.strerror)\n\n ### Function with docstring:\n [/INST]","def ensure_dir(d):\n """"""\n Check to make sure the supplied directory path does not exist, if so, create it. The\n method catches OSError exceptions and returns a descriptive message instead of\n re-raising the error.\n\n :type d: str\n :param d: It is the full path to a directory.\n\n :return: Does not return anything, but creates a directory path if it doesn't exist\n already.\n """"""\n if not os.path.exists(d):\n try:\n os.makedirs(d)\n except OSError as oe:\n # should not happen with os.makedirs\n # ENOENT: No such file or directory\n if os.errno == errno.ENOENT:\n msg = twdd(""""""One or more directories in the path ({}) do not exist. If\n you are specifying a new directory for output, please ensure\n all other directories in the path currently exist."""""")\n return msg.format(d)\n else:\n ..."
2,"def file_handle(fnh, mode=""rU""):\n \n handle = None\n if isinstance(fnh, file):\n if fnh.closed:\n raise ValueError(""Input file is closed."")\n handle = fnh\n elif isinstance(fnh, str):\n handle = open(fnh, mode)\n\n return handle","Takes either a file path or an open file handle, checks validity and returns an open\n file handle or raises an appropriate Exception.\n\n :type fnh: str\n :param fnh: It is the full path to a file, or open file handle\n\n :type mode: str\n :param mode: The way in which this file will be used, for example to read or write or\n both. By default, file will be opened in rU mode.\n\n :return: Returns an opened file for appropriate usage.","def file_handle(fnh, mode=""rU""):\n """"""\n Takes either a file path or an open file handle, checks validity and returns an open\n file handle or raises an appropriate Exception.\n\n :type fnh: str\n :param fnh: It is the full path to a file, or open file handle\n\n :type mode: str\n :param mode: The way in which this file will be used, for example to read or write or\n both. By default, file will be opened in rU mode.\n\n :return: Returns an opened file for appropriate usage.\n """"""\n handle = None\n if isinstance(fnh, file):\n if fnh.closed:\n raise ValueError(""Input file is closed."")\n handle = fnh\n elif isinstance(fnh, str):\n handle = open(fnh, mode)\n\n return handle","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def file_handle(fnh, mode=""rU""):\n \n handle = None\n if isinstance(fnh, file):\n if fnh.closed:\n raise ValueError(""Input file is closed."")\n handle = fnh\n elif isinstance(fnh, str):\n handle = open(fnh, mode)\n\n return handle\n\n ### Function with docstring:\n [/INST]","def file_handle(fnh, mode=""rU""):\n """"""\n Takes either a file path or an open file handle, checks validity and returns an open\n file handle or raises an appropriate Exception.\n\n :type fnh: str\n :param fnh: It is the full path to a file, or open file handle\n\n :type mode: str\n :param mode: The way in which this file will be used, for example to read or write or\n both. By default, file will be opened in rU mode.\n\n :return: Returns an opened file for appropriate usage.\n """"""\n handle = None\n if isinstance(fnh, file):\n if fnh.closed:\n raise ValueError(""Input file is closed."")\n handle = fnh\n elif isinstance(fnh, str):\n handle = open(fnh, mode)\n\n return handle"


In [None]:
df_test.head(3)

Unnamed: 0,raw_code,docstring,code_with_docstring,prompt,completion
0,"def sina_xml_to_url_list(xml_data):\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl",str->list\n Convert XML to URL List.\n From Biligrab.,"def sina_xml_to_url_list(xml_data):\n """"""str->list\n Convert XML to URL List.\n From Biligrab.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def sina_xml_to_url_list(xml_data):\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl\n\n ### Function with docstring:\n [/INST]","def sina_xml_to_url_list(xml_data):\n """"""str->list\n Convert XML to URL List.\n From Biligrab.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl"
1,"def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n \n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)",Downloads Dailymotion videos by URL.,"def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Dailymotion videos by URL.\n """"""\n\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n \n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)\n\n ### Function with docstring:\n [/INST]","def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Dailymotion videos by URL.\n """"""\n\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)"
2,"def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n vid = match1(url, r'#(\d+)')\n sina_download_by_vid(vid, output_dir=output_dir, me...",Downloads Sina videos by URL.,"def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Sina videos by URL.\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n vid = match1(url, r'#(\d+)')\n sina_download_by_...","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_...","def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Sina videos by URL.\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n vid = match1(url, r'#(\d+)')\n sina_download_by_..."


Now that we have the dataset prepared, lets build the dig into Predibase platform for building an LLM for docstring generation.

# **Setup Predibase** 🧰

Predibase allows users the fastest, most efficient way to fine-tune and serve open-source AI models on cloud. To setup predibase you need to,

* Download the Predibase SDK: ```pip install -U predibase```
* Generate the personalized API token.
* Run ```pbase login``` and paste the API token when prompted.

In [None]:
!pip install -U predibase

In [None]:
!pbase login

Once it is setup, lets import the predibase client.

In [None]:
from predibase import Predibase, FinetuningConfig, DeploymentConfig
from predibase.resources.deployment import Deployment
from predibase.resources.dataset import Dataset
from predibase.resources.repo import Repo
from predibase.resources.finetuning_job import FinetuningJob
from lorax.client import Client
from lorax.types import Response

In [None]:
pb: Predibase = Predibase(api_token=my_api_token)

Predibase already hosts a series of pre-trained large language models. The list can be found at https://docs.predibase.com/user-guide/inference/models.

We use the pre-trained codellama-13b-instruct model, which is hosted on Huggingface.

In [None]:
deployment: Deployment = pb.deployments.get(deployment_ref="codellama-13b-instruct")

In [None]:
client: Client = pb.deployments.client(deployment_ref=deployment.name)

In [None]:
def lorax_generate(
    client: Client,
    prompt: str,
    adapter_id: str | None = None,
    adapter_version: int | None = None,
    **kwargs,
) -> str:
    kwargs = kwargs or {}

    if adapter_id and adapter_version:
        adapter_id = f"{adapter_id}/{adapter_version}"

    response: Response = client.generate(
        prompt=prompt,
        adapter_id=adapter_id,
        **kwargs,
    )
    generated_text: str = response.generated_text

    return generated_text

To fine-tune an LLM with Predibase, we first need a training dataset uploaded to the platform. For this demonstration, we use a small subset of the training data for fine-tuning the model. We use the dataset name `Docstring_generation_dataset` to load this dataset with Predibase.

If you already have the dataset uploaded to Predibase, you can use ```pc.get_dataset(dataset_id)``` to retrieve the dataset.

## **Instruction-tuning LLM**

Instruction tuning is a particular form of fine-tuning in which a model is trained using pairs of input-output instructions. That way, the model learns a fine-tuning concept through instructions. We design a tailored prompt to fine-tune the codellama model.

In [None]:
!mkdir -p /content/datasets/docstring_generation

In [None]:
dataset_file_path: str = f"/content/datasets/docstring_generation/{deployment.name}.csv"
dataset_file_path

'/content/datasets/docstring_generation/codellama-13b-instruct.csv'

In [None]:
df_dataset.to_csv(path_or_buf=dataset_file_path, index=False)

In [None]:
dataset_name: str = f"Docstring_generation_dataset_{deployment.name}"

In [None]:
repo_ref: str = f"Docstring_generation_adapter-{deployment.name}"

In [None]:
dataset: Dataset

In [None]:
# dataset = pb.datasets.from_file(file_path=dataset_file_path, name=dataset_name)

In [None]:
dataset = pb.datasets.get(dataset_ref=dataset_name)

In [None]:
dataset

Dataset(uuid='a03ff65e-ffd2-4760-9ab9-a79d11a77ef4', name='Docstring_generation_dataset_codellama-13b-instruct', connection_type='file', connection_name='file_uploads', status='connected')

In [None]:
repo: Repo

In [None]:
# repo = pb.repos.create(name=repo_ref, description="Fine-tuning on Docstring Generation dataset with Predibase.")

In [None]:
repo = pb.repos.get(repo_ref=repo_ref)

In [None]:
repo

Repo(uuid='464f9403-fb82-4927-8b69-ed9ea502e6b5', name='Docstring_generation_adapter-codellama-13b-instruct', description='Fine-tuning on Gridspace-Stanford Harper Valley speech dataset with Predibase.')

In [None]:
# Create an adapter
adapter: FinetuningJob = pb.finetuning.jobs.create(
    config={
        "base_model": deployment.name,
        "epochs": 5,
        "learning_rate": 0.0002,
    },
    dataset=dataset,
    repo=repo_ref,
    description=f'Fine-tune "{deployment.name}" with Docstring Generation dataset.',
)

Successfully requested finetuning of codellama-13b-instruct as `Docstring_generation_adapter-codellama-13b-instruct/1`. (Job UUID: 48e5c55e-2de3-4246-b039-2cea59fcd7d0).



# **Model Testing** 🧑‍🔬
Now that we have the model trained, let's test the model on a few sample test examples. Before that, we need to deploy the fine-tuned model for inference. As the codellama-13b-instruct model is already deployed on Predibase, we can use an adapter to load our fine-tuned model on the deployed model. All the models deployed on Predibase can be found in https://docs.predibase.com/user-guide/inference/models.

We also compare the fine-tuned model with the base codellama-13b-instruct model to understand the performance improvement due to fine-tuning.

In [None]:
print(
    lorax_generate(
        client=client,
        prompt=df_test.prompt.iloc[1],
        max_new_tokens=2048,
        temperature=0.1,
    )
)

 ```
def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):
    """
    Download video from Dailymotion.

    Args:
        url (str): The URL of the Dailymotion video.
        output_dir (str, optional): The directory to save the video. Defaults to the current directory.
        merge (bool, optional): Whether to merge the video with other files that have been downloaded. Defaults to True.
        info_only (bool, optional): Whether to only show the information of the video. Defaults to False.
        **kwargs: Optional arguments for download.

    Returns:
        str: The path to the downloaded video file, or the information of the video.
    """
    html = get_content(rebuilt_url(url))
    info = json.loads(match1(html, r'qualities":({.+?}),"'))
    title = match1(html, r'"video_title"\s*:\s*"([^"]+)"') or \
            match1(html, r'"title"\s*:\s*"([^"]+)"')
    title = unicodize(title)

    for quality in ['1080','720','480','380','240','144','auto

In [None]:
print(
    lorax_generate(
        client=client,
        prompt=df_test.prompt.iloc[1],
        adapter_id=repo.name,
        adapter_version=1,
        max_new_tokens=2048, # fine-tuned LLMs actually know how to stop early, so it will not hit the 2048 token limit set here
        temperature=0.1,
    )
)

def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):
    """Download from dailymotion.com.

    url = video URL.
    """

    html = get_content(rebuilt_url(url))
    info = json.loads(match1(html, r'qualities":({.+?}),"'))
    title = match1(html, r'"video_title"\s*:\s*"([^"]+)"') or \
            match1(html, r'"title"\s*:\s*"([^"]+)"')
    title = unicodize(title)

    for quality in ['1080','720','480','380','240','144','auto']:
        try:
            real_url = info[quality][1]["url"]
            if real_url:
                break
        except KeyError:
            pass

    mime, ext, size = url_info(real_url)

    print_info(site_info, title, mime, size)
    if not info_only:
        download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)


Finally we test the fine-tuned model on the test sample and compute ROUGE score. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. ROUGE score measures the correctness of a generation model when ground-truth is available.

ROUGE-N metric computes precision, recall and f metric on the N-gram overlap between the ground truth and target texts. It is a widely popular metric for evaluating summarization or translation models. See https://huggingface.co/spaces/evaluate-metric/rouge for more details on the ROUGE metric.

In [None]:
!pip install torchmetrics

Collecting torchmetrics
  Downloading torchmetrics-1.4.0.post0-py3-none-any.whl (868 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/868.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.6/868.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m868.8/868.8 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Collecting lightning-utilities>=0.8.0 (from torchmetrics)
  Downloading lightning_utilities-0.11.2-py3-none-any.whl (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->torchmetrics)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->torchmetrics)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->torchme

In [None]:
from torchmetrics.text.rouge import ROUGEScore
from tqdm import tqdm

In [None]:
def get_rouge(generated_text, target_text):
    rouge = ROUGEScore()
    return rouge([generated_text], [target_text])["rougeL_fmeasure"].item()

In [None]:
for i in tqdm(range(df_test.shape[0])):
    result = lorax_generate(
        client=client,
        prompt=df_test.prompt.iloc[i],
        adapter_id=repo.name,
        adapter_version=1,
        max_new_tokens=2048, # fine-tuned LLMs actually know how to stop early, so it will not hit the 2048 token limit set here
        temperature=0.1,
    )
    df_test.loc[i, "Generated code_with_docstring finetunedmodel"] = result

    result = lorax_generate(
        client=client,
        prompt=df_test.prompt.iloc[i],
        max_new_tokens=2048,
        temperature=0.1,
    )
    df_test.loc[i, "Generated code_with_docstring basemodel"] = result

100%|██████████| 100/100 [34:56<00:00, 20.96s/it]


In [None]:
df_test['RougeL_basemodel'] = df_test.apply(lambda x: get_rouge(x['Generated code_with_docstring basemodel'], x['code_with_docstring']), axis=1)

In [None]:
df_test['RougeL_finetunedmodel'] = df_test.apply(lambda x: get_rouge(x['Generated code_with_docstring finetunedmodel'], x['code_with_docstring']), axis=1)

Lets see the average ROUGE-1 score on the entire test corpus with and without model fine-tuning.

In [None]:
print (df_test['RougeL_basemodel'].mean(), df_test['RougeL_finetunedmodel'].mean())

0.6307747969031334 0.7910575208067894


In [None]:
df_test.head()

Unnamed: 0,raw_code,docstring,code_with_docstring,prompt,completion,Generated code_with_docstring finetunedmodel,Generated code_with_docstring basemodel,RougeL_basemodel,RougeL_finetunedmodel
0,"def sina_xml_to_url_list(xml_data):\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl",str->list\n Convert XML to URL List.\n From Biligrab.,"def sina_xml_to_url_list(xml_data):\n """"""str->list\n Convert XML to URL List.\n From Biligrab.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def sina_xml_to_url_list(xml_data):\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl\n\n ### Function with docstring:\n [/INST]","def sina_xml_to_url_list(xml_data):\n """"""str->list\n Convert XML to URL List.\n From Biligrab.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl","def sina_xml_to_url_list(xml_data):\n """"""Convert Sina xml data to url list.\n\n :param xml_data: xml data from Sina\n :returns: a list of urls\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl","```\ndef sina_xml_to_url_list(xml_data):\n """"""\n Parse the given XML data and return a list of URLs.\n\n The XML data should be in the format of the XML returned by the\n Sina Weibo API.\n\n Args:\n xml_data (str): The XML data to be parsed.\n\n Returns:\n list: A list of URLs extracted from the XML data.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl\n```",0.606557,0.826087
1,"def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n \n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)",Downloads Dailymotion videos by URL.,"def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Dailymotion videos by URL.\n """"""\n\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n \n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)\n\n ### Function with docstring:\n [/INST]","def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Dailymotion videos by URL.\n """"""\n\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)","def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Download from dailymotion.com.\n\n url = video URL.\n output_dir = directory to save file.\n """"""\n\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)","```\ndef dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n Download video from Dailymotion.\n\n Args:\n url (str): The URL of the Dailymotion video.\n output_dir (str, optional): The directory to save the video. Defaults to the current directory.\n merge (bool, optional): Whether to merge the video with other files that have been downloaded. Defaults to True.\n info_only (bool, optional): Whether to only show the information of the video. Defaults to False.\n **kwargs: Optional arguments for download.\n\n Returns:\n str: The path to the downloaded video file, or the information of the video.\n """"""\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','48...",0.600601,0.934579
2,"def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n vid = match1(url, r'#(\d+)')\n sina_download_by_vid(vid, output_dir=output_dir, me...",Downloads Sina videos by URL.,"def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Sina videos by URL.\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n vid = match1(url, r'#(\d+)')\n sina_download_by_...","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_...","def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Sina videos by URL.\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n vid = match1(url, r'#(\d+)')\n sina_download_by_...","def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Download video from news.sina.com.cn using sina_download_by_vid or sina_download_by_vkey.\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n ...","```\ndef sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n Download video from Sina video platform.\n\n Args:\n url (str): URL of the video.\n output_dir (str, optional): Output directory for the downloaded video. Defaults to the current directory.\n merge (bool, optional): Whether to merge video and audio streams. Defaults to True.\n info_only (bool, optional): Whether to only return the metadata of the video. Defaults to False.\n **kwargs: Optional arguments for the download function.\n\n Returns:\n str: Path to the downloaded video file, or the metadata of the video if info_only is True.\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\...",0.806452,0.951087
3,"def sprint(text, *colors):\n """""" return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text",Format text with color or other effects into ANSI escaped string.,"def sprint(text, *colors):\n """"""Format text with color or other effects into ANSI escaped string.""""""\n return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def sprint(text, *colors):\n """""" return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text\n\n ### Function with docstring:\n [/INST]","def sprint(text, *colors):\n """"""Format text with color or other effects into ANSI escaped string.""""""\n return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text","def sprint(text, *colors):\n """"""Print text with colors.""""""\n return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text","```\ndef sprint(text, *colors):\n """"""\n Return the text with the given colors applied.\n\n If the terminal supports ANSI escape sequences, the text will be colored.\n Otherwise, the text will be returned as-is.\n\n Args:\n text (str): The text to be colored.\n *colors (int): The colors to apply to the text.\n\n Returns:\n str: The colored text.\n """"""\n return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text\n```\nThis docstring provides a brief description of the function, as well as the arguments and return value. It also includes a brief explanation of the function's behavior, including the use of ANSI escape sequences to color the text. The `*colors` parameter is documented as a tuple of integers, which are the colors to apply to the text. The `RESET` constant is also documented, as it is used to reset the terminal's color settings.",0.329897,0.849315
4,"def print_log(text, *colors):\n """""" sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")",Print a log message to standard error.,"def print_log(text, *colors):\n """"""Print a log message to standard error.""""""\n sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def print_log(text, *colors):\n """""" sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")\n\n ### Function with docstring:\n [/INST]","def print_log(text, *colors):\n """"""Print a log message to standard error.""""""\n sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")","def print_log(text, *colors):\n """"""Prints text to stderr with a script name prefix.""""""\n sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")","```\ndef print_log(text, *colors):\n """"""Print the given text to the log file.\n\n Args:\n text (str): The text to print to the log file.\n *colors (str): The colors to use for the text.\n\n Returns:\n None\n\n Raises:\n None\n """"""\n sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")\n```\nThis docstring provides a brief description of the function, its arguments, and its return value. It also mentions any exceptions that the function may raise. The `sprint` function is not defined in the docstring, but it is assumed to be a built-in function that formats the text with the given colors. The `script_name` variable is also not defined in the docstring, but it is assumed to be a global variable that contains the name of the script.",0.243243,0.695652


## **Result Analysis**

As we can see 239% improvement in ROUGE-1 metric between the base codellama-13b and the fine-tuned codellama-13b models. The examples highlighted above also suggest that base codellama model lacks the task understanding and often generates an explanation of the function, rather than generating the docstring.

We also compute a simlarity-based metric to understand the semantic similarity between the generated docstring and the ground truth.

In [None]:
!pip install -U sentence-transformers transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m122.9/171.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.41.1-py3-none-any.whl (9.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m58.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers, sentence-transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.41.0
    Uninstalling transformers-4.41.0:
      Successfully uninstalled transformers-4.41.0
Successfully installed sentence-transformers-2.7.0 transformers-4.41.1


In [None]:
from sentence_transformers import SentenceTransformer

emb_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(target, gt):
  emb_gt = emb_model.encode([gt])
  emb_target = emb_model.encode([target])
  return cosine_similarity(emb_gt, emb_target)[0][0]

In [None]:
df_test['similarity_basemodel'] = df_test.apply(lambda x: calculate_similarity(x['Generated code_with_docstring basemodel'], x['code_with_docstring']), axis=1)
df_test['similarity_finetunedmodel'] = df_test.apply(lambda x: calculate_similarity(x['Generated code_with_docstring finetunedmodel'], x['code_with_docstring']), axis=1)

In [None]:
print (df_test['similarity_basemodel'].mean())

0.8684739


In [None]:
print (df_test['similarity_finetunedmodel'].mean())

0.90929055


## **Result Analysis**

Even with the semantic similar measure, the fine-tuned model performs 16% better than the base codellama model.

## **Final Remarks**
For both the models we compute the ROUGE metric on the generated text. As the fine-tuned model generates codestring along with the original Python code, the ROUGE numbers can be inflated. To evaluate the fine-tuned model only on generated docstring, we need to first extract the docstring from the generated text, before calculating ROUGE. However, this approach is not applicable for the base codellama model, as the model does not always generate the original Python code.

In [None]:
def extract_generated_docstring(text):
    l = text.split("\n")
    out = ""
    start_idx = 0
    end_idx = 0
    for idx, i in enumerate(l):
        if i.strip().startswith('"""'):
            start_idx = idx
            break

    for idx, i in enumerate(l):
        if i.strip().endswith('"""'):
            end_idx = idx
            break

    #print (start_idx, end_idx)
    l2 = [i for i in l[start_idx:end_idx+1] if i != '']

    return "\n".join(l2).replace('"""','').strip()

In [None]:
df_test['Generated docstring_finetunedmodel'] = df_test['Generated code_with_docstring finetunedmodel'].apply(extract_generated_docstring)

In [None]:
df_test['RougeL_finetunedmodel_only_docstring'] = df_test.apply(lambda x: get_rouge(x['Generated docstring_finetunedmodel'], x['docstring']), axis=1)

In [None]:
print (df_test['RougeL_finetunedmodel_only_docstring'].mean())

0.060346930846571924


In [None]:
df_test['similarity_finetunedmodel_only_docstring'] = df_test.apply(lambda x: calculate_similarity(x['Generated docstring_finetunedmodel'], x['docstring']), axis=1)

In [None]:
print (df_test['similarity_finetunedmodel_only_docstring'].mean())

0.15051812


In [None]:
df_test.head()

Unnamed: 0,raw_code,docstring,code_with_docstring,prompt,completion,Generated code_with_docstring finetunedmodel,Generated code_with_docstring basemodel,RougeL_basemodel,RougeL_finetunedmodel,similarity_basemodel,similarity_finetunedmodel,Generated docstring_finetunedmodel,RougeL_finetunedmodel_only_docstring,similarity_finetunedmodel_only_docstring
0,"def sina_xml_to_url_list(xml_data):\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl",str->list\n Convert XML to URL List.\n From Biligrab.,"def sina_xml_to_url_list(xml_data):\n """"""str->list\n Convert XML to URL List.\n From Biligrab.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def sina_xml_to_url_list(xml_data):\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl\n\n ### Function with docstring:\n [/INST]","def sina_xml_to_url_list(xml_data):\n """"""str->list\n Convert XML to URL List.\n From Biligrab.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl","def sina_xml_to_url_list(xml_data):\n """"""Convert Sina xml data to url list.\n\n :param xml_data: xml data from Sina\n :returns: a list of urls\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl","```\ndef sina_xml_to_url_list(xml_data):\n """"""\n Parse the given XML data and return a list of URLs.\n\n The XML data should be in the format of the XML returned by the\n Sina Weibo API.\n\n Args:\n xml_data (str): The XML data to be parsed.\n\n Returns:\n list: A list of URLs extracted from the XML data.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.childNodes[0].data)\n return rawurl\n```",0.606557,0.826087,0.844222,0.943416,Convert Sina xml data to url list.\n :param xml_data: xml data from Sina\n :returns: a list of urls,0.428571,0.641119
1,"def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n \n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)",Downloads Dailymotion videos by URL.,"def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Dailymotion videos by URL.\n """"""\n\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n \n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)\n\n ### Function with docstring:\n [/INST]","def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Dailymotion videos by URL.\n """"""\n\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)","def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Download from dailymotion.com.\n\n url = video URL.\n output_dir = directory to save file.\n """"""\n\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','480','380','240','144','auto']:\n try:\n real_url = info[quality][1][""url""]\n if real_url:\n break\n except KeyError:\n pass\n\n mime, ext, size = url_info(real_url)\n\n print_info(site_info, title, mime, size)\n if not info_only:\n download_urls([real_url], title, ext, size, output_dir=output_dir, merge=merge)","```\ndef dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n Download video from Dailymotion.\n\n Args:\n url (str): The URL of the Dailymotion video.\n output_dir (str, optional): The directory to save the video. Defaults to the current directory.\n merge (bool, optional): Whether to merge the video with other files that have been downloaded. Defaults to True.\n info_only (bool, optional): Whether to only show the information of the video. Defaults to False.\n **kwargs: Optional arguments for download.\n\n Returns:\n str: The path to the downloaded video file, or the information of the video.\n """"""\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""([^""]+)""') or \\n match1(html, r'""title""\s*:\s*""([^""]+)""')\n title = unicodize(title)\n\n for quality in ['1080','720','48...",0.600601,0.934579,0.83129,0.968849,Download from dailymotion.com.\n url = video URL.\n output_dir = directory to save file.,0.222222,0.75692
2,"def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n vid = match1(url, r'#(\d+)')\n sina_download_by_vid(vid, output_dir=output_dir, me...",Downloads Sina videos by URL.,"def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Sina videos by URL.\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n vid = match1(url, r'#(\d+)')\n sina_download_by_...","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_...","def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Sina videos by URL.\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n vid = match1(url, r'#(\d+)')\n sina_download_by_...","def sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Download video from news.sina.com.cn using sina_download_by_vid or sina_download_by_vkey.\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\s*\'([^\']+)\'')\n if hd_vid == '0':\n vids = match1(video_page, r'[^\w]vid\s*:\s*\'([^\']+)\'').split('|')\n vid = vids[-1]\n\n if vid is None:\n vid = match1(video_page, r'vid:""?(\d+)""?')\n if vid:\n #title = match1(video_page, r'title\s*:\s*\'([^\']+)\'')\n sina_download_by_vid(vid, output_dir=output_dir, merge=merge, info_only=info_only)\n else:\n vkey = match1(video_page, r'vkey\s*:\s*""([^""]+)""')\n if vkey is None:\n ...","```\ndef sina_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""\n Download video from Sina video platform.\n\n Args:\n url (str): URL of the video.\n output_dir (str, optional): Output directory for the downloaded video. Defaults to the current directory.\n merge (bool, optional): Whether to merge video and audio streams. Defaults to True.\n info_only (bool, optional): Whether to only return the metadata of the video. Defaults to False.\n **kwargs: Optional arguments for the download function.\n\n Returns:\n str: Path to the downloaded video file, or the metadata of the video if info_only is True.\n """"""\n if 'news.sina.com.cn/zxt' in url:\n sina_zxt(url, output_dir=output_dir, merge=merge, info_only=info_only, **kwargs)\n return\n\n vid = match1(url, r'vid=(\d+)')\n if vid is None:\n video_page = get_content(url)\n vid = hd_vid = match1(video_page, r'hd_vid\s*:\...",0.806452,0.951087,0.79643,0.979025,Download video from news.sina.com.cn using sina_download_by_vid or sina_download_by_vkey.,0.181818,0.81832
3,"def sprint(text, *colors):\n """""" return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text",Format text with color or other effects into ANSI escaped string.,"def sprint(text, *colors):\n """"""Format text with color or other effects into ANSI escaped string.""""""\n return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def sprint(text, *colors):\n """""" return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text\n\n ### Function with docstring:\n [/INST]","def sprint(text, *colors):\n """"""Format text with color or other effects into ANSI escaped string.""""""\n return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text","def sprint(text, *colors):\n """"""Print text with colors.""""""\n return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text","```\ndef sprint(text, *colors):\n """"""\n Return the text with the given colors applied.\n\n If the terminal supports ANSI escape sequences, the text will be colored.\n Otherwise, the text will be returned as-is.\n\n Args:\n text (str): The text to be colored.\n *colors (int): The colors to apply to the text.\n\n Returns:\n str: The colored text.\n """"""\n return ""\33[{}m{content}\33[{}m"".format("";"".join([str(color) for color in colors]), RESET, content=text) if IS_ANSI_TERMINAL and colors else text\n```\nThis docstring provides a brief description of the function, as well as the arguments and return value. It also includes a brief explanation of the function's behavior, including the use of ANSI escape sequences to color the text. The `*colors` parameter is documented as a tuple of integers, which are the colors to apply to the text. The `RESET` constant is also documented, as it is used to reset the terminal's color settings.",0.329897,0.849315,0.861546,0.962674,Print text with colors.,0.266667,0.574295
4,"def print_log(text, *colors):\n """""" sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")",Print a log message to standard error.,"def print_log(text, *colors):\n """"""Print a log message to standard error.""""""\n sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")","<s>[INST] \n Write an appropriate docstring for the following Python function. Return the\n entire function with the in-line docstring.\n\n ### Function: def print_log(text, *colors):\n """""" sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")\n\n ### Function with docstring:\n [/INST]","def print_log(text, *colors):\n """"""Print a log message to standard error.""""""\n sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")","def print_log(text, *colors):\n """"""Prints text to stderr with a script name prefix.""""""\n sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")","```\ndef print_log(text, *colors):\n """"""Print the given text to the log file.\n\n Args:\n text (str): The text to print to the log file.\n *colors (str): The colors to use for the text.\n\n Returns:\n None\n\n Raises:\n None\n """"""\n sys.stderr.write(sprint(""{}: {}"".format(script_name, text), *colors) + ""\n"")\n```\nThis docstring provides a brief description of the function, its arguments, and its return value. It also mentions any exceptions that the function may raise. The `sprint` function is not defined in the docstring, but it is assumed to be a built-in function that formats the text with the given colors. The `script_name` variable is also not defined in the docstring, but it is assumed to be a global variable that contains the name of the script.",0.243243,0.695652,0.852648,0.974473,Prints text to stderr with a script name prefix.,0.125,0.429761
