New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proper Docstrings to all public functions and classes #119

Closed
m3nu opened this Issue May 14, 2018 · 14 comments

Comments

Projects
None yet
2 participants
@m3nu
Collaborator

m3nu commented May 14, 2018

To help users get started quickly, let's add some docstrings. I linked to some style guides in the wiki that deal with the practice.

@m3nu m3nu added this to the gsoc-1 milestone May 14, 2018

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb May 14, 2018

Collaborator

Should I add single line docstring (telling what it does and returns) or a multi-line (giving info about all variables and return values)?

Collaborator

duskybomb commented May 14, 2018

Should I add single line docstring (telling what it does and returns) or a multi-line (giving info about all variables and return values)?

@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu May 14, 2018

Collaborator

Complex public-facing functions will need multiple lines with all details.

Simple internal helper functions may only need a single line.

Collaborator

m3nu commented May 14, 2018

Complex public-facing functions will need multiple lines with all details.

Simple internal helper functions may only need a single line.

@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu May 14, 2018

Collaborator

In the end, anyone who never used the library needs to know enough about arguments and return values to get started quickly. This is true for invoice2data and even more so for facturx.

Collaborator

m3nu commented May 14, 2018

In the end, anyone who never used the library needs to know enough about arguments and return values to get started quickly. This is true for invoice2data and even more so for facturx.

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb May 14, 2018

Collaborator

How does this look for extract_data()

"""Function to extract data from e-invoices

:param invoicefile: path of invoice
:param templates: load templates (template name)
:param input_module: input module to use out of input_mapping dict. (default pdftotext)
:return: dict of extracted and matched fields or False if no template matches
"""
Collaborator

duskybomb commented May 14, 2018

How does this look for extract_data()

"""Function to extract data from e-invoices

:param invoicefile: path of invoice
:param templates: load templates (template name)
:param input_module: input module to use out of input_mapping dict. (default pdftotext)
:return: dict of extracted and matched fields or False if no template matches
"""
@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu May 14, 2018

Collaborator
  • The text could use some more detail on what the function does.
  • The return desc is fine.
  • Parameters also need more details. E.g. if a file point can be passed instead of a string (path).

Like in maths this needs precision because those things make a difference.

Collaborator

m3nu commented May 14, 2018

  • The text could use some more detail on what the function does.
  • The return desc is fine.
  • Parameters also need more details. E.g. if a file point can be passed instead of a string (path).

Like in maths this needs precision because those things make a difference.

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb May 14, 2018

Collaborator

Like this?

 """extract data from e-invoices
     * reads template if no template assigned
     * text is extracted from invoice using input module and stored in extracted_str
     * extracted text is sent for optimization in prepare_input()
     * required fields are matches from templates

    Note: Import all required module when using as a library
        Example: If you want to use pdfminer
            >>> from invoice2data.main import pdfminer
            >>> extract_data("/home/duskybomb/pdf/invoice.pdf", "com.aws-invoice.yaml", pdfminer)

    :param invoicefile: path of invoice (example: "/home/duskybomb/pdf/invoice.pdf")
    :param templates: load templates (template name)
    :param input_module: input module to use out of input_mapping dict. (default pdftotext)

    :var extracted_str: saves extracted text from input module
    :var optimized_str: stores optimized string after passing extracted_str to prepare_input()

    :return: dict of extracted and matched fields, False if no template matches
 """

Made new commits to branch issue/119. Do check.

Collaborator

duskybomb commented May 14, 2018

Like this?

 """extract data from e-invoices
     * reads template if no template assigned
     * text is extracted from invoice using input module and stored in extracted_str
     * extracted text is sent for optimization in prepare_input()
     * required fields are matches from templates

    Note: Import all required module when using as a library
        Example: If you want to use pdfminer
            >>> from invoice2data.main import pdfminer
            >>> extract_data("/home/duskybomb/pdf/invoice.pdf", "com.aws-invoice.yaml", pdfminer)

    :param invoicefile: path of invoice (example: "/home/duskybomb/pdf/invoice.pdf")
    :param templates: load templates (template name)
    :param input_module: input module to use out of input_mapping dict. (default pdftotext)

    :var extracted_str: saves extracted text from input module
    :var optimized_str: stores optimized string after passing extracted_str to prepare_input()

    :return: dict of extracted and matched fields, False if no template matches
 """

Made new commits to branch issue/119. Do check.

@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu May 15, 2018

Collaborator

I like the idea of giving an example. That's good. PEP doesn't define a sample for an example, but Google does.

For the first sentence, let's avoid the word "e-invoice" since it's not defined or used anywhere. Maybe like this:

Extracts structured elements (like data, amount, etc) from PDF invoices using different extraction backends and pre-defined templates.

The list (reads template...) is really an extension of input and output params. So let's move it there and extend the input- and output arguments a bit.

And regarding params:

  • invoicefile needs to be a path, more precisely a str with the path of the PDF invoice to extract data from.
  • templates: You need to mention that this is an optional param and what happens when it's not defined. You also need to tell the reader that templates are loaded using the helper class we have and best give an example for it. Or add a good Docstring to the InvoiceTemplate class.
  • input_module: Here you need to mention that the module needs to be imported first and maybe 2-3 words on what it does. And a note that people can define their own module.

Then the part of extracted- vs optimized string is an implementation detail and doesn't belong in the docstring. It's not important for using the function.

The return value could use an example again. Otherwise it's correct.

Collaborator

m3nu commented May 15, 2018

I like the idea of giving an example. That's good. PEP doesn't define a sample for an example, but Google does.

For the first sentence, let's avoid the word "e-invoice" since it's not defined or used anywhere. Maybe like this:

Extracts structured elements (like data, amount, etc) from PDF invoices using different extraction backends and pre-defined templates.

The list (reads template...) is really an extension of input and output params. So let's move it there and extend the input- and output arguments a bit.

And regarding params:

  • invoicefile needs to be a path, more precisely a str with the path of the PDF invoice to extract data from.
  • templates: You need to mention that this is an optional param and what happens when it's not defined. You also need to tell the reader that templates are loaded using the helper class we have and best give an example for it. Or add a good Docstring to the InvoiceTemplate class.
  • input_module: Here you need to mention that the module needs to be imported first and maybe 2-3 words on what it does. And a note that people can define their own module.

Then the part of extracted- vs optimized string is an implementation detail and doesn't belong in the docstring. It's not important for using the function.

The return value could use an example again. Otherwise it's correct.

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb May 15, 2018

Collaborator

I was using e-invoice and avoiding PDF invoice because invoicefile doesn't necessarily be a pdf. Can also be an image, right?

Collaborator

duskybomb commented May 15, 2018

I was using e-invoice and avoiding PDF invoice because invoicefile doesn't necessarily be a pdf. Can also be an image, right?

@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu May 15, 2018

Collaborator

Valid point. It can be in image if the right input-module is chosen. But that's not documented or tested anywhere. You can add this feature when improving the OCR module.

Until then you could use "path to electronic invoice file in PDF, JPEG or [whatever is supported] format". "E-invoice" alone is more like a brand name and not clearly defined.

Collaborator

m3nu commented May 15, 2018

Valid point. It can be in image if the right input-module is chosen. But that's not documented or tested anywhere. You can add this feature when improving the OCR module.

Until then you could use "path to electronic invoice file in PDF, JPEG or [whatever is supported] format". "E-invoice" alone is more like a brand name and not clearly defined.

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb May 15, 2018

Collaborator

Done.

"""Extracts structured elements (like data, amount, etc) from PDF/image invoices using different extraction backends
        and pre-defined templates.
     * reads template if no template assigned
     * required fields are matches from templates

    :param invoicefile:
        path of electronic invoice file in PDF,JPEG,PNG (example: "/home/duskybomb/pdf/invoice.pdf")
        :type invoicefile: str
    :param templates: This parameter is optional
        load templates as .yml file (template name). Templates are loaded using read_template function in loader.py
    :param input_module: This parameter is optional
        library to be used to extract text from given invoicefile,
        Currently supported libraries are: pdftotext, pdfminer, tesseract (default pdftotext)

        Note:
            Import required input_module when using invoice2data as a library
            Example: If you want to use pdfminer
                >>> from invoice2data.main import pdfminer
                >>> extract_data("/home/duskybomb/pdf/invoice.pdf", "com.aws-invoice.yaml", pdfminer)

    :return: extracted and matched fields, False if no template matches
    :rtype: dict
        Example:    {'issuer': 'OYO', 'amount': 1939.0, 'date': datetime.datetime(2017, 12, 31, 0, 0),
                    'invoice_number': 'IBZY2087', 'currency': 'INR', 'desc': 'Invoice IBZY2087
                    from OYO'}

"""

Should I make a PR now?

Collaborator

duskybomb commented May 15, 2018

Done.

"""Extracts structured elements (like data, amount, etc) from PDF/image invoices using different extraction backends
        and pre-defined templates.
     * reads template if no template assigned
     * required fields are matches from templates

    :param invoicefile:
        path of electronic invoice file in PDF,JPEG,PNG (example: "/home/duskybomb/pdf/invoice.pdf")
        :type invoicefile: str
    :param templates: This parameter is optional
        load templates as .yml file (template name). Templates are loaded using read_template function in loader.py
    :param input_module: This parameter is optional
        library to be used to extract text from given invoicefile,
        Currently supported libraries are: pdftotext, pdfminer, tesseract (default pdftotext)

        Note:
            Import required input_module when using invoice2data as a library
            Example: If you want to use pdfminer
                >>> from invoice2data.main import pdfminer
                >>> extract_data("/home/duskybomb/pdf/invoice.pdf", "com.aws-invoice.yaml", pdfminer)

    :return: extracted and matched fields, False if no template matches
    :rtype: dict
        Example:    {'issuer': 'OYO', 'amount': 1939.0, 'date': datetime.datetime(2017, 12, 31, 0, 0),
                    'invoice_number': 'IBZY2087', 'currency': 'INR', 'desc': 'Invoice IBZY2087
                    from OYO'}

"""

Should I make a PR now?

@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu May 15, 2018

Collaborator

Looks good. Let me just go over all the changes. Then I probably push a small commit. Then you can merge the branch. I'll also be on IRC today and working on factur-x for most of the day.

Collaborator

m3nu commented May 15, 2018

Looks good. Let me just go over all the changes. Then I probably push a small commit. Then you can merge the branch. I'll also be on IRC today and working on factur-x for most of the day.

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb May 15, 2018

Collaborator
"""Extracts structured data from PDF/image invoices.

    This function uses the text extracted from a PDF file or image and
    pre-defined regex templates to find structured data.

    Reads template if no template assigned
    Required fields are matches from templates

    Parameters
    ----------
    invoicefile : str
        path of electronic invoice file in PDF,JPEG,PNG (example: "/home/duskybomb/pdf/invoice.pdf")
    templates : list of instances of class `InvoiceTemplate`, optional
        Templates are loaded using `read_template` function in `loader.py`
    input_module : {'pdftotext', 'pdfminer', 'tesseract'}, optional
        library to be used to extract text from given `invoicefile`,

    Returns
    -------
    dict or False
        extracted and matched fields or False if no template matches

    Notes
    -----
    Import required `input_module` when using invoice2data as a library

    See Also
    --------
    read_template : Function where templates are loaded
    InvoiceTemplate : Class representing single template files that live as .yml files on the disk

    Examples
    --------
    When using `invoice2data` as an library

    >>> from invoice2data.input import pdftotext
    >>> extract_data("invoice2data/test/pdfs/oyo.pdf", None, pdftotext)
    {'issuer': 'OYO', 'amount': 1939.0, 'date': datetime.datetime(2017, 12, 31, 0, 0), 'invoice_number': 'IBZY2087', 
    'currency': 'INR', 'desc': 'Invoice IBZY2087 from OYO'}

"""

based on Numpydoc. Please check IRC

Collaborator

duskybomb commented May 15, 2018

"""Extracts structured data from PDF/image invoices.

    This function uses the text extracted from a PDF file or image and
    pre-defined regex templates to find structured data.

    Reads template if no template assigned
    Required fields are matches from templates

    Parameters
    ----------
    invoicefile : str
        path of electronic invoice file in PDF,JPEG,PNG (example: "/home/duskybomb/pdf/invoice.pdf")
    templates : list of instances of class `InvoiceTemplate`, optional
        Templates are loaded using `read_template` function in `loader.py`
    input_module : {'pdftotext', 'pdfminer', 'tesseract'}, optional
        library to be used to extract text from given `invoicefile`,

    Returns
    -------
    dict or False
        extracted and matched fields or False if no template matches

    Notes
    -----
    Import required `input_module` when using invoice2data as a library

    See Also
    --------
    read_template : Function where templates are loaded
    InvoiceTemplate : Class representing single template files that live as .yml files on the disk

    Examples
    --------
    When using `invoice2data` as an library

    >>> from invoice2data.input import pdftotext
    >>> extract_data("invoice2data/test/pdfs/oyo.pdf", None, pdftotext)
    {'issuer': 'OYO', 'amount': 1939.0, 'date': datetime.datetime(2017, 12, 31, 0, 0), 'invoice_number': 'IBZY2087', 
    'currency': 'INR', 'desc': 'Invoice IBZY2087 from OYO'}

"""

based on Numpydoc. Please check IRC

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb May 15, 2018

Collaborator

Done and Closing! 🥂

Collaborator

duskybomb commented May 15, 2018

Done and Closing! 🥂

@duskybomb duskybomb closed this May 15, 2018

@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu May 15, 2018

Collaborator

Hooray!

Collaborator

m3nu commented May 15, 2018

Hooray!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment