New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify Payment Method #141

Closed
Invigor opened this Issue Jun 7, 2018 · 12 comments

Comments

Projects
None yet
4 participants
@Invigor

Invigor commented Jun 7, 2018

Hi,

I'm looking for a way to identify the payment method used for the invoice or receipt.

This is typically indicated with the words "VISA", "CASH" or "MCARD" in the receipt (see attached)
but I don't think I can simply add a field to the template with REGEX that is searching for these terms as I need to return a standard ENUM for the different payment types.

Can anyone think of a way to do this in REGEX or suggest an approach for adding this to the invoice_template module?

Thanks,

Michael

pepper_lunch_1

@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu Jun 7, 2018

Collaborator

can you help, @duskybomb ?

Collaborator

m3nu commented Jun 7, 2018

can you help, @duskybomb ?

@Invigor

This comment has been minimized.

Show comment
Hide comment
@Invigor

Invigor Jun 7, 2018

Hi all,

I've added a new option to the template as follows:

find_<field_name>
- <option_static> <option_regex>
- <option_static> <option_regex>
- <option_static> <option_regex>

It will look for all of the options and return field_name with value option_static if found.

Note that the list is in priority order i.e. will loop through the options in order and only return the first match.

If no match is found, 'not_found' is returned as the value.

e.g.

  find_payment_method:
  - visa VISA
  - mastercard MCARD
  - amex AMEX
  - cash CASH

The code is as follows and is added to invoice_template.py

      for k, v in self['fields'].items():
            if k.startswith('static_'):
                logger.debug("field=%s | static value=%s", k, v)
                output[k.replace('static_', '')] = v
            # New code for <find_> option starts here
            # check if the field name begins with 'find_' and we have a list of options
            elif k.startswith('find_') and type(v) is list:
                logger.debug("field=%s | find value=%s", k, v)
                
                # Loop through options
                for v_option in v:

                    
                    # Break the options into the name and regex
                    find_type,find_regex = v_option.split(' ')    
                    logger.debug("find_type=%s | find_regex=%s", find_type, find_regex)
                    
                    if re.findall(find_regex, optimized_str):
                        output[k.replace('find_', '')] = find_type
                        logger.debug("Found=%s", find_type)
                        find_match = True
                        break
                            
                if not(find_match):    
                    output[k.replace('find_', '')] = 'not_ found'

            # New code for <find_> option ends here
                    
            else:
                logger.debug("field=%s | regexp=%s", k, v)

Feel free to add this to the code base if you think it is useful.

Cheers,

Michael

Invigor commented Jun 7, 2018

Hi all,

I've added a new option to the template as follows:

find_<field_name>
- <option_static> <option_regex>
- <option_static> <option_regex>
- <option_static> <option_regex>

It will look for all of the options and return field_name with value option_static if found.

Note that the list is in priority order i.e. will loop through the options in order and only return the first match.

If no match is found, 'not_found' is returned as the value.

e.g.

  find_payment_method:
  - visa VISA
  - mastercard MCARD
  - amex AMEX
  - cash CASH

The code is as follows and is added to invoice_template.py

      for k, v in self['fields'].items():
            if k.startswith('static_'):
                logger.debug("field=%s | static value=%s", k, v)
                output[k.replace('static_', '')] = v
            # New code for <find_> option starts here
            # check if the field name begins with 'find_' and we have a list of options
            elif k.startswith('find_') and type(v) is list:
                logger.debug("field=%s | find value=%s", k, v)
                
                # Loop through options
                for v_option in v:

                    
                    # Break the options into the name and regex
                    find_type,find_regex = v_option.split(' ')    
                    logger.debug("find_type=%s | find_regex=%s", find_type, find_regex)
                    
                    if re.findall(find_regex, optimized_str):
                        output[k.replace('find_', '')] = find_type
                        logger.debug("Found=%s", find_type)
                        find_match = True
                        break
                            
                if not(find_match):    
                    output[k.replace('find_', '')] = 'not_ found'

            # New code for <find_> option ends here
                    
            else:
                logger.debug("field=%s | regexp=%s", k, v)

Feel free to add this to the code base if you think it is useful.

Cheers,

Michael

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb Jun 8, 2018

Collaborator

Sometimes a bill is split into different payment methods, like half in cash and half through card.
We can add a new option payment_method inside fields in template.

payment_method:
	CASH: CASH:(\d+\.\d+)
	VISA: VISA:(\d+\.\d+)
	MCARD: MCARD:(\d+\.\d+)

Then there is slight change in invoice_template. In extract(). The additions have been marked as # *

for k, v in self['fields'].items():
    if k.startswith('static_'):
        logger.debug("field=%s | static value=%s", k, v)
        output[k.replace('static_', '')] = v
    else:
        logger.debug("field=%s | regexp=%s", k, v)

        sum_field = False
        if k.startswith('sum_amount') and type(v) is list:
            k = k[4:]  # remove 'sum_' prefix
            sum_field = True 
        # Fields can have multiple expressions
        if type(v) is list:
            res_find = []
            for v_option in v:
                res_val = re.findall(v_option, optimized_str)
                if res_val:
                    if sum_field:
                        res_find += res_val
                    else:
                        res_find = res_val
                        break
        elif k == 'payment_method':  # *
        	res_find = []  # *
		for payment_type, reg_ex in v.items:  # *
			res_val = re.findall(reg_ex, optimized_str)  # *
			if len(res_val) > 0:  # *
				res_find.append(payment_type)  # *
        else:
            res_find = re.findall(v, optimized_str)
        if res_find:
            logger.debug("res_find=%s", res_find)
            if k.startswith('date') or k.endswith('date'):
                output[k] = self.parse_date(res_find[0])
                if not output[k]:
                    logger.error(
                        "Date parsing failed on date '%s'", res_find[0])
                    return None
            elif k.startswith('amount'):
                if sum_field:
                    output[k] = 0
                    for amount_to_parse in res_find:
                        output[k] += self.parse_number(amount_to_parse)
                else:
                    output[k] = self.parse_number(res_find[0])
    		elif type(res_find) is list:  # *
    			output[k] = res_find  # *
            else:
                output[k] = res_find[0]
        else:
            logger.warning("regexp for field %s didn't match", k)

What do you think @m3nu?

Collaborator

duskybomb commented Jun 8, 2018

Sometimes a bill is split into different payment methods, like half in cash and half through card.
We can add a new option payment_method inside fields in template.

payment_method:
	CASH: CASH:(\d+\.\d+)
	VISA: VISA:(\d+\.\d+)
	MCARD: MCARD:(\d+\.\d+)

Then there is slight change in invoice_template. In extract(). The additions have been marked as # *

for k, v in self['fields'].items():
    if k.startswith('static_'):
        logger.debug("field=%s | static value=%s", k, v)
        output[k.replace('static_', '')] = v
    else:
        logger.debug("field=%s | regexp=%s", k, v)

        sum_field = False
        if k.startswith('sum_amount') and type(v) is list:
            k = k[4:]  # remove 'sum_' prefix
            sum_field = True 
        # Fields can have multiple expressions
        if type(v) is list:
            res_find = []
            for v_option in v:
                res_val = re.findall(v_option, optimized_str)
                if res_val:
                    if sum_field:
                        res_find += res_val
                    else:
                        res_find = res_val
                        break
        elif k == 'payment_method':  # *
        	res_find = []  # *
		for payment_type, reg_ex in v.items:  # *
			res_val = re.findall(reg_ex, optimized_str)  # *
			if len(res_val) > 0:  # *
				res_find.append(payment_type)  # *
        else:
            res_find = re.findall(v, optimized_str)
        if res_find:
            logger.debug("res_find=%s", res_find)
            if k.startswith('date') or k.endswith('date'):
                output[k] = self.parse_date(res_find[0])
                if not output[k]:
                    logger.error(
                        "Date parsing failed on date '%s'", res_find[0])
                    return None
            elif k.startswith('amount'):
                if sum_field:
                    output[k] = 0
                    for amount_to_parse in res_find:
                        output[k] += self.parse_number(amount_to_parse)
                else:
                    output[k] = self.parse_number(res_find[0])
    		elif type(res_find) is list:  # *
    			output[k] = res_find  # *
            else:
                output[k] = res_find[0]
        else:
            logger.warning("regexp for field %s didn't match", k)

What do you think @m3nu?

@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu Jun 8, 2018

Collaborator

It's already possible to define custom fields. There is no need to hardcode it, like you did. If you do it like this, all your field settings will be in Python code eventually. Field settings need to be in the vendor template only.

Collaborator

m3nu commented Jun 8, 2018

It's already possible to define custom fields. There is no need to hardcode it, like you did. If you do it like this, all your field settings will be in Python code eventually. Field settings need to be in the vendor template only.

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb Jun 8, 2018

Collaborator

In that case this should work

payment_method:
	- (CASH:\d+\.\d+)
	- (VISA:\d+\.\d+)
	- (MCARD:\d+\.\d+)

I don't think we were handling lists properly
# + : refers to addition
# - : refers to deletion

for k, v in self['fields'].items():
    if k.startswith('static_'):
        logger.debug("field=%s | static value=%s", k, v)
        output[k.replace('static_', '')] = v
    else:
        logger.debug("field=%s | regexp=%s", k, v)

        sum_field = False
        if k.startswith('sum_amount') and type(v) is list:
            k = k[4:]  # remove 'sum_' prefix
            sum_field = True 
        # Fields can have multiple expressions
        if type(v) is list:
            res_find = []
            for v_option in v:
                res_val = re.findall(v_option, optimized_str)
                if res_val:
                    if sum_field:
                        res_find += res_val
                    else:
                        res_find.extend(res_val)  # +
                        # - break
        else:
            res_find = re.findall(v, optimized_str)
        if res_find:
            logger.debug("res_find=%s", res_find)
            if k.startswith('date') or k.endswith('date'):
                output[k] = self.parse_date(res_find[0])
                if not output[k]:
                    logger.error(
                        "Date parsing failed on date '%s'", res_find[0])
                    return None
            elif k.startswith('amount'):
                if sum_field:
                    output[k] = 0
                    for amount_to_parse in res_find:
                        output[k] += self.parse_number(amount_to_parse)
                else:
                    output[k] = self.parse_number(res_find[0])
            else:
                if len(red_find) == 1:  # +
                   output[k] = res_find[0]   # +
                else:  # +
                    output[k] = res_find  # +
        else:
            logger.warning("regexp for field %s didn't match", k)
Collaborator

duskybomb commented Jun 8, 2018

In that case this should work

payment_method:
	- (CASH:\d+\.\d+)
	- (VISA:\d+\.\d+)
	- (MCARD:\d+\.\d+)

I don't think we were handling lists properly
# + : refers to addition
# - : refers to deletion

for k, v in self['fields'].items():
    if k.startswith('static_'):
        logger.debug("field=%s | static value=%s", k, v)
        output[k.replace('static_', '')] = v
    else:
        logger.debug("field=%s | regexp=%s", k, v)

        sum_field = False
        if k.startswith('sum_amount') and type(v) is list:
            k = k[4:]  # remove 'sum_' prefix
            sum_field = True 
        # Fields can have multiple expressions
        if type(v) is list:
            res_find = []
            for v_option in v:
                res_val = re.findall(v_option, optimized_str)
                if res_val:
                    if sum_field:
                        res_find += res_val
                    else:
                        res_find.extend(res_val)  # +
                        # - break
        else:
            res_find = re.findall(v, optimized_str)
        if res_find:
            logger.debug("res_find=%s", res_find)
            if k.startswith('date') or k.endswith('date'):
                output[k] = self.parse_date(res_find[0])
                if not output[k]:
                    logger.error(
                        "Date parsing failed on date '%s'", res_find[0])
                    return None
            elif k.startswith('amount'):
                if sum_field:
                    output[k] = 0
                    for amount_to_parse in res_find:
                        output[k] += self.parse_number(amount_to_parse)
                else:
                    output[k] = self.parse_number(res_find[0])
            else:
                if len(red_find) == 1:  # +
                   output[k] = res_find[0]   # +
                else:  # +
                    output[k] = res_find  # +
        else:
            logger.warning("regexp for field %s didn't match", k)
@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu Jun 8, 2018

Collaborator

I think OP is only interested in the name of the payment method. Like "VISA", etc. So the amount doesn't need to be extracted. (VISA):\d+\.\d+ We have that elsewhere.

Can you make a PR and test case for this addition?

Collaborator

m3nu commented Jun 8, 2018

I think OP is only interested in the name of the payment method. Like "VISA", etc. So the amount doesn't need to be extracted. (VISA):\d+\.\d+ We have that elsewhere.

Can you make a PR and test case for this addition?

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb Jun 8, 2018

Collaborator

Working on PR, but for test case we would need a sample invoice (preferably in PDF format).

Collaborator

duskybomb commented Jun 8, 2018

Working on PR, but for test case we would need a sample invoice (preferably in PDF format).

@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu Jun 8, 2018

Collaborator

Merged. It would be interesting to see if Tesseract is good enough to process the invoice posted by OP.

Collaborator

m3nu commented Jun 8, 2018

Merged. It would be interesting to see if Tesseract is good enough to process the invoice posted by OP.

@duskybomb

This comment has been minimized.

Show comment
Hide comment
@duskybomb

duskybomb Jun 12, 2018

Collaborator

I was able to get this file work with tesseract but I needed to edit the image (Cropped it and adjusted brightness and contract). Plus I used tesseract 4.0 (alpha)
inv
Here is the tesseract-ocr output text:

 

Outlet: Pepper Lunch - Jurong Point
Stall ID: 01 Pepper Lunch Restaurant
Machine No: 01

Cashier: Arpan

Date: 6/6/2018 12:24:44 PM

Receipt No: 000035 Tag No: 24

1, (8M) Salmon & Chicken 1X 7.90 7.90
Pepper Rice

~ (SM) Salmon & 1
Chicken Pepper Rice
- Iced Lemon Tea |

2. a BBQ Beef Pepper 1X 7,90 7.90
ce
- (SM) BBQ Beef Pepper 1
Rice
- Pepsi j
3. (WOL) Beef Pepper Rice 1X 11.90 11.90
- (WDL) Beef Pepper 1

Rice

- Iced Lemon Tea 1

- Miso Soup 1
“Total: ttStS~S 27.10
Sub Total : 27,70
Rounded Amt : 0.00
Net Total : Tl
Inclusive GST 7% : 1.81
~ VISA: 27.10

*xThank Your*
GST RegNo: 200408968W
Papper Lunch Jurong Point Telephone:
6265 7425
63 Jurong Hest Central 3
#B1-62/63 Jurong Point Shopping Centre
Singapore 648331
Pepper Lunch Website:
http://www. pepper lunch. com. so/

Collaborator

duskybomb commented Jun 12, 2018

I was able to get this file work with tesseract but I needed to edit the image (Cropped it and adjusted brightness and contract). Plus I used tesseract 4.0 (alpha)
inv
Here is the tesseract-ocr output text:

 

Outlet: Pepper Lunch - Jurong Point
Stall ID: 01 Pepper Lunch Restaurant
Machine No: 01

Cashier: Arpan

Date: 6/6/2018 12:24:44 PM

Receipt No: 000035 Tag No: 24

1, (8M) Salmon & Chicken 1X 7.90 7.90
Pepper Rice

~ (SM) Salmon & 1
Chicken Pepper Rice
- Iced Lemon Tea |

2. a BBQ Beef Pepper 1X 7,90 7.90
ce
- (SM) BBQ Beef Pepper 1
Rice
- Pepsi j
3. (WOL) Beef Pepper Rice 1X 11.90 11.90
- (WDL) Beef Pepper 1

Rice

- Iced Lemon Tea 1

- Miso Soup 1
“Total: ttStS~S 27.10
Sub Total : 27,70
Rounded Amt : 0.00
Net Total : Tl
Inclusive GST 7% : 1.81
~ VISA: 27.10

*xThank Your*
GST RegNo: 200408968W
Papper Lunch Jurong Point Telephone:
6265 7425
63 Jurong Hest Central 3
#B1-62/63 Jurong Point Shopping Centre
Singapore 648331
Pepper Lunch Website:
http://www. pepper lunch. com. so/

@mahendra047

This comment has been minimized.

Show comment
Hide comment
@mahendra047

mahendra047 Jul 4, 2018

How i use it for extracting the retailer name ,date,total and the items with price from receipts.i want to preprocess lot of supermarket receipts before giving the tesseract 4.0 .

can we use invoicetodata for receipts images only for extracting items and other fields .any suggestion ...
thanks in advance

mahendra047 commented Jul 4, 2018

How i use it for extracting the retailer name ,date,total and the items with price from receipts.i want to preprocess lot of supermarket receipts before giving the tesseract 4.0 .

can we use invoicetodata for receipts images only for extracting items and other fields .any suggestion ...
thanks in advance

@m3nu

This comment has been minimized.

Show comment
Hide comment
@m3nu

m3nu Jul 5, 2018

Collaborator

You can define your own fields and add new templates. The process is roughly:

  1. Find the template folder in this repo or locally.
  2. Choose a similar template as base and look into the template docs.
  3. Make your own new template based on debugging and looking at OCR output.
  4. Add your own template folder to the templates to be loaded.
  5. Done.
Collaborator

m3nu commented Jul 5, 2018

You can define your own fields and add new templates. The process is roughly:

  1. Find the template folder in this repo or locally.
  2. Choose a similar template as base and look into the template docs.
  3. Make your own new template based on debugging and looking at OCR output.
  4. Add your own template folder to the templates to be loaded.
  5. Done.
@mahendra047

This comment has been minimized.

Show comment
Hide comment
@mahendra047

mahendra047 Jul 5, 2018

mahendra047 commented Jul 5, 2018

@m3nu m3nu closed this Aug 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment