Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KIE annotation tool #434

Closed
VtlNmnk opened this issue Aug 13, 2021 · 20 comments
Closed

KIE annotation tool #434

VtlNmnk opened this issue Aug 13, 2021 · 20 comments

Comments

@VtlNmnk
Copy link

VtlNmnk commented Aug 13, 2021

Hey! I really appreciate your excellent work!
I want to add some of my own examples of annotated receipts to the wildreceipt dataset to train the model with my dataset. Is there any annotation tool available? Or is there a converter from other formats?

@amitbcp
Copy link
Contributor

amitbcp commented Aug 14, 2021

@VtlNmnk I used free version of Label Studio for annotation. The dump from label studio is very easy to convert to SDMG-R format.

Use the Json-min export from LabelStudio

@pushpalatha1405
Copy link

hi amit ,
pls can u help me with steps to convert to SDMG-R format. Iam using my custom document dataset ,want to annotate required key-value pair and convert to format supported by SDMGR.

@amitbcp
Copy link
Contributor

amitbcp commented Aug 24, 2021

@pushpalatha1405 we discussed this over email. If you feel comfortable, you can close the issue

@VtlNmnk
Copy link
Author

VtlNmnk commented Aug 25, 2021

@amitbcp, if you wrote to everyone here what to do with the exported data from the annotation tool, then similar questions will not appear in the future.
As for me, I haven't finished labeling my data yet and haven't tried converting it yet.

@pushpalatha1405
Copy link

@pushpalatha1405 we discussed this over email. If you feel comfortable, you can close the issue

pk amith iam closing the issue

@amitbcp
Copy link
Contributor

amitbcp commented Aug 25, 2021

@VtlNmnk once the data is dumped in json-min format from the label studio, a simple script to convert the data to SDMGR format works. Label Studio Dump has all the necessary information we require to convert to SDMGR format.

The most important thing would be to configure the labelling format for label studio. The sample one which works well for conversion to SDMGR is :

<View>
  <Image name="image" value="$ocr" zoomControl="true" rotateControl="true" zoom="true"/>
  <RectangleLabels name="label" toName="image" strokeWidth="2">
    <Label value=“class_1” background="Aqua"/>
    <Label value=“class”_2 background="#D4380D"/>
  </RectangleLabels>
  <View visibleWhen="region-selected" style="width: 100%; display: block">
    <Header value="Write transcription:"/>
    <TextArea name="transcription" toName="image" editable="true" perRegion="true" required="true" maxSubmissions="1" rows="5" strokeWidth="2"/>
  </View>
</View>

Here we can annotate the OCR and specify the labels. To add more classes, just repeat the <label value> as per your dataset

@pushpalatha1405
Copy link

@amitbcp pls can u share a one labelled object format(any one annotated filed json object tag) of label-studio and its equivalent conversion to sdmgr format would be really helpful.

@VtlNmnk
Copy link
Author

VtlNmnk commented Aug 25, 2021

yes, I figured out how to do this part. I use these settings to supplement the "wild receipts" dataset.

<View>
  <Image name="image" value="$ocr" zoom="true"/>
  <Labels name="label" toName="image">
    <Label value="Ignore" background="#FFA39E"/>
    <Label value="Store_name_value" background="#D4380D"/>
    <Label value="Store_name_key" background="#FFC069"/>
    <Label value="Store_addr_value" background="#AD8B00"/>
    <Label value="Store_addr_key" background="#D3F261"/>
    <Label value="Tel_value" background="#389E0D"/>
    <Label value="Tel_key" background="#5CDBD3"/>
    <Label value="Date_value" background="#096DD9"/>
    <Label value="Date_key" background="#ADC6FF"/>
    <Label value="Time_value" background="#9254DE"/>
    <Label value="Time_key" background="#F759AB"/>
    <Label value="Prod_item_value" background="#FFA39E"/>
    <Label value="Prod_item_key" background="#D4380D"/>
    <Label value="Prod_quantity_value" background="#FFC069"/>
    <Label value="Prod_quantity_key" background="#AD8B00"/>
    <Label value="Prod_price_value" background="#D3F261"/>
    <Label value="Prod_price_key" background="#389E0D"/>
    <Label value="Subtotal_value" background="#5CDBD3"/>
    <Label value="Subtotal_key" background="#096DD9"/>
    <Label value="Tax_value" background="#ADC6FF"/>
    <Label value="Tax_key" background="#9254DE"/>
    <Label value="Tips_value" background="#F759AB"/>
    <Label value="Tips_key" background="#FFA39E"/>
    <Label value="Total_value" background="#D4380D"/>
    <Label value="Total_key" background="#FFC069"/>
    <Label value="Others" background="#AD8B00"/>
  </Labels>
  <Rectangle name="bbox" toName="image" strokeWidth="3"/>
  <Polygon name="poly" toName="image" strokeWidth="3"/>
  <TextArea name="transcription" toName="image" editable="true" perRegion="true" required="true" maxSubmissions="1" rows="5" placeholder="Recognized Text" displayMode="region-list"/>
</View>

But it is not yet clear to me which script to use for conversion from Json-min to SDMG-R format.
It must be one of these converters, right?

@amitbcp
Copy link
Contributor

amitbcp commented Aug 25, 2021

@VtlNmnk no, the script is a custom one not provided here. Let me share that tommorrow here

@amitbcp
Copy link
Contributor

amitbcp commented Aug 27, 2021

@VtlNmnk Since Label studio stores coordinates in a different scale tan SDMGR we need those conversions too . Here is the code I used :

with open("manual_ls.json","r") as f : # from label-studio json-min dump
    ls = json.loads(f.read())
global_tags = []
for dl in ls :
    filename = "./annotate_dl/" + dl["ocr"].split("=")[-1]
    labels = dl['label']
    transcriptions = dl['transcription']
    annotations = []
    
    for label,text  in zip(labels,transcriptions) :
        tag = label['rectanglelabels'][0]

        ocr = text['text'][0]
        ocr = ocr.replace(" ","").lower()
        
        original_width = label['original_width']
        original_height = label['original_height']
        x,y = label['x'],label['y']
        width,height = label['width'],label['height']
        
        # default
        x0 = (x*original_width)/100
        y0 = (y*original_height)/100
        w = (width*original_width)/100
        h = (height*original_height)/100
        

        x1=x0+w
        y1=y0+h
        
        box = [x0,y0,x1,y0,x1,y1,x0,y1]
        
        annt_dict = {'box':box,'text':ocr,'label':class_dict[tag]} #converting label to int index
        annotations.append(annt_dict)
        global_tags.append((class_dict[tag],tag)) # to calculate statistics later
        

    manual_dl = {"file_name":filename,"height": original_height, "width": original_width, "annotations":annotations}
    with open('./annotate_dl/manual_ls_exp_syn_v1.txt','a') as convert_file:
        convert_file.write(json.dumps(manual_dl))
        convert_file.write("\n")

@pushpalatha1405
Copy link

Thanks amith for the script.

@amitbcp
Copy link
Contributor

amitbcp commented Nov 2, 2021

@pushpalatha1405 @VtlNmnk did it work for you ? In that case we can close it ?

@pushpalatha1405
Copy link

Thanks very much amith for intial script u sent for creatin annotation format acceptable by mmocr model.Through which iam able create my custom dataset and custom model and use mmocr in fledge for our project.

@pushpalatha1405
Copy link

U can close the issue.

@gaotongxiao
Copy link
Collaborator

Hi @amitbcp @VtlNmnk @pushpalatha1405, thanks for the great discussion! Would you summarize this discussion into a tutorial? We are planning a tutorial section in our documentation just similar to what MMDetection did to improve developer experience. If you would make a PR, your contribution can be acknowledged and help more people. :)

@pushpalatha1405
Copy link

Sure Thong i would make time and contribute to PR

@gaotongxiao
Copy link
Collaborator

@pushpalatha1405 Thanks! Just create a file named make_dataset.md under docs/ in the PR and we will help you organize it.

@VtlNmnk
Copy link
Author

VtlNmnk commented Nov 2, 2021

@pushpalatha1405 @VtlNmnk did it work for you ? In that case we can close it ?

yes, Label Studio and script are working. We can close the issue.

@VtlNmnk VtlNmnk closed this as completed Nov 2, 2021
@pushpalatha1405
Copy link

Hi Thong,
I have created a file named make_dataset.md(enabled PR also). I need add contents. Do have specific topics i should cover and organize in the file. I will add contents to the file by the end of this week.

regards,
Pushpalatha M

@nabilragab
Copy link

@VtlNmnk Since Label studio stores coordinates in a different scale tan SDMGR we need those conversions too . Here is the code I used :

with open("manual_ls.json","r") as f : # from label-studio json-min dump
    ls = json.loads(f.read())
global_tags = []
for dl in ls :
    filename = "./annotate_dl/" + dl["ocr"].split("=")[-1]
    labels = dl['label']
    transcriptions = dl['transcription']
    annotations = []
    
    for label,text  in zip(labels,transcriptions) :
        tag = label['rectanglelabels'][0]

        ocr = text['text'][0]
        ocr = ocr.replace(" ","").lower()
        
        original_width = label['original_width']
        original_height = label['original_height']
        x,y = label['x'],label['y']
        width,height = label['width'],label['height']
        
        # default
        x0 = (x*original_width)/100
        y0 = (y*original_height)/100
        w = (width*original_width)/100
        h = (height*original_height)/100
        

        x1=x0+w
        y1=y0+h
        
        box = [x0,y0,x1,y0,x1,y1,x0,y1]
        
        annt_dict = {'box':box,'text':ocr,'label':class_dict[tag]} #converting label to int index
        annotations.append(annt_dict)
        global_tags.append((class_dict[tag],tag)) # to calculate statistics later
        

    manual_dl = {"file_name":filename,"height": original_height, "width": original_width, "annotations":annotations}
    with open('./annotate_dl/manual_ls_exp_syn_v1.txt','a') as convert_file:
        convert_file.write(json.dumps(manual_dl))
        convert_file.write("\n")

why was this not included in a Pull Request ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants