Internship_Report_Roopa #47

roopavsm · 2022-08-02T14:37:23Z

roopavsm
Aug 2, 2022

Hi, this is about internship at CEVOpen.

roopavsm · 2022-08-02T14:38:25Z

roopavsm
Aug 2, 2022
Author

I am V.S.M. Roopa, a 3rd year student of BSc. Mathematics, Statistics and Computer Science at St. Francis College for Women, Osmania University, Hyderabad. I have been a Software Development and Alpha-Testing intern for 6 months (from 3rd February 2022 to 3rd August 2022) at CEVOpen.
I worked on the Terpene Classification, Dictionaries, Pyamiimage, and SemanticClimate projects supervised by Dr. Peter Murray-Rust and Dr. Gitanjali Yadav. I was majorly responsible for Alpha-Testing and Developing Pyamiimage and SemanticClimate and preparing a Project Plan for 2022.

Terpene classification is about phytochemistry.
Dictionaries is about terpene synthase words.
Pyamiimage is an image analysis tool in python.
SemanticClimate is about conversion of IPCC reports into semantic form.

1 reply

roopavsm Aug 9, 2022
Author

1.0 Terpene Classification
- 1.1 Introduction
- 1.2 Image annotation
- 1.3 Statistical analysis of inter-annotator matrix
- 1.4 Collecting more images and testing the software
2.0 Creating Dictionaries and Alpha-testing Pyamiimage
- 2.1 Introduction
- 2.2 Collecting words from image analysis tools
- 2.3 Observations
- 2.4 Points to work on
- 2.5 Cleaning data manually
- 2.6 Fuzzywuzzy matching
- 2.7 Code for dictionaries
3.0 Project Plan 2022
4.0 SemanticClimate
- 4.1 Introduction
- 4.2 Conversion from PDF to TXT
- 4.3 Extracting Images
- 4.4 Text from images
- 4.5 Abbreviations
- 4.6 Raw text vs extracted text
- 4.7 HACKATHON PIAGET

roopavsm · 2022-08-02T14:53:02Z

roopavsm
Aug 2, 2022
Author

1.0 Terpene Classification

1.1 Introduction

A pathway is a chain of events leading to a product. It contains multiple arrows between the elements. A metabolic pathway contains biochemical molecules.

Terpenes are largely found as constituents of essential oils. They are mostly hydro-carbons. If an image is a metabolic pathway having terpenes in them are called as terpene pathways.
Certain rules were drafted to classify images as pathways, metabolic pathways, pre-terpene pathways and terpene pathways.

1.2 Image annotation

Alex Pico has been collecting information about terpene pathways, he constructed an app for us with pathway images. To assess the familiarity of rules in terpene classification, a test on 30 terpene pathway images from Alex Pico's app was conducted with the help of 8 volunteers (I was one of them).

1.3 Statistical analysis of inter-annotator matrix .

The annotation gives a detailed understanding about the clarity of the terpene concept among volunteers and requirement of change in rules.

The following is Ananya's project :

Now, it is crucial to decide whether the rules are adequate or not for carrying out the later procedures. Hence, we calculate the relation between volunteers' answers.
In order to select the best method deriving accurate relation between the data entries, we are considering three methods using some example data.
The annotations were analysed using
a) Correlation Coefficient - Computes pairwise correlation of columns, excluding null values.

b) Confusion Matrix - Involves calculation of true negative, false positive, false negative and true positive values for the matrix.

c) Cohen's Kappa - A score that expresses the level of agreement between two annotators on a classification problem

1.4 Collecting more images and testing the software

The purpose of doing this is to create a corpus of 'terpene synthases' , run pyamiimage tool over the images to extract text and store the output for creating a dictionary.

I collected additional images of terpene synthase from PubMed Central using the following steps and added it to pyamiimage in the form of a Ctree (https://github.com/petermr/pyamiimage/tree/main/PMC).

Went to the homepage of PubMed Central by National Center for Biotechnology Information.
Typed 'terpene synthase' in the search section and you will be navigated to a page.

In the right side of the page, there is 'PMC Images search' section and an option to 'see more' images. Clicked on it.

Finally, I looked for terpene synthase diagrams exclusively and saved them.
For example:

0 replies

roopavsm · 2022-08-04T12:32:10Z

roopavsm
Aug 4, 2022
Author

2.0 Creating Dictionaries and Alpha-testing Pyamiimage

2.1 Introduction

In the context of this project’s requirements, dictionaries are collections of words that are all related to terpene and terpene synthase elements. For example: Hemiterpenes, Isomerase, etc.

Unlike other dictionaries, these will contain the frequency of the words as well as their synonyms.

The words are extracted from images using 'optical character recognition' tools tesseract and EasyOCR. Then, the words are looked up in the WikiData for their element code, description, and relatable element codes. This information is stored in dictionaries.
I created a Github folder in pyamiimage for dictionaries (https://github.com/petermr/pyamiimage/tree/main/dictionaries).

The purpose of alpha-testing was to mainly check the installation and execution processes initially. This involves testing a sample of images for the accuracy of their output using Tesseract and EasyOCR tools, checking command line functionalities, giving feedback regarding the output and posting issues on Github if facing any.

2.2 Collecting words from image analysis tool

Basically, the tool pyamiimage uses the image input file, creates an output(.txt) file and extracts the text into it.
Both the optical character recognition tools have been integrated into pyamiimage. We extract words from the terpene pathway images using any one of them at a time.

To use EasyOCR, type the following command in the 'command prompt':

pyamiimage --infile abc.png --outfile abc.txt -t --ocr_wrapper easyocr

[.png for the input file is used as an instance please mention the file format according to the input]

Example: Given below is an image stored as 'img.png' .

pyamiimage --infile img.png --outfile img_easy.txt -t --ocr_wrapper easyocr

For 'tesseract', here is the command:

pyamiimage --infile abc.png --outfile abc.txt -t --ocr_wrapper tesseract

Example: Given below is an image stored as 'img.png' .

pyamiimage --infile img.png --outfile img_tess.txt -t --ocr_wrapper tesseract

2.3 Observations

EasyOCR gives cleaner output than Tesseract. But there is a slight chance of EasyOCR not giving complete output whereas in Tesseract’s case it is high.
It reads phrases whereas Tesseract reads words.
Tesseract gives more errors than EasyOCR.

2.4 Points to work on

If EasyOCR reads Greek alphabets, this will help us in getting the exact output from the images.
Also, subscripts need to be processed in EasyOCR.

I have made a .csv file of text extracted from EasyOCR and classified it into Compounds and Enzymes which will be helpful in creating a dictionary (https://github.com/petermr/CEVOpen/blob/master/sample_eocr_classification.csv).

2.5 Cleaning data manually

The words acquired from the tool had a few errors, so I cleaned the data manually (https://github.com/petermr/pyamiimage/tree/main/cleaned_phrases).

Advantage of manual curation is replacing the misspelt word with the right spelling whereas for mistranslated words availability of more than one word to replace creates confusion.

For example: If 'ydroxy-3-me' is the output, it can be replaced with either 'hydroxy-3-methylbutyrate' or 'hydroxy-3-methylglutaryl'.

So, I came across the Fuzzywuzzy library to choose the suitable word.

2.6 Fuzzywuzzy matching

Fuzzywuzzy is a string-matching library in Python. This process involves manually searching the spellings in the internet to compare with the corrupted word obtained in the image analysis output. Then according to the highest index score we can choose to replace the word manually.
(https://github.com/petermr/pyamiimage/blob/main/dictionaries/fuzzywuzzy.ipynb)

Example :

We replace 'ydroxy-3-met' with 'hydroxy-3-methylbenzene'.

Advantages of using this library:

Most likely to give the right word.
Simple and easy process after obtaining the word (after manual search).

Limitations of using this library:

The word despite having high index score may not be the actual word.
Time consuming process.
Searching the words manually is compulsory.

2.7 Code for Dictionaries

I wrote code in Python to create dictionaries.
Here, we also calculate the frequency of the phrases in a list.
(https://github.com/petermr/pyamiimage/blob/main/dictionaries/dict_code.py)

0 replies

roopavsm · 2022-08-04T12:58:34Z

roopavsm
Aug 4, 2022
Author

3.0 Project Plan 2022

I made a project plan of events at CEVOpen for the year 2022 which includes the data components, workflow and goals {https://github.com/petermr/petermr/discussions/28}.

0 replies

roopavsm · 2022-08-04T13:42:24Z

roopavsm
Aug 4, 2022
Author

4.0 SemanticClimate

4.1 Introduction

The project majorly focusses on converting United Nations’ International Panel for Climate Change (IPCC) reports into semantic form to make it more readable to the citizens from any educational background.
I chose Chapter 16 of the IPCC/ar6/wg3 report whose title is “Innovation, Techonology Development and Transfer” to work with.

4.2 Conversion from PDF to TXT

The first task was to convert the PDF file of the chapter to TXT file in order to finally convert it into a HTML file. I tried using only the pdfminer package initially which did not work for me. So, I installed pdfminer.six package and this helped me convert my chapter to a TXT file.
(https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter16/fulltext.txt)

4.3 Extracting images

In order to extract text from the images, first we need to extract images. I used the ami3 tool to extract images from the chapter (https://github.com/petermr/semanticClimate/tree/main/ipcc/ar6/wg3/Chapter16/pdfimages). Also pdfminer.six extracts images from the PDF (https://github.com/petermr/semanticClimate/tree/main/ipcc/ar6/wg3/Chapter16/images/pdfminer.six_ext).

This is one of the important images from Chapter 16.

4.4 Text from images

Now, we need text from the images. I ran the pyamiimage software on the images to extract from them (https://github.com/petermr/semanticClimate/tree/main/ipcc/ar6/wg3/Chapter16/images/pdfminer.six_ext).

Here is the text from the image shown in 'extracting images' section using EasyOCR.

4.5 Abbreviations

We then decided to build a clickable dictionary for abbreviations as it would help the reader understand them better. I made a list of abbreviations, their expansions, frequency and WikiData links manually for this chapter (https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter16/abbrev.csv).

4.6 Raw text vs Extracted test

To compare the raw(original) text and text extracted using EasyOCR in pyamiimage, I created a .csv file (https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter16/raw_extracted_txt.csv).

4.7 HACKATHON PIAGET

The SemanticClimate team we took part in the “Hackathon Piaget” organised by UNESCO and University of Geneva under “Open Geneva” festival. We won the hackathon for our idea to make IPCC reports more readable and accessible.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internship_Report_Roopa #47

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Internship_Report_Roopa #47

roopavsm Aug 2, 2022

Replies: 5 comments · 1 reply

roopavsm Aug 2, 2022 Author

roopavsm Aug 9, 2022 Author

Table of contents

roopavsm Aug 2, 2022 Author

1.0 Terpene Classification

1.1 Introduction

1.2 Image annotation

1.3 Statistical analysis of inter-annotator matrix .

1.4 Collecting more images and testing the software

roopavsm Aug 4, 2022 Author

2.0 Creating Dictionaries and Alpha-testing Pyamiimage

2.1 Introduction

2.2 Collecting words from image analysis tool

2.3 Observations

2.4 Points to work on

2.5 Cleaning data manually

2.6 Fuzzywuzzy matching

Advantages of using this library:

Limitations of using this library:

2.7 Code for Dictionaries

roopavsm Aug 4, 2022 Author

3.0 Project Plan 2022

roopavsm Aug 4, 2022 Author

4.0 SemanticClimate

4.1 Introduction

4.2 Conversion from PDF to TXT

4.3 Extracting images

4.4 Text from images

4.5 Abbreviations

4.6 Raw text vs Extracted test

4.7 HACKATHON PIAGET

roopavsm
Aug 2, 2022

Replies: 5 comments 1 reply

roopavsm
Aug 2, 2022
Author

roopavsm Aug 9, 2022
Author

roopavsm
Aug 2, 2022
Author

roopavsm
Aug 4, 2022
Author

roopavsm
Aug 4, 2022
Author

roopavsm
Aug 4, 2022
Author