Internship_Report_Roopa #47
Replies: 5 comments 1 reply
-
I am V.S.M. Roopa, a 3rd year student of BSc. Mathematics, Statistics and Computer Science at St. Francis College for Women, Osmania University, Hyderabad. I have been a Software Development and Alpha-Testing intern for 6 months (from 3rd February 2022 to 3rd August 2022) at CEVOpen.
|
Beta Was this translation helpful? Give feedback.
-
1.0 Terpene Classification1.1 IntroductionA pathway is a chain of events leading to a product. It contains multiple arrows between the elements. A metabolic pathway contains biochemical molecules. Terpenes are largely found as constituents of essential oils. They are mostly hydro-carbons. If an image is a metabolic pathway having terpenes in them are called as terpene pathways. 1.2 Image annotationAlex Pico has been collecting information about terpene pathways, he constructed an app for us with pathway images. To assess the familiarity of rules in terpene classification, a test on 30 terpene pathway images from Alex Pico's app was conducted with the help of 8 volunteers (I was one of them). 1.3 Statistical analysis of inter-annotator matrix .
The following is Ananya's project : Now, it is crucial to decide whether the rules are adequate or not for carrying out the later procedures. Hence, we calculate the relation between volunteers' answers.
1.4 Collecting more images and testing the softwareThe purpose of doing this is to create a corpus of 'terpene synthases' , run I collected additional images of terpene synthase from PubMed Central using the following steps and added it to
|
Beta Was this translation helpful? Give feedback.
-
2.0 Creating Dictionaries and Alpha-testing Pyamiimage2.1 IntroductionIn the context of this project’s requirements, dictionaries are collections of words that are all related to terpene and terpene synthase elements. For example: Hemiterpenes, Isomerase, etc. Unlike other dictionaries, these will contain the frequency of the words as well as their synonyms. The words are extracted from images using 'optical character recognition' tools tesseract and EasyOCR. Then, the words are looked up in the WikiData for their element code, description, and relatable element codes. This information is stored in dictionaries. The purpose of alpha-testing was to mainly check the installation and execution processes initially. This involves testing a sample of images for the accuracy of their output using Tesseract and EasyOCR tools, checking command line functionalities, giving feedback regarding the output and posting issues on Github if facing any. 2.2 Collecting words from image analysis toolBasically, the tool To use
[.png for the input file is used as an instance please mention the file format according to the input] Example: Given below is an image stored as 'img.png' .
For 'tesseract', here is the command:
Example: Given below is an image stored as 'img.png' .
2.3 Observations
2.4 Points to work on
I have made a .csv file of text extracted from EasyOCR and classified it into Compounds and Enzymes which will be helpful in creating a dictionary (https://github.com/petermr/CEVOpen/blob/master/sample_eocr_classification.csv). 2.5 Cleaning data manuallyThe words acquired from the tool had a few errors, so I cleaned the data manually (https://github.com/petermr/pyamiimage/tree/main/cleaned_phrases). Advantage of manual curation is replacing the misspelt word with the right spelling whereas for mistranslated words availability of more than one word to replace creates confusion. For example: If 'ydroxy-3-me' is the output, it can be replaced with either 'hydroxy-3-methylbutyrate' or 'hydroxy-3-methylglutaryl'. So, I came across the Fuzzywuzzy library to choose the suitable word. 2.6 Fuzzywuzzy matchingFuzzywuzzy is a string-matching library in Python. This process involves manually searching the spellings in the internet to compare with the corrupted word obtained in the image analysis output. Then according to the highest index score we can choose to replace the word manually. Example : We replace 'ydroxy-3-met' with 'hydroxy-3-methylbenzene'. Advantages of using this library:
Limitations of using this library:
2.7 Code for DictionariesI wrote code in Python to create dictionaries. |
Beta Was this translation helpful? Give feedback.
-
3.0 Project Plan 2022I made a project plan of events at CEVOpen for the year 2022 which includes the data components, workflow and goals {https://github.com/petermr/petermr/discussions/28}. |
Beta Was this translation helpful? Give feedback.
-
4.0 SemanticClimate4.1 IntroductionThe project majorly focusses on converting United Nations’ International Panel for Climate Change (IPCC) reports into semantic form to make it more readable to the citizens from any educational background. 4.2 Conversion from PDF to TXTThe first task was to convert the PDF file of the chapter to TXT file in order to finally convert it into a HTML file. I tried using only the 4.3 Extracting imagesIn order to extract text from the images, first we need to extract images. I used the This is one of the important images from Chapter 16. 4.4 Text from imagesNow, we need text from the images. I ran the pyamiimage software on the images to extract from them (https://github.com/petermr/semanticClimate/tree/main/ipcc/ar6/wg3/Chapter16/images/pdfminer.six_ext). Here is the text from the image shown in 'extracting images' section using 4.5 AbbreviationsWe then decided to build a clickable dictionary for abbreviations as it would help the reader understand them better. I made a list of abbreviations, their expansions, frequency and WikiData links manually for this chapter (https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter16/abbrev.csv). 4.6 Raw text vs Extracted testTo compare the raw(original) text and text extracted using EasyOCR in pyamiimage, I created a .csv file (https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter16/raw_extracted_txt.csv). 4.7 HACKATHON PIAGETThe SemanticClimate team we took part in the “Hackathon Piaget” organised by UNESCO and University of Geneva under “Open Geneva” festival. We won the hackathon for our idea to make IPCC reports more readable and accessible. |
Beta Was this translation helpful? Give feedback.
-
Hi, this is about internship at CEVOpen.
Beta Was this translation helpful? Give feedback.
All reactions