Internship Report_Enakshi Das #53
Enakshi-1998
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am Enakshi Das, a 2nd-year student of M.Sc. Bioinformatics, at Pondicherry University. I have been a Software Development and beta-Testing intern for 12 weeks/3 months (from 3rd June 2022 to 15th August 2022) at CEVOpen.
I worked on the SemanticClimate, supervised by Dr. Peter Murray-Rust and Dr. Gitanjali Yadav. I was majorly responsible as a program manager for this entire time. I have also participated actively in the Open Geneva Hackathon Piaget. SemanticClimate is about the conversion of IPCC reports into semantic form. I have been the project manager throughout my time here. It was a great experience since I could get an overview of the entire project and see how the work is getting done.
Table of Content
1. SemanticClimate
1.1 Introduction
1.2 Conversion from PDF to TXT
1.3 Extracting Images
1.4 Text from images
1.5 Abbreviations
1.6 HACKATHON PIAGET
Semantic Climate
1.1 Introduction
The project majorly focuses on converting United Nations’ International Panel for Climate Change (IPCC) reports into semantic form to make them more readable to citizens from any educational background. I chose Chapter04 of the IPCC/ar6/wg3 report titled “Mitigation and Development Pathways in the near to mid-term” to work with.
1.2Conversion from PDF to TXT
The first task was to convert the PDF file of the chapter to a TXT file to convert it into an HTML file. I cloned pdf2txt.py in GitBash Terminal. After downloading the web IPCC report in pdf format, run the following commands in the terminal.
The pdf was successfully converted to fulltext.html with flowing text.
1.3Extracting images
To extract text from the images, first, we need to extract images. I used the
ami3
tool to extract images from the chapter (https://github.com/petermr/semanticClimate/tree/main/ipcc/ar6/wg3/Chapter04/pdfimages). Also,pdfminer.six
pulls images from the PDF (https://github.com/petermr/semanticClimate/tree/main/ipcc/ar6/wg3/Chapter04/images/pdfminer.six_ext).This is one of the important images from Chapter04.
1.4Text from Images
Now, we need the text from the images. I ran the pyamiimage software on the images to extract from them.
Here is the text from the image shown in the 'extracting images' section using EasyOCR.
1.5Abbreviations
We then decided to build a clickable dictionary for abbreviations as it would help the reader understand them better. I made a list of acronyms, their expansions, frequency, and WikiData links manually for this chapter 04
###1.6 HACKATHON PIAGET
The SemanticClimate team participated in the “Hackathon Piaget” organized by UNESCO and the University of Geneva under the “Open Geneva” festival. The central theme was 'Rethinking education for the 21st Century. Our group involved many enthusiastic participants in the Hackathon from all over the world, from India and Cameroon to Latin America.
We presented our project (Semantic Climate) revolves around the idea of #openscience and #opensource. We showed that we are building software using NLP and Python to extract useful information from thousands of scientific papers present in Euro PMC. Hopefully, we will be able to implement our product at the school and college levels soon. We won the hackathon for our idea to make IPCC reports more readable and accessible.
Beta Was this translation helpful? Give feedback.
All reactions