Internship Report_Enakshi Das #53

Enakshi-1998 · 2022-09-02T14:32:38Z

Enakshi-1998
Sep 2, 2022

I am Enakshi Das, a 2nd-year student of M.Sc. Bioinformatics, at Pondicherry University. I have been a Software Development and beta-Testing intern for 12 weeks/3 months (from 3rd June 2022 to 15th August 2022) at CEVOpen.
I worked on the SemanticClimate, supervised by Dr. Peter Murray-Rust and Dr. Gitanjali Yadav. I was majorly responsible as a program manager for this entire time. I have also participated actively in the Open Geneva Hackathon Piaget. SemanticClimate is about the conversion of IPCC reports into semantic form. I have been the project manager throughout my time here. It was a great experience since I could get an overview of the entire project and see how the work is getting done.

Table of Content

1. SemanticClimate
1.1 Introduction
1.2 Conversion from PDF to TXT
1.3 Extracting Images
1.4 Text from images
1.5 Abbreviations
1.6 HACKATHON PIAGET

Semantic Climate

1.1 Introduction

The project majorly focuses on converting United Nations’ International Panel for Climate Change (IPCC) reports into semantic form to make them more readable to citizens from any educational background. I chose Chapter04 of the IPCC/ar6/wg3 report titled “Mitigation and Development Pathways in the near to mid-term” to work with.

1.2Conversion from PDF to TXT

The first task was to convert the PDF file of the chapter to a TXT file to convert it into an HTML file. I cloned pdf2txt.py in GitBash Terminal. After downloading the web IPCC report in pdf format, run the following commands in the terminal.

     mkdir Chapter04
     cp Chapter04.pdf Chapter04/fulltext.pdf
     cd Chapter04
     pdf2txt.py -o fulltext.txt fulltext.pdf 
     ls
     fulltext.pdf	fulltext.txt

The pdf was successfully converted to fulltext.html with flowing text.

1.3Extracting images

To extract text from the images, first, we need to extract images. I used the ami3 tool to extract images from the chapter (https://github.com/petermr/semanticClimate/tree/main/ipcc/ar6/wg3/Chapter04/pdfimages). Also, pdfminer.six pulls images from the PDF (https://github.com/petermr/semanticClimate/tree/main/ipcc/ar6/wg3/Chapter04/images/pdfminer.six_ext).

This is one of the important images from Chapter04.

1.4Text from Images

Now, we need the text from the images. I ran the pyamiimage software on the images to extract from them.
Here is the text from the image shown in the 'extracting images' section using EasyOCR.

1.5Abbreviations

We then decided to build a clickable dictionary for abbreviations as it would help the reader understand them better. I made a list of acronyms, their expansions, frequency, and WikiData links manually for this chapter 04

###1.6 HACKATHON PIAGET
The SemanticClimate team participated in the “Hackathon Piaget” organized by UNESCO and the University of Geneva under the “Open Geneva” festival. The central theme was 'Rethinking education for the 21st Century. Our group involved many enthusiastic participants in the Hackathon from all over the world, from India and Cameroon to Latin America.

We presented our project (Semantic Climate) revolves around the idea of #openscience and #opensource. We showed that we are building software using NLP and Python to extract useful information from thousands of scientific papers present in Euro PMC. Hopefully, we will be able to implement our product at the school and college levels soon. We won the hackathon for our idea to make IPCC reports more readable and accessible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internship Report_Enakshi Das #53

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Internship Report_Enakshi Das #53

Enakshi-1998 Sep 2, 2022

Table of Content

Semantic Climate

1.1 Introduction

1.2Conversion from PDF to TXT

1.3Extracting images

1.4Text from Images

1.5Abbreviations

Replies: 0 comments

Enakshi-1998
Sep 2, 2022