Skip to content

Latest commit

 

History

History
14 lines (9 loc) · 881 Bytes

File metadata and controls

14 lines (9 loc) · 881 Bytes

Automatic-Metadata-Extraction-from-Scientific-Documents

This project deals with Automatic metadata extraction from Scientific Documents. The relevant metadata includes Title, Authors, Abstract, Keywords, Journal Name, Volume, etc.

The automatic extraction of metadata is performed by analyzing the relevant text, with application of suitable information extraction and natural language processing techniques. The methods used include font analysis and processing of the uncompressed and converted PDF file (converted to xml and text) using information extraction techniques like regular expressions, tokenizing, etc.

This project was done at Indira Gandhi Centre for Atomic Research(IGCAR) to perform automatic metadata extraction on scientific documents getting submitted to the knowledge repository at IGCAR. Further details are included in the final report.