Skip to content

Files with code related to the mentoring program with KaggleX

Notifications You must be signed in to change notification settings

mhrecaldeb/KaggleXMentoringProgram

Repository files navigation

KaggleXMentoringProgram

Files with code related to the mentoring program with KaggleX

What contains

This repository includes a serie of notebooks and JSON files as product of appliying it to the data included in folder ./dinardap

Notebooks and JSON files

webscrappingcsvmanaged-SINARDAP.ipynb

This notebook contains an example of a program which can be used to download PDF files from a web side. This notebook could be adjusted to be used to download other type of files in other websites.

PDF Text extracting-resx.ipynb

This repository contains several files (5) which contains code that allows to extract text from PDF files. In this case every file, numbered 1, 2, 3, 4, 8 corresponds to a different PDF each. This is because despite the fact that in this case, every file nature is the same (resolucion_no._00x-NG-Dinarp-2023.pdf), it means a resolution of DINARDAP year 2023, every document has differences that makes very complex authomatize every aspect. It is important to understand that the objective was to extract the text keeping the relevant information of from what part of the file it is extracted. Being those parts: CONSIDERACIONES, RESOLUCIONES, DISPOSICIONES.

jsonlfile_resx.jsonl

This files are the result of extracting the information from PDF files.

concatenate.ipynb

This file contains several methods used to try to concatenate all jsonl files originated from individual PDF files in only one, in order to have it avilable to use to fine tune a LLM with the information extracted from PDF files.

How to run this files

It is necesary to have an environment with Python 3.9 or higher and use jupyther notebook / jupyther lab.

There is a file called EnvRequirements.txt with referecen of versions used for every library.

About

Files with code related to the mentoring program with KaggleX

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published