Skip to content

heolin/onto-gen

Repository files navigation

onto-gen

Module for generating ontologies based on text corpus.

I. Preparing corpus. Current pipeline supports only pdf files. You can use different files by converting them into txt and placing in "corpus/txt" directory.

Place pdf files into "corpus/pdf" directory.
Use:
make

When whole process is ready, corpus file  will be placed in "corpus/final/final.txt".

II. Creating topics for Latent Semantic Indexing (LSI).

1. Creating a dictionary
./topics/analyser.py -i INPUT-CORPUS-FILE

This command will create dictionary file in the same directory as INPUT-CORPUS-FILE.
INPUT-CORPUS-FILE is a file with corpus sentences listed one in each line.
For example, corpus created in the previous section.

2. Creating a LSI models:
./topics/models.py -i INPUT_PATH
INPUT-PATH is a path to a dictionary created in the first step.
This command will create LSI models in the same directory as dictionary.

III. Create inverted index

1. Create index
./search/search_manager.py -c PATH-TO-SCHEMA -i PATH-TO-INDEX
PATH-TO-INDEX is a path to directory where index will be created
PATH-TO-SCHEMA is a path to schema file described below:
Example schema format (means: store title, store full text, allow full text search):
title	TEXT	True

2. Read corpus files into index
To read corpus file use:
./search/search_manager.py -i PATH-TO-INDEX -q -af PATH-TO-FILE
PATH-TO-INDEX is a path to directory where index was created in previous step
PATH-TO-FILE is a path to corpus file you want to read into index

III. Creating an ontology WARNING! All calculations for distance matrix are being stored in "./temp/" directory to speed up process in next generation, remember to clear (rm ./temp/*) this directory after changing INPUT-TERMS list.

1. Creating new ontology
./ontology_factory.py -i INDEX-PATH -c CORPUS-PATH -t INPUT-TERMS -o OUTPUT-OWL-FILE
Additionaly you can use -l option to lemmatize input terms.

INDEX-PATH is a path to a directory in which inverted index was created in the previous section.
CORPUS-PATH is a path to topic dictionary created in II section
OUTPUT-OWL-FILE is a path for generated output OWL ontology
INPUT-TERMS is a path to file with terms listed one in each line

2. To extend existing ontology
./ontology_factory.py -i INDEX-PATH -c CORPUS-PATH -t INPUT-TERMS -o OUTPUT-OWL-FILE -g INPUT-OWL
Additionaly you can use -l option to lemmatize input terms.

INDEX-PATH is a path to a directory in which inverted index was created in the previous section.
CORPUS-PATH is a path to topic dictionary created in II section
OUTPUT-OWL-FILE is a path for generated output OWL ontology
INPUT-TERMS is a path to file with terms listed one in each line
INPUT-OWL is path to OWL file you want to extend

About

Module for generating ontologies based on a text corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages