Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Wikipedia lexical miner
The development of the Wiki acquisition tool was made using Intel ® Core ™ I7-2670QM CPU with 2.2Ghz (Quad Core) and 8GB of RAM memory and complete running on the English Wikipedia was made in about a week using the same machine as the Database server and application server. Therefore, any weaker machine might consume more time for the execution.
The tool is multi-lingual (the IDM module can be tuned for each specific language, by implementing the eu.excitementproject.eop.lexicalminer.definition.idm.IIDM and the eu.excitementproject.eop.lexicalminer.definition.idm.SyntacticUtils interfaces, as currently done for English and Italian.
The software requirements are:
- Installed MySQL server. (Development was made on version 5.5.27)
- Java version 1.6.
- JWPL (Java based Wikipedia Library) for creating and filling the Wikipedia database. Full instruction on how to get a full Wikipedia JWPL database from Wikipedia dumps can be found on http://code.google.com/p/jwpl/wiki/DataMachine.
- Python 2.7 (For EasyFirst English parser)
How to build the rules database
Create the system scheme using the script "CreateDB.sql" supplied under the "DB_Scripts" folder.
Fill in the Wikipedia Miner configuration file according to your requirements. The file is divided to modules in which every one of them responsible for specific number of parameters. The modules are self-explained. The most important module which you probably want to edit is "Extractors" which determine the extractors that will be used to fill the database. Other modules are the JWPL database configuration, the target Database configuration (which will be filled with rules) and "processing_tools" which determine which processing tools will be used. Note, that the classpath should refer to the libraries of the configured tools - the current configuration for instance, requires lingpipe, stanford-postagger, stanford-ner, opennlp-tools, gate jars. In addition, the parameters of the stopwords file path and the JARS folder path are at the top of the configuration file as ENTITY.
Important note – do not fill the database using both the lexicalIDM and syntacticIDM extractors. Using them both can result in wrong classifiers ranks.
The system uses the log4j framework as a logger mechanism. Make sure you have a log4j configuration file (log4j.properties) in the log4j directory. You can change the logger configuration as you wish)
For the English EasyFirst parser which run in a server-client manner run the server side on port 8081.
Run the system using
java -Xmx<Allocated Memory Size> eu.excitementproject.eop.lexicalminer.wikipedia.MinerExecuter <Configuration file path>
- We set "Allocated Memory Size" to 4000M but bigger value can reduce the running time.
- The system has recovery mechanism which can recover from crash and skip the data which already been processed. In case of a crash all you have to do is to run the system again and mechanism will be used automatically.
During the execution you can view the log files which will be written the location defined in the log4j.properties file.
After a full success execution you have a database contains all the rules that extracted by the system. This database doesn't contain some indexes which important for retrieving the rules. To add those indexed run the script "CreateIndexes.sql" from the DB_Scripts folder.
In this point you have a full database ready to use by the retrieval tool. If you choose to run the lexical or syntactic extractors and wish to use an offline classifier, please go the "Build the offline classifiers" section, otherwise, you can skip to "How to retrieve rules" section.
###Build the offline classifiers
The DB now contains all the rules and indexes, but the statistical data should be collected in order to run the classifiers. In order to gather this statistics run the "CollectStatictics.sql" from DB_Scripts folder. (This script can take a while)
To run the offline classifiers you need to choose which of the supplied classifiers you want using the classifiers configuration file.
Run the classifiers ranks calculation process using
java-Xmx<Allocated Memory Size> eu.excitementproject.eop.lexicalminer.definition.classifier.BuildClassifiers <Configuration file path>
An example of such configuration file is given at: /src/eu/excitementproject/eop/lexicalminer/definition/classifier/ BuildClassifiersConfig.xml
- The database is now ready to retrieve rules using the offline classifiers.
###Conversion of the SQL database to Redis
In order to convert the generated SQL databe to Redis, apply the the SQL2RedisConverter program, as follows:
java-Xmx32GB eu.excitementproject.eop.lexicalminer.redis.SQL2RedisConverter <in sql dump file> <out l2r redis file> <out r2l redis file> <number of classifiers>
- in sql dump file: the dump file of the SQL db
- out l2r redis file: a path to the output Redis left-to-right similarities file
- out r2l redis file: a path to the output Redis right-to-left similarities file
- number of classifiers: The number of classifiers in the database. Assumption: the classifier ids are in the range [0, number-1] (can be varified by applying the "select id from classifiers" query).
How to retrieve rules
- Define the desired classifier and the other configurable parameters in the retrieval configuration file.
An example of such configuration file for SQL-based resource is given at: /src/eu/excitementproject/eop/lexicalminer/LexiclRulesRetrieval/wikipediaLexicalResourceConfig.xml
An example of such configuration file for Redis-based resource is given at: /src/eu/excitementproject/eop/lexicalminer/LexiclRulesRetrieval/redis/RedisBasedWikipediaLexicalResourceConfig.xml
- Use the LexicalResource interface for accessing.
An example of this usage for SQL-based database is given in the main method of eu.excitementproject.eop.lexicalminer.LexiclRulesRetrieval.WikipediaLexicalResource class.
An example of this usage for Redis-based database is given in the main method of eu.excitementproject.eop.lexicalminer.redis.RedisBasedWikipediaLexicalResource class.