This repository currently holds the solr files for indexing news articles from the NY Times news corpus and wikipedia articles from wikimedia xml files
- Install solr as described here - Navigate to example folder in the directory of your solr installation - Delete the solr folder - Clone this repository, and rename the resulting folder to "solr" - Create a folder - Download all tgz files from the NYT Corpus folder under projects on ublearns, and place them in a folder - Extract the tgz files - Change the newsDataDirectory property on line 10 of newsArticleCollection/core.properties to the ABSOLUTE path of the folder created above. No quotes are required, but for windows, you'll have to replace "\\" with "\\\" - Download and extract wikipedia xml files [[TODO:place url here]] and put them in a folder - Repeat the final step for NYC corpus above, replacing newsDataDirectory with wikiDataDirectory and newsArticleCollection with wikiArticleCollection - Run the solr admin panel, and execute a dataImport on either the newsArticleCollection or wikiArticleCollection core to index news articles and wikipedia articles respectively- Navigate to the flask app base folder
- With your virtual environment activated, run the command "python run.py refresh_index "