Code to run the Extra algorithm for unsupervised topic/aspect extraction on English texts.
Read the Official Documentation here
IMPORTANT:
- When running Extra inside docker-container, make sure that Docker process has enough resources. For example, on Mac/Windows it should have at least 8 Gb of RAM available to it. Read More about RAM Requirements
- GitHub repo does not come with Glove Embeddings. See section
Downloading Embeddingsfor how to download the required embeddings.
First, build the image:
docker-compose buildThen, run following command to make sure that extra-model was installed correctly:
docker-compose run testNext step is to download the embeddings (we use Glove from Stanford in this project).
To download the required embeddings, run the following command:
docker-compose run --rm setupThe embeddings will be downloaded, unzipped and formatted into a space-efficient format. Files will be saved in the embeddings/ directory in the root of the project directory. If the process fails, it can be safely restarted. If you want to restart the process with new files, delete all files except README.md in the embeddings/ directory.
After you've downloaded the embeddings, you may want to run docker-compose build again.
This will build an image with embeddings already present inside the image.
The tradeoff here is that the image will be much bigger, but you won't spend ~2 minutes each time you run extra-model waiting for embeddings to be mounted into the container.
On the other hand, building an image with embeddings in the context will increase build time from ~3 minutes to ~10 minutes.
Finally, running extra-model is as simple as:
docker-compose run extra-model /package/tests/resources/100_comments.csvNOTE: when using this approach, input file should be mounted inside the container.
By default, everything from extra-model folder will be mounted to /package/ folder.
This can be changed in docker-compose.yaml
This will produce a result.csv file in /io/ (default setting) folder.
Location of the output can be changed by supplying second path, e.g.:
docker-compose run extra-model /package/tests/resources/100_comments.csv /io/another_folderThe output filename can also be changed if you want it to be something else than result.csv by supplying a third argument:
docker-compose run extra-model /package/tests/resources/100_comments.csv /io/another_folder another_filename.csvMore examples, as well as an explanation of input/output are available in official documentation.
First, install extra-model via pip:
pip install extra-modelNext, run the following to download and set up the required embeddings (we use Glove from Stanford in this project):
extra-model-setupThe embeddings will be downloaded, unzipped and formatted into a space-efficient format and saved in /embeddings.
If the process fails, it can be safely restarted. If you want to restart the process with new files, delete all files except README.md in the embeddings directory.
Once set up, running extra-model is as simple as:
extra-model tests/resources/100_comments.csvThis will produce a result.csv file in /io. If you want to change the output directory this can be done by providing it as a second argument to extra-model like so:
extra-model tests/resources/100_comments.csv /path/to/store/outputThe output filename can also be changed if you want it to be something else than result.csv by supplying a third argument to extra-model:
docker-compose run extra-model tests/resources/100_comments.csv /path/to/store/output another_filename.csvFirst, install extra-model via pip:
pip install extra-modelNext, use either the extra-model-setup CLI or docker-compose to download and set up the required embeddings (we use Glove from Stanford in this project):
extra-model-setupor
docker-compose run --rm setupThe embeddings will be downloaded, unzipped and formatted into a space-efficient format. For the Docker based workflow, the embeddings will be saved to the embeddings directory. For the CLI workflow, by default, files will be saved in /embeddings. You can set another directory by providing it as an argument when running extra-model-setup like so:
extra-model-setup /path/to/store/embeddingsIf the process fails, it can be safely restarted. If you want to restart the process with new files, delete all files except README.md in the embeddings directory.
Once set up, you can use extra-model by calling the run() function in extra_model/_run.py :
from extra_model._run import run
run(
input_path=Path("input/path/file.csv"),
output_path=Path("output/path")
)This will process input/path and produce a result.csv file in output/path. If you want to change the output filename to be something different than result.csv, you can do os by providing an additional argument to run():
from extra_model._run import run
run(
input_path=Path("input/path"),
output_path=Path("output/path"),
output_filename=Path("output_filename.csv")
)More examples, as well as an explanation of input/output are available in official documentation.
extra-model was written by mbalyasin@wayfair.com, mmozer@wayfair.com.