Skip to content
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Go
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
classified
data
fastText
.gitignore
LICENSE
README.md
main.go

README.md

goclassy

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

For more info see our paper here.

If you want to download OSCAR you can do it here.

Note: For the moment the downloader and uncompression part of the pipeline is not available as they are still experimental, they will be open sourced in a future release.

References

@inproceedings{ortizsuarez:hal-02148693,
  TITLE = {{Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures}},
  AUTHOR = {Ortiz Su{\'a}rez, Pedro Javier and Sagot, Beno{\^i}t and Romary, Laurent},
  URL = {https://hal.inria.fr/hal-02148693},
  BOOKTITLE = {{7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7)}},
  ADDRESS = {Cardiff, United Kingdom},
  YEAR = {2019},
  MONTH = Jul,
  PDF = {https://hal.inria.fr/hal-02148693/file/Asynchronous_Pipeline_for_Processing_Huge_Corpora_on_Medium_to_Low_Resource_Infrastructures.pdf},
  HAL_ID = {hal-02148693},
  HAL_VERSION = {v1},
}
You can’t perform that action at this time.