by Pranav Karnani, Bandish Parikh, Faizan Khan
-
Crawlers - A web crawler which extracts all research papers on ACL Anthology on and after 2010. - A separate web crawler, which extracts information from the PyTorch and Huggingface about Hyperparameters - Another scripts which scraped commonly used datasets, metrics, tasks from papers with code.
-
Annotator Scripts - An automated annotator, which automatically annotated the research papers for us.
-
Training model - A set of notebooks highlighting our modeling process
PS: The annotator scripts can be made robust by cleaning the keywords extracted from the crawlers
- Run all the crawlers by running the main.py file
- Run the annotator script pipeline
- Use the notebooks in the training model directory to train models
- Download the dataset from papers with code in the link - https://production-media.paperswithcode.com/about/evaluation-tables.json.gz
- Move the downloaded file inside the ResearchPaper-NER directory and rename it to evaluation-tables.json
- Run the following commands:
3.1 python crawlers/ACLScraper/main.py
3.2 python crawlers/HuggingFaceScraper/main.py
3.3 python crawlers/main.py
- All papers from ACLAnthonlogy.org would be downloaded in the folder named data. The pdf and text files will be stored in different directories.
- The hyperparameters, keywords, datasets, metrics and tasks would be downloaded as separate CSV files in the dataset directory inside data
- Run the command - python annotator_scripts/main.py
- All downloaded papers will be tokenized using spacy, annotated automatically for consumption by Label Studio
- These papers will be stored in directory named - tokenized inside the data directory
- Train the model on Masked language modeling first, using the roberta_mlm_aws.ipynb
- Use the pretrained model for masked language modelling and train it on our custom NER Task using 11711_ner.ipynb
PS: Sorry for the tqdm massacare in the 11711_ner.ipynb 🥲