ResearchPaper-NER

by Pranav Karnani, Bandish Parikh, Faizan Khan

File Structure

Crawlers - A web crawler which extracts all research papers on ACL Anthology on and after 2010. - A separate web crawler, which extracts information from the PyTorch and Huggingface about Hyperparameters - Another scripts which scraped commonly used datasets, metrics, tasks from papers with code.
Annotator Scripts - An automated annotator, which automatically annotated the research papers for us.
Training model - A set of notebooks highlighting our modeling process

PS: The annotator scripts can be made robust by cleaning the keywords extracted from the crawlers

Sequence to run the scripts:

Run all the crawlers by running the main.py file
Run the annotator script pipeline
Use the notebooks in the training model directory to train models

Crawlers

Steps to Run:

Download the dataset from papers with code in the link - https://production-media.paperswithcode.com/about/evaluation-tables.json.gz
Move the downloaded file inside the ResearchPaper-NER directory and rename it to evaluation-tables.json
Run the following commands:
3.1 python crawlers/ACLScraper/main.py
3.2 python crawlers/HuggingFaceScraper/main.py
3.3 python crawlers/main.py

Outputs:

All papers from ACLAnthonlogy.org would be downloaded in the folder named data. The pdf and text files will be stored in different directories.
The hyperparameters, keywords, datasets, metrics and tasks would be downloaded as separate CSV files in the dataset directory inside data

Annotator Scripts

Steps to Run

Run the command - python annotator_scripts/main.py

Output

All downloaded papers will be tokenized using spacy, annotated automatically for consumption by Label Studio
These papers will be stored in directory named - tokenized inside the data directory

Model Training Notebooks

Train the model on Masked language modeling first, using the roberta_mlm_aws.ipynb
Use the pretrained model for masked language modelling and train it on our custom NER Task using 11711_ner.ipynb

PS: Sorry for the tqdm massacare in the 11711_ner.ipynb 🥲

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
annotator_scripts		annotator_scripts
crawlers		crawlers
data		data
model_training		model_training
README.md		README.md
requirements.txt		requirements.txt
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ResearchPaper-NER

File Structure

Sequence to run the scripts:

Crawlers

Steps to Run:

Outputs:

Annotator Scripts

Steps to Run

Output

Model Training Notebooks

About

Releases

Packages

Contributors 3

Languages

pranavkarnani/ResearchPaper-NER

Folders and files

Latest commit

History

Repository files navigation

ResearchPaper-NER

File Structure

Sequence to run the scripts:

Crawlers

Steps to Run:

Outputs:

Annotator Scripts

Steps to Run

Output

Model Training Notebooks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages