-
Notifications
You must be signed in to change notification settings - Fork 25.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
333 additions
and
0 deletions.
There are no files selected for viewing
56 changes: 56 additions & 0 deletions
56
model_cards/aliosm/ai-soco-c++-roberta-small-clas/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
--- | ||
language: "c++" | ||
tags: | ||
- exbert | ||
- authorship-identification | ||
- fire2020 | ||
- pan2020 | ||
- ai-soco | ||
- classification | ||
license: "mit" | ||
datasets: | ||
- ai-soco | ||
metrics: | ||
- accuracy | ||
--- | ||
|
||
# ai-soco-c++-roberta-small-clas | ||
|
||
## Model description | ||
|
||
`ai-soco-c++-roberta-small` model fine-tuned on [AI-SOCO](https://sites.google.com/view/ai-soco-2020) task. | ||
|
||
#### How to use | ||
|
||
You can use the model directly after tokenizing the text using the provided tokenizer with the model files. | ||
|
||
#### Limitations and bias | ||
|
||
The model is limited to C++ programming language only. | ||
|
||
## Training data | ||
|
||
The model initialized from [`ai-soco-c++-roberta-small`](https://github.com/huggingface/transformers/blob/master/model_cards/aliosm/ai-soco-c++-roberta-small) model and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset to do text classification. | ||
|
||
## Training procedure | ||
|
||
The model trained on Google Colab platform using V100 GPU for 10 epochs, 32 batch size, 512 max sequence length (sequences larger than 512 were truncated). Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization. | ||
|
||
## Eval results | ||
|
||
The model achieved 93.19%/92.88% accuracy on AI-SOCO task and ranked in the 4th place. | ||
|
||
### BibTeX entry and citation info | ||
|
||
```bibtex | ||
@inproceedings{ai-soco-2020-fire, | ||
title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}", | ||
author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo", | ||
booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)", | ||
year = "2020" | ||
} | ||
``` | ||
|
||
<a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-small-clas"> | ||
<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png"> | ||
</a> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
--- | ||
language: "c++" | ||
tags: | ||
- exbert | ||
- authorship-identification | ||
- fire2020 | ||
- pan2020 | ||
- ai-soco | ||
license: "mit" | ||
datasets: | ||
- ai-soco | ||
metrics: | ||
- perplexity | ||
--- | ||
|
||
# ai-soco-c++-roberta-small | ||
|
||
## Model description | ||
|
||
From scratch pre-trained RoBERTa model with 6 layers and 12 attention heads using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which consists of C++ codes crawled from CodeForces website. | ||
|
||
## Intended uses & limitations | ||
|
||
The model can be used to do code classification, authorship identification and other downstream tasks on C++ programming language. | ||
|
||
#### How to use | ||
|
||
You can use the model directly after tokenizing the text using the provided tokenizer with the model files. | ||
|
||
#### Limitations and bias | ||
|
||
The model is limited to C++ programming language only. | ||
|
||
## Training data | ||
|
||
The model initialized randomly and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which contains 100K C++ source codes. | ||
|
||
## Training procedure | ||
|
||
The model trained on Google Colab platform with 8 TPU cores for 200 epochs, 16\*8 batch size, 512 max sequence length and MLM objective. Other parameters were defaulted to the values mentioned in [`run_language_modelling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script. Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization. | ||
|
||
### BibTeX entry and citation info | ||
|
||
```bibtex | ||
@inproceedings{ai-soco-2020-fire, | ||
title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}", | ||
author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo", | ||
booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)", | ||
year = "2020" | ||
} | ||
``` | ||
|
||
<a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-small"> | ||
<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png"> | ||
</a> |
56 changes: 56 additions & 0 deletions
56
model_cards/aliosm/ai-soco-c++-roberta-tiny-96-clas/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
--- | ||
language: "c++" | ||
tags: | ||
- exbert | ||
- authorship-identification | ||
- fire2020 | ||
- pan2020 | ||
- ai-soco | ||
- classification | ||
license: "mit" | ||
datasets: | ||
- ai-soco | ||
metrics: | ||
- accuracy | ||
--- | ||
|
||
# ai-soco-c++-roberta-tiny-96-clas | ||
|
||
## Model description | ||
|
||
`ai-soco-c++-roberta-tiny-96` model fine-tuned on [AI-SOCO](https://sites.google.com/view/ai-soco-2020) task. | ||
|
||
#### How to use | ||
|
||
You can use the model directly after tokenizing the text using the provided tokenizer with the model files. | ||
|
||
#### Limitations and bias | ||
|
||
The model is limited to C++ programming language only. | ||
|
||
## Training data | ||
|
||
The model initialized from [`ai-soco-c++-roberta-tiny-96`](https://github.com/huggingface/transformers/blob/master/model_cards/aliosm/ai-soco-c++-roberta-tiny-96) model and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset to do text classification. | ||
|
||
## Training procedure | ||
|
||
The model trained on Google Colab platform using V100 GPU for 10 epochs, 16 batch size, 512 max sequence length (sequences larger than 512 were truncated). Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization. | ||
|
||
## Eval results | ||
|
||
The model achieved 91.12%/91.02% accuracy on AI-SOCO task and ranked in the 7th place. | ||
|
||
### BibTeX entry and citation info | ||
|
||
```bibtex | ||
@inproceedings{ai-soco-2020-fire, | ||
title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}", | ||
author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo", | ||
booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)", | ||
year = "2020" | ||
} | ||
``` | ||
|
||
<a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-tiny-96-clas"> | ||
<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png"> | ||
</a> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
--- | ||
language: "c++" | ||
tags: | ||
- exbert | ||
- authorship-identification | ||
- fire2020 | ||
- pan2020 | ||
- ai-soco | ||
license: "mit" | ||
datasets: | ||
- ai-soco | ||
metrics: | ||
- perplexity | ||
--- | ||
|
||
# ai-soco-c++-roberta-tiny-96 | ||
|
||
## Model description | ||
|
||
From scratch pre-trained RoBERTa model with 1 layers and 96 attention heads using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which consists of C++ codes crawled from CodeForces website. | ||
|
||
## Intended uses & limitations | ||
|
||
The model can be used to do code classification, authorship identification and other downstream tasks on C++ programming language. | ||
|
||
#### How to use | ||
|
||
You can use the model directly after tokenizing the text using the provided tokenizer with the model files. | ||
|
||
#### Limitations and bias | ||
|
||
The model is limited to C++ programming language only. | ||
|
||
## Training data | ||
|
||
The model initialized randomly and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which contains 100K C++ source codes. | ||
|
||
## Training procedure | ||
|
||
The model trained on Google Colab platform with 8 TPU cores for 200 epochs, 16\*8 batch size, 512 max sequence length and MLM objective. Other parameters were defaulted to the values mentioned in [`run_language_modelling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script. Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization. | ||
|
||
### BibTeX entry and citation info | ||
|
||
```bibtex | ||
@inproceedings{ai-soco-2020-fire, | ||
title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}", | ||
author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo", | ||
booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)", | ||
year = "2020" | ||
} | ||
``` | ||
|
||
<a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-tiny-96"> | ||
<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png"> | ||
</a> |
56 changes: 56 additions & 0 deletions
56
model_cards/aliosm/ai-soco-c++-roberta-tiny-clas/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
--- | ||
language: "c++" | ||
tags: | ||
- exbert | ||
- authorship-identification | ||
- fire2020 | ||
- pan2020 | ||
- ai-soco | ||
- classification | ||
license: "mit" | ||
datasets: | ||
- ai-soco | ||
metrics: | ||
- accuracy | ||
--- | ||
|
||
# ai-soco-c++-roberta-tiny-clas | ||
|
||
## Model description | ||
|
||
`ai-soco-c++-roberta-tiny` model fine-tuned on [AI-SOCO](https://sites.google.com/view/ai-soco-2020) task. | ||
|
||
#### How to use | ||
|
||
You can use the model directly after tokenizing the text using the provided tokenizer with the model files. | ||
|
||
#### Limitations and bias | ||
|
||
The model is limited to C++ programming language only. | ||
|
||
## Training data | ||
|
||
The model initialized from [`ai-soco-c++-roberta-tiny`](https://github.com/huggingface/transformers/blob/master/model_cards/aliosm/ai-soco-c++-roberta-tiny) model and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset to do text classification. | ||
|
||
## Training procedure | ||
|
||
The model trained on Google Colab platform using V100 GPU for 10 epochs, 32 batch size, 512 max sequence length (sequences larger than 512 were truncated). Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization. | ||
|
||
## Eval results | ||
|
||
The model achieved 87.66%/87.46% accuracy on AI-SOCO task and ranked in the 9th place. | ||
|
||
### BibTeX entry and citation info | ||
|
||
```bibtex | ||
@inproceedings{ai-soco-2020-fire, | ||
title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}", | ||
author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo", | ||
booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)", | ||
year = "2020" | ||
} | ||
``` | ||
|
||
<a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-tiny-clas"> | ||
<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png"> | ||
</a> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
--- | ||
language: "c++" | ||
tags: | ||
- exbert | ||
- authorship-identification | ||
- fire2020 | ||
- pan2020 | ||
- ai-soco | ||
license: "mit" | ||
datasets: | ||
- ai-soco | ||
metrics: | ||
- perplexity | ||
--- | ||
|
||
# ai-soco-c++-roberta-tiny | ||
|
||
## Model description | ||
|
||
From scratch pre-trained RoBERTa model with 1 layers and 12 attention heads using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which consists of C++ codes crawled from CodeForces website. | ||
|
||
## Intended uses & limitations | ||
|
||
The model can be used to do code classification, authorship identification and other downstream tasks on C++ programming language. | ||
|
||
#### How to use | ||
|
||
You can use the model directly after tokenizing the text using the provided tokenizer with the model files. | ||
|
||
#### Limitations and bias | ||
|
||
The model is limited to C++ programming language only. | ||
|
||
## Training data | ||
|
||
The model initialized randomly and trained using [AI-SOCO](https://sites.google.com/view/ai-soco-2020) dataset which contains 100K C++ source codes. | ||
|
||
## Training procedure | ||
|
||
The model trained on Google Colab platform with 8 TPU cores for 200 epochs, 32\*8 batch size, 512 max sequence length and MLM objective. Other parameters were defaulted to the values mentioned in [`run_language_modelling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script. Each continues 4 spaces were converted to a single tab character (`\t`) before tokenization. | ||
|
||
### BibTeX entry and citation info | ||
|
||
```bibtex | ||
@inproceedings{ai-soco-2020-fire, | ||
title = "Overview of the {PAN@FIRE} 2020 Task on {Authorship Identification of SOurce COde (AI-SOCO)}", | ||
author = "Fadel, Ali and Musleh, Husam and Tuffaha, Ibraheem and Al-Ayyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo", | ||
booktitle = "Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE 2020)", | ||
year = "2020" | ||
} | ||
``` | ||
|
||
<a href="https://huggingface.co/exbert/?model=aliosm/ai-soco-c++-roberta-tiny"> | ||
<img width="300px" src="https://hf-dinosaur.huggingface.co/exbert/button.png"> | ||
</a> |