TOPICAL

Topical introduces a deep learning model that can automatically tag source code repositories with semantic labels. Topical incorporates an embedding mechanism that projects the textual information, the full dependencies graph, and the source code structure into a common representational embedding that exploits the synergies between these domains.

GitHub Crawler

We develop a GitHub Crawler combining the official API with page scraping for additional metadata (git commits, repository tree, etc.). GitHub repositories are often classified by its owner using hand-picked topics, which can contain abbreviation, typos, and repetitions. Because of the large variations in topic names, GitHub also defines 480 featured topics, a limited number of predefined topics to be associated with the repository by its owner. In order to have a unique label for each category, the crawler maps the non-featured topics (hand-picked) to the GitHub featured topics using partial tokens matching methods relying on Levenshtein distance.

To scrap the dataset, execute this command line from the crawler directory :


python api\github_crawler.py --topics topics.json

after ensuring that the crawler directory contains a JSON file with the topics to scrap (either as a simple list or as a nested directory). The above topics.json is just an example of how to add topics for the crawler.

Please see examples/featured_topics.json for our the tags used for our specific use case and also examples/dataset.zip for the zip of the dataset used also. As stated above, you can re-create the 20 topics and dataset using featured_topics.json


python api\github_crawler.py --topics examples/featured_topics.json

Maintenance Level

This repository is maintained to fix bugs and ensure the stability of the existing codebase. However, please note that the team does not plan to introduce new features or enhancements in the future.

Script Hybrid Embedding Generation

Once your dataset is ready, we want to generate the DataFrame containing the repositories informations and script embeddings To do so, execute this command from the Topical folder, making sure your dataset is in a dataset folder in the Topical folder:

python compute_embeddings.py

Base with Classification Head

Once the DataFrame containing all the informations has been generated, you can simply execute the Topical classification head, from the same Topical directory:


python run.py

To change one of the training paramater, display all the available arguments from the Topical directory:


python run.py --help

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
TF3D		TF3D
Topical		Topical
crawler		crawler
docs		docs
examples		examples
pytests		pytests
summarization		summarization
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

License

jpmorganchase/topical

Folders and files

Latest commit

History

Repository files navigation

TOPICAL

GitHub Crawler

Maintenance Level

Script Hybrid Embedding Generation

Base with Classification Head

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages