Skip to content

Project Ideas Improve File Classification

Philippe Ombredanne edited this page Feb 28, 2020 · 2 revisions

Improve file classification in ScanCode

ScanCode currently detects the programming language, file type and MIME type for files, but this detection is not as accurate as it could be. We also need a better way to classify files for further automation particularly in the area of identifying the likely "purpose" of a file - e.g. focus on source and binary files that represent code versus files that are documentation, scripts, etc. This is similar to the concept of "facets" from the Clearly Defined project.

The first goal of this project is to improve the quality of detecting file characteristics including programming language (which currently use only Pygments) and Linux "magic" file type. The second goal is to create and implement a flexible framework of rules to automate assigning "purpose" to files, possibly with machine learning.

What do we mean by purpose? For instance, a Makefile obvious purpose is a build script. An HTML file purpose may be for documentation in some cases or be part of the core code in some other cases. The purpose of some other files may be for tests, configuration, etc. The purpose or classification of each file is important as the license of core code may be more important than the license of test code.

Clone this wiki locally