Clustering Python Programs

by Yehor Atell Krasnopolskyi and Oliver Vainikko.

In this project, we have implemented a tool that you can use to cluster different python programs. We used Python 3, as well as some data analysis packages (like sklearn).

Files

Files analysis.py and vis.py have all the working routines defined in them, while main.py and clustering_tests.py are python executable files with usage examples. main.py allows you to use the basic functionality of our program (for instance to find the closest program for each program in the data set). clustering_tests.py shows how one can test and visualise all sorts of clustering algorithms.

Data

We used a data set collected at the University of Tartu by Reimo Palm. The data set consists of submissions for Python programming course assignments by different students. In the data directory, there are two other subfolders: test1 and test2 that illustrate how the data is prepared to be used. They contain multiple Python source code files that should in theory produce identical or similar results. The process of extracting the files from the data set has been automated in the get.py file. We do not provide the entire data set in this repository.

Approach

Our approach can be described as the following:

Load all the data files with Python source codes from the given folder.
Use Python's abstract syntax trees (ASTs) and visitors to produce a sequence of tokens for every program, while preserving the program's structure.
Feed the sequences of tokens to a vectoriser.
Perform clusterisation on the resulting vectors. We also use PCA for dimensionality reduction when plotting.

The tf-idf vectorizer approach was inspired by https://github.com/scriptographers/UnPlag

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Clustering Python Programs

Files

Data

Approach

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data		data
report		report
results		results
visitors		visitors
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
clustering_tests.py		clustering_tests.py
main.py		main.py
poster.png		poster.png
vis.py		vis.py

maakler/clustering-python-programs

Folders and files

Latest commit

History

Repository files navigation

Clustering Python Programs

Files

Data

Approach

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages