This is a reproduction project as part of the MSR course 2021/22 at UniKo, CS department, SoftLang Team
This repositories fork from https://github.com/gorjatschev/applying-apis for MSR 2021/22 Coursework
Team: Mike
- Prayuth Patumcharoenpol
- Jorge Gavilan
Reproduction of the visualization stage of the empirical study, in which a representation of the applied API categorization is generated using treemaps for better understanding of the results.
The CSV file, which has been pre-processed by previous procedure (analyze the data), contains a list of APIs and corresponding APIs categories and the following data
filePath, packageName, className, methodName, line, column, javaParserTypeOfElement, usedClassOfElement, isAPIClass, api, mcrCategories, mcrTags
The original CSV file can be found in the project’s repo which contains the same data as the file used for this replication.
- Treemaps with the hierarchical structure and size of abstractions
- Colorized APIs and API categories
We believe that the process to create visualization is identical to the thesis since we relied on the external library (plotly) to create the visualization. In addition, the data we used is identical to the thesis itself, hence the process should not be distinct.
Since the data used to generate the visualizations was the same used in the original [Gorjatschev21] work, the results did not present major differences The output data is identical only different in the figure structure which we don't consider relevant. We expect to be able to analyze a significant delta when processing the data generated by other teams which we will be able to compare then with the results obtained in this stage.
This replication uses the code from Gorjatschev21 repository as a baseline. Then we removed unnecessary files used for other parts different from visualization and restructured the project. After that, we installed all software requirements and we adjusted the application configuration to run according to requirements.
To generate the visualizations:
python process/repositories_visualizer.py
- Operating system: Linux (recommended Ubuntu 16.04 or higher), MacOS, or Windows 7 to 10.
- Memory: At least 4GB RAM (8GB preferable)
- Java 11
- Python 3.9.6 (plotly>=5.1.0, pyspark>=3.1.2)
- kaleido (python module for image export)
By comparing the visualization results with the input data it can be noticed a reduction in complexity to interpret the hierarchy of data, particularly the dependence relationship between the APIs. As for the execution of the code, since we could not compare the runtime data ( data when code is running ) with the thesis itself, we had to check it by hand by going through each of the visualization figures and compare the final results.
The input file is a analyzed data from the analyze part of the thesis, which contains the identical column as mention in input data
While the output is the Treemaps visulization figures in the form of HTML files and PDF files, each one of them generated for a particular visualization.
By running the main process it is also generated intermediate data:
- Spark, used to analize the parsed repositories, generates a CSV file with [packageName, className,methodName,mcrCategories,mcrTags] as columns and a _SUCCESS flag file for the general process.
- A "characterization" folder is also created, and its content are CSV files with [packageName,className,methodName,mcrCategories,mcrTags] as columns, these files are used in the main process to be able to visualize based on the characterization type and be able to group by dependance relationships when generating the visualizations.
For interaction we selected the datasets from teams Whiskey and Xray, both teams generated a consistent set of CSV files following the original structure from [Gorjatschev21] with the following fields:
filePath, packageName, className, methodName, line, column, javaParserTypeOfElement, usedClassOfElement, isAPIClass, api, mcrCategories, mcrTags
and they also provide analyzed data, which we needed to run the visulization process.
We confirm that both datasets generated by previously mentioned teams allow us to reproduce visualizations similar to those generated by [Gorjatschev21]. In this matter we did not observe a big delta when comparing with the previous process of generating visualizations based on the original data from [Gorjatschev21], by representing different repositories one can observe differences in hierarchical data inherent to the components of each repository which was easier to interpret thanks to the generated visualizations.