TDMINING or (Technical Debt Mining Process) is a process designed to generate data and knowledge about code smells, focusing on the developer. It is independent of any specific static code analysis or statistical analysis tool.
Experiments with TDMINING utilize tools such as SonarQube, Python scripts with NumPy, Pandas, and other machine learning and data science packages. Data visualization is performed using Jupyter Notebooks, but can be adapted to Grafana or other tools.
This repository contains all artifacts required to run the TDMINING experiment. Below is a detailed description of the main directories and files:
tdmining_bizagi_ENUS.bpm: BPMN diagram of the TDMINING process (English). Can be viewed with Bizagi Modeler.
DATA_DICTIONARY.md: Data dictionary for the experiment's database schema.DM_ACTION.csv,DM_COMMITS.csv,DM_DEVELOPER.csv,DM_PROJECT.csv,DM_RULE.csv,DM_SEVERITY.csv,DM_TDITEM.csv,DM_TIME.csv,DM_VERSION.csv: Dimension tables in CSV format for the STARSCHEMA used in Experiment 1.FT_TECHNICALDEBT.csv,FT_TECHNICALDEBT_IGNORE_CONDITION.csv: Fact tables with technical debt data and ignore conditions.ORGANIZATION.csv,ORGANIZATION_PROJECT.csv,PROJECT_VERSION.csv: Additional supporting tables in CSV format.
DOC1 - INFRASTRUCTURE FORM_EN.docx: Infrastructure form (English).DOC2 - ENVIRONMENT VARIABLES DOCUMENT_EN.docx: Documentation of environment variables (English).DOC3 - RESULT EXECUTION SCRIPTS JUPYTER_EN.docx: Results documentation for Jupyter scripts (English).DOC4 - RESTRICTIONS CHECKLIST _EN.docx: Restrictions checklist (English).
Python scripts for data processing, ETL, and experiment automation:
condition-ignore.py: Updates the technical debt fact table to mark records as ignored based on conditions from the ignore condition table.configuration.py: Loads configuration values fromconfig.inifor database and project settings.executer.py: Automates the process of running SonarQube analysis for each commit, updating the database, and managing project properties and branches.extract.py: Extracts commit and file modification data from a Git repository using PyDriller and saves it to CSV files.import_csvs_to_sqlite.py: Imports CSV files into a SQLite database, normalizing column names and creating indexes for ID columns.insert.py: Inserts commit metadata from a Git repository into the database using PyDriller.Query.py: Loads SQL queries fromquery.inifor use in scripts and notebooks.reset-database.py: Resets the database to a previous state using a query fromquery.ini.config.ini,query.ini: Configuration files for database connections and SQL queries.requirements.txt: Python dependencies for the project.get-pip.py: Script to install pip if needed.
Jupyter Notebooks for data analysis and visualization:
question-1-metric-1.ipynb: Analyzes the number of developers and the amount of open code smells using SQL queries and data visualization.question-1-metric-2.ipynb: Analyzes the number of developers and the amount of closed code smells using SQL queries and data visualization.question-2-metric-1.ipynb: Performs a multi-step analysis of code smells per developer, including several SQL scripts and result displays.question-3-metric-1.ipynb: Applies association rule mining (Apriori, FP-Growth) to code smell data to find frequent patterns and rules.question-4-metric-1.ipynb: Calculates and visualizes moving averages for code smell creation and effort indices for all developers and a selected developer.question-5-metric-1.ipynb: Performs time series forecasting (ARIMA) on code smell creation, including stationarity tests and error metrics.test-hypothesis.ipynb: Tests the correlation between the number of code smells and the number of commits using Spearman and Pearson coefficients.test-normality.ipynb: Tests the normality of code smell creation data using Shapiro-Wilk and Kolmogorov-Smirnov tests, with histograms and descriptive statistics.
SQL scripts for database creation, data loading, analysis, and utility functions:
ANALISYS_TEMP.sql: Performs ETL operations to identify and aggregate technical debt issues that were opened and closed, joining data from multiple sources. It creates temporary tables for open/closed issues, aggregates commit and issue data, and updates a summary table with commit hashes and timestamps for open/closed states.CREATE_DDL.sql: Contains Data Definition Language (DDL) statements to create the main database schema for the experiment. It defines tables for organizations, projects, developers, actions, time, technical debt, rules, severities, and their relationships, including primary and foreign keys. Some commented-out sections suggest additional profile and remediation tables.LOAD_DATA.sql: Populates the main dimension tables with data extracted from source repositories. Inserts developers, time periods, commits, actions, technical debt item types, rules, and severities. Also creates a temporary fact table for technical debt and loads it into the main fact table.
COMMITS.sql: Creates the COMMITS table to store commit metadata, including author, hash, date, project, and processing status. Adds a column to track whether each commit has been processed.INDEX.sql: Creates indexes on the ISSUES and ISSUES_SONAR tables to optimize queries by commit hash. Also adds a primary key column to ISSUES_SONAR.
FN_FORMAT_TEXT.sql: Defines a SQL function to format and normalize text by removing special characters, accents, punctuation, numbers, and spaces, and converting to lowercase. Useful for text standardization in data processing.FN_SLINTSTRING .SQL: Defines a table-valued function to split a string into elements based on a separator character. Useful for parsing delimited strings into rows.FN_UNIXDATE_TODATE.sql: Defines a function to convert a Unix timestamp (seconds since 1970-01-01) to a SQL DATETIME value.
VERSION_WORDPRESS.sql: Populates tables with WordPress project and version information, including version names and release dates. Calculates version periods and links commits to the corresponding version based on commit date.
For more information, refer to the documentation in the DOCS folder (English files only).
Copy folder process_sonar to c:\