This project is a Python tool to implement, compare, and analyze the practical performance of the two primary algorithms for topological sorting:
- DFS-based Sort: The classic recursive, depth-first search approach.
- Kahn's Algorithm: The iterative, source-removal (or in-degree-based) approach.
Both algorithms are theoretically
- Algorithm Implementations: Provides robust, cycle-detecting implementations for both DFS-based sort and Kahn's algorithm.
-
Efficient Kahn's Algorithm: Uses
collections.dequefor$O(1)$ queue operations, ensuring a true$O(V+E)$ runtime. - Random DAG Generation: Includes a utility to generate random DAGs of specified sizes and densities for fair and scalable analysis.
-
Performance Plotting: Uses
seabornandmatplotlibto generate aTime vs. V+Eplot, which is the most accurate way to compare$O(V+E)$ algorithms. -
Data Export: Saves all raw experiment data to a
.csvfile. -
Modular Package: Code is cleanly structured in a
topo_analyzerpackage.
topological-sort-analyzer/
├── .gitignore
├── LICENSE
├── README.md # This documentation
├── requirements.txt # Project dependencies
├── main.py # Main runnable script (CLI)
└── topo_analyzer/
├── __init__.py # Makes 'topo_analyzer' a package
├── graph.py # Graph class with random DAG generator
├── algorithms.py # Implementations of DFS sort and Kahn's
├── analyzer.py # The 'timeit' analysis runner
└── plotter.py # Plot generation logic
-
Clone the repository:
git clone https://github.com/msmrexe/python-topological-sort-analyzer.git cd python-topological-sort-analyzer -
Create a virtual environment (Recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the analysis:
# Run with default settings (max 2000 nodes, 0.1 density) python main.py # Run a larger, denser analysis python main.py --max-nodes 5000 --steps 25 --density 0.2 # Specify output files python main.py --csv my_data.csv --plots-dir my_plots
After running, check the
plots/directory for your.pnggraph andtopo_sort_results.csvfor the raw data.
This algorithm uses the post-order traversal of a DFS.
- Logic:
- Start a DFS traversal from an unvisited node.
- Mark the node as "visiting" (to detect cycles).
- Recursively call DFS on all its unvisited neighbors.
- After all neighbors have been fully explored, add the current node to the end of a list (or the front of a stack).
- Mark the node as "visited."
- Cycle Detection: If the DFS encounters a node that is already in the "visiting" state (i.e., in the current recursion stack), it has found a back-edge, which means a cycle exists.
- Result: The final list, when reversed, is a valid topological sort.
This algorithm works by iteratively removing nodes that have no incoming edges.
-
Logic:
- Calculate the in-degree (count of incoming edges) for every node in the graph.
- Create a queue (using
dequefor$O(1)$ pops) and add all nodes with an in-degree of0to it. - Initialize an empty list for the topological order.
-
Loop: While the queue is not empty:
- Pop a node
ufrom the queue and add it to the topological order list. - For each neighbor
vofu:- Decrement the in-degree of
v. - If the in-degree of
vbecomes0, addvto the queue.
- Decrement the in-degree of
- Pop a node
- Cycle Detection: If the final topological order list contains fewer nodes than the total number of nodes in the graph, it means the loop terminated early because there was a cycle (which left nodes with an in-degree > 0).
Both algorithms have the same, efficient linear time complexity of
-
DFS-based Sort: The DFS visits every vertex (node) and every edge exactly once. The work done at each step is constant. Total time:
$O(V+E)$ . -
Kahn's Algorithm:
- Calculating all in-degrees requires traversing all edges:
$O(E)$ (or $O(V+E)$). - Initializing the queue by checking all nodes:
$O(V)$ . - The main loop processes each vertex exactly once (when it's popped from the queue) and each edge exactly once (when it decrements its neighbor's in-degree).
- Total time:
$O(V+E) + O(V) = O(V+E)$ .
- Calculating all in-degrees requires traversing all edges:
Despite having the same theoretical complexity, the plots show that Kahn's Algorithm is consistently faster than the DFS-based sort.
This is a classic example of how theoretical bounds don't tell the whole story. The reasons for this practical difference (which your course text correctly identifies) are:
- Python Recursion Overhead: The DFS algorithm is recursive. Python has a relatively high overhead for function calls. For a deep, "stringy" graph, this can result in thousands of recursive calls, each adding a small amount of overhead that accumulates.
-
Iterative vs. Recursive: Kahn's algorithm is iterative (it's just a
whileloop). Iterative code is almost always faster in Python than equivalent recursive code because it avoids the function call overhead. -
Data Structures: The (now efficient) Kahn's algorithm relies on a
deque, which is a highly optimized C-based data structure in Python, making its$O(1)$ appendandpopleftoperations extremely fast.
It should be noted that the shape of the graph matters. A very "wide" graph (low depth) might favor DFS, while a very "deep" graph (long chains) will heavily favor the iterative Kahn's algorithm. By testing on random DAGs, our analysis finds the average-case performance, which clearly shows Kahn's advantage.
Feel free to connect or reach out if you have any questions!
- Maryam Rezaee
- GitHub: @msmrexe
- Email: ms.maryamrezaee@gmail.com
This project is licensed under the MIT License. See the LICENSE file for full details.