---
badges: true
categories:
- kedro
date: '2022-11-13'
description: 'Meta-analysis of the kedro codebase'
output-file: kedro-meta-analysis.html
title: 'Understanding the Kedro codebase - A quick dirty meta-analysis - (Part I)'
toc: true

---


{{< video https://www.youtube.com/embed/pjq3QOxl9Ok >}}
Inspired by this talk

# How many lines of code in Kedro?

In [None]:
from pathlib import Path
import pandas as pd
from collections import Counter

In [None]:
REPO_PATH = Path("/Users/Nok_Lam_Chan/GitHub/kedro")
list(REPO_PATH.iterdir())

[PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/test_requirements.txt'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/CODE_OF_CONDUCT.md'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/LICENSE.md'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/tools'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/kedro_technical_charter.pdf'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/.DS_Store'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/test'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/.pytest_cache'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/derby.log'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/iris'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/kedro.egg-info'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/.pre-commit-config.yaml'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/.coverage'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/Makefile'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/CITATION.cff'),
 PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/CODEOWNERS'),
 PosixPath('/Users/N

In [None]:
def count_effective_line(counter, fn):
    with open (fn) as f:
        for line in f:
            counter[fn] += 1

In [None]:
lines_count = Counter()
for fn in REPO_PATH.rglob("*/*.py"):
#     print(fn)
    count_effective_line(lines_count, fn)
print(lines_count)
            

Counter({PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/tests/extras/datasets/spark/test_spark_dataset.py'): 984, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/tests/pipeline/test_pipeline.py'): 940, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/kedro/pipeline/pipeline.py'): 926, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/tests/framework/session/test_session.py'): 891, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/kedro/framework/cli/micropkg.py'): 854, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/tests/framework/cli/micropkg/test_micropkg_pull.py'): 846, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/kedro/io/core.py'): 748, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/tests/framework/cli/test_cli.py'): 730, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/tests/io/test_data_catalog.py'): 685, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/tests/framework/cli/test_starters.py'): 639, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro/features/steps/cli_steps.py'): 623, PosixPath('/Users/Nok_Lam_Chan/GitHub/kedro

# Clean up the dictionary a little bit

In [None]:
line_counts_df = pd.DataFrame(lines_count.items(), columns=["fullpath","line_of_code"])
line_counts_df["fullpath"] = line_counts_df["fullpath"].apply(str)
line_counts_df["fullpath"] =  line_counts_df["fullpath"].str.replace("/Users/Nok_Lam_Chan/GitHub/kedro/", "")
line_counts_df.head(2)

Unnamed: 0,fullpath,line_of_code
0,tools/cli.py,62
1,features/environment.py,128


In [None]:
line_counts_df[["toplevel","module","submodule","filename"]] = line_counts_df["fullpath"].str.split("/",expand=True, n=3)

In [None]:
line_counts_df

Unnamed: 0,fullpath,line_of_code,toplevel,module,submodule,filename
0,tools/cli.py,62,tools,cli.py,,
1,features/environment.py,128,features,environment.py,,
2,tests/test_utils.py,30,tests,test_utils.py,,
3,tests/conftest.py,89,tests,conftest.py,,
4,docs/conf.py,598,docs,conf.py,,
...,...,...,...,...,...,...
276,kedro/extras/datasets/pandas/feather_dataset.py,191,kedro,extras,datasets,pandas/feather_dataset.py
277,kedro/extras/datasets/pandas/hdf_dataset.py,204,kedro,extras,datasets,pandas/hdf_dataset.py
278,kedro/extras/datasets/pandas/csv_dataset.py,194,kedro,extras,datasets,pandas/csv_dataset.py
279,kedro/extras/datasets/pandas/excel_dataset.py,254,kedro,extras,datasets,pandas/excel_dataset.py


In [None]:
## Sort by Top level module
line_counts_df.groupby(["toplevel"]).sum().sort_values(ascending=False, by ="line_of_code")

Unnamed: 0_level_0,line_of_code
toplevel,Unnamed: 1_level_1
tests,25341
kedro,18683
features,1587
docs,1185
resume-kedro,1007
iris-demo,550
iris,547
test,405
tools,88


Interstingly we have roughly a 1:1 ratio between `tests` and `kedro`

In [None]:
line_counts_df.groupby(["module","submodule"]).sum().sort_values(ascending=False, by ="line_of_code")

Unnamed: 0_level_0,Unnamed: 1_level_0,line_of_code
module,submodule,Unnamed: 2_level_1
extras,datasets,15775
framework,cli,8837
framework,session,2574
pipeline,test_pipeline.py,940
pipeline,pipeline.py,926
...,...,...
config,__init__.py,19
runner,__init__.py,16
pipeline,__init__.py,9
extras,__init__.py,2


In [None]:
## Sort by Sub-module
kedro_line_counts_df = line_counts_df[line_counts_df["toplevel"] == "kedro"]
tmp = kedro_line_counts_df.groupby("module").sum().rename(mapper={"line_of_code": "module_line_of_code"},axis=1 )
kedro_line_counts_df_group = kedro_line_counts_df.groupby(["module","submodule"]).sum().reset_index().merge(tmp, left_on="module", right_on="module")


# .sort_values(ascending=False, by ="line_of_code")

In [None]:
kedro_line_counts_df.groupby(["module"]).sum().sort_values(ascending=False, by ="line_of_code")

Unnamed: 0_level_0,line_of_code
module,Unnamed: 1_level_1
extras,6871
framework,5246
io,2284
pipeline,1837
runner,1068
config,721
templates,443
ipython,164
utils.py,28
__init__.py,11


In [None]:
# Sort by file 
kedro_line_counts_df_group.sort_values(ascending=False, by =["module_line_of_code","line_of_code"])

Unnamed: 0,module,submodule,line_of_code,module_line_of_code
6,extras,datasets,6734,6871
8,extras,logging,110,6871
7,extras,extensions,25,6871
5,extras,__init__.py,2,6871
10,framework,cli,3439,5246
14,framework,session,511,5246
12,framework,hooks,418,5246
13,framework,project,369,5246
11,framework,context,352,5246
15,framework,startup.py,156,5246


In [None]:
# Total number of LOC
kedro_line_counts_df["line_of_code"].sum()

18683

# Conclusion
The kedro codebase is not huge, roughly 20000 line of code, compare to pandas which has > 250000 of code, 10x smaller. The `datasets` and `framework` code is the largest module which isn't surprise to me.
The more surprising is how small `config` actually is, but it creates huge complexity in terms of a kedro project. The `cli` is also relatively huge as it takes ~3000 lines of code which I didn't expected.