# Top tool filter

In this notebook, we will use the tool inventory statistics to choose a subset of tools for further manual feature and metadata extraction.

Our approach is to get the top 30 tools according to a number of repository stats (# stars, # forks, # downloads, ...) and to then look at which tools overlap on the most of these top-30 lists.

In [15]:
from pathlib import Path

import pandas as pd
import plotly.express as px

cwd = Path("").parent

In [16]:
tools = pd.read_csv(cwd / "output" / "filtered.csv", index_col="url")
stats = pd.read_csv(cwd / "output" / "stats.csv", index_col=0)

# Merge tools stats and other tool metadata
df = (
    pd.merge(left=tools, right=stats, right_index=True, left_index=True)
    .rename_axis(index="url")
    .reset_index()
    .set_index("id")
)

In [17]:
key_stats = [
    "stargazers_count",
    "forks_count",
    "last_month_downloads",
    "commit_stats.dds",
    "commit_stats.total_committers",
]
top_30_list = []
for key_stat in key_stats:
    top_30_list.append(
        df.sort_values(key_stat, ascending=False).head(30).reset_index().id
    )
top_tools = pd.concat(top_30_list).value_counts()

In [18]:
fig = px.bar((top_tools * 100 / len(key_stats)).astype(int).value_counts())
fig.update_layout(
    yaxis={"title": "number of Tools"},
    xaxis={"title": "Percentage of key stats that a tool is a top 30 entry"},
    showlegend=False,
)

In [19]:
# Keep only those with 3 or more top-30 entries
best_of_the_best = top_tools[top_tools > 2]
categories = df.loc[best_of_the_best.index].category
print(best_of_the_best)

id
pypsa                      5
pandapower                 5
egret                      5
oemof_solph                5
grid2op                    4
urbs                       4
powersimulations           4
genx                       4
hopp                       4
powermodels_jl             4
gridcal                    4
pypower                    3
fine                       3
dpsim                      3
andes                      3
power_grid_model           3
temoa                      3
pysam                      3
calliope                   3
open_modeling_framework    3
powsybl                    3
Name: count, dtype: int64


## Filling empty categories

We aim to fill empty categories of the top tools by manual inspection of the tool documentation.
Any we have already categorised are given in `categories.csv`.
Any remaining tools to be categorised are listed below.


In [23]:
categories_to_fill = categories[categories.isnull()]
if not categories_to_fill.empty:
    print(df.loc[categories_to_fill.index].url)
else:
    print("All top tools have been categorised!")

All top tools have been categorised!
