# Overview

At the time of writing, 1339 projects have been identified worldwide. Of these:
- 1188 projects are hosted on GitHub,
- 27 on GitLab and
- 125 on other websites or self-hosted Git platforms.

**We found 996 active project repositories in total on GitHub**. A project is considered active if the public repository has at least one commit or closed issue within the last year. We have excluded inactive projects from our analysis as their inclusion would distort current trends. The listed inactive open source projects are those that have become inactive since data collection began two years ago. The statistics on all active and inactive projects in the table below are based on the raw dataset. **Unless otherwise noted, all following plots in the study always refer to the active projects.** 

In [9]:
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.graph_objects as go
import plotly.express as px
from opensustainTemplate import *

In [10]:
df_raw = pd.read_csv("../csv/projects.csv")
df_raw.rename(columns={"rubric": "topic"}, inplace=True)
df_raw.rename(columns={"topics": "labels"}, inplace=True)

# This projects is two times in the database
df_raw = df_raw[
    df_raw["git_url"] != "https://github.com/openfoodfacts/openfoodfacts-server.git"
]

In [11]:
# Age plots are better in years
df_raw["project_age_in_years"] = df_raw["project_age_in_days"].apply(lambda x: x / 365)
max_age_in_years = 8.0

In [12]:
fig = go.Figure(
    data=[
        go.Table(
            columnwidth=[100, 30],
            header=dict(
                values=["Dimension", "Value"],
                line_color="#000000",
                fill_color="#ffffff",
                font_size=18,
            ),
            cells=dict(
                fill_color="#ffffff",
                line_color="#ffffff",
                font_size=16,
                height=30,
                values=[
                    [
                        "Total number of projects",
                        "GitHub projects",
                        "GitLab projects",
                        "Other platforms",
                        "Number of projects in personal namespace",
                        "Number of projects in community namespace",
                        "Total stars of all projects",
                        "Total contributors in all projects",
                        "Active GitHub projects",
                        "Inactive GitHub projects",
                        "Projects with contribution guide in %",
                        "Projects with code of conduct in %",
                        "Projects accepting donations in %",
                        "Median number of commits",
                        "Median stargazers",
                        "Median stars last year",
                        "Median Development Distribution Score",
                        "Median number of contributors",
                        "Median closed issues last year",
                        "Median commits last year",
                        "Median age in years",
                    ],
                    [
                        df_raw["project_name"].count(),
                        df_raw["platform"].value_counts()["github"],
                        df_raw["platform"].value_counts()["gitlab"],
                        df_raw["platform"].value_counts()["custom"],
                        df_raw["project_name"].count() - df_raw["organization"].count(),
                        df_raw["organization"].count(),
                        df_raw["stargazers_count"].sum(),
                        df_raw["contributors"].sum(),
                        df_raw["project_active"].value_counts()[True],
                        df_raw["project_active"].value_counts()[False],
                        round(
                            df_raw["contribution_guide"].value_counts(normalize=True)[
                                True
                            ]
                            * 100,
                            2,
                        ),
                        round(
                            df_raw["code_of_conduct"].value_counts(normalize=True)[True]
                            * 100,
                            2,
                        ),
                        round(
                            df_raw["accepts_donations"].value_counts(normalize=True)[
                                True
                            ]
                            * 100,
                            2,
                        ),
                        df_raw["total_number_of_commits"].median(),
                        df_raw["stargazers_count"].median(),
                        df_raw["stars_last_year"].median(),
                        round(df_raw["development_distribution_score"].median(), 4),
                        df_raw["contributors"].median(),
                        df_raw["issues_closed_last_year"].median(),
                        df_raw["total_commits_last_year"].median(),
                        round(df_raw["project_age_in_years"].median(), 2),
                    ],
                ],
            ),
        )
    ]
)


fig["layout"].update(margin=dict(l=5, r=5, b=0, t=5))
fig.update_layout(height=700, dragmode=False)
fig.show(responsive=True)

```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
:figclass: caption-hack
:name: statistics-all-projects

<br/>Statistics on all active and inactive projects
```

In [13]:
df_active = df_raw.copy()
# Filter out the inactive project for further analysis
df_active = df_active[(df_active["project_active"] == True)]
# Ciruated Lists are no classical open source projects and are not included into the analysis
df_active = df_active[(df_active["topic"] != "Curated Lists")]
# At the time of the data processing just one project was active in this topic.
df_active = df_active[(df_active["topic"] != "Production and Industry")]

# Filter out the projects not on the GitHub platform
df_active = df_active[(df_active["platform"] == "github")]
df_active["project_name"] = df_active["project_name"].replace(
    {
        "A Global Inventory of Commerical-, Industrial-, and Utility-Scale Photovoltaic Solar Generating Units": "A Global Inventory of Photovoltaic"
    }
)
df_active["project_name"] = df_active["project_name"].replace(
    {
        "Asset-level Transition Risk in the Global Coal, Oil, and Gas Supply Chains": "Global Coal, Oil, and Gas Supply Chains"
    }
)

df_active["project_name"] = df_active["project_name"].replace(
    {
        "The REgional Model of INvestments and Development": "REMIND"
    }
)

df_active["project_name"] = df_active["project_name"].replace(
    {
        "Hierarchical Engine for Large-scale Infrastructure Co-Simulation": "HELICS"
    }
)

df_active["project_name"] = df_active["project_name"].replace(
    {
        "Grid Singularity Energy Exchange Engine (D3A)": "Grid Singularity Energy Exchange"
    }
)

df_active["project_name"] = df_active["project_name"].replace(
    {
        "Integrated Valuation of Ecosystem Services and Tradeoffs": "InVEST"
    }
)



def text_to_link(project_name, git_url):
    return '<a href="' + git_url + '" target="_blank" style = "color: black">' + project_name + "</a>"


df_active["project_name"] = df_active.apply(
    lambda x: text_to_link(x.project_name, x.git_url), axis=1
)

In [14]:
## Hack field content into dataset


def topic_to_field(topic):
    if topic in (
        "Photovoltaics and Solar Energy",
        "Wind Energy",
        "Hydro Energy",
        "Geothermal Energy",
        "Bioenergy",
    ):
        field = "Renewable Energy"
    elif topic in ("Battery", "Hydrogen"):
        field = "Energy Storage"
    elif topic in (
        "Energy Modeling and Optimization",
        "Energy Monitoring and Control",
        "Energy Distribution and Grids",
        "Datasets on Energy Systems",
    ):
        field = "Energy Systems"
    elif topic in (
        "Buildings and Heating",
        "Mobility and Transportation",
        "Production and Industry",
        "Computation and Communication",
    ):
        field = "Consumption of Energy and Resources"
    elif topic in (
        "Carbon Intensity and Accounting",
        "Carbon Capture and Removal",
        "Emission Observation and Modeling",
    ):
        field = "Emissions"
    elif topic in ("Life Cycle Assessment", "Circular Economy and Waste"):
        field = "Industrial Ecology"
    elif topic in ("Biosphere", "Cryosphere", "Hydrosphere", "Atmosphere"):
        field = "Earth Systems"
    elif topic in (
        "Earth and Climate Modeling",
        "Radiative Transfer",
        "Meteorological Observation and Forecast",
        "Climate Data Processing and Access",
        "Integrated Assessment",
    ):
        field = "Climate and Earth Science"
    elif topic in (
        "Air Quality",
        "Water Supply and Quality",
        "Soil and Land",
        "Agriculture and Nutrition",
        "Natural Hazard and Poverty",
    ):
        field = "Natural Resources"
    elif topic in (
        "Sustainable Development Goals",
        "Sustainable Investment",
        "Knowledge Platforms",
        "Data Catalogs and Interfaces",
        "Curated Lists",
    ):
        field = "Sustainable Development"
    else:
        print(topic)
        raise ValueError("Topic not within fields")
    return field


df_active["topic"].replace(
    {"Carbon Capture and Removel": "Carbon Capture and Removal"}, inplace=True
)
df_active["field"] = df_active["topic"].apply(topic_to_field)

In [15]:
# Each project is ranked according to different indicators in the categories of community, activity and size.
# A value of 1 represents the highest rank and 0 the lowest.
# The individual values are summed up within the categories to create the scores for the different categories.
df_active["activity"] = (
    df_active["total_commits_last_year"].rank(pct=True)
    + df_active["issues_closed_last_year"].rank(pct=True)
    + df_active["days_until_last_issue_closed"].rank(pct=True)
    + df_active["last_released_date"].rank(pct=True, na_option="top")
) / 4

df_active["community"] = (
    df_active["contributors"].rank(pct=True)
    + df_active["development_distribution_score"].rank(pct=True)
    + df_active["reviews_per_pr"].rank(pct=True)
) / 3

df_active["size"] = (
    df_active["total_number_of_commits"].rank(pct=True)
    + df_active["contributors"].rank(pct=True)
    + df_active["closed_issues"].rank(pct=True)
    + df_active["closed_pullrequests"].rank(pct=True)
) / 4

# The scores are summed up and normalised so that 1 represents the largest total score.
df_active["total_score"] = (
    df_active["activity"] / df_active["activity"].max()
    + df_active["community"] / df_active["community"].max()
    + df_active["size"] / df_active["size"].max()
) / 3

Open source projects are grouped into fields based on their primary topic of focus. While the boundaries often overlap, these fields help to paint a broad landscape and can provide insight into the ecosystem health and complexity of fields relative to each other. The following sunburst diagram shows the relationship between fields, topics, and projects. The colour represents the {ref}`dds_chapter`.

`````{admonition} Tip
:class: tip
The plot is fully interactive. Drill into fields, topics, and projects via hovering your mouse! Click on the project names to jump to the repositories.
`````

In [16]:
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.graph_objects as go
import plotly.express as px
from opensustainTemplate import *

df_active = pd.read_csv("../csv/project_analysis.csv")


fig = px.sunburst(
    df_active.assign(
        hole='<a href="https://opensustain.tech/" style = "color: black >Open Sustainable Technology</a>'
    ),
    path=["hole", "field", "topic", "project_name"],
    maxdepth=3,
    color="development_distribution_score",
    custom_data=["oneliner", "topic", "git_url"],
    # Diverging colors
    color_continuous_scale=color_continuous_scale,
    # color_continuous_midpoint=df_active['development_distribution_score'].median(),
)

fig.update_layout(
    coloraxis_colorbar=dict(title="Development Distribution Score",
    orientation='h',
    y=-0.15,
    x=0.5
    ),
    height=1000,
    width=1000,
    title_font_size=22,
    font_size=12,
    dragmode=False,
)
# animated transitions are currently not implemented when uniformtext is used
fig.update_traces(
    insidetextorientation="radial",
    textinfo="label",
    marker=dict(line=dict(color="#000000", width=1)),
    hovertemplate="<br>".join(
        [
            "Project Info: <b>%{customdata[0]}</b>",
            "Topic: <b>%{customdata[1]}</b>",
            "Git URL: <b>%{customdata[2]}</b>",
        ]
    ),
)
fig.show()

```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
:figclass: caption-hack
:name: projects-within-sectors

<br/>All studied projects grouped into the corresponding fields and topics
```

The following scatter plot provides an overview of all projects studied. The size of the circles is proportional to the relative scale of the projects, based on total commits and contributions. The colour bar shows the Development Distribution Score (DDS) as a measure of the distribution of work among the individual developers. A high value indicates a high distribution of work and, thus, a strong developer community. More details about this can be found in chapter {ref}`dds_chapter`.

In [17]:
fig = px.scatter(
    df_active,
    x="project_age_in_years",
    y="topic",
    size="size",
    color="development_distribution_score",
    color_continuous_scale=color_continuous_scale,
    custom_data=["project_name", "oneliner", "git_url"],
    size_max=20,
)

fig.update_layout(
    coloraxis_colorbar=dict(
        title="Development Distribution Score",
        orientation='h',
        y=-0.15
    ),
    yaxis=dict(type="category", categoryorder="total ascending"),
    yaxis_title=None,
    xaxis_title="Project age in years",
    height=1000,  # Added parameter
    width=1210,
    title="Overview of all projects",
    hoverlabel=dict(
        bgcolor="white",
    ),
    dragmode=False,
)
fig.update_traces(
    hovertemplate="<br>".join(
        [
            "Project Name: <b>%{customdata[0]}</b>",
            "Project Info: <b>%{customdata[1]}</b>",
            "Git URL: <b>%{customdata[2]}</b>",
        ]
    )
)
fig.add_layout_image(
    dict(
        source=logo_img,
        xref="paper",
        yref="paper",
        x=1,
        y=1,
        sizex=0.05,
        sizey=0.05,
        xanchor="right",
        yanchor="top",
    )
)
fig["layout"].update(margin=dict(l=0, r=0, b=0, t=40))
fig["layout"]["xaxis"]["autorange"] = "reversed"
fig.show()

```{figure} data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
:figclass: caption-hack
:name: overview-all-projects

<br/>Overview of all projects of the last 14 years since the launch of GitHub
```

This overview depicts how projects within specific topics have evolved over time. Early diverse developments can be seen in fields such as biosphere, photovoltaics, mobility, and transportation. Other fields, such as carbon intensity and accounting, or computation and communication, have recently emerged. More information can be found in the chapter {ref}`growth`. We can also see that the majority  of OSS developments emerged rapidly approximately three years ago. Newer projects are getting fewer and fewer, which is examined in greater detail in the chapter on {ref}`age`.

In [18]:
# Save the dataset with the scores
df_active_path = "../csv/project_analysis.csv"
df_active.to_csv(df_active_path)