## Data Analysis on Tools 

### 0.0 Setup

#### 0.1 Imports

In [827]:
import pandas as pd
import plotly.express as px
import numpy as np
from datetime import datetime, timezone
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from ast import literal_eval

In [828]:
zkp_repos = pd.read_csv('zkp_repos.csv', sep=';')
tool_commits = pd.read_csv('tool_commits.csv')
application_commits = pd.read_csv('application_commits.csv')
tool_issues = pd.read_csv('tool_issues.csv')
repo_contributors = pd.read_csv('repo_contributors.csv')
contributor_data = pd.read_csv('contributor_data.csv')
branches_data = pd.read_csv('branches_data.csv')
application_contributors = pd.read_csv('application_contributors.csv')
zkp_tool_info = pd.read_csv('zkp_tool_info.csv', sep=';')

In [829]:
# TODO: check if this is still needed
remove_tools = ['cairo/starkware-libs', 'noir/noir-lang', 'starknet-rs/xjonathanlei', 'zokrates/zokrates', 'circom-compat/arkworks-rs', 'snarky/o1-labs']
tool_commits = tool_commits[~tool_commits['UniqueID'].isin(remove_tools) ]
tool_issues = tool_issues[~tool_issues['UniqueID'].isin(remove_tools) ]
repo_contributors = repo_contributors[~repo_contributors['UniqueID'].isin(remove_tools)]

#### 0.2 Get AppCount

In [830]:
zkp_repos.loc[zkp_repos["Type"] == "Application", "Tool"] = zkp_repos["Tool"].str.replace('[', '').str.replace(']', '').str.replace('\'', '').str.split(', ')
zkp_repos_exploded = zkp_repos.explode('Tool')

In [831]:
tool_counts = zkp_repos_exploded["Tool"].value_counts().reset_index()

tool_counts.columns = ["Tool", "AppCount"]
tool_counts = tool_counts.merge(
    zkp_repos[zkp_repos["Type"] == "Tool"], left_on="Tool", right_on="Name", how="outer"
)[["UniqueID", "AppCount"]]

### 1.0 AppCount <a id='AppCount'></a>



#### 1.1 AppCount per Tool

In [832]:
fig = px.bar(tool_counts, 
             x='UniqueID', 
             y='AppCount', 
             title='Application Usage per Tool',
             template="plotly_dark",
             text='AppCount'
             )
fig.show()

### 2.0 Issues 

#### 2.1 Tools & No. of Issues

In [833]:
fig = px.bar(tool_issues.groupby('UniqueID')['URL'].nunique().reset_index().sort_values(by='URL', ascending=False),
    x="UniqueID",
    y="URL",
    template="plotly_dark",
    title="Tools & No. of Total Issues",
    labels={'URL': 'No. of Issues'}
)

fig.show()

As shown in the graph above, the tools `leo`, `plonky2` and `miden-vm` have the highest number of issues. 

Issues are a combination of issues and PR from the GitHub API. Contributions are managed by creating issues and pull requests and merging them, thus collaboration is 

High issue counts are often a sign of an active development ecosystem - these repos mostly follow and fork-and-pull development system which requires contributors to create issues and PRs when contributing. Often repositories prefer issues and PRs to address a small, singular issue or feature which could increase the issue count further. As the `plonky2` documentation states, *'Under no circumstances should a single PR mix different purposes: Your PR is either a bug fix, a new feature, or a performance improvement, never a combination. Nor should you include, for example, two unrelated performance improvements in one PR. Please just submit separate PRs.'*

\* Note that the issues for the tools were collected using the GitHub API. The issues returned is a combination of the Issues and Pull Requests for a repository. 

#### 2.2 No. of Open Issues

In [834]:
open_issues = tool_issues[tool_issues['ClosedAt'].isna()].groupby('UniqueID')['URL'].nunique().reset_index().sort_values(by='URL', ascending=False).rename(columns={'URL': 'OpenCount'})

fig = px.bar(open_issues, 
             x='UniqueID', 
             y='OpenCount',
             title='No. of Open Issues for Each Tool',
             template='plotly_dark'
             )

fig.show()

#### 2.3 No. of Closed Issues

In [835]:
closed = tool_issues[~tool_issues['ClosedAt'].isna()].groupby('UniqueID')['URL'].nunique().reset_index().sort_values(by='URL', ascending=False).rename(columns={'URL': 'ClosedCount'})

fig = px.bar(closed, 
             x='UniqueID', 
             y='ClosedCount',
             title='No. of Closed Issues for Each Tool',
             template='plotly_dark'
             )

fig.show()

#### 2.4 No. of Open & Closed Issues

In [836]:
open_closed = open_issues.merge(closed, left_on='UniqueID', right_on='UniqueID')
open_closed.sort_values(by='ClosedCount', inplace=True, ascending=False)

fig = px.bar(open_closed, x='UniqueID', y=['OpenCount', 'ClosedCount'],
             title='No. of Opened & Closed Counts for Each Tool',
             labels={'value': 'Count', 'variable': 'Status'},
             template='plotly_dark')

fig.show()


`leo`, `plonky2` and `miden-vm` are the repos with the highest number of total issues, however, they contain the largest amount of closed issues. The amount of issues open for the repositories are a small fraction of their total issue count. 

These are all active repositories with an active development team, creating issues and PRs. `miden-vm`, `plonky2` have the highest number of active committers, `leo` has a lower number of active committers, but still ~higher compared to other repos. 


`openzkp` (last commit date Dec 2020) and `libsnark`(last commit date Jul 2020) have the a high number of open issues. These repositories are no longer active or maintained. Since the development of the repositories is halted, these issues are not addressed and remain open. 

`halo2` has a high number of open issues and remains an active repository. By looking at the commit history on the repo, contributions are made creating PRs and issues and merging those to main. In the case where there are many contributors and changes made to the package, this can result in a high number of issues. `halo2` has a high number of active committers, meaning there is a team of contributors making changes to the project by opening issues and PRs.

#### 2.5 App Count & Issue Resolution

In [837]:
tool_issues['ClosedAt'] = tool_issues['ClosedAt'].fillna(0)
issue_counts = tool_issues.groupby(["UniqueID", "State"]).size().unstack(fill_value=0)
issue_counts.reset_index(inplace=True)
issue_counts.columns = ["UniqueID", "Closed", "Open"]
issue_counts["IssueResolutionRate"] = (issue_counts["Closed"]) / (
    issue_counts["Closed"] + issue_counts["Open"]
)
issue_counts.sort_values("IssueResolutionRate", ascending=False, inplace=True)

merged_tools_issues = (
    issue_counts[["UniqueID", "IssueResolutionRate"]]
    .merge(tool_counts, left_on="UniqueID", right_on="UniqueID", how="outer")
    .sort_values("AppCount")
)

fig = px.scatter(merged_tools_issues,
    x="IssueResolutionRate",
    y="AppCount",
    template="plotly_dark",
    title="Application Usage & Issue Resolution Rate",
    trendline="ols", 
    hover_name='UniqueID'
)

fig.show()

The following have the lowest resolution rates:

|    | UniqueID                  |   IssueResolutionRate |   AppCount |
|---:|:--------------------------|----------------------:|-----------:|
| 34 | libsnark/scipr-lab        |              0.423645 |         63 |
| 35 | cairo-lang/starkware-libs |              0.341176 |        110 |
| 33 | circomlib/iden3           |              0.446602 |        167 |


- `libsnark/scipr-lab`, `cairo-lang/starkware-libs` and `circomlib/iden3` are no longer actively maintained. 

- `cairo/starkware` is Rust rewrite of `cairo-lang/starkware-libs`. `cairo-lang` was the previous `cairo` stack written in Python. Since the Rust re-write, the stack is now in Rust and the main repo is `cairo/starkware`. For this reason, the `cairo-lang` repository is no longer actively maintained, thus issues and PRs are not closed or merged. 

- `libsnark/scipr-lab` is no longer maintained or active. The repo's last commit date was in Jul 2020. 

- `circomlib/iden3 ` is no longer maintained or active. The repo's last commit date was in Jun 2022. 

Since repos are inactive and no longer actively maintained, open issues have not been addressed. 

From the graph we can see that there is a weak correlation (R^2 = 0.003) between the issue resolution rate and the number of apps that use it.

#### 2.6 App Count & Issues Opened per Month

In [838]:
monthly_opened_issues = tool_issues

monthly_opened_issues["CreatedAt"] = pd.to_datetime(monthly_opened_issues["CreatedAt"])
monthly_opened_issues["CreatedYearMonth"] = monthly_opened_issues[
    "CreatedAt"
].dt.to_period("M")
monthly_opened_issues = monthly_opened_issues[["UniqueID", "CreatedYearMonth"]].rename(
    columns={"CreatedYearMonth": "YearMonth", "UniqueID": "UniqueID"}
)
monthly_opened_issues = (
    monthly_opened_issues.groupby(["UniqueID", "YearMonth"])
    .size()
    .reset_index(name="OpenedCount")
)
monthly_opened_issues = (
    monthly_opened_issues.groupby("UniqueID")["OpenedCount"]
    .mean()
    .reset_index(name="AverageOpenedPerMonth")
)
monthly_opened_issues = monthly_opened_issues.merge(
    tool_counts, left_on="UniqueID", right_on="UniqueID", how="left"
)

fig = px.scatter(monthly_opened_issues, x='AverageOpenedPerMonth', y='AppCount', 
                template="plotly_dark",
                title='Application Usage & Issues Opened per Month',
                hover_data=['UniqueID'],
                trendline='ols')
fig.show()
                


Converting to PeriodArray/Index representation will drop timezone information.



From the graph above, it can be seen that *generally* tools with a low number of issues are used commonly used in apps. There is a weak correlation (R^2 = 0.026) between the AverageOpenedPerMonth and AppCount. 

The outlier is `leo/aleohq` with a high number of issues opened per month and a very low app count. By investigating the Aleo ecosystem, it seems that the tool is still in its development phase. The ecosystem has not yet launched a mainnet, only a testnet. There are multiple projects in the Aleo ecosystem, of which their Leo DSL is the foundation. The high number of opened issues is a sign of the active development of the DSL, the low number of application usage can be explained by fact that the tool is still in development and entirely ready for public use as in the case of other DSLs.  Leo is much younger, compared to other tools, specifically DSLs, in the space. 

#### 2.7 No. of Issues Closed per Month

In [839]:
monthly_closed_issues = tool_issues

monthly_closed_issues["ClosedAt"] = pd.to_datetime(monthly_closed_issues["ClosedAt"])
monthly_closed_issues["ClosedYearMonth"] = monthly_closed_issues[
    "ClosedAt"
].dt.to_period("M")
monthly_closed_issues = monthly_closed_issues[["UniqueID", "ClosedYearMonth"]].rename(
    columns={"ClosedYearMonth": "YearMonth", "UniqueID": "UniqueID"}
)
monthly_closed_issues = (
    monthly_closed_issues.groupby(["UniqueID", "YearMonth"])
    .size()
    .reset_index(name="ClosedCount")
)
monthly_closed_issues = (
    monthly_closed_issues.groupby("UniqueID")["ClosedCount"]
    .mean()
    .reset_index(name="AverageClosedPerMonth")
)
monthly_closed_issues = monthly_closed_issues.merge(
    tool_counts, left_on="UniqueID", right_on="UniqueID", how="left"
)

fig = px.scatter(
    monthly_closed_issues,
    x="AverageClosedPerMonth",
    y="AppCount",
    trendline="ols",
    template="plotly_dark",
    title="Application Usage & Issues Closed per Month",
    hover_name='UniqueID'
)

fig.show()


Converting to PeriodArray/Index representation will drop timezone information.



The graph above shows the average amount of issues closer per month for each tool. there is a weak correlation (R^2 = 0.019) between the average number of issues closed per month and the frequency use of the tool. 

Once again, the outlier is `leo/aleohq` for reasons explained in the previous graph. 

In [840]:
monthly_closed_issues.merge(
        monthly_opened_issues, left_on="UniqueID", right_on="UniqueID", how="left"
    )[['UniqueID', 'AverageOpenedPerMonth', 'AverageClosedPerMonth']]

fig = px.scatter(
    monthly_closed_issues.merge(
        monthly_opened_issues, left_on="UniqueID", right_on="UniqueID", how="left"
    ),
    x="AverageClosedPerMonth",
    y="AverageOpenedPerMonth",
    template="plotly_dark",
    title="Issues Closed per Month & Issues Opened Per Month",
    trendline="ols",
    hover_name='UniqueID'
)
fig.show()

From the trendline above, it is clear that there is a strong correlation (R^2 = 0.987) between the number of issues opened and closed per month.  It can be deduced that all these tool repos are actively being maintained during their lifetime as on average number of issues/PRs being opened is matched by the number of tickets being closed per month.

#### 2.8 Tools & Issues Over Time

In [841]:
tool_issues_over_time = (
    tool_issues[["CreatedYearMonth", "UniqueID", "Name"]]
    .groupby(["UniqueID", "CreatedYearMonth"])
    .count()
    .reset_index()
    .rename(columns={"Name": "IssueCount"})
)
tool_issues_over_time["CreatedYearMonth"] = tool_issues_over_time[
    "CreatedYearMonth"
].astype("datetime64[ns]")

px.line(
    tool_issues_over_time,
    x="CreatedYearMonth",
    y="IssueCount",
    color="UniqueID",
    title="Issues over time",
    template="plotly_dark",
)

The graph above illustrates that often tools have a surge in the number of issues with the first year of development. This phenomenon could be taken into account when understanding the issue counts of a repository. 

#### 2.9 Tools & Initial Issues

In [842]:
tool_issues_over_time['CreatedYearMonth'] = pd.to_datetime(tool_issues_over_time['CreatedYearMonth'], format='%Y-%m')

tool_issues_over_time['Year'] = tool_issues_over_time['CreatedYearMonth'].dt.year

total_issues_first_year = (
    tool_issues_over_time.groupby(['UniqueID', 'Year'])['IssueCount']
    .sum()
    .reset_index()
    .sort_values(by=['UniqueID', 'Year'])
)

total_issues_first_year = total_issues_first_year.groupby('UniqueID').first().reset_index()
total_issues_first_year.sort_values(by='IssueCount', inplace=True, ascending=False)

total_issues = tool_issues.groupby('UniqueID')['URL'].nunique().reset_index().sort_values(by='URL', ascending=False)
total_issues = total_issues.merge(total_issues_first_year, left_on='UniqueID', right_on='UniqueID', how='left')
total_issues.rename(columns={'URL': 'TotalIssues', 'IssueCount': 'InitialIssues'}, inplace=True)
total_issues['Ratio'] = total_issues['InitialIssues']/total_issues['TotalIssues']
total_issues.sort_values('Ratio', inplace=True, ascending=False)


px.bar(
    total_issues,
    x='UniqueID',
    y='Ratio',
    title="Ratio of Initial Issues to Total Issues By Tools",
    template="plotly_dark",
)


`plonky3`, `plonky` and `gemini` have a high ratio in initial issues to total issues. 

These tools have a short lifespan which could explain the large ratio (as mentioned above, repos tend to have a surge in the number of issues created within in their first year)

- `plonky3` is less then a year old (first commit is Feb 2023)

- `plonky`'s  lifetime was just over a year (Feb 2020 - Oct 2021)

- `gemini`'s lifetime (Nov 2021 - Jan 2023)



#### 2.10 No. of Issues Closed Per Month & No. of Contributors

In [843]:
repo_contributors = repo_contributors[repo_contributors['UniqueID'].isin(tool_counts['UniqueID'])]
contributor_counts = repo_contributors.groupby('UniqueID')['Contributor'].nunique().reset_index()
issues_contributors = contributor_counts.merge(monthly_closed_issues, left_on='UniqueID', right_on='UniqueID', how='left')
issues_contributors.rename(columns={'Contributor': 'ContributorCount'}, inplace=True)

px.scatter(
    issues_contributors,
    x="ContributorCount",
    y="AverageClosedPerMonth",
    hover_name="UniqueID",
    trendline='ols',
    title="No. of Issues Closed Per Month & No. of Contributors",
    template="plotly_dark",
)

As seen in the graph above, there is a weak correlation (R^2 = 0.145) between the number of contributors and average number of issues closed per month for a tool. 


### 3.0 Language

#### 3.1 App Count & Language

In [844]:
language_tool_counts = tool_counts.merge(
    zkp_repos[zkp_repos["Type"] == "Tool"],
    left_on="UniqueID",
    right_on="UniqueID",
    how="right",
)[["UniqueID", "AppCount", "Language"]]

language_tool_counts.sort_values(by='AppCount', inplace=True, ascending=False)

fig = px.scatter(
    language_tool_counts,
    x="UniqueID",
    y="AppCount",
    color="Language",
    template="plotly_dark",
    title="Application count of Tool (Languages Used)",
    category_orders={'UniqueID': language_tool_counts['UniqueID']}
    
)

fig.show()

From the graph above, it is clear that most tools are built using Rust as their primary language. 

`circom` is the only tool built using WebAssembly. However, in its documentation it says 'Circom compiler is a circom language compiler written in Rust that can be used to generate a R1CS file with a set of associated constraints and a program (written either in C++ or WebAssembly)'

Why Rust? *(speculation)*
 - **performance**: provides low-level control over system resources, making it suitable for performance-critical applications like cryptographic operations. ZKP protocols often involve complex mathematical computations, and Rust's performance characteristics are advantageous in this context.
 - **crypto libraries**: well-maintained cryptographic libraries, such as `rust-crypto` and `ring`
 - **memory safety**: ensures memory safety without the need for garbage collection (memory safety is crucial in cryptographic applications to prevent vulnerabilities such as buffer overflows and other memory-related issues)
 - **concurrency**
 - **community & ecosystem**: active community of developers


In [845]:
language_counts = language_tool_counts.groupby("Language").sum().reset_index().sort_values("AppCount")
language_counts.sort_values('AppCount', ascending=False, inplace=True)

fig = px.bar(
    language_counts,
    x="Language",
    y="AppCount",
    color="Language",
    template="plotly_dark",
    title="Total Application Usage by Language",
)

fig.show()

When looking at the the AppCount of the languages (instead of the individual tools), it is clear that tools written in Rust contribute to the highest count, followed by JavaScript (mostly attributable to `snarkjs` and `circomlib`)

### 4.0 Contributors

#### 4.1 App Count & Contributors

In [846]:
repo_contributors = repo_contributors[repo_contributors['UniqueID'].isin(tool_counts['UniqueID'])]
contributor_counts = repo_contributors.groupby('UniqueID')['Contributor'].nunique().reset_index()
contributor_counts.rename(columns={'Contributor': 'ContributorCount'}, inplace=True)
contributor_counts = contributor_counts.merge(tool_counts, left_on='UniqueID', right_on='UniqueID')
contributor_counts.sort_values(by=['ContributorCount'], ascending=False, inplace=True)

fig = px.scatter(contributor_counts, x='ContributorCount', y='AppCount', 
             title='Application Usage and Contributor Count',
             template="plotly_dark",
             hover_name='UniqueID',
             trendline='ols',
             labels={'ContributorCount': 'No. of Contributors', 'UniqueID': 'Repository'}
             )
fig.show()

As seen in the graph above, there is a weak correlation (R^2 = 0.159) between the number of contributors for a tool and the usage of the tool. 

### 5.0 Age

#### 5.1 App Count & Tool Age

In [847]:
tool_age = zkp_repos[zkp_repos['Type'] == 'Tool']
tool_age['Created'] = pd.to_datetime(tool_age['Created'])
tool_age['Age'] = (datetime.now(timezone.utc) - tool_age['Created']).dt.days
tool_age = tool_age[['UniqueID', 'Age']]
tool_age = tool_age.merge(tool_counts, left_on='UniqueID', right_on='UniqueID', how='left')

fig = px.scatter(tool_age, 
                x='Age', 
                y='AppCount', 
                title='Application Usage and Age',
                template="plotly_dark",
                hover_name='UniqueID',
                trendline='ols',
             )
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



There is a weak correlation (R^2 = 0.006) between the age of the tool and the number of apps that use it.

`libsnark` is the oldest out of all the tools with a low AppCount. `libsnark` is no longer actively maintained with its last commit date being Jul 2020. This tool's lifespan ended before some other tools were even developed. The low AppCount could be because `libsnark` is outdated or perhaps another tool has improved upon its functionality. The only other C++ tool is `Risc0`, but this tool is based on zk-STARKs whereas `libsnark` is based on zk-SNARK schemes.

TODO: verify the 'CREATED' field

### 6.0 Lifespan

#### 6.1 Lifespan of Tool (first commit date - last commit date)

In [848]:
first_commit = tool_commits.groupby('UniqueID')['CommitterDate'].min().reset_index()
first_commit.rename(columns={'CommitterDate': 'FirstCommit'}, inplace=True)
last_commit = tool_commits.groupby('UniqueID')['CommitterDate'].max().reset_index()
last_commit.rename(columns={'CommitterDate': 'LastCommit'}, inplace=True)

lifespan = first_commit.merge(last_commit, left_on='UniqueID', right_on='UniqueID')
lifespan['FirstCommit'] = pd.to_datetime(lifespan['FirstCommit'], utc=True)
lifespan['LastCommit'] = pd.to_datetime(lifespan['LastCommit'], utc=True)
lifespan['Lifespan'] = (lifespan['LastCommit'] - lifespan['FirstCommit']).dt.days

fig = px.scatter(lifespan.sort_values('Lifespan', ascending=False), 
                x='UniqueID', 
                y='Lifespan', 
                title='Lifespan of Tools',
                template="plotly_dark",
                hover_name='UniqueID',
                # trendline='ols',
             )
fig.show()


From the graph above, it can be seen that `bellman` has the longest lifespan. 

#### 6.2 Lifespan & App Count

In [849]:
lifespan_count = lifespan.merge(tool_counts,  left_on='UniqueID', right_on='UniqueID')

fig = px.scatter(lifespan_count, 
                x='Lifespan', 
                y='AppCount', 
                title='Application Usage and Lifespan',
                template="plotly_dark",
                hover_name='UniqueID',
                trendline='ols',
             )
fig.show()


There is a weak correlation (R^2 = 0.042) between the Lifespan and tool usage. 

#### Active Tools

In [850]:
lifespan[lifespan['FirstCommit'].dt.year == 2022].UniqueID

27    risc0/risc0
Name: UniqueID, dtype: object

### 7.0 Commits

#### 7.1 Date of Tool First Commit

In [851]:
start_dates = tool_commits.groupby('UniqueID')['AuthorDate'].min().reset_index()
start_dates['AuthorDate'] = pd.to_datetime(start_dates['AuthorDate'], utc=True)
start_dates['AuthorDate'] = start_dates['AuthorDate'].astype(str)
# start_dates['YearMonth'] = start_dates['AuthorDate'].dt.to_period('M')
start_dates.rename(columns={'AuthorDate': 'FirstCommit'}, inplace=True)
start_dates.sort_values(by='FirstCommit', inplace=True)

fig = px.scatter(start_dates, 
              x='FirstCommit', 
              y='UniqueID', 
              title='Date of Tool First Commit',
              template="plotly_dark",
              hover_name='UniqueID',
             )
fig.show()

#### 7.2 Date of Tool Last Commit

In [852]:
end_dates = tool_commits.groupby('UniqueID')['AuthorDate'].max().reset_index()
end_dates['AuthorDate'] = pd.to_datetime(end_dates['AuthorDate'], utc=True)
end_dates['AuthorDate'] = end_dates['AuthorDate'].astype(str)
# start_dates['YearMonth'] = start_dates['AuthorDate'].dt.to_period('M')
end_dates.rename(columns={'AuthorDate': 'LastCommit'}, inplace=True)
end_dates.sort_values(by='LastCommit', inplace=True)
end_dates['LastCommit'] = pd.to_datetime(end_dates['LastCommit'])
end_dates['Active'] = end_dates['LastCommit'].dt.year == 2023

fig = px.scatter(end_dates, 
              x='LastCommit', 
              y='UniqueID', 
              title='Date of Tool Last Commit',
              template="plotly_dark",
              hover_name='UniqueID',
              color='Active'
             )
fig.show()

In [853]:
fig = px.bar(end_dates['Active'].value_counts().reset_index(), 
              x='Active', 
              y='count', 
              title='No. of Active & Inactive Tools',
              template="plotly_dark",
              labels={'count': 'Count'},
              color='Active'
             )
fig.show()

As seen from the graph above, most tools are still being actively maintained. 

In [854]:
fig = px.bar(end_dates.merge(tool_counts, left_on='UniqueID', right_on='UniqueID').sort_values('AppCount', ascending=False), 
              x='UniqueID', 
              y='AppCount', 
              title='No. of Active & Inactive Tools',
              template="plotly_dark",
              color='Active',
              category_orders={'UniqueID': tool_counts['UniqueID']}
             )
fig.show()

As seen above, *generally* tools which are active have high app counts. 

`circomlib` is inactive but has a high AppCount. The other tools in its ecosystem, `circom` and `snarkjs` are both active. 

#### 7.3 Tool Commits Over Time

In [855]:
all_commits_df = tool_commits.sort_values(by='CommitterDate')
all_commits_df['UniqueID'] =( all_commits_df['Name'] + '/' + all_commits_df['Owner']).str.lower()
all_commits_df['CommitterDate'] = pd.to_datetime(all_commits_df['CommitterDate'])
fig = px.scatter(all_commits_df, x='CommitterDate', y='UniqueID', color='UniqueID',
                 title='All Commits Over Time for Various Repositories',
                 labels={'Date': 'Commit Date', 'Name': 'Repository'},
                 template="plotly_dark")

fig.update_xaxes(title_text='Date')
fig.update_yaxes(title_text='Repository')
fig.update_layout(legend_title_text='Repository', height=1000)
fig.show()




`gnark`, `leo`, `plonky2` and `miden-vm` have very frequent commit patterns

#### 7.4 App Count & Commit Count

In [856]:
total_commits = tool_commits
total_commits = total_commits.groupby('UniqueID')['CommitHash'].nunique().reset_index()
total_commits.columns = ['UniqueID', 'CommitCount']
total_commits = total_commits.merge(tool_counts, left_on='UniqueID', right_on='UniqueID', how='right')

fig = px.scatter(total_commits, 
                x='CommitCount', 
                y='AppCount', 
                title='Application Usage and CommitCount',
                template="plotly_dark",
                hover_name='UniqueID',
                trendline='ols',
                labels={'UniqueID': 'Repository'}
             )
fig.show()

As seen in the graph above, there is a weak correlation (R^2 = 0.027) between the number of commits a tool has and its usage. 

`zksync` seems to be slightly inactive (last commit in Feb 2023) with the focus put on the development of the `zksync-era` tool. `zksync` seems to be in a similar situation as `leo` in the sense that it is still in its development phase.  ZkSync was developed by Matter Labs, a blockchain research and engineering firm, allowing the tool to have a set of developers actively contributing to the project. This could explain its high development metrics but low app usage. 

#### 7.5 App Count & Commit Frequency

In [857]:
commit_frequency = tool_commits
commit_frequency["CommitterDate"]
commit_frequency["CommitterDate"] = pd.to_datetime(
    commit_frequency["CommitterDate"], utc=True
)
commit_frequency["YearMonth"] = commit_frequency["CommitterDate"].dt.to_period("M")
commit_frequency = (
    commit_frequency.groupby(["UniqueID", "YearMonth"])
    .size()
    .reset_index(name="CommitCount")
)
commit_frequency = (
    commit_frequency.groupby("UniqueID")["CommitCount"]
    .mean()
    .reset_index(name="AverageCommitFrequency")
)
commit_frequency = commit_frequency.merge(
    tool_counts, left_on="UniqueID", right_on="UniqueID", how="left"
)

fig = px.scatter(
    commit_frequency,
    x="AverageCommitFrequency",
    y="AppCount",
    title="Application Usage and Average Number Of Commits per Month",
    template="plotly_dark",
    hover_name='UniqueID',
    trendline='ols',
    labels={"UniqueID": "Repository"},
)
fig.show()


Converting to PeriodArray/Index representation will drop timezone information.



Why do these repo's have so many commits per month?
|    | UniqueID             |   AverageCommitFrequency |   AppCount |
|---:|:---------------------|-------------------------:|-----------:|
| 15 | leo/aleohq           |                  140.14  |          7 |
| 21 | openzkp/0xproject    |                  100.19  |          1 |
| 23 | plonky2/mir-protocol |                  141.156 |         17 |
| 35 | zksync/matter-labs   |                  265.065 |          4 |

A high commit could can be attributed to the following factors: 
- active development
  - `leo`: average lifespan. active.
  - `openzkp`: lower lifespan. inactive. 
  - `plonky2`: average-to-low lifespan. active.
  - `zksysnc`: average-to-high lifespan. active. 
- contribution by issues and PRs
  - `leo`: yes as stated in docs, automatic commits by `dependabot[bot]`, which adds to the commit count
  - `openzkp`: yes, mostly except for one contributor that often pushes small commits straight to main
  - `plonky2`: yes as stated in docs, with a subset of contributors pushing to main
  - `zksysnc`: yes, automatic commits by `dependabot[bot]`
- multiple contributors
  - `leo`: 31
  - `openzkp`: 2
  - `plonky2`: 28 
  - `zksysnc`: 54
- branching strategy
  - `leo`: fork & pull, with a subset of contributors pushing to main
  - `openzkp`: fork & pull, one contributor often pushing straight to main. atomic commits. 
  - `plonky2`: fork & pull, with a subset of contributors pushing to main
  - `zksysnc`: fork & pull, with a subset of contributors pushing to main


`std/arkworks`, `algebra/arkworks` and `curves/arkworks` have high AppCounts but low ActiveCommitters. These repos are mostly used indirectly - tools often use the Cairo DSL which is built using these repos. 

#### 7.6 Active Committers

In [858]:
tool_commits['CommitterDate'] = pd.to_datetime(tool_commits['CommitterDate'])
active_commits = tool_commits[tool_commits['CommitterDate'].dt.year == 2023]

active_commits = active_commits.merge(contributor_data, left_on='Author', right_on='Login', how='left') 
active_commits = active_commits.dropna(subset=['Login'])
active_commits 

active_committers = active_commits.groupby('UniqueID')['Author'].nunique().reset_index()

active_committers.rename(columns={'Author': 'ActiveCommitters'}, inplace=True)

fig = px.scatter(
    active_committers.sort_values('ActiveCommitters', ascending=False),
    x="UniqueID",
    y="ActiveCommitters",
    title="No. of Active Committers per Tool",
    template="plotly_dark",
    hover_name='UniqueID',
)
fig.show()


#### 7.7 App Count & Active Committers

In [859]:
active_counts = active_committers.merge(tool_counts, left_on='UniqueID', right_on='UniqueID')

fig = px.scatter(
    active_counts.sort_values('AppCount'),
    x="ActiveCommitters",
    y="AppCount",
    title="Application Usage and Active Committers",
    template="plotly_dark",
    hover_name='UniqueID',
    trendline='ols'
)
fig.show()

There is a weak correlation (R^2 = 0.005) between the number of active committers and the amount of times a tool is used

`std/arkworks`, `algebra/arkworks` and `curves/arkworks` have high AppCounts but low ActiveCommitters. 

Many applications use these repositories indirectly by using the Cairo DSL (which is built using these three tools). 

Keep in mind that this metric uses the contributors returned for a repository and verifies whether or not they are still active. There could be more authors contributing to the project but have not linked their GitHub account when committing. At times, the same author would commit from their official GitHub account or from a non-GitHub account. This is the same person, but when looking at the commit Author, it would be different. To avoid counting duplicate authors, contributors were considered instead of authors (contributors are a subset of authors).

#### 7.8 App Count & New Committers

In [860]:
contributor_commits = tool_commits.merge(contributor_data, left_on='Author', right_on='Login', how='left') 
contributor_commits = contributor_commits.dropna(subset='Login')
contributor_commits['CommitterDate'] = pd.to_datetime(contributor_commits['CommitterDate'])
active_contributor_commits = contributor_commits[contributor_commits['CommitterDate'].dt.year == 2023]
old_committers = contributor_commits['Login'].unique()

new_committers = contributor_commits.groupby('Login')['CommitterDate'].min().reset_index()
new_committers = new_committers[new_committers['CommitterDate'].dt.year == 2023]
new_committers.rename(columns={'CommitterDate': 'FirstCommit'}, inplace=True)
new_committers = new_committers.merge(active_contributor_commits, left_on='Login', right_on='Login')
new_committers = new_committers.groupby('UniqueID')['Login'].nunique().reset_index()
new_committers.rename(columns={'Login': 'NewCommitterCount'}, inplace=True)
new_committers = new_committers.merge(tool_counts, left_on='UniqueID', right_on='UniqueID', how='right')
new_committers['NewCommitterCount'] = new_committers['NewCommitterCount'].fillna(0)

fig = px.scatter(
    new_committers.sort_values('NewCommitterCount', ascending=False),
    x="UniqueID",
    y="NewCommitterCount",
    title="No. of New Committers per Tool",
    template="plotly_dark",
    hover_name='UniqueID',
)
fig.show()


`leo` and `risc0` have the highest number of new committers. 

`risc0` is still a fairly new project, with its first commit being in Feb 2022. It makes sense that its community is gaining traction to develop the tool. 

Although `leo` has been around for longer, it is also a tool that it is still in its development phase. Perhaps the need to increase rate of development has increased the number of new committers. 

#### 7.9 App Count & New Commits

In [861]:
new_commits = tool_commits

new_commits["CommitterDate"] = pd.to_datetime(new_commits["CommitterDate"], utc=True)
new_commits = new_commits[new_commits["CommitterDate"].dt.year == 2023]
new_commits = new_commits.groupby("UniqueID").size().reset_index(name="NewCommits")
new_commits = new_commits.merge(
    tool_counts, left_on="UniqueID", right_on="UniqueID", how="right"
).fillna(0)

fig = px.scatter(
    new_commits,
    x="NewCommits",
    y="AppCount",
    title="Application Usage and New Committs",
    template="plotly_dark",
    hover_name='UniqueID',
    labels={"UniqueID": "Repository"},
)
fig.show()

`plonky2` has the most new commits. As discussed before, `plonky2`'s high commit count could be explained by its active development team, methods of contributing and branching strategy. Similarily with `gnark`. 

### 8.0 Branches

#### 8.1 App Count & Branch Count

In [862]:
branch_count = branches_data.groupby('UniqueID').size().reset_index(name='BranchCount')
branch_count = branch_count.merge(tool_counts, left_on='UniqueID',  right_on='UniqueID', how='left')

fig = px.scatter(branch_count, x='BranchCount', y='AppCount', 
             title='Application Usage and BranchCount',
             template="plotly_dark",
             hover_name='UniqueID',
             labels={'UniqueID': 'Repository'}
             )
fig.show()

`snarky` has the highest number of branches. As with other tools, the `snarky` seems to follow a fork-and-push development strategy, whereby contributors fork the repo, address a feature or an issues, and then merge those changes with main. 

In the case of `snarky`, there are many branches which have been abandoned. 

### 9.0 Tool Type

#### 9.1 App Count and Tool Type 

In [863]:
tools = zkp_repos[zkp_repos['Type'] == 'Tool'][['UniqueID', 'ToolType']]
tools_types = tool_counts.merge(tools, left_on='UniqueID',  right_on='UniqueID', how='right')
tools_types.sort_values(by='AppCount', inplace=True, ascending=False)

fig = px.bar(tools_types, x='UniqueID', y='AppCount', 
             title='Application Usage and Tool Type',
             template="plotly_dark",
             color='ToolType',
             labels={'UniqueID': 'Repository'},
             category_orders={'UniqueID': tools_types['UniqueID']},
             text='AppCount'
             )
fig.show()


#### 9.2 Total Application Usage by Tool Type

In [864]:
fig = px.bar(tools_types.groupby('ToolType').sum().reset_index().sort_values(by='AppCount', ascending=False),
             x='ToolType', y='AppCount',
                title='Total Application Usage by Tool Type',   
                template="plotly_dark",
                color='ToolType',
                labels={'ToolType': 'Tool Type'}
)
fig.show()

As seen above, low-level zkp development appears to be the most common tool type. Many applications use the DSL Cairo, which is built using these low-level tools. 

### 10.0 Tool Combinations

#### 10.1 Tools Used In Combination

In [865]:
from itertools import combinations
from collections import Counter

zkp_applications = zkp_repos[zkp_repos['Type'] == 'Application']
tool_combinations = zkp_applications['Tool'].apply(lambda x: list(combinations(x, 2)))
tool_combinations = [tuple(item) for sublist in tool_combinations for item in sublist]
tool_combinations = Counter(tool_combinations)
tool_combinations = pd.DataFrame(list(tool_combinations.items()), columns=['Tool Combination', 'Frequency'])
tool_combinations.sort_values('Frequency', inplace=True, ascending=False)
tool_combinations['Tool Combination'] = tool_combinations['Tool Combination'].astype(str)

fig = px.bar(tool_combinations[tool_combinations['Frequency'] > 10],
             x='Tool Combination', 
             y='Frequency',
             title='No. of Time Tool Combinations Are Used',   
             template="plotly_dark",
)
fig.show()


The top 3 most common combinations are:
- (`arkworks/algebra`, `arkworks/std`) with count 611
- (`arkworks/std`, `arkworks/curves`) with count 483
- (`arkworks/algebra`, `arkworks/curves`) with count 451

`arkworks/algebra`: Libraries for finite field, elliptic curve, and polynomial arithmetic 

`arkworks/std`: A base library for interfacing with streams of vectors and matrices.

`arkworks/curves`: Implementations of popular elliptic curves

These tools are used in the construction of ZKP circuits and computation of proofs. Each library presents a different functionality, which is why they are often used in combination. 

#### 10.2 Tools Always Used Alone

In [866]:
zkp_applications = zkp_repos[zkp_repos['Type'] == 'Application']
tool_combinations = zkp_applications['Tool'].apply(lambda x: list(combinations(x, 2)))
flat_combinations = [tuple(item) for sublist in tool_combinations for item in sublist]
combination_counts = Counter(flat_combinations)
all_tools = set([tool for sublist in zkp_applications['Tool'] for tool in sublist])
tools_in_combinations = set([tool for combination in flat_combinations for tool in combination])
tools_never_used_in_combination = all_tools - tools_in_combinations
tools_never_used_in_combination

{'bulletproofs (sdiehl)', 'plonky3', 'zksync'}

`bulletproofs (sdiehl)`, `plonky3` and `zksync` are never used in combination with another tool. 

These three tools have a low app count. 

`bulletproofs` is a proof system. `bulletproofs/dalek-cryptography` seems to be the preferred tool to use when implementing this proof system. 

`plonky3` is a library for implementing polynomial IOPs (PIOPs), such as PLONK and STARKs. It is the "youngest" tool, having its first commit in Feb 2023. 

`zksync` is a zkEVM. These tools may have built-in tools to help with the creation of ZKPs where there is no need to use a combination of various tools.  

### 11.0 External Resources

#### 11.1 Application Usage per Tool (External resources available)

In [867]:
resources = tool_counts.merge(zkp_repos, left_on='UniqueID', right_on='UniqueID')

fig = px.bar(resources, 
             x='UniqueID', 
             y='AppCount', 
             title='Application Usage per Tool  (External resources available)',
             template="plotly_dark",
             color='Tool Resources', 
             category_orders = {'UniqueID':resources['UniqueID'] }
             )
fig.show()

All tools with high AppCount have external resources available. 

### 12.0 Proof Constructions & Proof Systems

#### 12.1 Proof Construction Frequency of Tools

In [868]:
zkp_tool_info[['UniqueID','ProofConstruction','ProvingSystem' ]]

Unnamed: 0,UniqueID,ProofConstruction,ProvingSystem
0,algebra/arkworks-rs,snark,
1,crypto-primitives/arkworks-rs,snark,
2,curves/arkworks-rs,snark,
3,gemini/arkworks-rs,snark,gemini
4,gm17/arkworks-rs,snark,gm17
5,groth16/arkworks-rs,snark,groth16
6,marlin/arkworks-rs,snark,marlin
7,nonnative/arkworks-rs,snark,
8,poly-commit/arkworks-rs,snark,marlin
9,r1cs-std/arkworks-rs,snark,


In [869]:
zkp_tool_info = pd.read_csv('zkp_tool_info.csv', sep=';')
zkp_tool_info['ProofConstruction'] = zkp_tool_info['ProofConstruction'].str.split(', ')
zkp_tool_info_exp = zkp_tool_info.explode('ProofConstruction')

zkp_tool_info_exp.drop_duplicates(subset=['UniqueID', 'ProofConstruction'])['ProofConstruction'].value_counts().reset_index()
fig = px.bar(zkp_tool_info_exp.drop_duplicates(subset=['UniqueID', 'ProofConstruction'])['ProofConstruction'].value_counts().reset_index(), 
             x='ProofConstruction', 
             y='count', 
             title='Frequency of Proof Constructions Supported by Tool',
             template="plotly_dark",
             labels={'count':'No. of Tools', 'ProofConstruction': 'Proof Construction'}
             )
fig.show()

zk-SNARKS are the most commonly supported proof constructions. It was the one of the first constructions used in ZKP development. Other constructions, like zk-STARKs, are slightly newer. 

It is important to note that a tool may support multiple proof constructions.

#### 12.2 Proof Systems Frequency of Tools

In [870]:
zkp_tool_info_exp['ProvingSystem'] = zkp_tool_info_exp['ProvingSystem'].str.split(', ')
zkp_tool_info_exp = zkp_tool_info_exp.explode('ProvingSystem')

zkp_tool_info_exp.drop_duplicates(subset=['UniqueID', 'ProvingSystem'])['ProvingSystem'].value_counts().reset_index()
fig = px.bar(zkp_tool_info_exp.drop_duplicates(subset=['UniqueID', 'ProvingSystem'])['ProvingSystem'].value_counts().reset_index(), 
             x='ProvingSystem', 
             y='count', 
             title='Frequency of Proving Systems Supported by Tool',
             template="plotly_dark",
             labels={'count':'No. of Tools', 'ProvingSystem': 'Proving System'}
             )
fig.show()


Plonk is the most commonly supported proving system. 

It is important to note that a tool may support multiple proving systems.

#### 12.3 Proof Construction Over Time

In [871]:

zkp_tools = zkp_repos[zkp_repos['Type'] == 'Tool']
start_info = zkp_tool_info.merge(zkp_tools[['UniqueID', 'Created']], left_on='UniqueID', right_on='UniqueID')
start_info['Created'] = pd.to_datetime(start_info['Created'])
start_info['ProofConstruction'] = start_info['ProofConstruction'].astype(str)

fig = px.scatter(start_info.sort_values(by='Created'), 
                 x='Created', 
                 y='UniqueID', 
                 title='Frequency of Proving Systems Supported by Tool',
                 template="plotly_dark",
                 color='ProofConstruction',
                )
fig.show()


#### 12.4 Proof Systems Over Time

In [872]:
zkp_tools = zkp_repos[zkp_repos['Type'] == 'Tool']
start_info = zkp_tool_info.merge(zkp_tools[['UniqueID', 'Created']], left_on='UniqueID', right_on='UniqueID')
start_info['Created'] = pd.to_datetime(start_info['Created'])
start_info['ProvingSystem'] = start_info['ProvingSystem'].astype(str)

fig = px.scatter(start_info.sort_values(by='Created'), 
                 x='Created', 
                 y='UniqueID', 
                 title='Frequency of Proving Systems Supported by Tool',
                 template="plotly_dark",
                 color='ProvingSystem',
                 color_continuous_scale='rainbow'
                )
fig.show()

#### 12.5 Frequency of Proof Constructions by Application Use

In [873]:
zkp_apps_exploded = zkp_repos_exploded[zkp_repos_exploded['Type'] == 'Application'] 
zkp_tool_info_exp.rename(columns={'Name': 'ToolName'}, inplace=True)
zkp_apps_exploded = zkp_apps_exploded.merge(zkp_tool_info_exp[['ToolName', 'ProofConstruction', 'ProvingSystem']], left_on='Tool', right_on='ToolName') 
print(zkp_apps_exploded.drop_duplicates(subset=['UniqueID', 'ProofConstruction'])['ProofConstruction'].value_counts().reset_index().to_markdown())

fig = px.bar(zkp_apps_exploded.drop_duplicates(subset=['UniqueID', 'ProofConstruction'])['ProofConstruction'].value_counts().reset_index(), 
             x='ProofConstruction', 
             y='count', 
             title='Frequency of Proving Construction by Application Use',
             template="plotly_dark",
             labels={'count':'No. of Applications', 'ProofConstruction': 'Proof Construction'}
             )
fig.show()


|    | ProofConstruction   |   count |
|---:|:--------------------|--------:|
|  0 | snark               |     899 |
|  1 | stark               |     194 |
|  2 | bulletproofs        |      58 |


This follows a similar distribution to that of the Tools - tools that support SNARKs are most commonly used. 

#### 12.6 Frequency of Proof Systems by Application Use

In [874]:
print(zkp_apps_exploded.drop_duplicates(subset=['UniqueID','ProvingSystem' ])['ProvingSystem'].value_counts().reset_index().to_markdown())

zkp_apps_exploded['ProvingSystem'] = zkp_apps_exploded['ProvingSystem'].fillna('unspecified')
fig = px.bar(zkp_apps_exploded.drop_duplicates(subset=['UniqueID','ProvingSystem' ])['ProvingSystem'].value_counts().reset_index(), 
             x='ProvingSystem', 
             y='count', 
             title='Frequency of Proving Systems by Application Use',
             template="plotly_dark",
             labels={'count':'No. of Applications', 'ProvingSystem': 'Proving System'}
             )
fig.show()

|    | ProvingSystem   |   count |
|---:|:----------------|--------:|
|  0 | groth16         |     379 |
|  1 | plonk           |     304 |
|  2 | gm17            |      73 |
|  3 | merlin          |      57 |
|  4 | marlin          |       9 |
|  5 | pinocchio       |       4 |
|  6 | halo            |       1 |
|  7 | gemini          |       1 |


Tools that support plonk are most commonly used. iden3's Circom stack supports the plonk proof system (`circom`, `snarkjs` and `circomlib`). These tools are used quite often and often used in combination as demonstrated previously. This could explain the high frequency of the plonk proving system. 

#### 12.7 Creation of Tools Supporting Proof Construction

In [875]:
tool_first_commit = tool_commits.groupby('UniqueID')['CommitterDate'].min().reset_index()
tool_first_commit = tool_first_commit.merge(zkp_tool_info, left_on='UniqueID', right_on='UniqueID')
tool_first_commit['CommitterDate'] = pd.to_datetime(tool_first_commit['CommitterDate']) 
tool_first_commit['MonthYear'] = tool_first_commit['CommitterDate'].dt.to_period('M')
tool_first_commit['MonthYear'] = tool_first_commit['MonthYear'].astype(str)
tool_first_commit['ProofConstruction'] = tool_first_commit['ProofConstruction'].astype(str)

fig = px.scatter(tool_first_commit,
             x='MonthYear', 
             y='Name',
             title='No. of Applications Created per Month Using Tool',   
             template="plotly_dark",
             color='ProofConstruction',
             color_discrete_sequence=px.colors.qualitative.Light24,
)
fig.show()


Converting to PeriodArray/Index representation will drop timezone information.



#### 12.8 Creation of Tools Supporting Proving Systems

In [876]:
fig = px.scatter(tool_first_commit,
             x='MonthYear', 
             y='Name',
             title='No. of Applications Created per Month Using Tool',   
             template="plotly_dark",
             color='ProvingSystem',
             color_discrete_sequence=px.colors.qualitative.Light24,
)
fig.show()

#### 12.9 Proof Construction Use Over Time

In [877]:
zkp_applications_exp = zkp_repos_exploded[zkp_repos_exploded['Type']=='Application']

zkp_applications_exp = zkp_applications_exp.merge(application_commits[['UniqueID', 'CommitterDate']], left_on='UniqueID', right_on='UniqueID')
zkp_applications_exp['CommitterDate'] = pd.to_datetime(zkp_applications_exp['CommitterDate'])
zkp_applications_exp.sort_values('CommitterDate', inplace=True)
zkp_applications_exp = zkp_applications_exp.drop_duplicates(['UniqueID'])

zkp_applications_exp['MonthYear'] = zkp_applications_exp['CommitterDate'].dt.to_period('M')
zkp_applications_exp['MonthYear'] = zkp_applications_exp['MonthYear'].astype(str)

applicationFirstCommit = zkp_applications_exp.groupby(['MonthYear', 'Tool'])['UniqueID'].count().reset_index()
applicationFirstCommit['MonthYear'] = applicationFirstCommit['MonthYear'].astype(str)
applicationFirstCommit.rename(columns={'UniqueID': 'No. of Applications'}, inplace=True)
applicationFirstCommit = applicationFirstCommit.merge(zkp_tool_info[['Name', 'ProofConstruction', 'ProvingSystem']], left_on='Tool', right_on='Name', )
applicationFirstCommit['ProofConstruction'] = applicationFirstCommit['ProofConstruction'].astype(str)

fig = px.scatter(applicationFirstCommit,
             x='MonthYear', 
             y='No. of Applications',
             title='No. of Applications Created per Month Using Tool',   
             template="plotly_dark",
             color='ProofConstruction',
             color_discrete_sequence=px.colors.qualitative.Light24,
)
fig.show()


Converting to PeriodArray/Index representation will drop timezone information.



#### 12.10 Proving System Use Over Time

In [878]:
applicationFirstCommit['ProvingSystem'] = applicationFirstCommit['ProvingSystem'].fillna('unspecified')

fig = px.scatter(applicationFirstCommit,
             x='MonthYear', 
             y='No. of Applications',
             title='No. of Applications Created per Month Using Tool',   
             template="plotly_dark",
             color='ProvingSystem',
             color_discrete_sequence=px.colors.qualitative.Light24
)
fig.show()

#### 12.11 HeatMap of Proof Construction Popularity Over Time

In [879]:
applicationFirstCommit = zkp_applications_exp.groupby(['MonthYear', 'Tool'])['UniqueID'].count().reset_index()
applicationFirstCommit['MonthYear'] = applicationFirstCommit['MonthYear'].astype(str)
applicationFirstCommit.rename(columns={'UniqueID': 'No. of Applications'}, inplace=True)
applicationFirstCommit = applicationFirstCommit.merge(zkp_tool_info[['Name', 'ProofConstruction', 'ProvingSystem']], left_on='Tool', right_on='Name', )

applicationFirstCommit_exp  = applicationFirstCommit.explode('ProofConstruction')

fig = px.imshow(applicationFirstCommit_exp.pivot_table(index='ProofConstruction', columns='MonthYear', values='No. of Applications'),
                labels=dict(color='No. of Applications'),
                title='Proof Construction Popularity Over Time',
                x=sorted(applicationFirstCommit_exp['MonthYear'].unique()), 
                y=sorted(applicationFirstCommit_exp['ProofConstruction'].unique()),
                template='plotly_dark'
                )

fig.show()

#### 12.12 HeatMap of Proving System Popularity Over Time

In [880]:
applicationFirstCommit = zkp_applications_exp.groupby(['MonthYear', 'Tool'])['UniqueID'].count().reset_index()
applicationFirstCommit['MonthYear'] = applicationFirstCommit['MonthYear'].astype(str)
applicationFirstCommit.rename(columns={'UniqueID': 'No. of Applications'}, inplace=True)
applicationFirstCommit = applicationFirstCommit.merge(zkp_tool_info[['Name', 'ProofConstruction', 'ProvingSystem']], left_on='Tool', right_on='Name', )

applicationFirstCommit['ProvingSystem'] = applicationFirstCommit['ProvingSystem'].str.split(',')
applicationFirstCommit_exp  = applicationFirstCommit.explode('ProvingSystem')
applicationFirstCommit_exp['ProvingSystem'] = applicationFirstCommit_exp['ProvingSystem'].fillna('unspecified')

fig = px.imshow(applicationFirstCommit_exp.pivot_table(index='ProvingSystem', columns='MonthYear', values='No. of Applications'),
                labels=dict(color='No. of Applications'),
                title='Proving System Popularity Over Time',
                x=sorted(applicationFirstCommit_exp['MonthYear'].unique()), 
                y=sorted(applicationFirstCommit_exp['ProvingSystem'].unique()),
                template='plotly_dark'
                )

fig.show()

### 13.0 Original Tools

#### 13.1 App Count for Original Tools

Originally `cairo`, `starknet-rs`, `noir` and `zokrates` were in the original set of Tools. However, it was discovered that these tools using a set of `arkworks` libraries in the backend and have then been treated as an application. 

In [881]:
zkp_repos_original = pd.read_csv('zkp_repos_original.csv', sep=';')
zkp_repos_original.loc[zkp_repos_original["Type"] == "Application", "Tool"] = zkp_repos_original["Tool"].str.replace('[', '').str.replace(']', '').str.replace('\'', '').str.split(', ')
zkp_repos_original_exploded = zkp_repos_original.explode('Tool')

In [882]:
fig = px.bar(zkp_repos_original_exploded[zkp_repos_original_exploded['Type'] == 'Application']['Tool'].value_counts().reset_index(), 
             x='Tool', 
             y='count', 
             title='Original Frequency Distribution of Tools',
             template="plotly_dark",
             )
fig.show()

fig = px.bar(tools_types, 
             x='UniqueID', 
             y='AppCount', 
             title='Current Frequency Distribution of Tools',
             template="plotly_dark",
             )
fig.show()

When `cairo`, `starknet-rs`, `noir` and `zokrates` are treated as tools instead of applications, `cairo` is the most commonly used tool, followed by `circom` and `snarkjs`. When treated as an application, the `arkworks` libraries used to construct `cairo`, `starknet-rs`, `noir` and `zokrates` are the most common. 

#### 13.2 Tool Types for Original Tools

In [883]:
original_applications = (zkp_repos_original_exploded[zkp_repos_original_exploded['Type'] == 'Application']).drop(columns=['ToolType'])
original_tool = (zkp_repos_original[zkp_repos_original['Type'] == 'Tool']).rename(columns={'Name': 'ToolName'})
original_tool_type = original_applications.merge(original_tool[['ToolName', 'ToolType']] , left_on='Tool', right_on='ToolName', how='left')

fig = px.bar(original_tool_type['ToolType'].value_counts().reset_index(), 
             x='ToolType', 
             y='count', 
             title='Original Frequency Distribution of Tool Types',
             template="plotly_dark",
             )
fig.show()

fig = px.bar(tools_types.groupby('ToolType').sum().reset_index().sort_values(by='AppCount', ascending=False),
             x='ToolType', 
             y='AppCount',
             title='Current Frequency Distribution of Tools',   
             template="plotly_dark",
)
fig.show()

When `cairo`, `starknet-rs`, `noir` and `zokrates` are treated as tools instead of applications, DSLs are the most commonly use tool types. When treated as applications, Low-Level ZK Development tools are the most common tool type. The order for the other types remain the same. 

### 14.0 Application Tool Usage Over Time

#### 14.1 Application Created Over Time (by `Created` field)

In [884]:
zkp_applications_exp = zkp_repos_exploded[zkp_repos_exploded['Type']=='Application']
zkp_applications_exp['Created'] = pd.to_datetime(zkp_applications_exp['Created'])
zkp_applications_exp['MonthYear'] = zkp_applications_exp['Created'].dt.to_period('M')

tools_counts_monthly = zkp_applications_exp.groupby(['MonthYear', 'Tool'])['UniqueID'].count().reset_index()
tools_counts_monthly['MonthYear'] = tools_counts_monthly['MonthYear'].astype(str)

fig = px.scatter(tools_counts_monthly,
             x='MonthYear', 
             y='UniqueID',
             title='Number of Applications Created per Month Using Tool',   
             template="plotly_dark",
             color='Tool',
             color_discrete_sequence=px.colors.qualitative.Light24
)
fig.show()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Converting to PeriodArray/Index representation will drop timezone information.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



#### 14.2 Applications Created Over Time (by `CommitterDate` field) <a id='applications_created'></a>

In [885]:
zkp_applications_exp = zkp_repos_exploded[zkp_repos_exploded['Type']=='Application']

zkp_applications_exp = zkp_applications_exp.merge(application_commits[['UniqueID', 'CommitterDate']], left_on='UniqueID', right_on='UniqueID')
zkp_applications_exp['CommitterDate'] = pd.to_datetime(zkp_applications_exp['CommitterDate'])
zkp_applications_exp.sort_values('CommitterDate', inplace=True)
zkp_applications_exp = zkp_applications_exp.drop_duplicates(['UniqueID'])

zkp_applications_exp['MonthYear'] = zkp_applications_exp['CommitterDate'].dt.to_period('M')
zkp_applications_exp['MonthYear'] = zkp_applications_exp['MonthYear'].astype(str)

applicationFirstCommit = zkp_applications_exp.groupby(['MonthYear', 'Tool'])['UniqueID'].count().reset_index()
applicationFirstCommit['MonthYear'] = applicationFirstCommit['MonthYear'].astype(str)
applicationFirstCommit.rename(columns={'UniqueID': 'No. of Applications'}, inplace=True)

fig = px.scatter(applicationFirstCommit,
             x='MonthYear', 
             y='No. of Applications',
             title='No. of Applications Created per Month Using Tool',   
             template="plotly_dark",
             color='Tool',
             color_discrete_sequence=px.colors.qualitative.Light24,
)
fig.show()



Converting to PeriodArray/Index representation will drop timezone information.



As seen in the graph, earlier created applications mostly used `libsnark`. Then tools used begun to diversify. From the beginning of January, a higher number of applications were created using the `arkworks` libraries. 

- `circom` stack saw an increase after ~2021


- non-popular `arkworks` libraries remained fairly constant



#### 14.3 Applications Stopped Over Time (by `CommitterDate` field)

In [886]:
zkp_applications_exp = zkp_repos_exploded[zkp_repos_exploded['Type']=='Application']

zkp_applications_exp = zkp_applications_exp.merge(application_commits[['UniqueID', 'CommitterDate']], left_on='UniqueID', right_on='UniqueID')
zkp_applications_exp['CommitterDate'] = pd.to_datetime(zkp_applications_exp['CommitterDate'])
zkp_applications_exp.sort_values('CommitterDate', inplace=True, ascending=False)
zkp_applications_exp = zkp_applications_exp.drop_duplicates(['UniqueID'])

zkp_applications_exp['MonthYear'] = zkp_applications_exp['CommitterDate'].dt.to_period('M')
zkp_applications_exp['MonthYear'] = zkp_applications_exp['MonthYear'].astype(str)

applicationFirstCommit = zkp_applications_exp.groupby(['MonthYear', 'Tool'])['UniqueID'].count().reset_index()
applicationFirstCommit['MonthYear'] = applicationFirstCommit['MonthYear'].astype(str)
applicationFirstCommit.rename(columns={'UniqueID': 'No. of Applications'}, inplace=True)

fig = px.scatter(applicationFirstCommit,
             x='MonthYear', 
             y='No. of Applications',
             title='No. of Applications Last Committed per Month Using Tool',   
             template="plotly_dark",
             color='Tool',
             color_discrete_sequence=px.colors.qualitative.Light24,
)
fig.show()


Converting to PeriodArray/Index representation will drop timezone information.



`libsnark` applications ended the earliest. 

Similar trends as seen when looking at the FirstCommit dates. 

### Cluster Tools


In [887]:
tools = zkp_tools[['UniqueID','ToolType', 'Size', 'Language', 'Stars', 'Forks', 'Watchers', 'Issues']]
tools = tools.merge(tool_age, left_on='UniqueID', right_on='UniqueID')
tools = tools.merge(total_commits[['UniqueID', 'CommitCount']], left_on='UniqueID', right_on='UniqueID')
tools = tools.merge(active_committers, left_on='UniqueID', right_on='UniqueID', how='left')
tools['ActiveCommitters'] = tools['ActiveCommitters'].fillna(0)
tools = tools.merge(contributor_counts[['UniqueID',	'ContributorCount']], left_on='UniqueID', right_on='UniqueID', how='left')
tools = tools.merge(last_commit, left_on='UniqueID', right_on='UniqueID', how='left')
tools['LastCommit'] = pd.to_datetime(tools['LastCommit'], utc=True)
tools['Active'] = tools['LastCommit'].dt.year == 2023
tools['Active'] = tools['Active'].astype(int)
tools.drop(columns=['LastCommit'], inplace=True)
tools = tools.merge(zkp_tool_info[['UniqueID', 'Production']], left_on='UniqueID', right_on='UniqueID')
tools['Production'] = tools['Production'].astype(int)

In [888]:
cols = [
    # 'UniqueID', 

    # 'ToolType', 

    # 'Size', 

    # 'Language',

    # popularity features  
    # 'Stars', 
    # 'Watchers',  
    # 'AppCount', 

    # development features  
    'Age', 
    'Forks',
    'Issues', 
    'CommitCount',
    'ContributorCount',

    # 'ActiveCommitters', 
    # 'Active',
    # 'Production'
 ]

nr_components = 2
X = tools[cols]
scaler = StandardScaler()
df_standardized = scaler.fit_transform(X)
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_standardized)
kmeans = KMeans(n_clusters=3, random_state=42)
tools['Cluster'] = kmeans.fit_predict(df_standardized)
components = pca.fit_transform(X)
pca.explained_variance_ratio_





array([0.92583324, 0.06669184])

In [889]:
pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'], index=cols)

Unnamed: 0,PC1,PC2
Age,-0.001997,0.979375
Forks,0.134982,0.200237
Issues,0.00741,0.008997
CommitCount,0.990814,-0.025365
ContributorCount,0.002924,-0.002287


In [890]:
tinkering = ['openzkp/0xproject','libsnark/scipr-lab','marlin/arkworks-rs','pysnark/charterhouse','merlin/dalek-cryptography','bulletproofs/sdiehl','plonky/mir-protocol','nonnative/arkworks-rs']
implem = ['algebra/arkworks-rs', 'snarkjs/iden3', 'gnark/consensys', 'plonky2/mir-protocol', 'circom/iden3', 'halo2/zcash', 'bulletproofs/dalek-cryptography', 'bellman/zkcrypto', 'std/arkworks-rs', 'curves/arkworks-rs', 'winterfell/facebook']
book = ['leo/aleohq', 'plonky3/plonky3', 'zksync/matter-labs', 'risc0/risc0', 'miden-vm/0xpolygonmiden']

print(tools[tools['UniqueID'].isin(implem)][['UniqueID','Age', 'Stars', 'Forks', 'Watchers', 'Issues', 'AppCount', 'CommitCount','ContributorCount', 'Active']].sort_values('Age', ascending=False).reset_index().drop(columns='index').to_markdown())



|    | UniqueID                        |   Age |   Stars |   Forks |   Watchers |   Issues |   AppCount |   CommitCount |   ContributorCount |   Active |
|---:|:--------------------------------|------:|--------:|--------:|-----------:|---------:|-----------:|--------------:|-------------------:|---------:|
|  0 | bellman/zkcrypto                |  2881 |     823 |     582 |        823 |       30 |         30 |           353 |                 18 |        1 |
|  1 | bulletproofs/dalek-cryptography |  2110 |     931 |     189 |        931 |       45 |         17 |           892 |                 14 |        1 |
|  2 | snarkjs/iden3                   |  1922 |    1494 |     355 |       1494 |       89 |        195 |           643 |                 39 |        1 |
|  3 | gnark/consensys                 |  1357 |    1055 |     227 |       1055 |       82 |         21 |          2480 |                 22 |        1 |
|  4 | halo2/zcash                     |  1175 |     528 |     335 |        

In [891]:
out = tools[~tools['UniqueID'].isin(book)]
out = out[~out['UniqueID'].isin(implem)]
out = out[~out['UniqueID'].isin(tinkering)]
out.UniqueID.to_list()

['crypto-primitives/arkworks-rs',
 'gemini/arkworks-rs',
 'gm17/arkworks-rs',
 'groth16/arkworks-rs',
 'poly-commit/arkworks-rs',
 'r1cs-std/arkworks-rs',
 'snark/arkworks-rs',
 'sponge/arkworks-rs',
 'cairo-lang/starkware-libs',
 'circomlib/iden3']

In [892]:
print(zkp_tool_info[['UniqueID', 'Production', 'License']].merge(zkp_tools[['UniqueID', 'Tool Resources']], left_on='UniqueID', right_on='UniqueID').to_markdown())

|    | UniqueID                        | Production   | License                    | Tool Resources   |
|---:|:--------------------------------|:-------------|:---------------------------|:-----------------|
|  0 | algebra/arkworks-rs             | False        | Apache 2.0, MIT            | True             |
|  1 | crypto-primitives/arkworks-rs   | False        | Apache 2.0, MIT            | True             |
|  2 | curves/arkworks-rs              | False        | Apache 2.0, MIT            | True             |
|  3 | gemini/arkworks-rs              | False        | MIT                        | True             |
|  4 | gm17/arkworks-rs                | False        | Apache 2.0, MIT            | True             |
|  5 | groth16/arkworks-rs             | False        | Apache 2.0, MIT            | True             |
|  6 | marlin/arkworks-rs              | False        | Apache 2.0, MIT            | True             |
|  7 | nonnative/arkworks-rs           | False        | Apache 2

In [893]:
fig = px.scatter(
    tools,
    x=df_pca[:, 0],
    y=df_pca[:, 1],
    color='Cluster', 
    hover_name='UniqueID',
    labels={'color': 'Cluster'},
    title='PCA Scatter Plot with Clusters',
    template='plotly_dark',
    color_continuous_scale='Rainbow',
)

fig.show()


In [894]:
tools = tools.sort_values(by='Cluster')
print(tools[['UniqueID', 'Forks',   'Issues', 'Age', 'CommitCount','ContributorCount','Active', 'Production', 'Cluster'   ]].to_markdown())
# print(tools[['UniqueID', 'Stars', 'Watchers', 'AppCount', 'Active', 'Production', 'Cluster']].to_markdown())


|    | UniqueID                        |   Forks |   Issues |   Age |   CommitCount |   ContributorCount |   Active |   Production |   Cluster |
|---:|:--------------------------------|--------:|---------:|------:|--------------:|-------------------:|---------:|-------------:|----------:|
| 16 | cairo-lang/starkware-libs       |     224 |      112 |  1107 |            57 |                  4 |        1 |            1 |         0 |
| 26 | plonky/mir-protocol             |      12 |        8 |  1377 |           410 |                  6 |        0 |            0 |         0 |
| 28 | plonky3/plonky3                 |      33 |       22 |   228 |           491 |                 13 |        1 |            1 |         0 |
| 18 | circomlib/iden3                 |     178 |       57 |  1848 |           191 |                 12 |        0 |            1 |         0 |
| 29 | pysnark/charterhouse            |      10 |        2 |  2133 |            38 |                  1 |        0 |            0

In [895]:
features_to_plot = cols

for feature in features_to_plot:
    fig = px.box(tools, 
                 x='Cluster', 
                 y=feature, 
                 points='all', 
                 title=f'Distribution of {feature} by Cluster',
                 template='plotly_dark',
                 hover_name='UniqueID'
                 )
    # fig.show()
    