We want to use the most powerful tool "chatgpt" to help us classify repos.

Using ChatGPT API (https://openai.com/blog/introducing-chatgpt-and-whisper-apis) and GitHub API
(https://docs.github.com/en/rest), we develop a Python program that retrieves the readme file content for each repository and feeds it into the algorithm below.

Scenario 1: First, we start with
supervised approach by giving cluster topics based on Ethereum EIP categories (
https://eips.ethereum.org): Core, Network, Interface, ERC, Meta, and Informational. This approach is the most deterministic in the sense of our expectation of clustered categories. However, the approach is not transferrable to other OSS projects without externally predetermined categories. We further design other scenarios to overcome this limitation.

* Note: we upload "ethereum_repos.json" for running this code block, which is a file consisting of repo name, description and readme.

In [None]:
!pip3 install -q openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.4/262.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.8/77.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
##########################################################################
# Senario 1
##########################################################################

import json
from openai import OpenAI
from random import shuffle
import time
import pickle

with open('ethereum_repos.json', 'r') as f:
    repos = json.load(f)

client = OpenAI(api_key='xxxxxx')

instruction = '''Classify each given repository Readme within one of the given categories:

Core: Improvements requiring a consensus fork, as well as changes that are not necessarily consensus critical but may be relevant to “core dev” discussions (for example, the PoA algorithm for testnets).
Networking: Includes improvements around devp2p and Light Ethereum Subprotocol, as well as proposed improvements to network protocol specifications of whisper and swarm.
Interface: Includes improvements around client API/RPC specifications and standards, and also certain language-level standards like method names and contract ABIs. The label “interface” aligns with the interfaces repo and discussion should primarily occur in that repository before an EIP is submitted to the EIPs repository.
ERC: Application-level standards and conventions, including contract standards such as token standards, name registries, URI schemes, library/package formats, and account abstraction.
Meta: Describes a process surrounding Ethereum or proposes a change to (or an event in) a process. Process EIPs are like Standards Track EIPs but apply to areas other than the Ethereum protocol itself. They may propose an implementation, but not to Ethereum's codebase; they often require community consensus; unlike Informational EIPs, they are more than recommendations, and users are typically not free to ignore them. Examples include procedures, guidelines, changes to the decision-making process, and changes to the tools or environment used in Ethereum development. Any meta-EIP is also considered a Process EIP.
Informational: Describes an Ethereum design issue or provides general guidelines or information to the Ethereum community but does not propose a new feature. Informational EIPs do not necessarily represent Ethereum community consensus or a recommendation, so users and implementers are free to ignore Informational EIPs or follow their advice.

Your output should be in a consistent format <repo name>: <class>.
'''

N = 6

completions = []

for i in range(N):
    partition = len(repos)//N
    for repo in repos[partition*i:min(partition*(i+1), len(repos))]:
        messages = [
            {"role": "system", "content": instruction},
        ]
        repo_name, repo_desc, readme = repo['name'], repo['desc'][:1024] if repo['desc'] else 'None', repo['readme'][:3000] if repo['readme'] else 'None'
        messages.append({"role": "user", "content": f"Repo name: {repo_name}. Description: {repo_desc}. Readme: {readme}"})
        completion = client.chat.completions.create(
          model="gpt-4",
          messages=messages
        )
        print(f'Done {repo_name}')
        completions.append(completion)
        with open(f'results/{repo_name}', 'wb') as f:
            pickle.dump(completion, f)

print([completion.choices for completion in completions])


Scenario 2: We continue with a semi-supervised approach. In scenario 2, we ease the constraints on the number of categories and their respective topics. While we continue to recommend the initial six categories to the model, we now permit the model to generate additional clusters if there are repositories that support the formation of these new groups.

* Note: due to token limits, we only upload repo description to feed into chatgpt. Based on all these repo desciptions, chatgpt will come up with several cluster topics.

In [None]:
##########################################################################
# Senario 2
##########################################################################

!pip3 install -q openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.8/77.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import json
from openai import OpenAI

In [None]:
# collect all repo descriptions

with open ('ethereum_repos.json', 'r') as f:
  repos=json.load(f)


information_feed = []
for repo in repos:
  desc = repo["name"]+" " + (repo["desc"] or " ")
  information_feed.append(desc)

In [None]:
# Feed ChatGPT with all repo descriptions and instructions.

client = OpenAI(api_key='xxxxx')

instruction = ''' group all repositories into 6-10 clusters and provide a short definition of each cluster and list all repo names under the corresponding cluster: Suggested categories for clusters but not limited to: Core: Improvements requiring a consensus fork, and changes that are not necessarily consensus critical but might relevant to “core dev” discussions. Networking: improvements around devp2p and Light Ethereum Subprotocol, and proposed improvements to network protocol specifications of whisper and swarm. Interface: improvements around client API/RPC specifications and standards, and certain language-level standards like method names and contract ABIs. ERC: Application-level standards and conventions, including contract standards such as token standards, name registries, URI schemes, library/package formats, and account abstraction. Meta: Describes a process surrounding Ethereum or proposes a change to (or an event in) a process. Process EIPs are Standards Track EIPs but apply to areas other than the Ethereum protocol itself. They may propose an implementation, but not to Ethereum's codebase; they require community consensus; unlike Informational EIPs, they are more than recommendations, and users are typically not free to ignore them. Examples include procedures, guidelines, changes to the decision-making process, and changes to the tools or environment used in Ethereum development. Any meta-EIP is also considered a Process EIP. Informational: Describes a Ethereum design issue, or provides general guidelines or information to the Ethereum community, but does not propose a new feature. '''
# '''Be sure to print all repos in the corresponding cluster. For each repo I gave you, display its name under corresponding cluster you come up with'''

messages = [
      {"role": "system", "content": instruction},
  ]

# Build the prompt to integrate all repo descriptions into it
prompt = ''

for i, desc in enumerate(information_feed):
  prompt = prompt + f'{desc}\n'

# prompt: 'these are ... repo 1 desc is ...; repo 2 ...'
messages.append({"role": "user", 'content': prompt})
completion = client.chat.completions.create(
  model="gpt-4",
  messages=messages,
   seed=3407
)

import pickle

with open('completion', 'wb') as f:
  pickle.dump(completion, f)


# completions = []

# for i in range(6):
#     partition = len(repos)//6
#     for repo in repos[partition*i:min(partition*(i+1), len(repos))]:
#         messages = [
#             {"role": "system", "content": instruction},
#         ]
#         repo_name, repo_desc, readme = repo['name'], repo['desc'][:1024] if repo['desc'] else 'None', repo['readme'][:3000] if repo['readme'] else 'None'
#         messages.append({"role": "user", "content": f"Repo name: {repo_name}. Description: {repo_desc}. Readme: {readme}"})
#         completion = client.chat.completions.create(
#           model="gpt-3.5-turbo",
#           messages=messages,
#            seed=3407)
#         print(f'Done {repo_name}')
#         completions.append(completion)
#         with open(f'results/{repo_name}', 'wb') as f:
#             pickle.dump(completion, f)


print(completion.choices[0].message.content)

Cluster 1 - Ethereum Core
Definition: These repositories have improvements which require a consensus fork, and changes that are not necessarily consensus critical but might be relevant to “core dev” discussions. Implementation to Ethereum's codebase is provided in these repositories.
Repositories: aleth, alethzero, cpp-ethereum, go-ethereum, pyethereum, libethereum

Cluster 2 - Networking
Definition: Repositories in this cluster have improvements around devp2p and Light Ethereum Subprotocol and propose improvements to network protocol specifications of whisper and swarm.
Repositories: libwhisper, devp2p, remix-project, pydevp2p, node-ethereum, ethereum-ppa

Cluster 3 - Interface
Definition: These repositories have improvements around client API/RPC specifications and standards. They provide certain language-level standards like method names and contract ABIs.
Repositories: solc-bin, solc-js, web3.py, ethereum-console, pyethapp, eth-typing

Cluster 4 - ERC
Definition: Repositories in th

Scenario 3: we explored an unsupervised approach, in other words, highly autonomous, minimally supervised approach. In this scenario, we permit the model to classify without predefined categories, maintaining an upper limit on the number of categories to ensure they remain interpretable by humans.

In [None]:
##########################################################################
# Senario 3
##########################################################################
import json
from openai import OpenAI

In [None]:
# collect all repo descriptions

with open ('ethereum_repos.json', 'r') as f:
  repos=json.load(f)


information_feed = []
for repo in repos:
  desc = repo["name"]+" " + (repo["desc"] or " ")
  information_feed.append(desc)

In [None]:
# Feed ChatGPT with all repo descriptions and instructions.

client = OpenAI(api_key='xxxxx')

instruction = '''Based on the content provided, group all repositories into up to 15 clusters and provide a short definition of each cluster. Note that each cluster should contain more than 1 repository. In addition, the topics of repositories in the cluster should be as similar as possible.The repositories are primarily software development repositories. For each repo I gave you, display its name under corresponding cluster you come up with'''

messages = [
      {"role": "system", "content": instruction},
  ]

# Build the prompt to integrate all repo descriptions into it'''

messages = [
      {"role": "system", "content": instruction},
  ]

# Build the prompt to integrate all repo descriptions into it
prompt = ''

for i, desc in enumerate(information_feed):
  prompt = prompt + f'{desc}\n'

# prompt: 'these are ... repo 1 desc is ...; repo 2 ...'
messages.append({"role": "user", 'content': prompt})
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=messages,
  seed = 3407
)

import pickle

with open('completion', 'wb') as f:
  pickle.dump(completion, f)

print(completion.choices[0].message.content)

In [None]:
# Feed ChatGPT with all repo descriptions and instructions.

client = OpenAI(api_key='xxxxx')

instruction = '''
Classify each given repository within one of the given categories:

Cluster 1: Ethereum Development Tools and Libraries

Cluster 2: Ethereum Protocol Specifications

Cluster 3: Ethereum Improvement Proposals (EIPs) Management

Cluster 4: Ethereum Client Implementations

Cluster 5: Ethereum Networking and Communication

Cluster 6: Ethereum Simulation and Testing

Cluster 7: Centralized Management and Deployment Tools for Ethereum

Cluster 8: Solidity Programming Language and Tools

Cluster 9: Ethereum Web Development

Cluster 10: Ethereum Education and Documentation

Cluster 11: Miscellaneous Ethereum Tools and Utilities

Cluster 12: Ethereum Research and Formal Verification

Cluster 13: Ethereum Networking Monitoring and Dashboards

Cluster 14: Ethereum Staking and Validator Support

Cluster 15: Ethereum Ecosystem Support Programs

Your output should be in a consistent format <repo name>: <class>.
'''

# messages = [
#       {"role": "system", "content": instruction},
#   ]

# Build the prompt to integrate all repo descriptions into it
prompt = ''

# for i, desc in enumerate(information_feed):
#   prompt = prompt + f'{desc}\n'

# prompt: 'these are ... repo 1 desc is ...; repo 2 ...'
# messages.append({"role": "user", 'content': prompt})
# completion = client.chat.completions.create(
#   model="gpt-4",
#   messages=messages
# )

import pickle

# with open('completion', 'wb') as f:
#   pickle.dump(completion, f)


completions = []


for repo in repos:
    messages = [
        {"role": "system", "content": instruction},
    ]
    repo_name, repo_desc, readme = repo['name'], repo['desc'][:1024] if repo['desc'] else 'None', repo['readme'][:3000] if repo['readme'] else 'None'
    messages.append({"role": "user", "content": f"Repo name: {repo_name}. Description: {repo_desc}. Readme: {readme}"})
    completion = client.chat.completions.create(
      model="gpt-4",
      messages=messages,
        seed=3407)
    print(f'Done {repo_name}')
    completions.append(completion)
    with open(f'results/{repo_name}', 'wb') as f:
        pickle.dump(completion, f)

print([completion.choices for completion in completions])

In [None]:
import os
BASE = '/content/drive/Shareddrives/Chri-City🏡/ethereum/scenario3/results'
files = os.listdir(BASE)
files = sorted(files)
print(files)
for file in files:
  with open(f'{BASE}/{file}', 'rb') as f:
    try:
      completion = pickle.load(f)
      print(completion.choices[0].message.content)
    except Exception as e:
      print(f'\n{file} wrong\n')

['EIP-Bot', 'EIPs', 'ERCs', 'GitSync', 'RIPs', 'UniversalLoginSDK', 'Yul-K', 'abm1559', 'act', 'aio-run-in-process', 'aleth', 'alethzero', 'alexandria', 'annotated-spec', 'async-service', 'asyncio-cancel-token', 'asyncio-run-in-process', 'awesome-remix', 'beacon-APIs', 'beacon-metrics', 'beacon_chain', 'beaconrunner', 'bench', 'benchmarking', 'bimini', 'blake2b-py', 'blockies', 'bls12-381-tests', 'browser-solidity', 'btcrelay', 'builder-specs', 'c-kzg-4844', 'cable', 'casper', 'cbc-casper', 'clef-ui', 'clrfund', 'common', 'consensus-deployment-ansible', 'consensus-spec-tests', 'consensus-specs', 'cpp-build-env', 'cpp-dependencies', 'cpp-dependencies-win64', 'cpp-ethash', 'cpp-ethereum-cmake', 'cryptography-research-website', 'cryptopp', 'cthaeh', 'dapp-bin', 'dapp-styles', 'ddht', 'deposit_contract', 'devcommon', 'devops-test-prater-redirect', 'devp2p', 'diary', 'discv4-crawl', 'discv4-dns-lists', 'distributed-validator-specs', 'docker-pyeth-dev', 'dopple', 'economic-modeling', 'ecp', 