In [1]:
import pandas as pd

## List of projects

One of parameters / input data is Pandas DataFrame with projects as one of main columns.

This DataFrame will be enriched by this notebook / script with project metadata features.

In [2]:
infile='df_commits_with_one_project.parquet'

In [3]:
df=pd.read_parquet(infile)

Examine the DataFrame

In [5]:
df.columns

Index(['commit', 'commit_cves', 'commit_time', 'project_names', 'cve',
       'published_date', 'project_name', 'date_difference'],
      dtype='object')

In [10]:
projects=df['project_name'].unique()
len(projects)

35758

The starting point could be just a list of projects

In [13]:
projects_s=pd.Series(projects)

In [12]:
pd.Series(projects).to_csv('interesting_projects_list.csv', sep=';', header=False, index=False)

## Accessing MongoDB database

Accessing the World of Code MongoDB database with various metadata is described in https://github.com/woc-hack/tutorial#mongo-database

> On the da1 server, there is a MongoDB server holding some relevant data. This data includes some information that was used for data analysis in the past. Mongo provides an excellent place to store relatively small data without requiring relational information.
>
> Two collections the WoC database cand be helpful for sampling projects and authors `A_metadata.V` and `P.metadata.V` where `V` represents the version (e.g., `T`) , `A` stands for aliased author id and `P` for deforked repository name.

There is no direct access to the `da1` server.  One needs to tunnel connection via `da0`, otherwise one would get the following error

```
ServerSelectionTimeoutError: da1.eecs.utk.edu:27017: [Errno 113] No route to host, Timeout: 30s, Topology Description: <TopologyDescription id: 639216df3fa72534c7d7e1c5, topology_type: Unknown, servers: [<ServerDescription ('da1.eecs.utk.edu', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('da1.eecs.utk.edu:27017: [Errno 113] No route to host')>]>
```

One possible solution is to use [`ssh-pymongo`](https://pypi.org/project/ssh-pymongo/), see e.g.  
https://stackoverflow.com/questions/56239184/python-script-to-connect-to-remote-mongodb-using-ssh-tunnel-and-pymongo-client-i

The configuration for connecting to World of Code servers looks like this:
```
Host da0
	Hostname da0.eecs.utk.edu
	Port 443
	User <username>
	IdentityFile <path to private key, e.g. %d/.ssh/id_rsa_woc>
```

In [24]:
import os

username = os.environ['JUPYTERHUB_USER']
username

'jnareb'

In [34]:
home = os.getenv('HOME', '/home/'+username)
home

'/home/jnareb'

In [20]:
from ssh_pymongo import MongoSession

session = MongoSession(
    host="da0.eecs.utk.edu",
    port=443,
    user=username,
    key=home+'/.ssh/id_rsa_woc',
    uri="mongodb://da1.eecs.utk.edu/"
)

client=session.connection

2022-12-08 18:50:21,129| ERROR   | Could not open connection to gateway


BaseSSHTunnelForwarderError: Could not establish session to SSH gateway

Example from the tutorial, split into two, slightly modified:

In [None]:
import pymongo

client = pymongo.MongoClient("mongodb://da1.eecs.utk.edu/")

In [18]:
db = client['WoC']
coll = db['A_metadata.U']

dataset = coll.find_one({}, no_cursor_timeout=True)
for data in dataset:
    a = data["AuthorID"].encode('utf-8').strip()
    print(a)

dataset.close()

  return Cursor(self, *args, **kwargs)


ServerSelectionTimeoutError: da1.eecs.utk.edu:27017: [Errno 113] No route to host, Timeout: 30s, Topology Description: <TopologyDescription id: 639216df3fa72534c7d7e1c5, topology_type: Unknown, servers: [<ServerDescription ('da1.eecs.utk.edu', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('da1.eecs.utk.edu:27017: [Errno 113] No route to host')>]>

In [None]:
client

In [16]:
!mongo --host "da1.eecs.utk.edu"

/bin/bash: line 1: mongo: command not found


In [15]:
projects_s[1]

'buildroot_buildroot'