# Introduction to connecting and Querying the Augur DB

If you made to this point, welcome! :) This short tutorial will show how to connect to the database and how to do a simple query. If you need the config file please email cdolfi@redhat.com

For Project Sandiego's data, we will be using a personal instance of Augur. Augur is a software suite for collecting and measuring structured data about free and open-source software (FOSS) communities.

Augur gathers trace data for a group of repositories, normalizes it into the data model, and provides a variety of metrics about said data. The structure of the data model enables us to synthesize data across various platforms to provide meaningful context for meaningful questions about the way these communities evolve.

All the notebooks in this folder are based on https://github.com/chaoss/augur-community-reports templates. 

## Connect to your database

Until the Operate First enviroment can connect to the DB, use config file to access. Do not push config file to Github repo

In [1]:
import psycopg2
import pandas as pd 
import sqlalchemy as salc
import json
import os

with open("../comm_cage.json") as config_file:
    config = json.load(config_file)

In [2]:
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])

dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

### Retrieve Available Respositories

In [3]:
aval_repos = pd.DataFrame()
repo_query = salc.sql.text(f"""
             SET SCHEMA 'augur_data';
             SELECT a.rg_name,
                a.repo_group_id,
                b.repo_name,
                b.repo_id,
                b.forked_from,
                b.repo_archived,
                b.repo_git
            FROM
                repo_groups a,
                repo b
            WHERE
                a.repo_group_id = b.repo_group_id
            ORDER BY
                rg_name,
                repo_name;
    """)
aval_repos = pd.read_sql(repo_query, con=engine)
display(aval_repos)
aval_repos.dtypes

Unnamed: 0,rg_name,repo_group_id,repo_name,repo_id,forked_from,repo_archived,repo_git
0,Default Repo Group,10,blueprint,24442,Parent not available,0.0,https://github.com/operate-first/blueprint
1,Default Repo Group,10,grimoirelab-hatstall,25450,Parent not available,0.0,https://github.com/chaoss/grimoirelab-hatstall
2,Default Repo Group,10,grimoirelab-perceval-opnfv,25445,Parent not available,0.0,https://github.com/chaoss/grimoirelab-perceval...
3,Default Repo Group,10,operate-first-twitter,24441,Parent not available,0.0,https://github.com/operate-first/operate-first...
4,Default Repo Group,10,update-test,25430,Parent not available,0.0,https://github.com/SociallyCompute/update-test
...,...,...,...,...,...,...,...
116,torvalds,25648,linux,36109,Parent not available,0.0,https://github.com/torvalds/linux
117,torvalds,25648,pesconvert,36106,Parent not available,0.0,https://github.com/torvalds/pesconvert
118,torvalds,25648,subsurface-for-dirk,36108,subsurface/subsurface,0.0,https://github.com/torvalds/subsurface-for-dirk
119,torvalds,25648,test-tlb,36111,Parent not available,0.0,https://github.com/torvalds/test-tlb


rg_name           object
repo_group_id      int64
repo_name         object
repo_id            int64
forked_from       object
repo_archived    float64
repo_git          object
dtype: object

### Create a Simpler List for quickly Identifying repo_group_id's and repo_id's for other queries

In [4]:
repolist = pd.DataFrame()

repo_query = salc.sql.text(f"""
             SET SCHEMA 'augur_data';
             SELECT b.repo_id,
                a.repo_group_id,
                b.repo_name,
                a.rg_name,
                b.repo_git
            FROM
                repo_groups a,
                repo b 
            WHERE
                a.repo_group_id = b.repo_group_id 
            ORDER BY
                rg_name,
                repo_name;   

    """)

repolist = pd.read_sql(repo_query, con=engine)

repolist

Unnamed: 0,repo_id,repo_group_id,repo_name,rg_name,repo_git
0,24442,10,blueprint,Default Repo Group,https://github.com/operate-first/blueprint
1,25450,10,grimoirelab-hatstall,Default Repo Group,https://github.com/chaoss/grimoirelab-hatstall
2,25445,10,grimoirelab-perceval-opnfv,Default Repo Group,https://github.com/chaoss/grimoirelab-perceval...
3,24441,10,operate-first-twitter,Default Repo Group,https://github.com/operate-first/operate-first...
4,25430,10,update-test,Default Repo Group,https://github.com/SociallyCompute/update-test
...,...,...,...,...,...
116,36109,25648,linux,torvalds,https://github.com/torvalds/linux
117,36106,25648,pesconvert,torvalds,https://github.com/torvalds/pesconvert
118,36108,25648,subsurface-for-dirk,torvalds,https://github.com/torvalds/subsurface-for-dirk
119,36111,25648,test-tlb,torvalds,https://github.com/torvalds/test-tlb


In [5]:
repolist[50:70]

Unnamed: 0,repo_id,repo_group_id,repo_name,rg_name,repo_git
50,30740,25481,move2kube,konveyor,https://github.com/konveyor/move2kube
51,30770,25481,move2kube-api,konveyor,https://github.com/konveyor/move2kube-api
52,30789,25481,move2kube-demos,konveyor,https://github.com/konveyor/move2kube-demos
53,30729,25481,move2kube-katacoda,konveyor,https://github.com/konveyor/move2kube-katacoda
54,30720,25481,move2kube-operator,konveyor,https://github.com/konveyor/move2kube-operator
55,30718,25481,move2kube-tests,konveyor,https://github.com/konveyor/move2kube-tests
56,30745,25481,move2kube-ui,konveyor,https://github.com/konveyor/move2kube-ui
57,30783,25481,mtc-breakfix,konveyor,https://github.com/konveyor/mtc-breakfix
58,30788,25481,must-gather,konveyor,https://github.com/konveyor/must-gather
59,30779,25481,oadp-capstone,konveyor,https://github.com/konveyor/oadp-capstone


### Create a list of all of the tables with the total number of data entries 

In [6]:
data_entries = pd.DataFrame()

repo_query = salc.sql.text(f"""
                ANALYZE;
                SELECT schemaname,relname,n_live_tup 
                  FROM pg_stat_user_tables 
                  ORDER BY n_live_tup DESC;

    """)

data_entries = pd.read_sql(repo_query, con=engine)

data_entries

Unnamed: 0,schemaname,relname,n_live_tup
0,augur_data,commits,2776170
1,augur_data,contributor_repo,292334
2,augur_data,dm_repo_weekly,274448
3,augur_data,dm_repo_group_weekly,273697
4,augur_data,dm_repo_monthly,176061
...,...,...,...
100,spdx,files,0
101,augur_operations,repos_fetch_log,0
102,augur_data,pull_request_repo,0
103,spdx,files_licenses,0


Congrats you have done your first queries! There will be a few more simple examples below on how to pull an entire table. If you would like to explore on your own, the schema.png on the home sandiego directory will be greatly helpful in your explorations! Happy querying :) 

### Data from the messages 

This data is the collection of all comments from any issue, PR, commit, etc opened. This example shows another side of the database and the types of data we can pull from it. 

In [7]:
mes_data = pd.DataFrame()

repo_query = salc.sql.text(f"""
             SET SCHEMA 'augur_data';
             SELECT * FROM message
    """)

mes_data = pd.read_sql(repo_query, con=engine)

mes_data

Unnamed: 0,msg_id,rgls_id,platform_msg_id,platform_node_id,repo_id,cntrb_id,msg_text,msg_timestamp,msg_sender_email,msg_header,pltfrm_id,tool_source,tool_version,data_source,data_collection_date
0,25430,,826722981,MDEyOklzc3VlQ29tbWVudDgyNjcyMjk4MQ==,,25440,I've enabled actions for this repo. LMK if it ...,2021-04-26 10:35:38,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-10-21 13:41:55
1,25431,,826812453,MDEyOklzc3VlQ29tbWVudDgyNjgxMjQ1Mw==,,25449,[APPROVALNOTIFIER] This PR is **NOT APPROVED**...,2021-04-26 12:55:35,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-10-21 13:41:55
2,25432,,826815197,MDEyOklzc3VlQ29tbWVudDgyNjgxNTE5Nw==,,25438,Actions seems to be working now. Thanks 👍,2021-04-26 12:59:50,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-10-21 13:41:55
3,25433,,826817026,MDEyOklzc3VlQ29tbWVudDgyNjgxNzAyNg==,,25440,/cc @oindrillac,2021-04-26 13:02:13,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-10-21 13:41:55
4,25434,,826822231,MDEyOklzc3VlQ29tbWVudDgyNjgyMjIzMQ==,,25449,[APPROVALNOTIFIER] This PR is **NOT APPROVED**...,2021-04-26 13:09:08,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-10-21 13:41:55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39835,69684,,882431101,IC_kwDOEMblf840mNR9,,57762,"Thank you for the reminder, we forgot to add.\...",2021-07-19 10:20:22,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-11-03 22:25:47
39836,69685,,882431342,IC_kwDOEMblf840mNVu,,57762,Closes https://github.com/konveyor/imagestream...,2021-07-19 10:20:42,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-11-03 22:25:47
39837,69686,,846151549,MDEyOklzc3VlQ29tbWVudDg0NjE1MTU0OQ==,,57772,Closing as preview is confirmed working!,2021-05-21 18:25:47,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-11-03 22:25:52
39838,69687,,847149399,MDEyOklzc3VlQ29tbWVudDg0NzE0OTM5OQ==,,57772,currently working this in a branch,2021-05-24 16:04:03,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-11-03 22:25:52


### Contributor affiliation data


This data tells us what is the company affiliation of many open source contributors. This can help tell us which companies are involved in a certian open source project. 

In [8]:
con_aff = pd.DataFrame()

repo_query = salc.sql.text(f"""
             SET SCHEMA 'augur_data';
             SELECT * FROM contributor_affiliations
    """)

con_aff = pd.read_sql(repo_query, con=engine)

con_aff

Unnamed: 0,ca_id,ca_domain,ca_start_date,ca_last_used,ca_affiliation,ca_active,tool_source,tool_version,data_source,data_collection_date
0,1,samsung.com,1970-01-01,2018-08-01 18:37:54,Samsung,1,load,1.0,load,1970-01-01
1,2,linuxfoundation.org,1970-01-01,2018-08-01 18:37:54,Linux Foundation,1,load,1.0,load,1970-01-01
2,3,ibm.com,1970-01-01,2018-08-01 18:37:54,IBM,1,load,1.0,load,1970-01-01
3,8,walmart.com,1970-01-01,2018-09-01 06:00:00,Walmart,1,load,1.0,load,1970-01-01
4,9,exxonmobil.com,1970-01-01,2018-09-01 06:00:00,Exxon Mobil,1,load,1.0,load,1970-01-01
...,...,...,...,...,...,...,...,...,...,...
515,516,twitter.com,1970-01-01,2018-09-01 06:00:00,Twitter,1,load,1.0,load,1970-01-01
516,517,adobe.com,1970-01-01,2018-09-01 06:00:00,Adobe,1,load,1.0,load,1970-01-01
517,519,acm.org,1970-01-01,2018-09-12 02:01:59,ACM,1,load,1.0,load,1970-01-01
518,520,outdoors@acm.org,1970-01-01,2018-09-12 02:32:53,University of Missouri,1,load,1.0,load,2013-07-15
