# Introduction to connecting and Querrying the Augur DB

If you made to this point, welcome! :) This short tutorial will show how to connect to the database and how to do a simple querry. If you need the config file please email cdolfi@redhat.com

All the notebooks in this folder are based on https://github.com/chaoss/augur-community-reports templates. 

## Connect to your database

Until the Operate First enviroment can connect to the DB, use config file to access. Do not push config file to Github repo

In [2]:
import psycopg2
import pandas as pd 
import sqlalchemy as salc
import json
import os

with open("../../config.json") as config_file:
    config = json.load(config_file)

In [3]:
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])

dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

### Retrieve Available Respositories

In [4]:
repolist = pd.DataFrame()
repo_query = salc.sql.text(f"""
             SET SCHEMA 'augur_data';
             SELECT a.rg_name,
                a.repo_group_id,
                b.repo_name,
                b.repo_id,
                b.forked_from,
                b.repo_archived,
                b.repo_git
            FROM
                repo_groups a,
                repo b
            WHERE
                a.repo_group_id = b.repo_group_id
            ORDER BY
                rg_name,
                repo_name;
    """)
repolist = pd.read_sql(repo_query, con=engine)
display(repolist)
repolist.dtypes

Unnamed: 0,rg_name,repo_group_id,repo_name,repo_id,forked_from,repo_archived,repo_git
0,Default Repo Group,1,augur,1,Parent not available,0,https://github.com/chaoss/augur.git
1,konveyor,101,,25442,Parent not available,0,https://github.com/konveyor/labs.git
2,konveyor,101,,25443,Parent not available,0,https://github.com/konveyor/velero-examples.git
3,konveyor,101,,25445,Parent not available,0,https://github.com/konveyor/move2kube-demos.git
4,konveyor,101,,25446,Parent not available,0,https://github.com/konveyor/tackle-documentati...
...,...,...,...,...,...,...,...
66,konveyor,101,,25497,Parent not available,0,https://github.com/konveyor/tackle-commons-res...
67,konveyor,101,,25498,Parent not available,0,https://github.com/konveyor/eng-kbase.git
68,konveyor,101,,25486,Parent not available,0,https://github.com/konveyor/mtc-breakfix.git
69,konveyor,101,,25441,Parent not available,0,https://github.com/konveyor/mig-demo-apps.git


rg_name          object
repo_group_id     int64
repo_name        object
repo_id           int64
forked_from      object
repo_archived     int64
repo_git         object
dtype: object

### Create a Simpler List for quickly Identifying repo_group_id's and repo_id's for other queries

In [5]:

repolist = pd.DataFrame()

repo_query = salc.sql.text(f"""
             SET SCHEMA 'augur_data';
             SELECT b.repo_id,
                a.repo_group_id,
                b.repo_name,
                a.rg_name,
                b.repo_git
            FROM
                repo_groups a,
                repo b 
            WHERE
                a.repo_group_id = b.repo_group_id 
            ORDER BY
                rg_name,
                repo_name;   

    """)

repolist = pd.read_sql(repo_query, con=engine)

repolist[50:150]

Unnamed: 0,repo_id,repo_group_id,repo_name,rg_name,repo_git
50,25490,101,,konveyor,https://github.com/konveyor/crane-ui-tests.git
51,25491,101,,konveyor,https://github.com/konveyor/mig-ci.git
52,25492,101,,konveyor,https://github.com/konveyor/tackle-config-disc...
53,25493,101,,konveyor,https://github.com/konveyor/must-gather.git
54,25494,101,,konveyor,https://github.com/konveyor/data-mover.git
55,25495,101,,konveyor,https://github.com/konveyor/virt-payload.git
56,25484,101,,konveyor,https://github.com/konveyor/mig-analytics-tool...
57,25496,101,,konveyor,https://github.com/konveyor/wadsworth.git
58,25430,101,,konveyor,https://github.com/konveyor/move2kube.git
59,25431,101,,konveyor,https://github.com/konveyor/pelorus.git


### Create a list of all of the tables with the total number of data entries 

In [6]:
repolist = pd.DataFrame()

repo_query = salc.sql.text(f"""
                ANALYZE;
                SELECT schemaname,relname,n_live_tup 
                  FROM pg_stat_user_tables 
                  ORDER BY n_live_tup DESC;

    """)

repolist = pd.read_sql(repo_query, con=engine)

repolist[:35]

Unnamed: 0,schemaname,relname,n_live_tup
0,augur_data,pull_request_files,137311
1,augur_data,issue_events,67098
2,augur_data,pull_request_events,51052
3,augur_data,pull_request_commits,41075
4,augur_data,message,20023
5,augur_data,pull_request_meta,14943
6,augur_data,pull_request_reviews,11630
7,augur_data,issues,10242
8,augur_data,issue_message_ref,9932
9,augur_data,pull_requests,7507


Congrats you have done your first querries! There will be a few more simple examples below on how to pull an entire table. If you would like to explore on your own, the schema.png on the home sandiego directory will be greatly helpful in your explorations! Happy querrying :) 

### Data from the messages 

In [12]:
repolist = pd.DataFrame()

repo_query = salc.sql.text(f"""
             SET SCHEMA 'augur_data';
             SELECT * FROM message
    """)

repolist = pd.read_sql(repo_query, con=engine)

repolist[:100]

Unnamed: 0,msg_id,rgls_id,msg_text,msg_timestamp,msg_sender_email,msg_header,pltfrm_id,tool_source,tool_version,data_source,data_collection_date,cntrb_id
0,1691707,,"Hello, I am an Outreachy applicant. I am inter...",2020-03-06 12:16:26,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-07-16 21:02:43,278615
1,1691708,,"Hi, I am Abhishek, currently pursuing B.Tech i...",2020-03-14 08:24:40,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-07-16 21:02:43,278628
2,1691709,,Currently it seems that changes are required i...,2020-06-07 19:32:47,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-07-16 21:02:43,277384
3,1691710,,@ccarterlandis I had a question regarding the ...,2020-03-22 16:14:31,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-07-16 21:02:43,277384
4,1691711,,@sgoggins the first bug is just key not found ...,2021-01-05 15:48:50,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-07-16 21:02:43,277384
...,...,...,...,...,...,...,...,...,...,...,...,...
95,1691798,,"Yes, I am using Ubuntu 19.10\n\nOn Fri, 13 Mar...",2020-03-13 07:54:58,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-07-16 21:02:43,277795
96,1691799,,"Oh I am so sorry, actually I replied directly ...",2020-04-05 15:06:23,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-07-16 21:02:43,277795
97,1691800,,@sgoggins \r\nIs it like creating GitHub's Dep...,2021-03-30 16:52:19,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-07-16 21:02:43,278703
98,1691801,,I started out using the local environment but ...,2018-02-20 21:02:37,,,25150,GitHub API Worker,1.0.0,GitHub API,2021-07-16 21:02:43,278354


### Contributor affiliation data

In [11]:
repolist = pd.DataFrame()

repo_query = salc.sql.text(f"""
             SET SCHEMA 'augur_data';
             SELECT * FROM contributor_affiliations
    """)

repolist = pd.read_sql(repo_query, con=engine)

repolist[650:700]

Unnamed: 0,ca_id,ca_domain,ca_start_date,tool_source,tool_version,data_source,data_collection_date,ca_last_used,ca_affiliation,ca_active
650,24277,ross@heptio.com,1970-01-01,Helper Script,,Dawn's vmware_mapping JSON,2020-04-28 18:52:49,2020-04-28 18:52:49,VMware,1
651,24278,ross@kukulinski.com,1970-01-01,Helper Script,,Dawn's vmware_mapping JSON,2020-04-28 18:52:49,2020-04-28 18:52:49,VMware,1
652,24279,ralph@heptio.com,1970-01-01,Helper Script,,Dawn's vmware_mapping JSON,2020-04-28 18:52:49,2020-04-28 18:52:49,VMware,1
653,24280,ralph.l.bankston@gmail.com,1970-01-01,Helper Script,,Dawn's vmware_mapping JSON,2020-04-28 18:52:49,2020-04-28 18:52:49,VMware,1
654,24281,alex_brand@heptio.com,1970-01-01,Helper Script,,Dawn's vmware_mapping JSON,2020-04-28 18:52:49,2020-04-28 18:52:49,VMware,1
655,24282,alexbrand09@gmail.com,1970-01-01,Helper Script,,Dawn's vmware_mapping JSON,2020-04-28 18:52:49,2020-04-28 18:52:49,VMware,1
656,24283,joe@heptio.com,1970-01-01,Helper Script,,Dawn's vmware_mapping JSON,2020-04-28 18:52:49,2020-04-28 18:52:49,VMware,1
657,24284,joe.github@bedafamily.com,1970-01-01,Helper Script,,Dawn's vmware_mapping JSON,2020-04-28 18:52:49,2020-04-28 18:52:49,VMware,1
658,24285,vince@heptio.com,1970-01-01,Helper Script,,Dawn's vmware_mapping JSON,2020-04-28 18:52:49,2020-04-28 18:52:49,VMware,1
659,24286,vince@vincepri.com,1970-01-01,Helper Script,,Dawn's vmware_mapping JSON,2020-04-28 18:52:49,2020-04-28 18:52:49,VMware,1
