# SQL Querying

This notebook can be used to query tables in the Congressional Data database. In order to use this notebook, you will need to set an environment variable 'CD_DWH' to the database connection string. If you do not have the credentials, please slack us at #datasci-congressdata channel and/or talk to a project lead.

**It is best practice to not hard code database URI strings directly in notebooks or code as when we push to Github, that would mean credentials are public for anyone to see.**

In [None]:
import os
import sys

import pandas as pd
pd.options.display.max_columns = 999
import sqlalchemy as sqla
from sqlalchemy import create_engine

from plotnine import *
import math

DB_URI = os.getenv('CD_DWH')
engine = create_engine(DB_URI)

In [None]:
# Checking that the Kernel is using the Conda environment datasci-congressional-data
# Below you should see something like '/Users/Username/anaconda3/envs/datasci-congressional-data/bin/python
# If you do NOT see "datasci-congressional-data" this means you are not in the right Python Environment
# Please make sure you have gone through the onboarding docs and/or talk to a project lead.
sys.executable

Below are the tables that currently exist in the database!

## Query Factors & visualize frequency plots

In [None]:
QUERRY = """
SELECT
    recipient_candidate_name
    , recipient_candidate_office
    , donor_name
    , donor_organization
    , transaction_amount
    , transaction_date
  FROM trg_analytics.candidate_contributions
  GROUP BY
    recipient_candidate_office, transaction_id"""
with engine.begin() as conn:
    results = pd.read_sql(QUERRY, conn)

In [None]:
results.head()

In [None]:
results.groupby(['recipient_candidate_office'])['transaction_amount'].sum()

In [None]:
(ggplot(data=results[:10], mapping=aes(x=results['transaction_amount'][:10])) +
     geom_histogram(binwidth = 100))