# SELECT {field}, COUNT() FROM {table} GROUP BY {field} HAVING COUNT() condition

- Aggregate function: COUNT(), SUM(), AVG(), MIN(), and MAX().
- GROUP BY... HAVING.
- Examples in SQL Query vs Pandas 

## Two examples of GROPUP BY and agg. values in this case
1. What was the total turnover of a given product selected by ProductID? - b: add columns like UnitPrice and ProductName.
2. List of products that total invoiced is between x and y $, order by ProdutID or Total Invocied, Asc or Desc: - b: as in 1.

- I will use only Purchasing.PurchaseOrderDetail table
- I will use only df_from_query function
- And as always build the code first w/direct SQL query to the DB and second w/native Pandas using a DF tha is the whole table extracted from the DB

## 1. Establish DB connection and def funct. to convert SQL Query result to DF
- I define df_from_query() function to avoid using pd.read_sql() cause the warning message when using pyodbc

In [2]:
### Connect to the DB - Establish the connection
import pyodbc

# Valid values for the connection string
driver = '{ODBC Driver 17 for SQL Server}'
server = '(local)'
dbname = 'AdventureWorks2019'
user = 'user1'
passwd = 'pass1'

# Construct the Connection String
connection_string = f'DRIVER={driver};SERVER={server};\
    DATABASE={dbname};UID={user};PWD={passwd}'
print('Connection String:\n', connection_string)

# Establish the connection
try:
    cnx = pyodbc.connect(connection_string)
    cur = cnx.cursor()
except pyodbc.Error as e:
    print('ERROR:', e)
else:
    print('SUCCESS: Connection Established')

# mk function to convert SQL queries to DF
import pandas as pd

def df_from_query(qry):     # convert cursor.execute(query) to DF
    cur.execute(qry)
    field_names = [i[0] for i in cur.description]
    get_data = [list(x) for x in cur]
    df = pd.DataFrame(data=get_data, columns=field_names)
    return df

Connection String:
 DRIVER={ODBC Driver 17 for SQL Server};SERVER=(local);    DATABASE=AdventureWorks2019;UID=user1;PWD=pass1
SUCCESS: Connection Established


## 2. Get the table as a DF (orders_df)
- To apply native pandas code i need that the same DB.table i'll query will be a DF

In [3]:
query = ''' SELECT * FROM Purchasing.PurchaseOrderDetail'''
orders_df = df_from_query(query)
print(orders_df.columns)
orders_df

Index(['PurchaseOrderID', 'PurchaseOrderDetailID', 'DueDate', 'OrderQty',
       'ProductID', 'UnitPrice', 'LineTotal', 'ReceivedQty', 'RejectedQty',
       'StockedQty', 'ModifiedDate'],
      dtype='object')


Unnamed: 0,PurchaseOrderID,PurchaseOrderDetailID,DueDate,OrderQty,ProductID,UnitPrice,LineTotal,ReceivedQty,RejectedQty,StockedQty,ModifiedDate
0,1,1,2011-04-30,4,1,50.2600,201.0400,3.00,0.00,3.00,2011-04-23 00:00:00.000
1,2,2,2011-04-30,3,359,45.1200,135.3600,3.00,0.00,3.00,2011-04-23 00:00:00.000
2,2,3,2011-04-30,3,360,45.5805,136.7415,3.00,0.00,3.00,2011-04-23 00:00:00.000
3,3,4,2011-04-30,550,530,16.0860,8847.3000,550.00,0.00,550.00,2011-04-23 00:00:00.000
4,4,5,2011-04-30,3,4,57.0255,171.0765,2.00,1.00,1.00,2011-04-23 00:00:00.000
...,...,...,...,...,...,...,...,...,...,...,...
8840,4011,8841,2014-07-24,1000,880,20.5600,20560.0000,1000.00,0.00,1000.00,2015-08-12 12:25:46.470
8841,4012,8842,2014-07-24,6000,881,41.5700,249420.0000,6000.00,0.00,6000.00,2015-08-12 12:25:46.483
8842,4012,8843,2014-07-24,6000,882,41.5700,249420.0000,6000.00,0.00,6000.00,2015-08-12 12:25:46.483
8843,4012,8844,2014-07-24,6000,883,41.5700,249420.0000,6000.00,0.00,6000.00,2015-08-12 12:25:46.483


## 3. First Question
- 1. What was the total turnover of a given product selected by ProductID? - b: add columns like UnitPrice and ProductName.

### 3.1. Get the list of valid ProductIDs
Since I will ask the user to enter the ProductID i'll have to validate it. First i'll get the list of valid ProductIDs from the table by three different methods (and take the time of each):
1. Direct native SQL query to the table
2. Using pandas .groupby()
3. Using pandas .unique()

Points 2. and 3. has the advantage that the table is loaded in memory as a pandas DF. Then i'll do a complete round of get the DF and then use the (faster) pandas method to make a more complete comparative.

In [None]:
#%%timeit - 2.89 ms ± 263 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 1. Direct native SQL query to the table
q_31 = ''' SELECT ProductID
            FROM Purchasing.PurchaseOrderDetail
            GROUP BY ProductID
            ORDER BY ProductID'''

# To get a list w/the result y use the cursor() just created
cur.execute(q_31)
prods_IDs = [el[0] for el in cur.fetchall()] 
#prods_IDs = [el[0] for el in cur]    # same time as above
#print(prods_IDs, type(prods_IDs))

In [None]:
#%%timeit - 6.12 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2. Using pandas .groupby()
prods_IDsp = orders_df.groupby('ProductID').first().index 

In [None]:
#%%timeit - 621 µs ± 77.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# 3. Using pandas .unique()
prods_IDsp1 = orders_df.ProductID.sort_values().unique()
print(prods_IDsp1, type(prods_IDsp1))

In [None]:
#%%timeit - 146 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 4. Complete time comparative (w/1)
qc = ''' SELECT * FROM Purchasing.PurchaseOrderDetail'''
cur.execute(qc)
cols = [i[0] for i in cur.description]
dats = [list(x) for x in cur]
cdf = pd.DataFrame(data=dats, columns=cols)

prods_IDsp2 = cdf.ProductID.sort_values().unique()

In [None]:
#%%timeit - 181 ms ± 3.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 5. Another comparative w/ a warning message ;)
qd = ''' SELECT * FROM Purchasing.PurchaseOrderDetail'''
ddf = pd.read_sql(qd, cnx)

prods_IDsp3 = ddf.ProductID.sort_values().unique()

### 3.2. Ask user the ProductID and validate it

In [None]:
message = f'ERROR: ProductID must be an integer included in: \n {prods_IDsp1}'

while True:
    try:
        prodID = int(input('Enter the ID of de Product - ProductID -: '))
        assert prodID in prods_IDsp1        # (prods_IDs - prods_IDsp)
    except (ValueError, AssertionError) as e:
        print(f'{message} \n\n {e}')
    except Exception as e:
        print(f'ERROR: Unknown! \n {e}')
    else:
        print(f'ProductID entered: {prodID}', type(prodID))
        break


In [None]:
## An aux code to calc prods that has more than x and less than y rows
# use it for examples of prods tha are easy to check manually
x = 1
y = 15
for pID in prods_IDsp1:
    # at this moment is easy to me to make using pandas
    rows_num = orders_df.loc[orders_df.ProductID == pID].count()[0]
    if rows_num > x and rows_num < y:
        print(f'ProductID {pID} -> {rows_num} rows')

### 3.3. Native SQL query to answer the question

In [None]:
prodID = 461   # To avoid input code

q_3 = f''' SELECT ProductID, SUM(LineTotal) AS SUM_LineTotal
            FROM Purchasing.PurchaseOrderDetail
            WHERE ProductID = {prodID}
            GROUP BY ProductID'''
df_3 = df_from_query(q_3)
df_3

### 3.4. Native Pandas to answer the question

In [None]:
# 1. get only a DF of the ProductID we are interested in, then group then agg. SUM
df_34 = orders_df.loc[orders_df.ProductID == prodID]
# 2. group by and sum agg by 'LineTotal' column
df_34 = df_34.groupby('ProductID').agg(
    SUM_LineTotal=pd.NamedAgg(column='LineTotal', aggfunc='sum'))
# 3. reset index to transform ProductID index to a new pandas col
df_34.reset_index(inplace=True)
df_34

## 4. Second Question
- 2. List of products that total invoiced is between min and max $, order by ProdutID or Total Invoiced, Asc or Desc: - b: add columns like UnitPrice and ProductName.

### 4.1. Ask lower and upper limits of the range of total invoiced

In [None]:
txt = 'total turnover value'
t_low = 'Lower' + ' ' + txt
t_hig = 'Higher' + ' ' + txt

while True: 
    try:
        min = int(input(f'{t_low}? '))
        max = int(input(f'{t_hig}? '))
        assert min > 0 and max > 0 and min <= max 
    except ValueError as e:
        print(f'ERROR: {e} \n'
              f'{t_low} and {t_hig} must be integers')             
    except AssertionError as e:
        print(f'ERROR: {e} \n'
              f'{t_low} must be less or equal to {t_hig} and both grater than 0')
    except Exception as e:
        print(f'ERROR: {e}')
    else:
        print(f'Total income between {min:,} and {max:,}')
        break

### 4.2. Ask Sorting info: Column (ProductID, Total_turnover) and order (asc, desc)

In [5]:
# dics that contain valids values
dcol_txs = ('Total_turnover', 'ProductID')
dcol = {'t': (dcol_txs[0], dcol_txs[0]), 'p': (dcol_txs[1], dcol_txs[1])}
dord = {'a': ('Ascendent', 'ASC'), 'd': ('Descendent', 'DESC')}

# Function to make input text to show:
def inp_trin(dic):
    if dic == dcol:
        o1, o2 = 't', 'p'
    elif dic == dord:
        o1, o2 = 'a', 'd'
    return f'[{o1}: {dic[o1][0]}, {o2}: {dic[o2][0]}]'

# Function to check and complete sorting options:
def sort_inputed(dic, inp):
    if inp.lower() in dic.keys():
        return dic[inp.lower()][1]
    else:
        raise AssertionError

# Loop to input and validate sorting options    
while True: 
    try:
        sort_col = input(f'Sorting column {inp_trin(dcol)}? ')
        sort_col = sort_inputed(dcol, sort_col)
        sort_order = input(f'Sorting order {inp_trin(dord)}? ')
        sort_order = sort_inputed(dord, sort_order)  
    except AssertionError as e:
        print(f'ERROR: Invalid Input {e} \n'
              f' Valid Columns: {inp_trin(dcol)} \n'
              f' Valid ordering {inp_trin(dord)}')
    except Exception as e:
        print(f'ERROR: {e}')
    else:
        print(f"Sorted by '{sort_col}' column and '{sort_order}'")
        break

Sorted by 'Total_turnover' column and 'DESC'


### 4.3. Native SQL query

In [12]:
## To faster the proofs
min = 300_000
max = 1_000_000
sort_col = dcol['p'][1]
sort_order = dord['a'][1]
print(f'{min:,} - {max:,}  |  {sort_col} - {sort_order}')

300,000 - 1,000,000  |  ProductID - ASC


In [13]:
q_43 = f''' SELECT ProductID, SUM(LineTotal) AS {dcol['t'][0]}
            FROM Purchasing.PurchaseOrderDetail
            GROUP BY ProductID
            HAVING SUM(LineTotal) >= {min} and SUM(LineTotal) <= {max}
            ORDER BY {sort_col} {sort_order} '''
df_43 = df_from_query(q_43)
df_43

Unnamed: 0,ProductID,Total_turnover
0,509,798105.0
1,523,592289.775
2,524,759874.5
3,530,451212.3
4,908,707720.475
5,911,823740.225
6,914,464079.0
7,915,669669.0
8,916,878152.275


### 4.4. Native Pandas to answer the question

In [16]:
## Native PANDAS - orders_df is the table
# mk the aggregate table w/the total turnover grouped by product
df_44 = orders_df.groupby('ProductID').agg(
    Total_turnover=pd.NamedAgg(column='LineTotal', aggfunc='sum'))  # -N1-
# mk the index as a col to filter and order by Total_turnover or ProductID
df_44.reset_index(inplace=True)

# Convert sort options to Pandas order params
if sort_order == 'ASC':
    asc = True
elif sort_order == 'DESC':
    asc = False

# Filter according sorting options
df_44 = df_44.loc[(df_44.Total_turnover >= min) & (df_44.Total_turnover <= max)]\
    .sort_values(by=[sort_col], ascending=asc).reset_index()        # -N2-
del df_44['index']

df_44

# -N1-: Can not use {dcol['t'][0]} in place of Total_turnover!?
# -N2-: reset_index() to change the pandas table index to a new fresh
#   df.index completed with the deletion of df_44['index']


Unnamed: 0,ProductID,Total_turnover
0,509,798105.0
1,523,592289.775
2,524,759874.5
3,530,451212.3
4,908,707720.475
5,911,823740.225
6,914,464079.0
7,915,669669.0
8,916,878152.275


## 5. Point b. of Q1 and Q2
Will be interesting to see characteristics of the productID in this results tables like ProductName and UnitPrice for instance

In [None]:
# ...