`GROUP BY` é uma clausula opcional do SELECT que permite agrupar as linhas por uma ou mais colunas.

A cláusula retorna uma linha para cada grupo e permite aplicar metódos de `MIN`, `MAX`, `SUM` `COUNT` e `AVG`.

In [1]:
import pandas as pd
import sqlite3

In [3]:
con = sqlite3.connect('../primeiro_banco')
cur = con.cursor()
print('Conexão bem sucedida!')

Conexão bem sucedida!


In [15]:
query = 'SELECT * FROM Customers LIMIT 3'

df = pd.read_sql_query(query, con)
df

Unnamed: 0,CustomerKey,FirstName,LastName,BirthDate,MaritalStatus,Gender,EmailAddress,AnnualIncome,TotalChildren,EducationLevel,Occupation,HomeOwner
0,11000,JON,YANG,4/8/1966,M,M,jon24@adventure-works.com,90000.0,2,Bachelors,Professional,Y
1,11001,EUGENE,HUANG,5/14/1965,S,M,eugene10@adventure-works.com,60000.0,3,Bachelors,Professional,N
2,11002,RUBEN,TORRES,8/12/1965,M,M,ruben35@adventure-works.com,60000.0,3,Bachelors,Professional,Y


Checar valores únicos

In [12]:
query = '''
SELECT COUNT(DISTINCT Occupation) AS Ocupations,
    COUNT(DISTINCT EducationLevel) AS Education
FROM Customers
'''

df = pd.read_sql_query(query, con)
print(df)
# 5 tipos de ocupação e 5 níveis de educação

   Ocupations  Education
0           5          5


Consultar valores únicos com `GROUP BY`

In [21]:
print('Tipos de ocupações:')
query = '''
SELECT Occupation
FROM Customers
GROUP BY Occupation
'''
df = pd.read_sql_query(query, con)
print(df)

print()
print('Níveis de educação:')
query = '''
SELECT EducationLevel
FROM Customers
GROUP BY EducationLevel
'''

df = pd.read_sql_query(query, con)
print(df)

Tipos de ocupações:
       Occupation
0        Clerical
1      Management
2          Manual
3    Professional
4  Skilled Manual

Níveis de educação:
        EducationLevel
0            Bachelors
1      Graduate Degree
2          High School
3      Partial College
4  Partial High School


Extraindo mais informações com o GROUP BY e metódos de agregação e ordenação com ORDER BY

In [73]:
#Descobrir o maior número de profissionais por ocupação e nível de educação, sua média de renda e o maior e menor salário
query = '''
SELECT Occupation,
EducationLevel, 
COUNT(Occupation) AS TotalWorkers,
ROUND(AVG(AnnualIncome),2) AS AvgIncome,
MAX(AnnualIncome) AS MaxIncome,
MIN(AnnualIncome) AS MinIncome
FROM Customers
GROUP BY Occupation, EducationLevel
ORDER BY TotalWorkers DESC
'''

df = pd.read_sql_query(query, con)
print(df)

        Occupation       EducationLevel  TotalWorkers  AvgIncome  MaxIncome  \
0     Professional            Bachelors          1862   68367.35   150000.0   
1     Professional      Partial College          1627   82771.97   170000.0   
2       Management            Bachelors          1564   88529.41   170000.0   
3   Skilled Manual      Partial College          1288   57663.04    90000.0   
4         Clerical      Partial College          1226   34510.60    40000.0   
5   Skilled Manual            Bachelors          1099   49126.48    80000.0   
6       Management      Graduate Degree          1039   90057.75   170000.0   
7   Skilled Manual          High School          1020   37882.35    80000.0   
8     Professional          High School           955   71832.46   170000.0   
9           Manual          High School           883   18414.50    30000.0   
10    Professional      Graduate Degree           846   66241.13   130000.0   
11          Manual      Partial College           74

Filtrar dados com a cláusula `HAVING`

In [79]:
query = '''
SELECT Occupation,
EducationLevel, 
COUNT(Occupation) AS TotalWorkers,
ROUND(AVG(AnnualIncome),2) AS AvgIncome,
MAX(AnnualIncome) AS MaxIncome,
MIN(AnnualIncome) AS MinIncome
FROM Customers
GROUP BY Occupation, EducationLevel
HAVING TotalChildren >= 2
ORDER BY TotalWorkers DESC
'''

df = pd.read_sql_query(query, con)
print(df)

       Occupation       EducationLevel  TotalWorkers  AvgIncome  MaxIncome  \
0    Professional      Partial College          1627   82771.97   170000.0   
1      Management            Bachelors          1564   88529.41   170000.0   
2        Clerical      Partial College          1226   34510.60    40000.0   
3      Management      Graduate Degree          1039   90057.75   170000.0   
4  Skilled Manual          High School          1020   37882.35    80000.0   
5    Professional          High School           955   71832.46   170000.0   
6        Clerical  Partial High School           457   23982.49    40000.0   
7      Management          High School           304  107960.53   170000.0   
8      Management      Partial College            85  121529.41   170000.0   

   MinIncome  
0    40000.0  
1    40000.0  
2    20000.0  
3    50000.0  
4    10000.0  
5    30000.0  
6    10000.0  
7    80000.0  
8   100000.0  


In [93]:
query = '''
SELECT
Occupation,
EducationLevel,
COUNT(EducationLevel) AS TotalEducation
FROM Customers
GROUP BY Occupation, EducationLevel
ORDER BY Occupation DESC
'''

df = pd.read_sql_query(query, con)
print(df)

        Occupation       EducationLevel  TotalEducation
0   Skilled Manual            Bachelors            1099
1   Skilled Manual      Graduate Degree             737
2   Skilled Manual          High School            1020
3   Skilled Manual      Partial College            1288
4   Skilled Manual  Partial High School             357
5     Professional            Bachelors            1862
6     Professional      Graduate Degree             846
7     Professional          High School             955
8     Professional      Partial College            1627
9     Professional  Partial High School             134
10          Manual            Bachelors              82
11          Manual      Graduate Degree              60
12          Manual          High School             883
13          Manual      Partial College             740
14          Manual  Partial High School             588
15      Management            Bachelors            1564
16      Management      Graduate Degree         