# SQL to pandas cheat sheet

This is a cheat sheet for the comparison between SQL commands to python's library Pandas.

In [48]:
# Import libraries
import pandas as pd
import numpy as np

# Install pydataset
#!pip install pydataset

# Import data from pydataset library (used as example data)
from pydataset import data

In [25]:
# Load iris dataset 
df = data('iris')
df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa


## SQL WHERE Clause in pandas

Used to extract only those records that fulfill a specified condition. Similar to the SQL command:

<code>SELECT *
FROM table_name
WHERE condition</code>

In [14]:
df[df['Species']=='setosa'].head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,Cod.Species
1,5.1,3.5,1.4,0.2,setosa,1
2,4.9,3.0,1.4,0.2,setosa,1
3,4.7,3.2,1.3,0.2,setosa,1
4,4.6,3.1,1.5,0.2,setosa,1
5,5.0,3.6,1.4,0.2,setosa,1


To select some columns like:
    
<code>SELECT column1, column2, ...
FROM table_name
WHERE condition</code>

In [17]:
df[df['Species']=='setosa'].loc[:, ['Sepal.Length', 'Sepal.Width']].head()

Unnamed: 0,Sepal.Length,Sepal.Width
1,5.1,3.5
2,4.9,3.0
3,4.7,3.2
4,4.6,3.1
5,5.0,3.6


## Distinct in pandas

Similar to the SQL command <code> SELECT DISTINCT COL_1 FROM TABLE </code>

### Method 1

In [4]:
df['Species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [5]:
df.Species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

## Group by in pandas

### Group by Count

Similar to the SQL command
<code> SELECT COL_1, COUNT(*) FROM TABLE GROUP BY COL_1 </code>

In [18]:
df.groupby(['Species']).size()

Species
setosa        50
versicolor    50
virginica     50
dtype: int64

### Group by and Aggregate by different functions

GROUP BY with different aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns. Similar to the SQL command:

<code>SELECT COUNT(COL_1), SUM(COL_2)
       FROM TABLE
       GROUP BY COL_3, COL_4
</code>   
 

In [29]:
df.groupby('Species').agg({'Sepal.Length':['max', 'min'], 
                         'Sepal.Width':'mean', 
                         'Petal.Length':'sum', 
                         'Petal.Width': lambda x: x.max() - x.min()})

Unnamed: 0_level_0,Sepal.Length,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
Unnamed: 0_level_1,max,min,mean,sum,<lambda>
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
setosa,5.8,4.3,3.428,73.1,0.5
versicolor,7.0,4.9,2.77,213.0,0.8
virginica,7.9,4.9,2.974,277.6,1.1


In [31]:
df.groupby('Species').agg({'Sepal.Length':['max', 'min'], 
                         'Sepal.Width':'mean', 
                         'Petal.Length':'sum', 
                         'Petal.Width': lambda x: x.max() - x.min()})

To avoid the "lambda" column name, create a function and supply a custom name

In [44]:
def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('Species').agg({'Sepal.Length':['max', 'min'], 
                         'Sepal.Width':'mean', 
                         'Petal.Length':'sum', 
                         'Petal.Width': max_min})

Unnamed: 0_level_0,Sepal.Length,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
Unnamed: 0_level_1,max,min,mean,sum,Max minus Min
Species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
setosa,5.8,4.3,3.428,73.1,0.5
versicolor,7.0,4.9,2.77,213.0,0.8
virginica,7.9,4.9,2.974,277.6,1.1


To flatten the hierarchical index in columns

In [45]:
df_groups = df.groupby('Species').agg({'Sepal.Length':['max', 'min'], 
                         'Sepal.Width':'mean', 
                         'Petal.Length':'sum', 
                         'Petal.Width': max_min})

df_groups.columns

MultiIndex([('Sepal.Length',           'max'),
            ('Sepal.Length',           'min'),
            ( 'Sepal.Width',          'mean'),
            ('Petal.Length',           'sum'),
            ( 'Petal.Width', 'Max minus Min')],
           )

In [47]:
df_groups.columns = [' '.join(col).strip() for col in df_1.columns.values]
df_groups

Unnamed: 0_level_0,Sepal.Length max,Sepal.Length min,Sepal.Width mean,Petal.Length sum,Petal.Width Max minus Min
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
setosa,5.8,4.3,3.428,73.1,0.5
versicolor,7.0,4.9,2.77,213.0,0.8
virginica,7.9,4.9,2.974,277.6,1.1


## SQL CASE Statement in pandas

Usef for creating new columns based on multiple conditions. Similar to the SQL command:

<code>CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    WHEN conditionN THEN resultN
    ELSE result
END </code>

In [19]:
# Define a function with the rules to apply

def create_column(df):
    
    if df['Species'] == 'setosa':
        return 1
    elif df['Species'] == 'versicolor':
        return 2
    elif df['Species'] == 'virginica':
        return 3
    
# Then create the column and apply the create_column function

df['Cod.Species'] = df.apply(create_column, axis = 1)
df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,Cod.Species
1,5.1,3.5,1.4,0.2,setosa,1
2,4.9,3.0,1.4,0.2,setosa,1
3,4.7,3.2,1.3,0.2,setosa,1
4,4.6,3.1,1.5,0.2,setosa,1
5,5.0,3.6,1.4,0.2,setosa,1
