# Comparing Pandas v.s. SQL: Ponder Citibike Mobility Analysis

<div class="alert alert-block alert-info"> <b>Before we get started: </b> 
    <ul style="list-style-type: none;margin: 0;padding: 0;">
        <li>✍️ To run this notebook, you need to have Ponder installed and set up on your machine. If you have not done so already, please refer to our <a href="https://docs.ponder.io/getting_started/quickstart.html">Quickstart guide</a> to get started.</li>
        <li>📖 Otherwise, if you're just interested in browsing through the notebook, keep reading below!</li>
    </ul>
</div>

Pandas is easy to use, flexible, and concise. Relational databases are scalable and reliable, but force you to use SQL. This demo shows you how Ponder allows your to combine the best of both worlds. With Ponder you can write more concise and maintainable code relative to SQL AND you can do operations that aren't even relational in nature!

This notebook shows you how you can use Ponder to unleash your pandas code directly in your database! Ponder gives you a native pandas interface to your data warehouse. If you haven't already, sign up for a free trial account at [***app.ponder.io***](www.app.ponder.io) to try it out!

# Tutorial Overview

In this notebook, we will walk through a simple analysis of Citbike bikeshare data to show you how easy it is to use Ponder. The dataset includes specific bikeshare trips in New York City and we'll answer 6 mobility questions inspired by Kevin Chan's [Kaggle notebook](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql/notebook). We use a sample of the public data available [here](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=new_york_citibike&page=dataset&project=geocoding-314221&ws=!1m4!1m3!3m2!1sbigquery-public-data!2snew_york_citibike) for this analysis.

**To answer these questions with SQL it takes about 72 lines of code. With Pandas, we can answer these same questions with almost 10X less code.**


Specifically we'll show you:

#### [How to setup Ponder ](#setup)


#### [ How Ponder Works](#how-it-works)


#### [The Citbike Use Case](#key-questions)
    


 <h1 align="center">🔥9 lines of pandas vs. ~70 lines with the SQL🔥</h1>

&nbsp;

<a class="anchor" id="setup"></a>
# 🛠️ How to setup Ponder 🛠️

### Import Requirements 

In [1]:
import ponder; ponder.init()
import modin.pandas as pd
import snowflake.connector

2023-05-25 06:13:19 - Ponder package successfully imported
2023-05-25 06:13:19 - Creating session LvN3TVvaeRH2K6mJiEtlcEVgmvzfD84mCJ7lF57PC8


###  Configure  Database Connection

This example uses your Snowflake database as the backend engine. If you have Snowflake, you can find the documentation for the Snowflake Python Connector [here](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector).

If you don't have Snowflake, visit our documentation visit our documentation to [Connect Ponder to your DB](https://docs.ponder.io/getting_started/connection.html) and for a list of other supported databases, orreach out to support@ponder.io.

In [2]:
import os; os.chdir("..")
import credential
sfcon = snowflake.connector.connect(
    user=credential.params["user"],
    password=credential.params["password"],
    account=credential.params["account"],
    role=credential.params["role"],
    database=credential.params["database"],
    schema=credential.params["schema"],
    warehouse=credential.params["warehouse"]
)

### Select Data Source

Ponder allows you to work with data in flat files as well as in your exisiting database tables. Below, we will use a csv file stored in my local machine.

**So how is Ponder different than vanilla pandas?**

* With vanilla pandas, the `read_csv()` method pulls the data from disk into a dataframe in your local memory.

* With Ponder, the `read_csv()` method automatically creates a database table, configures the schema for your csv file, and loads the data into the warehouse for analysis.

Here, we configure Snowflake as the default database connection to use when reading CSVs.

In [3]:
ponder.configure(default_connection=sfcon)

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/citibike_tutorial.csv')

2023-05-25 06:13:23 - Preparing table in Snowflake using CSV file...
2023-05-25 06:13:26 - Configuring Ponder DataFrame in Snowflake...
2023-05-25 06:13:30 - Ponder DataFrame successfully configured in Snowflake


<a class="anchor" id="how-it-works"></a>
#    🧪 How it Works 🧪

As you go through the rest of this notebook, it may seem like any other pandas tutorial notebook, and that is the magic of Ponder! 

Ponder gives you a pandas interface for your data warehouse. As you run each code block, Ponder automatically compiles your pandas code into SQL and runs it directly in your database. None of the computation is done in your local python enviornment!

Dont believe us? Try operating on a large dataset that doesn't fit into your local RAM and run a memory profiler to see how Ponder can help you scale up your workflows! And for more background on what we've been building, check out this [Ponder Overview Blogpost](https://ponder.io/run-pandas-on-1tb-directly-in-your-data-warehouse/) for more 

When we print the dataframe, we see that our dataframe contains Citibike trips and associated details, including pickup/drop of stations and times.

In [5]:
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bikeid,usertype,birth_year,gender
0,730,2/9/15 8:37,2/9/15 8:49,72,W 52 St & 11 Ave,40.767272,-73.993929,520,W 52 St & 5 Ave,40.759923,-73.976485,18809,Subscriber,1975.0,male
1,704,11/20/13 20:21,11/20/13 20:33,72,W 52 St & 11 Ave,40.767272,-73.993929,470,W 20 St & 8 Ave,40.743453,-74.00004,20515,Subscriber,1981.0,male
2,425,1/6/16 17:01,1/6/16 17:08,72,W 52 St & 11 Ave,40.767272,-73.993929,469,Broadway & W 53 St,40.763441,-73.982681,17116,Subscriber,1947.0,male
3,373,11/9/15 12:50,11/9/15 12:56,72,W 52 St & 11 Ave,40.767272,-73.993929,469,Broadway & W 53 St,40.763441,-73.982681,20892,Subscriber,1947.0,male
4,1149,8/3/13 17:14,8/3/13 17:33,72,W 52 St & 11 Ave,40.767272,-73.993929,325,E 19 St & 3 Ave,40.736245,-73.984738,17711,Subscriber,1981.0,female


&nbsp;

<a class="anchor" id="key-questions"></a>
# 6 Mobility Trend Questions Answered

The following 6 exploratory questions were answered using SQL in the bikeshare analysis on Kaggle, so we'll walk you through how you can do the same with Ponder + pandas.



* [What time fram does the data set contain?](#first-question)
* [Which age group uses Citi Bike most often and the trend?](#second-question)
* [Which gender uses Citi Bike most often and the trend?](#third-question)
* [Which day of the week is Citi Bike most utilized?](#fourth-question)
* [What is the average trip duration per day of the week?](#fifth-question)
* [What is the total number of trips per month?](#sixth-question)

We have quite a few mobility time series related questions, so above we convert our trip start time variable `starttime` to a datetime type for simplifying our analysis.

In [6]:
df['starttime'] = pd.to_datetime(df.starttime,format="%m/%d/%y %H:%M")

<a class="anchor" id="first-question"></a>
# ❓Q1: What time frame does the data set contain ❓

### pandas  - 1 line of code

Calculate the min and max values of trip starts.

In [7]:
df.starttime.min(),df.starttime.max()

(Timestamp('2013-07-01 08:43:00'), Timestamp('2016-09-30 18:10:00'))

## <h2 align="center"> 💡 5 lines of code required using SQL Approach [below](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql?scriptVersionId=78471031&cellId=4) </h2>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">With SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre style=""><code class="cm-s-jupyter language-python"><span class="cm-variable">df</span>.<span class="cm-property">starttime</span>.<span class="cm-property">min</span>(),<span class="cm-variable">df</span>.<span class="cm-property">starttime</span>.<span class="cm-property">max</span>()
</code></pre>
</div></td>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre style=""><code class="cm-s-jupyter language-sql"><span class="cm-keyword">SELECT</span>  
    MAX<span class="cm-bracket">(</span><span class="cm-builtin">DATE</span><span class="cm-bracket">(</span>starttime<span class="cm-bracket">)</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> max_date<span class="cm-punctuation">,</span>
    MIN<span class="cm-bracket">(</span><span class="cm-builtin">DATE</span><span class="cm-bracket">(</span>starttime<span class="cm-bracket">)</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> min_date
<span class="cm-keyword">FROM</span> 
    `citibike<span class="cm-operator">-</span>project<span class="cm-operator">-</span><span class="cm-number">330415</span><span class="cm-variable-2">.CITIBIKE</span><span class="cm-variable-2">.citibike_trips</span>` 
</code></pre>
</div></td>
  </tr>
  <tr>
    <td class="tg-y698">1 Line </td>
    <td class="tg-y698">5 Lines</td>
  </tr>
</tbody>
</table>

&nbsp;

<a class="anchor" id="second-question"></a>
# ❓Q2: Which age group uses CitiBike most often & what is the trend ❓

### pandas - 3 lines of code

Define mappings of birth years to age groups, replace birth years with age groups, calcualte how many trips per age group per year

In [8]:
replace_dict = { 'boomer': 1960, 'genx': 1980, 'geny': 1994,'genz':2023,"missing":2030}
df['birth_year'] = pd.cut(df['birth_year'],bins=[1]+list(replace_dict.values()),labels=list(replace_dict.keys()))

df.groupby('birth_year',as_index=False)['tripduration'].count()

Unnamed: 0,birth_year,tripduration
0,boomer,1029
1,genx,4490
2,geny,3932
3,genz,125


### <h2 align="center"> 💡 3 lines of pandas vs. 14 lines of code required using SQL approach [below](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql?scriptVersionId=78471031&cellId=4) </h2>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">With SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre style=""><code class="cm-s-jupyter language-python"><span class="cm-variable">replace_dict</span> <span class="cm-operator">=</span> { <span class="cm-string">'boomer'</span>: <span class="cm-number">1960</span>, <span class="cm-string">'genx'</span>: <span class="cm-number">1980</span>, <span class="cm-string">'geny'</span>: <span class="cm-number">1994</span>,<span class="cm-string">'genz'</span>:<span class="cm-number">2023</span>,<span class="cm-string">"missing"</span>:<span class="cm-number">2030</span>}
<span class="cm-variable">df</span>[<span class="cm-string">'birth_year'</span>] <span class="cm-operator">=</span> <span class="cm-variable">pd</span>.<span class="cm-property">cut</span>(<span class="cm-variable">df</span>[<span class="cm-string">'birth_year'</span>],<span class="cm-variable">bins</span><span class="cm-operator">=</span>[<span class="cm-number">1</span>]<span class="cm-operator">+</span><span class="cm-builtin">list</span>(<span class="cm-variable">replace_dict</span>.<span class="cm-property">values</span>()),<span class="cm-variable">labels</span><span class="cm-operator">=</span><span class="cm-builtin">list</span>(<span class="cm-variable">replace_dict</span>.<span class="cm-property">keys</span>()))

<span class="cm-variable">df</span>.<span class="cm-property">groupby</span>(<span class="cm-string">'birth_year'</span>,<span class="cm-variable">as_index</span><span class="cm-operator">=</span><span class="cm-keyword">False</span>)[<span class="cm-string">'tripduration'</span>].<span class="cm-property">count</span>()
</code></pre>
</div></td>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre><code class="cm-s-jupyter language-sql"><span class="cm-keyword">SELECT</span>  
    EXTRACT<span class="cm-bracket">(</span><span class="cm-builtin">year</span> <span class="cm-keyword">FROM</span> starttime<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> <span class="cm-builtin">year</span><span class="cm-punctuation">,</span>
    <span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN birth_year<span class="cm-operator">&gt;=</span><span class="cm-number">1940</span> <span class="cm-keyword">AND</span> birth_year<span class="cm-operator">&lt;</span><span class="cm-number">1959</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> boomer<span class="cm-punctuation">,</span>
    <span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN birth_year<span class="cm-operator">&gt;=</span><span class="cm-number">1960</span> <span class="cm-keyword">AND</span> birth_year<span class="cm-operator">&lt;</span><span class="cm-number">1979</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> genx<span class="cm-punctuation">,</span>
    <span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN birth_year<span class="cm-operator">&gt;=</span><span class="cm-number">1980</span> <span class="cm-keyword">AND</span> birth_year<span class="cm-operator">&lt;</span><span class="cm-number">1994</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> geny<span class="cm-punctuation">,</span>
    <span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN birth_year<span class="cm-operator">&gt;=</span><span class="cm-number">1995</span> <span class="cm-keyword">AND</span> birth_year<span class="cm-operator">&lt;</span><span class="cm-number">2012</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> genz
<span class="cm-keyword">FROM</span> 
    `citibike<span class="cm-operator">-</span>project<span class="cm-operator">-</span><span class="cm-number">330415</span><span class="cm-variable-2">.CITIBIKE</span><span class="cm-variable-2">.citibike_trips</span>` 
<span class="cm-keyword">GROUP</span> <span class="cm-keyword">BY</span> 
    <span class="cm-builtin">year</span>
<span class="cm-keyword">HAVING</span> 
    <span class="cm-builtin">year</span> <span class="cm-keyword">IS</span> <span class="cm-keyword">NOT</span> <span class="cm-atom">NULL</span> <span class="cm-keyword">AND</span> <span class="cm-builtin">year</span> <span class="cm-operator">!=</span> <span class="cm-number">2013</span> <span class="cm-keyword">AND</span> <span class="cm-builtin">year</span> <span class="cm-operator">!=</span> <span class="cm-number">2018</span>
<span class="cm-keyword">ORDER</span> <span class="cm-keyword">BY</span>
    <span class="cm-builtin">year</span> <span class="cm-keyword">ASC</span>
</code></pre>
</div></td>
  </tr>
  <tr>
    <td class="tg-y698">3 Lines</td>
    <td class="tg-y698">14 Lines</td>
  </tr>
</tbody>
</table>

<a class="anchor" id="third-question"></a>
# Q3: Which gender uses Citi Bike most often and what's the trend❓

### pandas  - 2 lines of code

Group by gender and year, count how many trip per gender per year

In [9]:
df['start_year'] = df.starttime.dt.year
df[df.start_year != 2013].groupby(['start_year','gender']).size() 

start_year  gender 
2014        female      571
            male       2053
            unknown     268
2015        female      601
            male       2036
            unknown     430
2016        female      569
            male       1912
            unknown     359
dtype: int64

## <h2 align="center"> 💡 2 lines of pandas vs 16 lines required with SQL approach [below](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql?scriptVersionId=78471031&cellId=11) </h2>

<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">In SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre><code class="cm-s-jupyter language-python"><span class="cm-variable">df</span>[<span class="cm-string">'start_year'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>.<span class="cm-property">starttime</span>.<span class="cm-property">dt</span>.<span class="cm-property">year</span>
<span class="cm-variable">df</span>[<span class="cm-variable">df</span>.<span class="cm-property">start_year</span> <span class="cm-operator">!=</span> <span class="cm-number">2013</span>].<span class="cm-property">groupby</span>([<span class="cm-string">'start_year'</span>,<span class="cm-string">'gender'</span>]).<span class="cm-property">size</span>() 
</code></pre>
</div></td>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre><code class="cm-s-jupyter language-sql"><span class="cm-keyword">SELECT</span>  
    EXTRACT<span class="cm-bracket">(</span> <span class="cm-builtin">year</span> <span class="cm-keyword">FROM</span> starttime<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> <span class="cm-builtin">year</span><span class="cm-punctuation">,</span>
    <span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN gender<span class="cm-operator">=</span> <span class="cm-string">"female"</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> female<span class="cm-punctuation">,</span>
    ROUND<span class="cm-bracket">(</span><span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN gender<span class="cm-operator">=</span> <span class="cm-string">"female"</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span><span class="cm-operator">/</span><span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>gender<span class="cm-bracket">)</span><span class="cm-operator">*</span><span class="cm-number">100</span><span class="cm-punctuation">,</span> <span class="cm-number">2</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> female_percentage<span class="cm-punctuation">,</span>
    <span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN gender <span class="cm-operator">=</span> <span class="cm-string">"male"</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> male<span class="cm-punctuation">,</span>
    ROUND<span class="cm-bracket">(</span><span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN gender <span class="cm-operator">=</span> <span class="cm-string">"male"</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span><span class="cm-operator">/</span><span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>gender<span class="cm-bracket">)</span><span class="cm-operator">*</span><span class="cm-number">100</span><span class="cm-punctuation">,</span> <span class="cm-number">2</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> male_percentage
<span class="cm-keyword">FROM</span> 
    `citibike<span class="cm-operator">-</span>project<span class="cm-operator">-</span><span class="cm-number">330415</span><span class="cm-variable-2">.CITIBIKE</span><span class="cm-variable-2">.citibike_trips</span>` 
<span class="cm-keyword">GROUP</span> <span class="cm-keyword">BY</span> 
    <span class="cm-builtin">year</span>
<span class="cm-keyword">HAVING</span> 
    <span class="cm-builtin">year</span> <span class="cm-keyword">IS</span> <span class="cm-keyword">NOT</span> <span class="cm-atom">NULL</span> <span class="cm-keyword">AND</span> <span class="cm-builtin">year</span> <span class="cm-operator">!=</span> <span class="cm-number">2013</span> <span class="cm-keyword">AND</span> <span class="cm-builtin">year</span> <span class="cm-operator">!=</span> <span class="cm-number">2018</span>
<span class="cm-keyword">ORDER</span> <span class="cm-keyword">BY</span>
    <span class="cm-builtin">year</span> <span class="cm-keyword">ASC</span>
</code></pre>
</div></td>
  </tr>
    
  <tr>
    <td class="tg-0pky">2 Lines<br></td>
    <td class="tg-0pky">16 Lines</td>
  </tr>
</tbody>
</table>

&nbsp;

<a class="anchor" id="fourth-question"></a>
#  ❓Q4: Which day of the week is the most popular ❓

### pandas - 1 line

Group records by day of week and count how many trips there are

In [10]:
df.groupby(df.starttime.dt.day_of_week).size()

starttime
0    1632
1    1653
2    1643
3    1723
4    1643
5    1314
6    1281
dtype: int64

## <h2 align="center"> 💡 1 line of pandas vs 11 lines of code required with SQL approach [below](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql?scriptVersionId=78471031&cellId=16) </h2>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">With SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre style=""><code class="cm-s-jupyter language-python"><span class="cm-variable">df</span>.<span class="cm-property">groupby</span>(<span class="cm-variable">df</span>.<span class="cm-property">starttime</span>.<span class="cm-property">dt</span>.<span class="cm-property">day_of_week</span>).<span class="cm-property">size</span>()
</code></pre>
</div></td>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-Cell jp-MarkdownCell jp-Notebook-cell jp-mod-rendered"><div class="lm-Widget p-Widget jp-CellHeader jp-Cell-header"></div><div class="lm-Widget p-Widget lm-Panel p-Panel jp-Cell-inputWrapper"><div class="lm-Widget p-Widget jp-Collapser jp-InputCollapser jp-Cell-inputCollapser"><div class="jp-Collapser-child"></div></div><div class="lm-Widget p-Widget jp-InputArea jp-Cell-inputArea"><div class="lm-Widget p-Widget jp-InputPrompt jp-InputArea-prompt"></div><div class="lm-Widget p-Widget jp-CodeMirrorEditor jp-Editor jp-InputArea-editor lm-mod-hidden p-mod-hidden" data-type="inline" aria-hidden="true"><div class="CodeMirror cm-s-jupyter CodeMirror-wrap"><div style="overflow: hidden; position: relative; width: 3px; height: 0px; top: 191.875px; left: 152.699px;"><textarea autocorrect="off" autocapitalize="off" spellcheck="false" tabindex="0" style="position: absolute; bottom: -1em; padding: 0px; width: 1000px; height: 1em; outline: none;"></textarea></div><div class="CodeMirror-vscrollbar" tabindex="-1" cm-not-content="true" style="bottom: 0px; display: block;"><div style="min-width: 1px; height: 89px;"></div></div><div class="CodeMirror-hscrollbar" tabindex="-1" cm-not-content="true"><div style="height: 100%; min-height: 1px; width: 0px;"></div></div><div class="CodeMirror-scrollbar-filler" cm-not-content="true"></div><div class="CodeMirror-gutter-filler" cm-not-content="true"></div><div class="CodeMirror-scroll" tabindex="-1" draggable="false"><div class="CodeMirror-sizer" style="margin-left: 0px; padding-right: 14px; padding-bottom: 0px; margin-bottom: -14px; border-right-width: 36px; min-height: 231px;"><div style="position: relative; top: 0px;"><div class="CodeMirror-lines" role="presentation"><div role="presentation" style="position: relative; outline: none;"><div class="CodeMirror-measure"><pre class="CodeMirror-line-like"><span>xxxxxxxxxx</span></pre></div><div class="CodeMirror-measure"></div><div style="position: relative; z-index: 1;"></div><div class="CodeMirror-cursors" style="visibility: hidden;"><div class="CodeMirror-cursor" style="left: 152.699px; top: 186.875px; height: 16.9886px;">&nbsp;</div></div><div class="CodeMirror-code" role="presentation" style=""><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-comment">```sql</span></span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-keyword">SELECT</span> &nbsp;</span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  EXTRACT <span class="cm-bracket">(</span> dayofweek <span class="cm-keyword">FROM</span> starttime<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> day_of_week<span class="cm-punctuation">,</span></span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp; &nbsp;<span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span><span class="cm-operator">*</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> number_of_trip</span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-keyword">FROM</span> </span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  `citibike<span class="cm-operator">-</span>project<span class="cm-operator">-</span><span class="cm-number">330415</span><span class="cm-variable-2">.CITIBIKE.citibike_trips</span>` </span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-keyword">GROUP</span> <span class="cm-keyword">BY</span> </span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  day_of_week</span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-keyword">HAVING</span></span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  day_of_week <span class="cm-keyword">IS</span> <span class="cm-keyword">NOT</span> <span class="cm-atom">NULL</span> </span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-keyword">ORDER</span> <span class="cm-keyword">BY</span> </span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  day_of_week <span class="cm-keyword">ASC</span></span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-comment">```</span></span></pre></div></div></div></div></div><div style="position: absolute; height: 36px; width: 1px; border-bottom: 0px solid transparent; top: 231px;"></div><div class="CodeMirror-gutters" style="display: none; height: 267px;"></div></div></div></div><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre style=""><code class="cm-s-jupyter language-sql"><span class="cm-keyword">SELECT</span>  
    EXTRACT <span class="cm-bracket">(</span> dayofweek <span class="cm-keyword">FROM</span> starttime<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> day_of_week<span class="cm-punctuation">,</span>
    <span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span><span class="cm-operator">*</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> number_of_trip
<span class="cm-keyword">FROM</span> 
    `citibike<span class="cm-operator">-</span>project<span class="cm-operator">-</span><span class="cm-number">330415</span><span class="cm-variable-2">.CITIBIKE</span><span class="cm-variable-2">.citibike_trips</span>` 
<span class="cm-keyword">GROUP</span> <span class="cm-keyword">BY</span> 
    day_of_week
<span class="cm-keyword">HAVING</span>
    day_of_week <span class="cm-keyword">IS</span> <span class="cm-keyword">NOT</span> <span class="cm-atom">NULL</span> 
<span class="cm-keyword">ORDER</span> <span class="cm-keyword">BY</span> 
    day_of_week <span class="cm-keyword">ASC</span>
</code></pre>
</div></div></div><div class="lm-Widget p-Widget jp-CellFooter jp-Cell-footer"></div></div></td>
  </tr>
  <tr>
    <td class="tg-y698">1 Line</td>
    <td class="tg-y698">11 Lines</td>
  </tr>
</tbody>
</table>

&nbsp;

<a class="anchor" id="fifth-question"></a>
# ❓Q5: What is the average trip duration per day of the week ❓

### pandas - 1 line

Group records by day of week then calculate the average duration of trips for each day of the week

In [11]:
df.groupby(df.starttime.dt.day_of_week)['tripduration'].mean()

starttime
0     978.046569
1     792.236540
2     825.590383
3    1129.637261
4     878.025563
5    1040.140791
6    1035.018735
Name: tripduration, dtype: float64

## <h2 align="center">  💡1 line of pandas vs. 11 lines of code required with SQL approach [below](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql?scriptVersionId=78471031&cellId=21) </h2>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">With SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre><code class="cm-s-jupyter language-python"><span class="cm-variable">df</span>.<span class="cm-property">groupby</span>(<span class="cm-variable">df</span>.<span class="cm-property">starttime</span>.<span class="cm-property">dt</span>.<span class="cm-property">day_of_week</span>)[<span class="cm-string">'tripduration'</span>].<span class="cm-property">mean</span>()
</code></pre>
</div></td>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-Cell jp-MarkdownCell jp-Notebook-cell jp-mod-rendered"><div class="lm-Widget p-Widget jp-CellHeader jp-Cell-header"></div><div class="lm-Widget p-Widget lm-Panel p-Panel jp-Cell-inputWrapper"><div class="lm-Widget p-Widget jp-Collapser jp-InputCollapser jp-Cell-inputCollapser"><div class="jp-Collapser-child"></div></div><div class="lm-Widget p-Widget jp-InputArea jp-Cell-inputArea"><div class="lm-Widget p-Widget jp-InputPrompt jp-InputArea-prompt"></div><div class="lm-Widget p-Widget jp-CodeMirrorEditor jp-Editor jp-InputArea-editor lm-mod-hidden p-mod-hidden" data-type="inline" aria-hidden="true"><div class="CodeMirror cm-s-jupyter CodeMirror-wrap"><div style="overflow: hidden; position: relative; width: 3px; height: 0px; top: 89.9432px; left: 426.634px;"><textarea autocorrect="off" autocapitalize="off" spellcheck="false" tabindex="0" style="position: absolute; bottom: -1em; padding: 0px; width: 1000px; height: 1em; outline: none;"></textarea></div><div class="CodeMirror-vscrollbar" tabindex="-1" cm-not-content="true" style="bottom: 0px;"><div style="min-width: 1px; height: 0px;"></div></div><div class="CodeMirror-hscrollbar" tabindex="-1" cm-not-content="true"><div style="height: 100%; min-height: 1px; width: 0px;"></div></div><div class="CodeMirror-scrollbar-filler" cm-not-content="true"></div><div class="CodeMirror-gutter-filler" cm-not-content="true"></div><div class="CodeMirror-scroll" tabindex="-1" draggable="false"><div class="CodeMirror-sizer" style="margin-left: 0px; padding-right: 0px; padding-bottom: 0px; margin-bottom: -14px; border-right-width: 36px; min-height: 231px;"><div style="position: relative; top: 0px;"><div class="CodeMirror-lines" role="presentation"><div role="presentation" style="position: relative; outline: none;"><div class="CodeMirror-measure"><pre class="CodeMirror-line-like">x</pre></div><div class="CodeMirror-measure"></div><div style="position: relative; z-index: 1;"></div><div class="CodeMirror-cursors" style=""><div class="CodeMirror-cursor" style="left: 426.634px; top: 84.9432px; height: 16.9886px;">&nbsp;</div></div><div class="CodeMirror-code" role="presentation" style=""><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-comment">```sql</span></span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-keyword">SELECT</span> &nbsp;</span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  EXTRACT<span class="cm-bracket">(</span>dayofweek <span class="cm-keyword">FROM</span> starttime<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> day<span class="cm-punctuation">,</span></span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  ROUND<span class="cm-bracket">(</span>AVG<span class="cm-bracket">(</span>tripduration<span class="cm-bracket">)</span><span class="cm-punctuation">,</span> <span class="cm-number">2</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> average_trip_duration_minutes</span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-keyword">FROM</span> </span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  `citibike<span class="cm-operator">-</span>project<span class="cm-operator">-</span><span class="cm-number">330415</span><span class="cm-variable-2">.CITIBIKE.citibike_trips</span>` </span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-keyword">GROUP</span> <span class="cm-keyword">BY</span> </span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  day</span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-keyword">HAVING</span> </span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  day <span class="cm-keyword">IS</span> <span class="cm-keyword">NOT</span> <span class="cm-atom">NULL</span></span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-keyword">ORDER</span> <span class="cm-keyword">BY</span> </span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"> &nbsp;  day <span class="cm-keyword">ASC</span></span></pre><pre class=" CodeMirror-line " role="presentation"><span role="presentation" style="padding-right: 0.1px;"><span class="cm-comment">```</span></span></pre></div></div></div></div></div><div style="position: absolute; height: 36px; width: 1px; border-bottom: 0px solid transparent; top: 231px;"></div><div class="CodeMirror-gutters" style="display: none; height: 267px;"></div></div></div></div><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre style=""><code class="cm-s-jupyter language-sql"><span class="cm-keyword">SELECT</span>  
    EXTRACT<span class="cm-bracket">(</span>dayofweek <span class="cm-keyword">FROM</span> starttime<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> day<span class="cm-punctuation">,</span>
    ROUND<span class="cm-bracket">(</span>AVG<span class="cm-bracket">(</span>tripduration<span class="cm-bracket">)</span><span class="cm-punctuation">,</span> <span class="cm-number">2</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> average_trip_duration_minutes
<span class="cm-keyword">FROM</span> 
    `citibike<span class="cm-operator">-</span>project<span class="cm-operator">-</span><span class="cm-number">330415</span><span class="cm-variable-2">.CITIBIKE</span><span class="cm-variable-2">.citibike_trips</span>` 
<span class="cm-keyword">GROUP</span> <span class="cm-keyword">BY</span> 
    day
<span class="cm-keyword">HAVING</span> 
    day <span class="cm-keyword">IS</span> <span class="cm-keyword">NOT</span> <span class="cm-atom">NULL</span>
<span class="cm-keyword">ORDER</span> <span class="cm-keyword">BY</span> 
    day <span class="cm-keyword">ASC</span>
</code></pre>
</div></div></div><div class="lm-Widget p-Widget jp-CellFooter jp-Cell-footer"></div></div></td>
  </tr>
  <tr>
    <td class="tg-y698">1 Line</td>
    <td class="tg-y698">11 Lines</td>
  </tr>
</tbody>
</table>

&nbsp;

<a class="anchor" id="sixth-question"></a>
#  ❓Q6: What is the total number of trips per month for a year (2015) ❓

### pandas - 1 line of code

Group by month, then count trips per month

In [12]:
df[df['start_year']==2015].groupby(df.starttime.dt.month).size()

starttime
1      86
2      62
3     119
4     194
5     286
6     312
7     376
8     388
9     436
10    359
11    250
12    199
dtype: int64

## <h2 align="center">💡1 line of pandas vs. 15 lines of code required with SQL approach [below](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql?scriptVersionId=78471031&cellId=26)</h2>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">With SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky"><pre style=""><code class="cm-s-jupyter language-python"><span class="cm-variable">df</span>[<span class="cm-variable">df</span>[<span class="cm-string">'start_year'</span>]<span class="cm-operator">==</span><span class="cm-number">2015</span>].<span class="cm-property">groupby</span>(<span class="cm-variable">df</span>.<span class="cm-property">starttime</span>.<span class="cm-property">dt</span>.<span class="cm-property">month</span>).<span class="cm-property">size</span>()
</code></pre></td>
    <td class="tg-0pky"><div class="lm-Widget p-Widget jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput" data-mime-type="text/markdown"><pre style=""><code class="cm-s-jupyter language-sql"><span class="cm-keyword">SELECT</span>  
    EXTRACT<span class="cm-bracket">(</span><span class="cm-builtin">year</span> <span class="cm-keyword">FROM</span> starttime<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> <span class="cm-builtin">year</span><span class="cm-punctuation">,</span>
    EXTRACT<span class="cm-bracket">(</span>month <span class="cm-keyword">FROM</span> starttime<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> month<span class="cm-punctuation">,</span>
    <span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span><span class="cm-operator">*</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> number_of_trips
<span class="cm-keyword">FROM</span> 
    `citibike<span class="cm-operator">-</span>project<span class="cm-operator">-</span><span class="cm-number">330415</span><span class="cm-variable-2">.CITIBIKE</span><span class="cm-variable-2">.citibike_trips</span>` 
<span class="cm-keyword">GROUP</span> <span class="cm-keyword">BY</span> 
    <span class="cm-builtin">year</span><span class="cm-punctuation">,</span> month
<span class="cm-keyword">HAVING</span> 
    month <span class="cm-keyword">IS</span> <span class="cm-keyword">NOT</span> <span class="cm-atom">NULL</span> <span class="cm-keyword">AND</span> 
    <span class="cm-builtin">year</span> <span class="cm-keyword">IS</span> <span class="cm-keyword">NOT</span> <span class="cm-atom">NULL</span> <span class="cm-keyword">AND</span> 
    <span class="cm-builtin">year</span> <span class="cm-operator">=</span> <span class="cm-number">2015</span> 
<span class="cm-keyword">ORDER</span> <span class="cm-keyword">BY</span> 
    <span class="cm-builtin">year</span> <span class="cm-keyword">ASC</span><span class="cm-punctuation">,</span>
    month <span class="cm-keyword">ASC</span>
</code></pre>
</div></td>
  </tr>
  <tr>
    <td class="tg-y698">1 Line</td>
    <td class="tg-y698">11 Lines</td>
  </tr>
</tbody>
</table>

&nbsp;

## So in total...

 <h1 align="center">🔥9 lines of pandas vs. ~70 lines with the SQL🔥</h1>

&nbsp;

# But what if I want to do more that the simple summary stats above?

As part of my analysis I may want to go beyond the 6 questions we've already answered. I may want to do more complex data cleaning and transformation steps that aren't included in the Kaggle analysis. If my operations aren't relational in nature, what do I do? 

See two examples below:

### Drop rows with missing values, sort by trip duration, get the top 10 longest trips

A key characteristic of dataframes is that they are ordered, the order is preserved across operations, and we can use indexing to select subsets of data. These characteristics are critical for interactive data science.

With pandas, the above prompt can be answered with one highly expressive line of pandas code.  Whereas with SQL, preserving order across operations in nearly impossible and dealing with null values is a very involved.   

In [13]:
df.dropna(axis=0).sort_values(by='tripduration',ascending=False).head(10)

Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bikeid,usertype,birth_year,gender,start_year
1266,444671,2015-06-04 09:18:00,6/9/15 12:49,72,W 52 St & 11 Ave,40.767272,-73.993929,484,W 44 St & 5 Ave,40.755003,-73.980144,17272,Subscriber,geny,female,2015
7211,209125,2015-10-19 08:44:00,10/21/15 18:49,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,2001,Sands St & Navy St,40.699773,-73.979927,23115,Subscriber,genx,unknown,2015
1798,37328,2015-12-10 23:07:00,12/11/15 9:29,72,W 52 St & 11 Ave,40.767272,-73.993929,380,W 4 St & 7 Ave S,40.734011,-74.002939,18186,Subscriber,geny,male,2015
4632,33396,2015-10-16 08:08:00,10/16/15 17:25,79,Franklin St & W Broadway,40.719116,-74.006667,116,W 17 St & 8 Ave,40.741776,-74.001497,18657,Subscriber,geny,male,2015
6719,31215,2015-09-25 01:17:00,9/25/15 9:57,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,314,Cadman Plaza West & Montague St,40.69383,-73.990539,23247,Subscriber,genx,male,2015
6645,23450,2016-07-22 20:46:00,7/23/16 3:17,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,245,Myrtle Ave & St Edwards St,40.69327,-73.977039,25899,Subscriber,geny,male,2016
6627,19151,2015-09-05 20:16:00,9/6/15 1:35,83,Atlantic Ave & Fort Greene Pl,40.683826,-73.976323,3103,N 11 St & Wythe Ave,40.721533,-73.957824,16570,Subscriber,geny,female,2015
4335,17890,2013-08-01 18:46:00,8/1/13 23:44,79,Franklin St & W Broadway,40.719116,-74.006667,496,E 16 St & 5 Ave,40.737262,-73.99239,20152,Subscriber,genx,female,2013
5321,17504,2016-04-13 12:50:00,4/13/16 17:42,79,Franklin St & W Broadway,40.719116,-74.006667,3236,W 42 St & Dyer Ave,40.758985,-73.9938,16900,Subscriber,genx,female,2016
10344,15095,2016-03-13 13:29:00,3/13/16 17:40,116,W 17 St & 8 Ave,40.741776,-74.001497,212,W 16 St & The High Line,40.743349,-74.006818,20310,Subscriber,geny,male,2016


### Create dummies for feature engineering/modeling

Gender values in our data set are 0,1,2 representing male, female, and unknown. 

A common feature engineering task is to one-hot encode a category in order to do some predictive modeling.

In a relational database, users must define a schema upfront before they load data into a table or do their analysis. This makes one-hot encoding particularly challenging in that context, because the schema needs to be updated based on the unknown categories contained in a column. 

In order to work around these challenges with SQL, a user will have to write a brittle query that will require hard-coding category values and will be difficult to maintain.

With Ponder, you can just use get_dummies for one-hot encoding inside of your database.

In [14]:
pd.get_dummies(df,columns="gender")

Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bikeid,usertype,birth_year,start_year,gender_female,gender_male,gender_unknown
0,730.0,2015-02-09 08:37:00,2/9/15 8:49,72.0,W 52 St & 11 Ave,40.767272,-73.993929,520.0,W 52 St & 5 Ave,40.759923,-73.976485,18809.0,Subscriber,genx,2015.0,0,1,0
1,704.0,2013-11-20 20:21:00,11/20/13 20:33,72.0,W 52 St & 11 Ave,40.767272,-73.993929,470.0,W 20 St & 8 Ave,40.743453,-74.000040,20515.0,Subscriber,geny,2013.0,0,1,0
2,425.0,2016-01-06 17:01:00,1/6/16 17:08,72.0,W 52 St & 11 Ave,40.767272,-73.993929,469.0,Broadway & W 53 St,40.763441,-73.982681,17116.0,Subscriber,boomer,2016.0,0,1,0
3,373.0,2015-11-09 12:50:00,11/9/15 12:56,72.0,W 52 St & 11 Ave,40.767272,-73.993929,469.0,Broadway & W 53 St,40.763441,-73.982681,20892.0,Subscriber,boomer,2015.0,0,1,0
4,1149.0,2013-08-03 17:14:00,8/3/13 17:33,72.0,W 52 St & 11 Ave,40.767272,-73.993929,325.0,E 19 St & 3 Ave,40.736245,-73.984738,17711.0,Subscriber,geny,2013.0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1001027,,NaT,,,,,,,,,,,,,,0,0,0
1001028,,NaT,,,,,,,,,,,,,,0,0,0
1001029,,NaT,,,,,,,,,,,,,,0,0,0
1001030,,NaT,,,,,,,,,,,,,,0,0,0


For a more detailed comparison of one-hot encoding in Pandas vs SQL, check out our blogpost [here](https://ponder.io/pandas-vs-sql-part-2-pandas-is-more-concise/#2.-One-hot-encoding)

&nbsp;

##  💡 Ponder gives you a high fidelity pandas experience for your databases 💡

* We answered 6 questions from a Kaggle notebook to demonstrate how it is often more concise and easier to express queries in pandas than SQL with fewer lines of code. 

* Showed you how Ponder can help you explore your data using almost 10X less code 
* Demonstrated how certain dataframe operations, such as `get_dummmies`, are extremely challenging to do or even impossible to do in SQL.

### <h1><center>To try Ponder for free, sign up for an account at [app.ponder.io/signup](https://app.ponder.io/signup) !</center></h1>