# Ponder Citibike Mobility Analysis

<div class="alert alert-block alert-info"> <b>Before we get started: </b> 
    <ul style="list-style-type: none;margin: 0;padding: 0;">
        <li>✍️ To run this notebook, you need to have Ponder installed and set up on your machine. If you have not done so already, please refer to our <a href="https://docs.ponder.io/getting_started/quickstart.html">Quickstart guide</a> to get started.</li>
        <li>📖 Otherwise, if you're just interested in browsing through the notebook, keep reading below!</li>
    </ul>
</div>

Pandas is easy to use, flexible, and concise. Relational databases are scalable and reliable, but force you to use SQL. This demo shows you how Ponder allows your to combine the best of both worlds. With Ponder you can write more concise and maintainable code relative to SQL AND you can do operations that aren't even relational in nature!

This notebook shows you how you can use Ponder to unleash your pandas code directly in your database! Ponder gives you a native pandas interface to your data warehouse. If you haven't already, sign up for a free trial account at [***app.ponder.io***](www.app.ponder.io) to try it out!

# Tutorial Overview

In this notebook, we will walk through a simple analysis of Citbike bikeshare data to show you how easy it is to use Ponder. The dataset includes specific bikeshare trips in New York City and we'll answer 6 mobility questions inspired by Kevin Chan's [Kaggle notebook](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql/notebook). We use a sample of the public data available [here](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=new_york_citibike&page=dataset&project=geocoding-314221&ws=!1m4!1m3!3m2!1sbigquery-public-data!2snew_york_citibike) for this analysis.

**To answer these questions with SQL it takes about 72 lines of code. With Pandas, we can answer these same questions with almost 10X less code.**


Specifically we'll show you:

#### [How to setup Ponder ](#setup)


#### [ How Ponder Works](#how-it-works)


#### [The Citbike Use Case](#key-questions)
    


 <h1 align="center">🔥9 lines of pandas vs. ~70 lines with the SQL🔥</h1>

&nbsp;

<a class="anchor" id="setup"></a>
# 🛠️ How to setup Ponder 🛠️

### Import Requirements 

In [1]:
import ponder; ponder.init()
import modin.pandas as pd
import snowflake.connector



###  Configure  Database Connection

This example uses your Snowflake database as the backend engine. If you have Snowflake, you can find the documentation for the Snowflake Python Connector [here](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector).

If you don't have Snowflake, visit our documentation visit our documentation to [Connect Ponder to your DB](https://docs.ponder.io/getting_started/connection.html) and for a list of other supported databases, orreach out to support@ponder.io.

In [None]:
import os; os.chdir("..")

import credential
sfcon = snowflake.connector.connect(
    user=credential.params["user"],
    password=credential.params["password"],
    account=credential.params["account"],
    role=credential.params["role"],
    database=credential.params["database"],
    schema=credential.params["schema"],
    warehouse=credential.params["warehouse"]
)

### Select Data Source

Ponder allows you to work with data in flat files as well as in your exisiting database tables. Below, we will use a csv file stored in my local machine.

**So how is Ponder different than vanilla pandas?**

* With vanilla pandas, the `read_csv()` method pulls the data from disk into a dataframe in your local memory.

* With Ponder, the `read_csv()` method automatically creates a database table, configures the schema for your csv file, and loads the data into the warehouse for analysis.

Here, we configure Snowflake as the default database connection to use when reading CSVs.

In [6]:
ponder.configure(default_connection=sfcon)

In [None]:
df = pd.read_csv('example/citibike_tutorial.csv',on_bad_lines='skip')

<a class="anchor" id="how-it-works"></a>
#    🧪 How it Works 🧪

As you go through the rest of this notebook, it may seem like any other pandas tutorial notebook, and that is the magic of Ponder! 

Ponder gives you a pandas interface for your data warehouse. As you run each code block, Ponder automatically compiles your pandas code into SQL and runs it directly in your database. None of the computation is done in your local python enviornment!

Dont believe us? Try operating on a large dataset that doesn't fit into your local RAM and run a memory profiler to see how Ponder can help you scale up your workflows! And for more background on what we've been building, check out this [Ponder Overview Blogpost](https://ponder.io/run-pandas-on-1tb-directly-in-your-data-warehouse/) for more 

When we print the dataframe, we see that our dataframe contains Citibike trips and associated details, including pickup/drop of stations and times.

In [9]:
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bikeid,usertype,birth_year,gender
0,616,9/20/16 18:12,9/20/16 18:22,369,Washington Pl & 6 Ave,40.732241,-74.000264,473,Rivington St & Chrystie St,40.721101,-73.991925,22404,Subscriber,1988.0,male
1,582,8/20/16 22:29,8/20/16 22:39,369,Washington Pl & 6 Ave,40.732241,-74.000264,498,Broadway & W 32 St,40.748549,-73.988084,17498,Subscriber,1996.0,male
2,669,3/15/14 22:31,3/15/14 22:42,369,Washington Pl & 6 Ave,40.732241,-74.000264,511,E 14 St & Avenue B,40.729387,-73.977724,17111,Subscriber,1992.0,female
3,352,10/15/15 8:56,10/15/15 9:02,298,3 Ave & Schermerhorn St,40.686832,-73.979677,392,Jay St & Tech Pl,40.695065,-73.987167,19014,Subscriber,1976.0,male
4,2092,9/28/16 13:12,9/28/16 13:47,298,3 Ave & Schermerhorn St,40.686832,-73.979677,330,Reade St & Broadway,40.714505,-74.005628,23004,Subscriber,1984.0,male


&nbsp;

<a class="anchor" id="key-questions"></a>
# 6 Mobility Trend Questions Answered

The following 6 exploratory questions were answered using SQL in the bikeshare analysis on Kaggle, so we'll walk you through how you can do the same with Ponder + pandas.



* [What time fram does the data set contain?](#first-question)
* [Which age group uses Citi Bike most often and the trend?](#second-question)
* [Which gender uses Citi Bike most often and the trend?](#third-question)
* [Which day of the week is Citi Bike most utilized?](#fourth-question)
* [What is the average trip duration per day of the week?](#fifth-question)
* [What is the total number of trips per month?](#sixth-question)

We have quite a few mobility time series related questions, so above we convert our trip start time variable `starttime` to a datetime type for simplifying our analysis.

In [11]:
df['starttime'] = pd.to_datetime(df.starttime,format="%m/%d/%y %H:%M")

<a class="anchor" id="first-question"></a>
# ❓Q1: What time frame does the data set contain ❓

### pandas  - 1 line of code

Calculate the min and max values of trip starts.

In [12]:
df.starttime.min(),df.starttime.max()

(Timestamp('2013-07-01 00:00:00'), Timestamp('2016-09-30 23:57:00'))

## <h2 align="center"> 💡 5 lines of code required using SQL Approach [below](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql?scriptVersionId=78471031&cellId=4) </h2>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">With SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky">df.starttime.min(),df.starttime.max()</td>
    <td class="tg-0pky">SELECT  <br><br>        MAX(DATE(starttime)) AS max_date,<br><br>        MIN(DATE(starttime)) AS min_date<br><br>    FROM <br><br>       <span style="font-weight:400;font-style:normal">'citibike-project-330415.CITIBIKE.citibike_trips'</span></td>
  </tr>
  <tr>
    <td class="tg-y698">1 Line </td>
    <td class="tg-y698">5 Lines</td>
  </tr>
</tbody>
</table>

&nbsp;

<a class="anchor" id="second-question"></a>
# ❓Q2: Which age group uses CitiBike most often & what is the trend ❓

### pandas - 3 lines of code

Define mappings of birth years to age groups, replace birth years with age groups, calcualte how many trips per age group per year

In [13]:
replace_dict = { 'boomer': 1960, 'genx': 1980, 'geny': 1994,'genz':2023,"missing":2030}
df['birth_year'] = pd.cut(df['birth_year'],bins=[1]+list(replace_dict.values()),labels=list(replace_dict.keys()))

df.groupby('birth_year',as_index=False)['tripduration'].count()

Unnamed: 0,birth_year,tripduration
0,boomer,94665
1,genx,385055
2,geny,388611
3,genz,10753


### <h2 align="center"> 💡 3 lines of pandas vs. 14 lines of code required using SQL approach [below](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql?scriptVersionId=78471031&cellId=4) </h2>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">With SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky">replace_dict = { 'boomer': 1960, 'genx': 1980, 'geny': 1994,'genz':2023,"missing":2030}<br>df['birth_year'] = pd.cut(df['birth_year'],bins=[1]+list(replace_dict.values()),labels=list(replace_dict.keys()))<br><br>df.groupby('birth_year',as_index=False)['tripduration'].count()<br></td>
    <td class="tg-0pky">SELECT  <br><br>        EXTRACT(year FROM starttime) AS year,<br><br>        COUNT(CASE WHEN birth_year&gt;=1940 AND birth_year&lt;1959 THEN 1 END) AS boomer,<br><br>        COUNT(CASE WHEN birth_year&gt;=1960 AND birth_year&lt;1979 THEN 1 END) AS genx,<br><br>        COUNT(CASE WHEN birth_year&gt;=1980 AND birth_year&lt;1994 THEN 1 END) AS geny,<br><br>        COUNT(CASE WHEN birth_year&gt;=1995 AND birth_year&lt;2012 THEN 1 END) AS genz<br><br><br>    FROM <br><br>        `citibike-project-330415.CITIBIKE.citibike_trips` <br><br>    GROUP BY <br><br>        year<br><br>    HAVING <br><br>        year IS NOT NULL AND year != 2013 AND year != 2018<br><br>    ORDER BY<br><br>        year ASC<br></td>
  </tr>
  <tr>
    <td class="tg-y698">3 Lines</td>
    <td class="tg-y698">14 Lines</td>
  </tr>
</tbody>
</table>

<a class="anchor" id="third-question"></a>
# Q3: Which gender uses Citi Bike most often and what's the trend❓

### pandas  - 2 lines of code

Group by gender and year, count how many trip per gender per year

In [28]:
df['start_year'] = df.starttime.dt.year
df[df.start_year != 2013].groupby(['start_year','gender']).size() 

start_year  gender 
2014        female      49816
            male       169068
            unknown     23972
2015        female      59882
            male       198921
            unknown     39787
2016        female      66222
            male       203057
            unknown     39335
dtype: int64

## <h2 align="center"> 💡 2 lines of pandas vs 16 lines required with SQL approach [below](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql?scriptVersionId=78471031&cellId=11) </h2>

<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">In SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky"><pre style=""><code class="cm-s-jupyter language-python"><span class="cm-variable">df</span>[<span class="cm-string">'start_year'</span>] <span class="cm-operator">=</span> <span class="cm-variable">df</span>.<span class="cm-property">starttime</span>.<span class="cm-property">dt</span>.<span class="cm-property">year</span>
<span class="cm-variable">df</span>.<span class="cm-property">groupby</span>([<span class="cm-string">'start_year'</span>,<span class="cm-string">'gender'</span>]).<span class="cm-property">size</span>()
</code></pre></td>
    <td class="tg-0pky"><pre style=""><code class="cm-s-jupyter language-sql"><span class="cm-keyword">SELECT</span>  
    EXTRACT<span class="cm-bracket">(</span> <span class="cm-builtin">year</span> <span class="cm-keyword">FROM</span> starttime<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> <span class="cm-builtin">year</span><span class="cm-punctuation">,</span>
    <span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN gender<span class="cm-operator">=</span> <span class="cm-string">"female"</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> female<span class="cm-punctuation">,</span>
    ROUND<span class="cm-bracket">(</span><span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN gender<span class="cm-operator">=</span> <span class="cm-string">"female"</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span><span class="cm-operator">/</span><span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>gender<span class="cm-bracket">)</span><span class="cm-operator">*</span><span class="cm-number">100</span><span class="cm-punctuation">,</span> <span class="cm-number">2</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> female_percentage<span class="cm-punctuation">,</span>
    <span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN gender <span class="cm-operator">=</span> <span class="cm-string">"male"</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> male<span class="cm-punctuation">,</span>
    ROUND<span class="cm-bracket">(</span><span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>CASE WHEN gender <span class="cm-operator">=</span> <span class="cm-string">"male"</span> THEN <span class="cm-number">1</span> END<span class="cm-bracket">)</span><span class="cm-operator">/</span><span class="cm-keyword">COUNT</span><span class="cm-bracket">(</span>gender<span class="cm-bracket">)</span><span class="cm-operator">*</span><span class="cm-number">100</span><span class="cm-punctuation">,</span> <span class="cm-number">2</span><span class="cm-bracket">)</span> <span class="cm-keyword">AS</span> male_percentage
<span class="cm-keyword">FROM</span> 
    `citibike<span class="cm-operator">-</span>project<span class="cm-operator">-</span><span class="cm-number">330415</span><span class="cm-variable-2">.CITIBIKE</span><span class="cm-variable-2">.citibike_trips</span>` 
<span class="cm-keyword">GROUP</span> <span class="cm-keyword">BY</span> 
    <span class="cm-builtin">year</span>
<span class="cm-keyword">HAVING</span> 
    <span class="cm-builtin">year</span> <span class="cm-keyword">IS</span> <span class="cm-keyword">NOT</span> <span class="cm-atom">NULL</span> <span class="cm-keyword">AND</span> <span class="cm-builtin">year</span> <span class="cm-operator">!=</span> <span class="cm-number">2013</span> <span class="cm-keyword">AND</span> <span class="cm-builtin">year</span> <span class="cm-operator">!=</span> <span class="cm-number">2018</span>
<span class="cm-keyword">ORDER</span> <span class="cm-keyword">BY</span>
    <span class="cm-builtin">year</span> <span class="cm-keyword">ASC</span>
</code></pre></td>
  </tr>
    
  <tr>
    <td class="tg-0pky">2 Lines<br></td>
    <td class="tg-0pky">16 Lines</td>
  </tr>
</tbody>
</table>

&nbsp;

<a class="anchor" id="fourth-question"></a>
#  ❓Q4: Which day of the week is the most popular ❓

### pandas - 1 line

Group records by day of week and count how many trips there are

In [22]:
df.groupby(df.starttime.dt.day_of_week).size()

starttime
0    145702
1    153066
2    157988
3    154676
4    149662
5    121801
6    118137
dtype: int64

## <h2 align="center"> 💡 1 line of pandas vs 11 lines of code required with SQL approach [below](https://www.kaggle.com/code/cjinquan/citibike-analysis-sql?scriptVersionId=78471031&cellId=16) </h2>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">With SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky">df.groupby(df.starttime.dt.day_of_week).size()<br><br></td>
    <td class="tg-0pky">SELECT  <br><br><br>        EXTRACT ( dayofweek FROM starttime) AS day_of_week,<br><br><br>        COUNT(*) AS number_of_trip<br><br><br>    FROM <br><br><br>        `citibike-project-330415.CITIBIKE.citibike_trips` <br><br><br>    GROUP BY <br><br><br>        day_of_week<br><br><br>    HAVING<br><br><br>        day_of_week IS NOT NULL <br><br><br>    ORDER BY <br><br><br>        day_of_week ASC<br></td>
  </tr>
  <tr>
    <td class="tg-y698">1 Line</td>
    <td class="tg-y698">11 Lines</td>
  </tr>
</tbody>
</table>

&nbsp;

<a class="anchor" id="fifth-question"></a>
# ❓Q5: What is the average trip duration per day of the week ❓

### pandas - 1 line

Group records by day of week then calculate the average duration of trips for each day of the week

In [24]:
df.groupby(df.starttime.dt.day_of_week)['tripduration'].mean()

starttime
0     887.048881
1     850.023983
2     863.566176
3     855.236055
4     889.788971
5    1138.640561
6    1125.504829
Name: tripduration, dtype: float64

## <h2 align="center">  💡1 line of pandas vs. 11 lines of code required with SQL approach [below]() </h2>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">With SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky">df.groupby(df.starttime.dt.day_of_week)['tripduration'].mean()<br></td>
    <td class="tg-0pky">SELECT  <br><br><br>        EXTRACT(dayofweek FROM starttime) AS day,<br><br><br>        ROUND(AVG(tripduration), 2) AS average_trip_duration_minutes<br><br><br>    FROM <br><br><br>        `citibike-project-330415.CITIBIKE.citibike_trips` <br><br><br>    GROUP BY <br><br><br>        day<br><br><br>    HAVING <br><br><br>        day IS NOT NULL<br><br><br>    ORDER BY <br><br><br>        day ASC<br></td>
  </tr>
  <tr>
    <td class="tg-y698">1 Line</td>
    <td class="tg-y698">11 Lines</td>
  </tr>
</tbody>
</table>

&nbsp;

<a class="anchor" id="sixth-question"></a>
#  ❓Q6: What is the total number of trips per month for a year (2015) ❓

### pandas - 1 line of code

Group by month, then count trips per month

In [27]:
df[df['start_year']==2015].groupby(df.starttime.dt.month).size()

starttime
1      8722
2      5989
3     10180
4     19649
5     28979
6     28292
7     32840
8     35388
9     38535
10    36304
11    29709
12    24003
dtype: int64

## <h2 align="center">💡1 line of pandas vs. 15 lines of code required with SQL approach [below]()</h2>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-y698{background-color:#efefef;border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">In Python</th>
    <th class="tg-0pky">With SQL</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky">df[df['start_year']==2015].groupby(df.starttime.dt.month).size()<br></td>
    <td class="tg-0pky">SELECT  <br><br><br>        EXTRACT(year FROM starttime) AS year,<br><br><br>        EXTRACT(month FROM starttime) AS month,<br><br><br>        COUNT(*) AS number_of_trips<br><br><br>    FROM <br><br><br>        `citibike-project-330415.CITIBIKE.citibike_trips` <br><br><br>    GROUP BY <br><br><br>        year, month<br><br><br>    HAVING <br><br><br>        month IS NOT NULL AND <br><br><br>        year IS NOT NULL AND <br><br><br>        year = 2015 <br><br><br>    ORDER BY <br><br><br>        year ASC,<br><br><br>        month ASC<br></td>
  </tr>
  <tr>
    <td class="tg-y698">1 Line</td>
    <td class="tg-y698">11 Lines</td>
  </tr>
</tbody>
</table>

&nbsp;

## So in total...

 <h1 align="center">🔥9 lines of pandas vs. ~70 lines with the SQL🔥</h1>

&nbsp;

# But what if I want to do more that the simple summary stats above?

As part of my analysis I may want to go beyond the 6 questions we've already answered. I may want to do more complex data cleaning and transformation steps that aren't included in the Kaggle analysis. If my operations aren't relational in nature, what do I do? 

See two examples below:

### Drop rows with missing values, sort by trip duration, get the top 10 longest trips

A key characteristic of dataframes is that they are ordered, the order is preserved across operations, and we can use indexing to select subsets of data. These characteristics are critical for interactive data science.

With pandas, the above prompt can be answered with one highly expressive line of pandas code.  Whereas with SQL, preserving order across operations in nearly impossible and dealing with null values is a very involved.   

In [None]:
df.dropna(axis=0).sort_values(by='tripduration',ascending=False).head(10)

### Create dummies for feature engineering/modeling

Gender values in our data set are 0,1,2 representing male, female, and unknown. 

A common feature engineering task is to one-hot encode a category in order to do some predictive modeling.

In a relational database, users must define a schema upfront before they load data into a table or do their analysis. This makes one-hot encoding particularly challenging in that context, because the schema needs to be updated based on the unknown categories contained in a column. 

In order to work around these challenges with SQL, a user will have to write a brittle query that will require hard-coding category values and will be difficult to maintain.

With Ponder, you can just use get_dummies for one-hot encoding inside of your database.

In [None]:
pd.get_dummies(df,columns="gender")

For a more detailed comparison of one-hot encoding in Pandas vs SQL, check out our blogpost [here](https://ponder.io/pandas-vs-sql-part-2-pandas-is-more-concise/#2.-One-hot-encoding)

&nbsp;

##  💡 Ponder gives you a high fidelity pandas experience for your databases 💡

* We answered 6 questions from a Kaggle notebook to demonstrate how it is often more concise and easier to express queries in pandas than SQL with fewer lines of code. 

* Showed you how Ponder can help you explore your data using almost 10X less code 
* Demonstrated how certain dataframe operations, such as `get_dummmies`, are extremely challenging to do or even impossible to do in SQL.

### <h1><center>To try Ponder for free, sign up for an account at [app.ponder.io](app.ponder.io) !</center></h1>