In [6]:
import sys
import os

import pandas as pd
%matplotlib inline
import urllib.request

In [7]:
try:
    import KustoPandas
except:
    sys.path.insert(0, os.path.abspath(os.path.join(os.path.abspath(""), '..')))
    import KustoPandas
from KustoPandas import Wrap

Get a sample dataset and import it as a Pandas DataFrame

In [8]:
def dowload_dataset_if_necessary(url, filename):
    if not os.path.exists(filename):
        urllib.request.urlretrieve (url, filename)
url = "https://projects.fivethirtyeight.com/trump-approval-data/approval_polllist.csv"
filename = "approval_polllist.csv"
dowload_dataset_if_necessary(url, filename)
data = pd.read_csv(filename, parse_dates=["modeldate", "startdate", "enddate"])

wrap it using KustoPandas

In [9]:
w = Wrap(data)

Now we can start to explore it using KustoPandas commands.  

Lets start with something simple and just look at a few lines.

The corresponding kusto command would be 

```
w | take 5
```

Python doesn't support `|`, so instead we use `.`

In [10]:
w.take(5)

Unnamed: 0,president,subgroup,modeldate,startdate,enddate,pollster,grade,samplesize,population,weight,...,disapprove,adjusted_approve,adjusted_disapprove,multiversions,tracking,url,poll_id,question_id,createddate,timestamp
0,Donald Trump,All polls,2021-01-20,2017-01-20,2017-01-22,Morning Consult,B/C,1992.0,rv,0.680029,...,37.0,45.686784,38.055805,,,http://static.politico.com/9b/13/82a3baf542ae9...,49249,77261,1/23/2017,11:47:59 20 Jan 2021
1,Donald Trump,All polls,2021-01-20,2017-01-20,2017-01-22,Gallup,B,1500.0,a,0.262323,...,45.0,45.861441,43.539189,,T,http://www.gallup.com/poll/201617/gallup-daily...,49253,77265,1/23/2017,11:47:59 20 Jan 2021
2,Donald Trump,All polls,2021-01-20,2017-01-20,2017-01-24,Ipsos,B-,1632.0,a,0.153481,...,45.2,43.451563,43.780389,,T,http://polling.reuters.com/#poll/CP3_2/,49426,77599,3/1/2017,11:47:59 20 Jan 2021
3,Donald Trump,All polls,2021-01-20,2017-01-21,2017-01-23,Gallup,B,1500.0,a,0.242845,...,46.0,45.861441,44.539189,,T,http://www.gallup.com/poll/201617/gallup-daily...,49262,77274,1/24/2017,11:47:59 20 Jan 2021
4,Donald Trump,All polls,2021-01-20,2017-01-22,2017-01-24,Gallup,B,1500.0,a,0.22738,...,45.0,46.861441,43.539189,,T,http://www.gallup.com/poll/201617/gallup-daily...,49236,77248,1/25/2017,11:47:59 20 Jan 2021


Notice that the output is a nicely formatted table.  That is becuase KustoPandas is just a shallow wrapper around a pandas dataframe.  Pandas is doing the hard work of formatting it nicely in jupyter

Lets explore the data a bit more using the summarize operator.  Here is the Kusto command

```
w | summarize count(), min(startdate), max(startdate), dcount(pollster), AverageDisapproval = avg(disapprove)"
```

The way we execute this in python is that the top level kusto operator (summarize) is a class method. All arguments are passed to the method as a string

In [11]:
w.summarize("count(), min(startdate), max(startdate), dcount(pollster), AverageDisapproval = avg(disapprove)")

Unnamed: 0,count_,min_startdate,max_startdate,dcount_pollster,AverageDisapproval
0,16500,2017-01-20,2021-01-16,93,53.31449


Similarly we can use the `where` operator to do filtering.  The Kusto query would be

```
w | where grade in ("A", "B")
```

Again (and this is always the case) the top level operator `where` is the class method and the rest of the arguments are passed as a string.

But the output of the above would be too long, so lets chain operators together

```
w | where grade in ("A", "B") | summarize count() by grade, pollster
```

In [12]:
w.where("grade in ('A', 'B')").summarize("count(), avg(disapprove) by grade, pollster")

Unnamed: 0,grade,pollster,count_,avg_disapprove
0,A,CBS News,41,54.02439
1,A,Suffolk University,38,52.594737
2,A,SurveyUSA,4,51.5
3,B,American Research Group,127,57.708661
4,B,GQR Research,28,54.27381
5,B,Gallup,859,55.54482
6,B,Public Policy Polling,82,52.926829
7,B,YouGov,4582,52.215976


KustoPandas fully supports arbitrary mathematical expressions, just like kusto

w.extend("NewWeight = exp(weight - 4) * 0.5").summarize("min(NewWeight), max(NewWeight), avg(NewWeight)")

It also supports kusto's nice syntax for binning time values

In [13]:
w.summarize("count() by bin(startdate, 1d)").take(5)

Unnamed: 0,bin_startdate,count_
0,2017-01-20,9
1,2017-01-21,5
2,2017-01-22,7
3,2017-01-23,12
4,2017-01-24,7


Notice that all the above commands leave `w` unchanged.  In fact that will always be true.  All operations will leave `w` unchanged and return a new instance of the `Wrap` class wrapping a new instance of a Pandas DataFrame.  This is generally the behavior that pandas follows as well, and Pandas makes it easy to do this without duplicating the data inside the dataframe.

If you want to save the output of a calculation, you can do just that

In [14]:
A_rated = w.where("grade == 'A'")
print("number of polls with grade A")
print(A_rated.count())
print("total number of polls")
print(w.count())

number of polls with grade A
   Count
0     83
total number of polls
   Count
0  16500


There are some things that are easier to do in python and Pandas than with kusto, so KustoPandas provides easy access to to the underlying dataframe using `w.df`

For example, we can format the output in the above expression more nicely by accessing the dataframe directly

In [15]:
print("number of polls with grade A: ", A_rated.df.shape[0])
print("total number of polls:        ", w.df.shape[0])

number of polls with grade A:  83
total number of polls:         16500


Of couse the above example is a bit contrived.  The same info can be written more succinctly using summarize

In [16]:
w.summarize("TotalNumberOfPolls=count(), NumberOfPollsWithGradeA = countif(grade == 'A')")

Unnamed: 0,TotalNumberOfPolls,NumberOfPollsWithGradeA
0,16500,83


As mentioned above, we can chaining tabular operatorts together

```w.where("grade in ('A', 'B')").summarize("count(), avg(disapprove) by grade, pollster")```

However Kusto pandas also supports full Kusto queries. So you can chain them together using the `execute` method and the kusto pipe operator. We use `self` to refer to the current table ("w") that the operation is performed on

In [17]:
w.execute("""
self 
| where grade in ('A', 'B')
| summarize count(), avg(disapprove) by grade, pollster
| sort by count_ desc
| take 3
""")

Unnamed: 0,grade,pollster,count_,avg_disapprove
7,B,YouGov,4582,52.215976
5,B,Gallup,859,55.54482
3,B,American Research Group,127,57.708661


KustoPandas even supports chaining multiple query statements together using `;`

In [20]:
w.execute("""
# show data only from the top N most prolific pollsters
let N = 3;
let topPollsters = self 
| summarize count() by pollster
| sort by count_ desc
| take N
| project pollster;
self 
| join (topPollsters) on pollster
| take 2
""")

Unnamed: 0,president,subgroup,modeldate,startdate,enddate,pollster,grade,samplesize,population,weight,...,disapprove,adjusted_approve,adjusted_disapprove,multiversions,tracking,url,poll_id,question_id,createddate,timestamp
0,Donald Trump,All polls,2021-01-20,2017-01-20,2017-01-22,Morning Consult,B/C,1992.0,rv,0.680029,...,37.0,45.686784,38.055805,,,http://static.politico.com/9b/13/82a3baf542ae9...,49249,77261,1/23/2017,11:47:59 20 Jan 2021
1,Donald Trump,All polls,2021-01-20,2017-01-26,2017-01-28,Morning Consult,B/C,1991.0,rv,0.560098,...,41.0,48.686784,42.055805,,,https://www.politico.com/f/?id=00000159-f6e7-d...,49241,77253,1/29/2017,11:47:59 20 Jan 2021


If there is anything that Kusto (or KustoPandas) does not support, then you can use the `let` command to pass in custom functions

In [19]:
def last_word(myString):
    return myString.split(" ")[-1]

w.let_elementwise(last_word=last_word).execute("""
self 
| extend pollster_last_name = last_word(pollster)
| project pollster_last_name
| take 4
""")

Unnamed: 0,pollster_last_name
0,Consult
1,Gallup
2,Ipsos
3,Gallup


In [21]:
w.execute("""
self 
| where grade !contains "D"
| extend AorB = grade in ("A", "B")
| summarize count(), AverageAdjustedApprovalDiff = avg(adjusted_approve - approve) by AorB
| take 5
""")

Unnamed: 0,AorB,count_,AverageAdjustedApprovalDiff
0,False,10111,-1.029531
1,True,5761,0.276601
