In [None]:
%matplotlib inline

import ibis
import pandas as pd

## Default Row Limits
It is possible to turn on interactive mode, which automatically executs ibis expressions. By default, ibis limits result sets returned to the local process to 10,000 rows. If you know you require >10000 rows returned, be careful to change the default limit.

## Interactive Mode
Ibis also allows and interactive mode that automatically executes all expressions. This can be useful in a notebook or repl. I personally prefer to epxlicitly execute expresssions, but this is a personal preference.  If you use the interactive mode, I recommnd setting the defaultlimit low to prevent accidentally trying to return an unreasonable number of rows to your local process.

In [None]:
ibis.options.sql.default_limit = None

hdfs_conn = ibis.hdfs_connect(host='cdh3.2.guerilla-python.internal')

ibis_conn = ibis.impala.connect(host='cdh1.c.guerilla-python.internal',
                                    hdfs_client=hdfs_conn)

In [None]:
pageviews_tbl = ibis_conn.table('wiki_pageviews', database='u_srowen')

What is in a project name? What does this data look like?

In [None]:
project_names_expr = pageviews_tbl.project_name.distinct()
project_names = ibis_conn.execute(project_names_expr)
project_names

Maybe we can understand this by finding the projects with the most pages. Let's group by porject name and then count the size of the groups.

In [None]:
project_page_counts = pageviews_tbl.group_by(pageviews_tbl.project_name)\
                                   .size()\
                                   .sort_by(('count', False))
project_page_counts = ibis_conn.execute(project_page_counts)
project_page_counts

To find something interesting, it'll help to understand the language. 

In [None]:
[name for name in project_names if 'en' in name]

The part of the project name after the '.' specifies a special type of wiki. Let's just look at the standard wiki pages (ie, not media-wiki) that are also written in English.

In [None]:
ibis_conn.execute(pageviews_tbl[pageviews_tbl.project_name == 'en'].limit(10))

Project_name is homogenous in this dataset, so lets just take the projection of all other columns.

In [None]:
en_pageviews = pageviews_tbl[pageviews_tbl.project_name == 'en'].projection(['page_name',
                                                                              'n_views',
                                                                             'n_bytes',
                                                                             'day',
                                                                             'hour',
                                                                             'month',
                                                                             'year'])

In [None]:
ibis_conn.execute(en_pageviews.limit(10))

It seems that we should exclude these pages with no names, and NaN counts. (With big data sets, you will find all
types of messed up data.)

In [None]:
top_10_pg_views_hourly = en_pageviews.sort_by((en_pageviews.n_views, False)).limit(10)
ibis_conn.execute(top_10_pg_views_hourly)

In [None]:
null_pg_views = en_pageviews[en_pageviews.n_views.isnull()]

In [None]:
ibis_conn.execute(null_pg_views)

In [None]:
nn_pg_views = en_pageviews[en_pageviews.n_views.notnull()]

What are the top ten page in this series that 

In [None]:
ibis_conn.execute(nn_pg_views.sort_by((nn_pg_views.n_views, False)).limit(10))

hangover, brands of champagne, mew years traditions, time differences, international datetime,

In [None]:
champagne_df = ibis_conn.execute(nn_pg_views[nn_pg_views.page_name.lower() == 'champagne'])

In [None]:
champagne_df.sort(['day', 'hour'])

In [None]:
champagne_df['time'] = pd.to_datetime(champagne_df[['year', 'month', 'day', 'hour']])

In [None]:
champagne_df[['n_views', 'time']].plot()

In [None]:
w_daily_views = nn_pg_views.group_by(['page_name', 'month', 'day']).aggregate(daily_views=nn_pg_views.n_views.sum())

ibis_conn.execute(w_daily_views.sort_by((w_daily_views.daily_views, False)).limit(10))

In [None]:
tot_view = nn_pg_views.group_by('page_name').aggregate(all_views=nn_pg_views.n_views.sum())
ibis_conn.execute(tot_view.sort_by((tot_view.all_views, False)).limit(30))