# This is XXXXXXXXXXXXXXX's notebook.

# Getting Wiki Pageview Date
Wikipedia makes a lot of their data publicly available, inline with their open access philosophy. It is one of my favorite example of large (note: their data is particularly large because much of it is released in formats that do not optimize for efficient representation or reducing data redundancy.), publicly available datasets, because it is an incredibly popular website that covers a ton of domains and languages. 

Their data is popular and people make many interesting things with it. In particular, I think [Wikitrends is a great example.](http://www.wikipediatrends.com/) Currently on their front page they are displaying an interactive plot of pageviews for presidential candidate's pages.

![trending on wikitrends](img/trending-on-wikitrends.png)

They also provide a page-customizable view for anyone to look up view numbers for any page.

![lemmy views](img/search-interface.png)

This data is interesting because it tells us what people are looking up on wikipedia. From this we can infer that their mind is on the topic. Talk about a look into cultural Zeitgeist!

We will use pageveiw data where the wikimedia team has already removed as many identifiable bot requests as possible. Documentation on this data is avaiable here [available here.](https://dumps.wikimedia.org/other/pagecounts-raw/) The data is made available in hourly files and the urls are first split by year, then by month.

### dumps.wikimedia.org/other/pageviews/
![screen shot of file index1](img/pgvw1.png)

### dumps.wikimedia.org/other/pageviews/2016/
![screen shot of file index1](img/pgvw2.png)

### dumps.wikimedia.org/other/pageviews/2016/2016-01/
![screen shot of file index1](img/pgvw3.png)

[Files reside here](https://dumps.wikimedia.org/other/pageviews/)

[Docs on pageview stats](https://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics)

[Other, related data](https://dumps.wikimedia.org/other/analytics/)

Our strategy for moving this data into HDFS is to download it locally and the then use the hdfs client 'put' command to move the data from local to HDFS. To download the data we will:

* Send and HTTP request to get a page listing all files in a certain month.
* Parse the returned HTML to retrieve all file names.
* Filter file names by day so we dont spend our entire lives downloading data (ie. only download a days worth of data at a time.)

In [None]:
import requests
from bs4 import BeautifulSoup

We will parse the html with Beautiful soup, NOT A REGEX:

![screen shot of the stack overflow QA with the amazing characters](img/stackoverflow.png)

We will:
* retreive the index page for the hosted files as text. These will look like:


    <li>
        <a href='pageviews20160101-010000.gz'>
        ...etc...
    </li>

* iterate through all links (ie tag 'a') and grab the associated urls (labeled 'href')
* yield the fully qualified url back.

In [None]:
def get_pageviews(year, month, day):
    pageviews_url = 'https://dumps.wikimedia.org/other/pageviews/{0}/{0}-{1}/'.format(year, month)
    soup = BeautifulSoup(requests.get(pageviews_url).text)
    for a in soup.find_all('a'):
        if 'pageviews-{0}{1}{2}'.format(year, month, day) in a['href']:
            yield pageviews_url + a['href']

Given the fully qualified url we will use the reuqests package to retreive the contents of the file hosted there. Pretty straightforward.

In [None]:
def write_file(url):
    req = requests.get(url, stream=True)
    local_filename = url.split("/")[-1]
    with open('data/' + local_filename, 'wb') as f:
        for chunk in req.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()

For all hourly files matching the pattern we expect for January 1st 2016, download it and write the first one to local disk:

In [None]:
all_pageviews = [n for n in get_pageviews('2016', '01', '01')]
write_file(all_pageviews[0])

# Moving data to HDFS

HDFS provides a CLI for common operations. We would like to move our data withut leaving python. We have two options to do this:

* Use the subprocess package to shellout commands to the HDFS CLI, and run from a machine where that CLI is installed.
* Use Ibis' HDFS client (a wrapper around WebHDFS) to move the data.

Using subprocess always feels like a hack, so let's go with the more straightforward HDFS connection:

In [None]:
# System independent way to join paths
local_data_path = os.path.join(os.getcwd(), "pageviews-gz")

def mv_files(filename, hdfs_dir, hdfs_conn):
    dir_name = hdfs_dir + filename[:-3]
    hdfs_conn.mkdir(dir_name)
    filepathtarget = '/'.join([dir_name, filename])
    hdfs_conn.put(filepathtarget, os.path.join(local_data_path, filename))
    return dir_name

Now that the data resides in HDFS, we would like to transform it so that we have fast access for analysis. The current format has a few problems:

* Information contained in file name.
* Many GZipped files.
* The underlying files are text.
* The text files contain space delimited data.

In general, especially with analytics, we want:

* Scans/aggregations over columns.
* Good, splittable compression.
* A binary format, performant encodings.
* Designed with compelx nested data sructures in mind.

My default recommendation is Praquet. It is a binary file format designed for Hadoop and fast column oriented aggregations. We can use it in conjunction with a splittable compresseion codec like LZO.

We can do this transfromation with either Ibis (Impala) or (Py)Spark. Parquet files generated by spark are not compatible with Impala, so, for the sake of our sanity we will use Ibis to do this transformation. The steps will be:

* Read in files as a temporary table in Ibis.
* Extract time information from file path and add to table.
* Insert data into permanent table in Impala.

## Read Data from an HDFS directory

We can define the schema of the files we want to read in and create a temporary table for Ibis to read from.

In [None]:
file_schema = ibis.schema([('project_name', 'string'),
                           ('page_name', 'string'),
                           ('n_views', 'int64'),
                           ('n_bytes', 'int64')])


tmp_table = ibis_conn.delimited_file(hdfs_dir=data_dir,
                                     schema=file_schema,
                                     delimiter=' ')

## Create New Columns

We can create new named columns using the 'mutate' method. Here, year, month, day, and hour are string values.

In [None]:
def extract_datetime(filename):
    _, date_str, time_str = filename.split("-")
    year = date_str[:4]
    month = date_str[4:6]
    day = date_str[-2:]
    hour = time_str[:2]
    return year, month, day, hour

year, month, day, hour = extract_datetime(data_dir.split("/")[-1])

# create a column for year, month, day and hour.
tmp_w_time = tmp_table.mutate(year=year,
                              month=month,
                              day=day,
                              hour=hour)

## Hive Metastore
Ibis allows us to intergoate the hive metastore. We can determine if databases or tables exists by using functions defined directly on the ibis_connection.

It is useful for us to determine if a database exists and then create it if it does not. 

In [None]:
if not ibis_conn.exists_database(db_name):
    ibis_conn.create_database(db_name)

working_db = ibis_conn.database(db_name)

## Insert Data
We can then create a table from an ibis expression or insert more data into a table with the same schema.

In [None]:
if 'wiki_pageviews' in working_db.tables:
    ibis_conn.insert('wiki_pageviews', tmp_w_time, database=db_name)
else:
    ibis_conn.create_table('wiki_pageviews', obj=tmp_w_time,
                           database=db_name)

Then, wrap this all up in a function so we can use it in a list comprehension.

In [None]:
def gz_2_data_insert(data_dir, ibis_conn, db_name):
    tmp_table = ibis_conn.delimited_file(hdfs_dir=data_dir,
                                  schema=file_schema,
                                  delimiter=' ')
    year, month, day, hour = extract_datetime(data_dir.split("/")[-1])
    # create a column named time
    tmp_w_time = tmp_table.mutate(year=year, month=month, day=day, hour=hour)

    working_db = safe_get_db(ibis_conn, db_name)
    if 'wiki_pageviews' in working_db.tables:
        ibis_conn.insert('wiki_pageviews', tmp_w_time, database=db_name)
    else:
        ibis_conn.create_table('wiki_pageviews', obj=tmp_w_time,
                               database=db_name)

In [None]:
local_files = os.listdir(local_data_path)
# There is an impala deamon on eachmachine. You likely want change
# the host name to match the machine you are working on.
hdfs_conn = ibis.hdfs_connect(host='cdh1.c.guerilla-python.internal')
db_name = 'u_srowen.wikipageviews'
hdfs_dir = '/user/srowen'

hdfs_gz_dirs = [mv_files(filename, hdfs_dir, hdfs_conn) for filename in local_files]
[gz_2_data_insert(data_dir, ibis_conn, db_name) for data_dir in hdfs_gz_dirs]