<h1 align="center">Visualizing Twitter data with Blaze and Bokeh</h1>

## 1.1 Tweet volume and hashtags - PySpark locally

### Objectives:

- Show advantatges of wrapping Spark with [Blaze](http://blaze.pydata.org/docs/latest/index.html)
    - `Table`
    - `head`
    - `selection`
    - `map`
    - `by`
    - `sort`

- Visualize tweet volume and top hashtag by date with [Bokeh](http://bokeh.pydata.org/)
    - `scatter`
    - `line`
    - `hover`



In [1]:
from blaze import *

We know our data has the following schema, we do not use the full syntax.

In [2]:
schema="""{longitude: float64,
latitude: float64,
dateTime: string,
userid: int64,
text: string}"""

## 1. Tweet volume and hashtags

In [3]:
# Coerce was removed in Python3 Python3 has only one string type, so the str() function is all you need.
def coerce(x, y):
    t = type(x + y)
    return (t(x), t(y))

In [4]:
sc

<pyspark.context.SparkContext at 0x7f4cf412d080>

In [5]:
rdd = sc.textFile("file:///lus/snx11141/jsparks/blog-spark-kmeans/tweets.csv")

In [6]:
#Check we get the expected number of tweets
rdd.count()

215229

In [7]:
# Same operation on Blaze tables
import datetime as dt
# Create the schema
ds = dshape("var * {longitude: float64, latitude: float64, place: string, country: string, dateTime: string, userid: int64, text: string}")
d = Data('/lus/snx11141/jsparks/blog-spark-kmeans/tweets.csv',sep=',', dshape=ds)

In [8]:
# Get a few tweets from the raw Spark RDD
rdd.take(1)

['31.189324,30.0109738,Giza Egypt,مصر,Tue Nov 24 15:12:53 CST 2015,2307225594,@AnasAlaa6 دي المخروبة القديمة بتاعتي']

In [9]:
# Get a few tweets from the same RDD, via Blaze
d.count()

In [10]:
d.dshape

dshape("""var * {
  longitude: float64,
  latitude: float64,
  place: string,
  country: string,
  dateTime: string,
  userid: int64,
  text: string
  }""")

In [11]:
d

Unnamed: 0,longitude,latitude,place,country,dateTime,userid,text
0,-35.206311,-5.811433,Natal Rio Grande do Norte,Brasil,Tue Nov 24 15:12:53 CST 2015,255228769,I'm at Midway Mall in Natal RN https://t.co/l...
1,-119.176996,34.182791,Oxnard CA,United States,Tue Nov 24 15:12:53 CST 2015,1429460480,Ariel camacho af
2,-5.63956,42.581287,Valverde de la Virgen Castilla y León,España,Tue Nov 24 15:12:53 CST 2015,924016321,En lo malo se conoce a los buenos🍀🙅🏽 @ Virgen ...
3,-62.965661,-40.783581,Buenos Aires Argentina,Argentina,Tue Nov 24 15:12:53 CST 2015,3354477083,Como algo y nada qcyo
4,-51.171934,-30.0136,Porto Alegre Rio Grande do Sul,Brasil,Tue Nov 24 15:12:53 CST 2015,339796957,É impressionante como ela me acalma e me faz e...
5,51.324703,35.736273,Islamic Republic of Iran,جمهوری اسلامی ایران,Tue Nov 24 15:12:53 CST 2015,187139551,@_albaloo_ جبران کنیم :))
6,-115.040119,36.066723,Henderson NV,United States,Tue Nov 24 15:12:53 CST 2015,2252314255,Interested in a #Sales #job near #Henderson N...
7,-93.55783,44.857207,Chanhassen MN,United States,Tue Nov 24 15:12:53 CST 2015,2194738604,#CustomerService in #Chanhassen MN: Technical...
8,-117.396156,33.953349,Riverside CA,United States,Tue Nov 24 15:12:53 CST 2015,27313171,If you're a #Nursing professional in #Riversid...
9,-58.855186,-27.480223,Capital - Corrientes Argentina,Argentina,Tue Nov 24 15:12:53 CST 2015,197537837,💅💋🌸 @ club Boca unidos https://t.co/bynRlra1MR


In [16]:
#Number of tweets from the US
t=d[d.country == 'United States'].count()
t

In [38]:
s=by(d.country, count=d.country.count())
s.sort('count',ascending=False)

  return df.sort(t.key, ascending=t.ascending)


Unnamed: 0,country,count
174,United States,49938
79,Indonesia,27447
25,Brasil,17877
98,Malaysia,11346
7,Argentina,10884
200,日本,10562
170,Türkiye,8459
107,México,7047
133,Republika ng Pilipinas,6933
197,ประเทศไทย,6724


### Interacting with Spark
We'd like to flatten these lists of strings down. Unfortunately this isn't currently supported in Blaze (it's not a standard relational algebra operation) so we'll have to rely on raw Spark. Fortunately the raw data structures are never far away. Here we swap back to Spark, perform the flattening, and then swap back to Blaze.

In [None]:
countries = d[d.country != 'None']
countries2 = countries[['country','count']].map(lambda x, y)

In [15]:
!date

Wed Nov 25 11:09:39 CST 2015
