In [4]:
%load_ext watermark
%watermark

11/02/2015 22:10:55

CPython 2.7.10
IPython 4.0.0

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 3.13.0-66-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit


**TWEETS HEATMAP OF MURCIA**

**1. Getting the tweets**

*In order to get the tweets, we need to use Twitter Streaming api. I found that [Tweepy](http://tweepy.readthedocs.org/en/latest/getting_started.html), one (of many) Twitter API Python wrappers does an outstanding job in capturing tweets via Twitter Streaming API*

Here is the code I used. I filtered the tweets by location. Twitter allows for filtering using a bounding box set of coordinates using the following structure:

**location=[sw_longitude, sw_latitude, ne_longitude, ne_latitude]**

*(interesting how some apis follow (lat, lon) and others (lon, lat). We need a standard)*

In [None]:
import json
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener


ckey = YOUR_CONSUMER_KEY_HERE
csecret = YOUR_CONSUMER_SECRET_HERE
atoken = YOUR_TWITTER_APP_TOKEN_HERE
asecret = YOUR_TWITTER_APP_SECRET_HERE

murcia = [-1.157420, 37.951741, -1.081202, 38.029126] #Check it out, is a very nice city!

file =  open('tweets.txt', 'a')

class listener(StreamListener):

    def on_data(self, data):
        # Twitter returns data in JSON format - we need to decode it first
        try:
            decoded = json.loads(data)
        except Exception as e:
            print e #we don't want the listener to stop
            return True
   
        if decoded.get('geo') is not None:
            location = decoded.get('geo').get('coordinates')
        else:
            location = '[,]'
        text = decoded['text'].replace('\n',' ')
        user = '@' + decoded.get('user').get('screen_name')
        created = decoded.get('created_at')
        tweet = '%s|%s|%s|s\n' % (user,location,created,text)
        
        file.write(tweet)
        print tweet
        return True

    def on_error(self, status):
        print status

if __name__ == '__main__':
    print 'Starting'
    
    auth = OAuthHandler(ckey, csecret)
    auth.set_access_token(atoken, asecret)
    twitterStream = Stream(auth, listener())
    twitterStream.filter(locations=murcia)

**Run it...and wait.**

The script will capture all the tweets that fit within that bounding box we setup. 

One important thing to notice is that the api is not 100% accurate on the data it returns. I found several geocoded tweets that didn't belong to the specified box. 

Since the script has to be running in order to capture all the tweets, you can run this on a spare computer if you have one, or alternatively you can consider online services such as [RedHat](http://www.redhat.com/) or [PythonAnywhere](https://www.pythonanywhere.com/), or rent your ownn tiny machine on the cloud with services like [Digital Ocean](digitalocean.com) or [Amazon Web Services](aws.amazon.com)

<hr>

Now we have a file containing one tweet per line. Each line follows the following structure:

*@USER + | + [LAT,LON] | TIMESTAMP | TWEET*

And now we proceed to turn it into a more useable file

In [None]:
import pandas as pd
import numpy as np

tweets_raw = pd.read_table('tweets.txt', header=None, iterator=True)

while 1:
    tweets = tweets_raw.get_chunk(10000)
    tweets.columns = ['tweets']
    tweets['len'] = tweets.tweets.apply(lambda x: len(x.split('|')))
    tweets[tweets.len < 4] = np.nan
    del tweets['len']
    tweets = tweets[tweets.tweets.notnull()]
    tweets['user'] = tweets.tweets.apply(lambda x: x.split('|')[0])
    tweets['geo'] = tweets.tweets.apply(lambda x: x.split('|')[1])
    tweets['timestamp'] = tweets.tweets.apply(lambda x: x.split('|')[2])
    tweets['tweet'] = tweets.tweets.apply(lambda x: x.split('|')[3])
    tweets['lat'] = tweets.geo.apply(lambda x: x.split(',')[0].replace('[',''))
    tweets['lon'] = tweets.geo.apply(lambda x: x.split(',')[1].replace(']',''))
    del tweets['tweets']
    del tweets['geo']
    tweets['lon'] = tweets.lon.convert_objects(convert_numeric=True)
    tweets['lat'] = tweets.lat.convert_objects(convert_numeric=True)
    tweets.to_csv('tweets.csv', mode='a', header=False,index=False)

In [1]:
import pandas as pd
tweets = pd.read_csv('tweets.csv',header=False)
tweets.columns = ['a','user','timestamp','tweet','lat','lon']
del tweets['a']
tweets.head()

  data = self._reader.read(nrows)


Unnamed: 0,user,timestamp,tweet,lat,lon
0,@AlexLuna_72,Tue Apr 15 02:23:01 +0000 2014,Ahora mismo si tuviese una pistola me pegaba u...,37.977438,-1.063001
1,@Gissell_Tytta,Tue Apr 15 02:23:22 +0000 2014,Hola. Visto 05:36. Ms que Visto 05:38. I...,,
2,@69Rikifriki,Tue Apr 15 02:23:32 +0000 2014,Buenas noches a todos... Fav ;) juju,37.9798,-1.06197
3,@Gissell_Tytta,Tue Apr 15 02:24:44 +0000 2014,JAJAJA,,
4,@AndreaGalian,Tue Apr 15 02:25:22 +0000 2014,Salimos para a corua :),38.068789,-1.192239


**NOW WE FILTER THE TWEETS WITH LAT/LON FITTING IN THE DESIRED BOUNDING BOX OF MURCIA**

In [2]:
min_lon = -1.157420
max_lon = -1.081202
min_lat = 37.951741
max_lat = 38.029126

tweets = tweets[(tweets.lat.notnull()) & (tweets.lon.notnull())]

tweets = tweets[(tweets.lon > min_lon) & (tweets.lon < max_lon) & (tweets.lat > min_lat) & (tweets.lat < max_lat)]
print(tweets.shape[0])

104981


**POINT MAP**

Looking good! 

First, we will clone the code in [Eric Fisher repository](https://github.com/ericfischer/datamaps):

*(Currently working at Mapbox, you might have seen Eric's [maps](https://www.mapbox.com/blog/mapping-millions-of-dots/) before.)*

`git clone https://github.com/ericfischer/datamaps.git*`

To use Eric's code to produce the map, the file needs to follow the structure:

40.711017,-74.011017

40.710933,-74.011250

40.710867,-74.011400

40.710783,-74.011483

40.710650,-74.011500

40.710517,-74.011483

So we need to write another file with just lat, lon

In [4]:
def create_point(lat, lon, tweet):
    point =  { "type": "Feature",
        "geometry": {"type": "Point", "coordinates": [lon, lat]},
        "properties": {"Tweet": tweet}
        }
    return point

import json
with open('murcia.geojson') as file:
    geo = json.load(file)

map_file = open('map_lines.txt', 'w')

for i , row in tweets.iterrows():
    if len(str(row.lat)) > 0 and len(str(row.lon)) > 0:
        lon = str(row.lon).strip()
        map_line = '{},{}\n'.format(row.lat,lon)
        geo['features'].append(create_point(row.lat, lon, row.tweet))
        map_file.write(map_line)
map_file.close()

#This file is to use in github
with open('murcia2.geojson', 'w') as file:
    json.dump(geo, file)

In [22]:
#repustate 
import requests
api_key = "847a6550eb8b011a8a23e90448452987a01d694e"

repustate_url = 'https://api.repustate.com/v2/{}/score.json'.format(api_key)
def get_sentiment(tweet):
    params = {'text':tweet, 'lang':'es'}
    response = requests.get(repustate_url, params=params, verify=True )
    return response

In [23]:
a = get_sentiment(tweets.tweet[272474])

SSLError: [Errno 1] _ssl.c:510: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Now we go to the terminal

`git clone https://github.com/ericfischer/datamaps`

`cd datamaps`

`make`

*copy the txt file with the tweets to the datamaps folder*

`cat map_lines.txt | ./encode -o tweets -z 17`

`./render -A -- tweets 16 37.951741 -1.157420 38.029126 -1.081202 > tweets.png`

./render -A -- tweets 17 37.951741 -1.157420 38.029126 -1.081202 > tweets.png

**WORD CLOUD**

In [4]:
tweets = pd.read_table('tweets.txt', header=None)
tweets.columns = ['tweets']
tweets['text'] = tweets.tweets.apply(lambda x: x.split('|')[-1])

In [None]:
import collections
import vincent

vincent.initialize_notebook()

text = tweets.text.to_string()
counter = collections.Counter()
for w in text.split():
    counter[w] += 1
normalize = lambda x: int(x / (max(counter.values()) - min(counter.values())) * 90 + 10)
word_list = {k: normalize(v) for k, v in counter.items()}
word = vincent.Word(word_list)
word

#HEATMAP

using https://github.com/rybo32/heatmap

#PARSE TWEETS (ONLY ONCE)

In [2]:
import pandas as pd
reg = '^(.*)\|\[(.*),(.*)\]\|(.*?)\|(.*)'
tweets = pd.read_table('tweets.txt', sep=reg, encoding='utf-8',header=False)
del tweets['Unnamed: 0']
del tweets['Unnamed: 6']
tweets.columns = ['user','lat','lon','timestamp','message']
tweets['lon'] = tweets.lon.convert_objects(convert_numeric=True)
tweets['lat'] = tweets.lat.convert_objects(convert_numeric=True)
tweets.to_csv('tweets2.csv',index=False, encoding='utf-8')


from dateutil.parser import parse

def safe_parse(date):
    try:
        return parse(date)
    except Exception as e:
        return None
tweets.timestamp = tweets.timestamp.apply(safe_parse)

#LOAD TWEETS CSV

In [1]:
import pandas as pd

tweets = pd.read_csv('tweets2.csv', encoding='utf-8', parse_dates=[3],infer_datetime_format=True)
tweets.set_index('timestamp').tz_localize('UTC').tz_convert('Europe/Madrid').reset_index()
tweets.head()

Unnamed: 0,user,lat,lon,timestamp,message
0,@AlexLuna_72,37.977438,-1.063001,2014-04-15 02:23:01,Ahora mismo si tuviese una pistola me pegaba u...
1,@Gissell_Tytta,,,2014-04-15 02:23:22,Hola. Visto 05:36. Ms que Visto 05:38. I...
2,@69Rikifriki,37.9798,-1.06197,2014-04-15 02:23:32,Buenas noches a todos... Fav ;) juju
3,@Gissell_Tytta,,,2014-04-15 02:24:44,JAJAJA
4,@AndreaGalian,38.068789,-1.192239,2014-04-15 02:25:22,Salimos para a corua :)


#we only care about geocoded tweets

In [2]:
min_lon = -1.157420
max_lon = -1.081202
min_lat = 37.951741
max_lat = 38.029126

tweets = tweets[(tweets.lat.notnull()) & (tweets.lon.notnull())]

tweets = tweets[(tweets.lon > min_lon) & (tweets.lon < max_lon) & (tweets.lat > min_lat) & (tweets.lat < max_lat)]
tweets.shape

(104983, 5)

In [3]:
'''
https://github.com/rybo32/heatmap
requires ffmpeg for animations
http://www.ubuntugeek.com/install-ffmpeg-on-ubuntu-14-10-using-ppa.html

example queries
p heatmap.py -b black -p tweets_heatmap -W 1800 -o g1.png -P equirectangular --decay 0.8 -v -r 5 --osm --osm_base=http://a.basemaps.cartocdn.com/dark_all/

'''



In [12]:
with open('tweets_heatmap','w') as file:
    file.write(tweets.to_string(header=False, index=False))

In [16]:
with open('tweets_heatmap_test','w') as file:
    file.write(tweets.head(10000).to_string(header=False, index=False))

#LETS JUST GET THE COMMUTING TIME TWEETS AND TRY TO FIND COMMUTING SPOTS

In [49]:
commute_hours = [7,8,9,17,18,19]
commute_days = range(5)
tweets_commute = tweets[['timestamp','lat','lon']].copy()
tweets_commute = tweets_commute[(tweets_commute.timestamp.dt.dayofweek.isin(commute_days))&
                                (tweets_commute.timestamp.dt.hour.isin(commute_hours))]
del tweets_commute['timestamp']
print(tweets_commute.shape)

with open('tweets_heatmap_commute','w') as file:
    file.write(tweets_commute.to_string(header=False, index=False))

#run the heatmap command to get a plot
!python heatmap.py -b black -p tweets_heatmap_commute -W 1800 -o commute.png -P equirectangular --decay 0.8 -v -r 5 --osm --osm_base=http://a.basemaps.cartocdn.com/dark_all/
!python heatmap.py -b black -p tweets_heatmap_commute -W 1800 -o commute1.png -P equirectangular -v -r 10 --osm --osm_base=http://a.basemaps.cartocdn.com/dark_all/

(18052, 2)
      33 ms  // reading points from tweets_heatmap_commute
     138 ms  // read 18052 points
     371 ms  // scale: 0.000146 meters/pixel (763549741.511111 pixels/degree)
     371 ms  // Choosing OSM zoom level 15
     371 ms  // scale: 4.777314 meters/pixel (23301.688889 pixels/degree)
Retrieving 80 tiles...
... done.
     431 ms  // input extent: 37.9514125993,-1.15770915344,38.0294550424,-1.08086784656
     431 ms  // output extent: -959448.636109,-26976.5785173,-957141.200743,-25186.0462905
     431 ms  // creating an appending matrix
     431 ms  // processing data
    8086 ms  // combining coincident points
   10705 ms  // saving image (1781 x 2298)
   13804 ms  // end
      37 ms  // reading points from tweets_heatmap_commute
     157 ms  // read 18052 points
     399 ms  // scale: 0.000146 meters/pixel (763549741.511111 pixels/degree)
     399 ms  // Choosing OSM zoom level 15
     399 ms  // scale: 4.777314 meters/pixel (23301.688889 pixels/degree)
Retrieving 80 til

In [52]:
!python heatmap.py -b black -p tweets_heatmap_commute -W 1800 -o commute2.png -P equirectangular --decay 1 -v -r 10 --osm --osm_base=http://a.basemaps.cartocdn.com/dark_all/

      30 ms  // reading points from tweets_heatmap_commute
     134 ms  // read 18052 points
     372 ms  // scale: 0.000146 meters/pixel (763549741.511111 pixels/degree)
     372 ms  // Choosing OSM zoom level 15
     372 ms  // scale: 4.777314 meters/pixel (23301.688889 pixels/degree)
Retrieving 80 tiles...
... done.
     431 ms  // input extent: 37.9510741971,-1.15813830688,38.0297930833,-1.08043869312
     431 ms  // output extent: -959458.636109,-26986.5785173,-957131.200743,-25176.0462905
     431 ms  // creating a summing matrix
     431 ms  // processing data
   32384 ms  // saving image (1791 x 2308)
   38585 ms  // end


In [59]:
!python heatmap.py -b black -p tweets_heatmap_commute -W 1800 -o commute3.png -P equirectangular --decay 0.9 -r 5 --osm --osm_base=http://a.basemaps.cartocdn.com/dark_all/

Retrieving 80 tiles...
... done.


#LETS SEE WHERE PEOPLE AR EDURING WEEKENDS

In [62]:
weekend_hours = range(25)
weekend_days = [5,6]
tweets_weekend = tweets[['timestamp','lat','lon']].copy()
tweets_weekend = tweets_weekend[(tweets_weekend.timestamp.dt.dayofweek.isin(weekend_days))&
                                (tweets_weekend.timestamp.dt.hour.isin(weekend_hours))]
#del tweets_weekend['timestamp']
print(tweets_weekend.shape)

with open('tweets_heatmap_weekend','w') as file:
    file.write(tweets_weekend[['lat','lon']].to_string(header=False, index=False))

#run the heatmap command to get a plot
!python heatmap.py -b black -p tweets_heatmap_weekend -W 1800 -o weekend.png -P equirectangular --decay 0.8 -r 5 --osm --osm_base=http://a.basemaps.cartocdn.com/dark_all/
!python heatmap.py -b black -p tweets_heatmap_weekend -W 1800 -o weekend2.png -P equirectangular --decay 1 -r 10 --osm --osm_base=http://a.basemaps.cartocdn.com/dark_all/

(32823, 2)
Retrieving 80 tiles...
... done.
Retrieving 80 tiles...
... done.


##WE TRY AN ANIMATION BY MINUTE

In [31]:
mkdir minute_frames/png

In [60]:
import os, time
from PIL import Image
from PIL import ImageFont
from PIL import ImageDraw 

hour_format = '%y-%m-%d_%H:00'

def generate(hour, frame_number):
    hour_str = generate_data(hour)
    if hour_str:
        generate_frame(hour_str, frame_number)
        add_time(hour_str, frame_number)

def generate_data(hour): 
    hour_str = hour.strftime(hour_format)
    hour_slice = tweets[['lat', 'lon']][tweets.hour==hour_str]
    with open('minute_frames/data/{}'.format(hour_str), 'w') as f:
        if not hour_slice.empty:
            f.write(hour_slice.to_string(header=False, index=False))
        else:
            return None
    return hour_str

def generate_frame(hour_str, frame_number):
    os.system('''
    python heatmap.py -b black -p minute_frames/data/{} -W 1800 -o minute_frames/png/{}.png\
    -e  37.951741,-1.157420,38.029126,-1.081202\
    -P equirectangular --decay 0.1 -r 10 --osm --osm_base=http://a.basemaps.cartocdn.com/dark_all/
    '''.format(hour_str, frame_number)
    )
    
def add_time(hour_str, frame_number):
    image_name = 'minute_frames/png/{}.png'.format(frame_number)
    img = Image.open(image_name)
    draw = ImageDraw.Draw(img)
    font_size = 50
    font_src = '/usr/share/texlive/texmf-dist/fonts/truetype/public/opensans/OpenSans-Bold.ttf'
    font = ImageFont.truetype(font_src, font_size)
    draw.text((0, 0),hour_str,(255,255,255),font=font)
    img.save(image_name)
    
hours = pd.date_range(tweets.timestamp.min(),tweets.timestamp.max(), freq='h')
tweets['hour'] =  tweets.timestamp.apply(lambda x: x.strftime(hour_format))   

In [63]:
for frame_number, hour in enumerate(hours):
    try:
        generate(hour, frame_number)
    except Exception as E:
        print(E)
    time.sleep(1)

In [19]:
from minute_frames.images2gif import writeGif

In [16]:
import os
from scipy.misc import imread, imresize

images = os.listdir('./minute_frames/png/')
images = map(lambda x: int(x.split('.')[0]), images)
images.sort()
images = map(lambda x: '{}.png'.format(x), images)

images_list = []
for image in images:
    im = imread('./minute_frames/png/{}'.format(image))
    im = imresize(im, 0.5)
    images_list.append(im)

In [21]:
writeGif('anim.gif', images_list, duration=0.1)

##DATAMAPS

In [2]:
import pandas as pd
tweets1 = pd.read_csv('tweets.csv', encoding='utf-8', header=None,  parse_dates=[4],infer_datetime_format=True)
tweets1.columns = ['a', 'user', 'timestamp', u'message', u'lat', u'lon']
del tweets1['a']
tweets2 = pd.read_csv('tweets2.csv', encoding='utf-8', parse_dates=[3],infer_datetime_format=True)
tweets3 = pd.read_csv('tweets3.csv', encoding='utf-8', parse_dates=[3],infer_datetime_format=True)

tweets_comb = pd.concat([tweets1, tweets2, tweets3])
tweets_comb = tweets_comb[(tweets_comb.lat.notnull())&(tweets_comb.lon.notnull())]
tweets_comb.drop_duplicates(inplace=True)
tweets_comb['timestamp'] = pd.to_datetime(tweets_comb.timestamp)
tweets_comb = tweets_comb.set_index('timestamp').tz_localize('UTC').tz_convert('Europe/Madrid').reset_index()
tweets_comb['hour' ]= tweets_comb.timestamp.dt.hour

  data = self._reader.read(nrows)


In [3]:
def get_hour_color(hour):
    return 50
    if 0<=hour<16:
        return 0
    elif 16<=hour<=23:
        return 170
tweets_comb['hour_color'] = tweets_comb.hour.apply(get_hour_color)
tweets_comb['hour_color'] = tweets_comb['hour_color'].astype(int)

###Render datamap

In [4]:
map_file = open('map_lines.txt', 'w')

for i , row in tweets_comb.iterrows():
    if len(str(row.lat)) > 0 and len(str(row.lon)) > 0:
        lon = str(row.lon).strip()
        map_line = '{},{} :{}\n'.format(row.lat,lon, row.hour_color)
        map_file.write(map_line)
map_file.close()

In [5]:
%%bash
cd datamaps
mv datamaps/map_lines.txt .
rm -rf tweets
cat map_lines.txt | ./encode -m8 -o tweets -z 17
./render -C256  -A -- tweets 17 37.951741 -1.157420 38.029126 -1.081202 > tweets.png




Read 0.0 million recordsRead 0.1 million recordsRead 0.2 million recordsRead 0.3 million recordsRead 0.4 million recordsSorting 425460 shapes of 1 point(s), zoom level 0
Sorting part 1 of 1Merging: 0%Merging: 1%Merging: 2%Merging: 3%Merging: 4%Merging: 5%Merging: 6%Merging: 7%Merging: 8%Merging: 9%Merging: 10%Merging: 11%Merging: 12%Merging: 13%Merging: 14%Merging: 15%Merging: 16%Merging: 17%Merging: 18%Merging: 19%Merging: 20%Merging: 21%Merging: 22%Merging: 23%Merging: 24%Merging: 25%Merging: 26%Merging: 27%Merging: 28%Merging: 29%Merging: 30%Merging: 31%Merging: 32%Merging: 33%Merging: 34%Merging: 35%Merging: 36%Merging: 37%Merging: 38%Merging: 39%Merging: 40%Merging: 41%Merging: 42%Merging: 43%Merging: 44%Merging: 45%Merging: 46%Merging: 47%Merging: 48%Merging: 49%Merging: 50%Merging: 51%Merging: 52%Merging: 53%Merging: 54%Merging: 55%Merging: 56%Merging: 57%Merging: 58%Merging: 59%Merging: 60%Merging: 61%Merging: 

In [8]:
%%bash


bash: line 2: ./encode: No such file or directory
cat: map_lines.txt: No such file or directory
bash: line 3: ./render: No such file or directory


`git clone https://github.com/ericfischer/datamaps`

`cd datamaps`

`make`

*copy the txt file with the tweets to the datamaps folder*

`cat map_lines.txt | ./encode -o tweets -z 17`

* in case you are using colors*
`./render -C256 -A -- tweets 17 37.951741 -1.157420 38.029126 -1.081202 > tweets.png`

./render -A -- tweets 17 37.951741 -1.157420 38.029126 -1.081202 > tweets.png

###Download osm bakground map

In [42]:
from heatmap import _get_osm_image

In [55]:
bbox = [37.951741, -1.157420, 38.029126, -1.08120]
zoom = 17
osm_base = 'http://a.basemaps.cartocdn.com/dark_all/'

def _get_osm_image(bbox, zoom, osm_base=''):
    # Just a wrapper for osm.createOSMImage to translate coordinate schemes
    from osmviz.manager import PILImageManager, OSMManager
    osm = OSMManager(
           image_manager=PILImageManager('RGB'),
            server=osm_base)
        
    image, bounds = osm.createOSMImage((bbox[0], bbox[2], bbox[1], bbox[3]), zoom)
    (lat1, lat2, lon1, lon2) = bounds
    return image

In [56]:
im = _get_osm_image(bbox, zoom, osm_base)
im.save('datamap.png')

Retrieving 1073 tiles...
... done.
