###LOAD AND PARSE  THE TWEETS FILE

In [1]:
%load_ext watermark
%watermark

11/03/2015 18:36:43

CPython 2.7.10
IPython 4.0.0

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 3.13.0-66-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit


The Tweepy Stream Handler writes to a file containing one tweet per line. Each line follows the following structure:

*@USER + | + [LAT,LON] | TIMESTAMP | TWEET*

And now we proceed to turn it into a more useable file

Since the tweets file was so big, I processed it by chunks instead of loading all of it in memory

In [None]:
import pandas as pd
import numpy as np

tweets_raw = pd.read_table('tweets.txt', header=None, iterator=True)

while 1:
    tweets = tweets_raw.get_chunk(10000)
    tweets.columns = ['tweets']
    tweets['len'] = tweets.tweets.apply(lambda x: len(x.split('|')))
    tweets[tweets.len < 4] = np.nan
    del tweets['len']
    tweets = tweets[tweets.tweets.notnull()]
    tweets['user'] = tweets.tweets.apply(lambda x: x.split('|')[0])
    tweets['geo'] = tweets.tweets.apply(lambda x: x.split('|')[1])
    tweets['timestamp'] = tweets.tweets.apply(lambda x: x.split('|')[2])
    tweets['tweet'] = tweets.tweets.apply(lambda x: x.split('|')[3])
    tweets['lat'] = tweets.geo.apply(lambda x: x.split(',')[0].replace('[',''))
    tweets['lon'] = tweets.geo.apply(lambda x: x.split(',')[1].replace(']',''))
    del tweets['tweets']
    del tweets['geo']
    tweets['lon'] = tweets.lon.convert_objects(convert_numeric=True)
    tweets['lat'] = tweets.lat.convert_objects(convert_numeric=True)
    tweets.to_csv('tweets.csv', mode='a', header=False,index=False)

In [31]:
tweets.shape

(181987, 5)

In [9]:
tweets.dtypes

user                 object
timestamp    datetime64[ns]
tweet                object
lat                 float64
lon                 float64
dtype: object

In [33]:
#convert time zome from UTC to Spain time for further time of day analyses
tweets.set_index('timestamp').tz_localize('UTC').tz_convert('Europe/Madrid').reset_index()
tweets.head()

Unnamed: 0,user,timestamp,tweet,lat,lon
1,@IkiduAuren,2015-03-29 15:58:23,Feliz tarde de Domingo. http://t.co/jxL7v5zFwd,38.842937,-0.115407
3,@monorex2,2015-03-29 15:59:37,Good afternoon:-D:-D,38.026032,-1.208355
4,@Santos_Poveda,2015-03-29 16:01:29,@InkUtv @OilVirgin @RiobuenoRafael @NinaNebo @...,38.095896,-1.181909
6,@anittaaML,2015-03-29 16:03:46,@caarmens98 te voy a reportar por hj de p,37.992632,-1.1977
7,@helenatovar0210,2015-03-29 16:04:46,Lucha por lo qe quieres qe les joda a los qe h...,38.055685,-1.081301


Since we want to do a heatmap, we only care about those tweets that are geocoded and whose latitude and longitud are within the Murcia area

In [34]:
min_lon = -1.157420
max_lon = -1.081202
min_lat = 37.951741
max_lat = 38.029126

tweets = tweets[(tweets.lat.notnull()) & (tweets.lon.notnull())]

tweets = tweets[(tweets.lon > min_lon) & (tweets.lon < max_lon) & (tweets.lat > min_lat) & (tweets.lat < max_lat)]
tweets.shape

(95384, 5)

Finally, we save the parsed tweets to use with [heatmap.py](http://www.sethoscope.net/heatmap/)

In [40]:
cd ../heatmap

/media/manuel/DATA/Backup/Proyectos/tweepy murcia/heatmap


In [42]:
with open('tweets_heatmap','w') as file:
    file.write(tweets[['lat','lon']].to_string(header=False, index=False))