In [4]:
import csv
import numpy as np
import pandas as pd
import pyspark as ps

In [7]:
from pyspark import SparkContext
from pyspark.sql import SQLContext

# Dataset analysis for Dataset 1

This notebook contains an initial analysis of the swiss-tweets dataset that was provided on the cluster.

We first preform the analysis on the sample set using pandas and then show how to use spark to scale up.

## Fields contained

We first read the schema to see what fields are contained:

In [6]:
schema = pd.read_table('twitter-swisscom/schema.txt', header=None, delim_whitespace=True, index_col=0,
                       names=['name', 'type', 'specification', '??', 'format'] )
schema

Unnamed: 0,name,type,specification,??,format
1,id,bigint(20),UNSIGNED,No,
2,userId,bigint(20),UNSIGNED,No,
3,createdAt,timestamp,No,0000-00-00,00:00:00
4,text,text,utf8_unicode_ci,No,
5,longitude,float,Yes,,
6,latitude,float,Yes,,
7,placeId,varchar(25),utf8_general_ci,Yes,
8,inReplyTo,bigint(20),UNSIGNED,Yes,
9,source,int(10),UNSIGNED,No,
10,truncated,bit(1),No,,


While we have no exact information of what exactly each column contains, we can infer it form the column names.

For out analysis, the following columns are potentialy useful:
- userId: to see if some users occur frequently
- createdAt: to see when the tweet was created, so we can look into seasonal changes
- text: the actual tweet
- longitude & latitude / placeLongitude & placeLatitude: giving us the exact location of the tweet
- followersCount & friendCounts: to see how sociable or integrated in twitter a user is
- userLocation: to give us the location of the user


We see that we have no way of dirrectly seeing the language of the tweet.

## Dataset analysis

We now take a look at the sample tsv provided, containing a smaller subset of the data.

In [7]:
#after previous reads we saw that \N was used as NA value
#tsv -> use \t as separator
#use schema names for column name
twitter_df = pd.read_table('twitter-swisscom/sample.tsv', 
              sep='\t', engine='c', encoding='utf-8', quoting=csv.QUOTE_NONE,
              header=None, names=schema.name, na_values='\\N')

In [16]:
twitter_df.dtypes

id                 object
userId            float64
createdAt          object
text               object
longitude         float64
latitude          float64
placeId            object
inReplyTo         float64
source            float64
truncated          object
placeLatitude      object
placeLongitude     object
sourceName         object
sourceUrl          object
userName           object
screenName         object
followersCount     object
friendsCount      float64
statusesCount     float64
userLocation       object
dtype: object

In [11]:
twitter_df.head()

Unnamed: 0,id,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
0,776522983837954049,7.354492e+17,2016-09-15 20:48:01,se lo dici tu... https://t.co/x7Qm1VHBKL,,,51c0e6b24c64e54e,,1.0,,46.0027,8.96044,Twitter for iPhone,http://twitter.com/#!/download/iphone,plvtone filiae.,hazel_chb,146,110.0,28621.0,Earleen.
1,776523000636203010,2741686000.0,2016-09-15 20:48:05,https://t.co/noYrTnqmg9,,,4e7c21fd2af027c6,,1.0,,46.8131,8.22414,Twitter for iPhone,http://twitter.com/#!/download/iphone,samara,letisieg,755,2037.0,3771.0,Suisse
2,776523045200691200,435239200.0,2016-09-15 20:48:15,@BesacTof @Leonid_CCCP Tu dois t'engager en si...,,,12eb9b254faf37a3,7.765221e+17,5.0,,47.201,5.94082,Twitter for Android,http://twitter.com/download/android,lebrübrü❤,lebrubru,811,595.0,30191.0,Fontain
3,776523058404290560,503244200.0,2016-09-15 20:48:18,@Mno0or_Abyat اشوف مظاهرات على قانون العمل الج...,,,30bcd7f767b4041e,7.765216e+17,1.0,,45.8011,6.16552,Twitter for iPhone,http://twitter.com/#!/download/iphone,عبدالله القنيص,bingnais,28433,417.0,12262.0,Shargeyah
4,776523058504925185,452805300.0,2016-09-15 20:48:18,Greek night #geneve (@ Emilios in Genève) http...,6.14414,46.1966,c3a6437e1b1a726d,,3.0,,46.2048,6.14319,foursquare,http://foursquare.com,Alkan Şenli,Alkanoli,204,172.0,3390.0,İstanbul/Burgazada


We immediatly see the following things:

- longitude/latitude are often nan, this is note the case for the place_ equivalent
- userLocation contains places outside of Switzerland
- there are a lot of different languages present in the dataset

We look at bit more closely at the nan values:

In [12]:
twitter_df.isnull().sum()

id                  11
userId            1032
createdAt         1032
text               609
longitude         8541
latitude          7928
placeId           1902
inReplyTo         6973
source            1209
truncated         9306
placeLatitude     1209
placeLongitude    1273
sourceName        1209
sourceUrl         1209
userName          1246
screenName        1209
followersCount    1290
friendsCount      1902
statusesCount     1902
userLocation      3541
dtype: int64

- longtiude/latitude contains a lot of nan values
- this is not the case for placeLatitude/placeLongitude, but still 10% nan
- we have Nans in every column, we would have to remove these

As natural language processing methods are at the heart of our project, knowing which language is used in a tweet is essential.

We thus tried to find a way to detect language in a tweet. The langdetect library provides this functionality by making calls to the google translate api.

In [8]:
from langdetect import detect

we try to apply this to the sample set and get an error:

In [9]:
#we try to get the distribution of the languages in the tweets
twitter_df['text'].map(detect).value_counts()

LangDetectException: No features in text.

The issue is that some tweets contain no language or language usable for the classification. We thus have to first remove the tweets containing these values

In [33]:
twitter_df['text'] = twitter_df['text'].astype(str)
twitter_df.text.dropna() #remove NaNs
twitter_df.text = twitter_df.text.str.replace('http\S+|www.\S+', '', case=False)#remove website
twitter_df.text = twitter_df.text.str.replace('@\S+|via', '', case=False)#remove @ and via
twitter_df.text = twitter_df.text.str.replace('\((.+?)\)', '', case=False)#remove content in ()
twitter_df.text = twitter_df.text.str.replace('([^\s\w]|_)+','', case=False)#remove non alphanumeric (needed for language dec)
twitter_df = twitter_df[twitter_df.text.map(lambda x: len(x)) > 0 ] #remove empty strings

In [18]:
twitter_df.text = twitter_df.text[twitter_df.text.map(lambda x:any(c.isalpha() for c in x))]

In [38]:
twitter_df['text'].map(detect).value_counts()

en    2284
fr    1682
de    1148
tl     950
it     751
ar     389
pt     385
cy     324
es     315
ca     254
tr     216
so     122
nl     106
id     102
ja      79
af      75
ru      70
da      53
et      48
fi      46
sv      45
no      45
ro      43
vi      43
pl      27
sl      25
hu      24
hr      22
fa      20
lv      17
sk      17
sq      17
sw      16
th      16
lt      13
mk      10
bg       4
uk       3
cs       3
el       2
ur       2
ko       2
he       1
Name: text, dtype: int64

We can get a sense of the overall distribution of languages in the dataset.
We see that the majority of tweets is either english, french or german.

We note that there is a small issue with the accuracy of detection, sometimes swiss german gets categorized as something other than german:

In [40]:
detect('wie gahts dir') #How are you

'af'

We see that this sentence is categorized as afrikaans

## moving to spark

We also looked into how we could read this in spark:

In [None]:
sc = SparkContext('local', 'pyspark')
sqlContext = SQLContext(sc)

In [8]:
df = sqlContext.read.format('com.databricks.spark.csv')
    .option("delimiter", "\t")
    .options(header='false')
    .load('twitter-swisscom/sample.tsv')

In [9]:
df.show(n=4, truncate=False)

+------------------+------------------+-------------------+--------------------------------------------------------------------------------------------------------------------------------------+---+---+----------------+------------------+---+---+-------+-------+-------------------+-------------------------------------+---------------+---------+-----+----+-----+---------+
|_c0               |_c1               |_c2                |_c3                                                                                                                                   |_c4|_c5|_c6             |_c7               |_c8|_c9|_c10   |_c11   |_c12               |_c13                                 |_c14           |_c15     |_c16 |_c17|_c18 |_c19     |
+------------------+------------------+-------------------+--------------------------------------------------------------------------------------------------------------------------------------+---+---+----------------+------------------+---+---+------