In [None]:
import csv
import numpy as np
import pandas as pd
import pyspark as ps

In [None]:
from pyspark import SparkContext
from pyspark.sql import SQLContext

# Dataset analysis for Dataset 1

This notebook contains an initial analysis of the swiss-tweets dataset that was provided on the cluster.

We first preform the analysis on the sample set using pandas and then show how to use spark to scale up.

## Fields contained

We first read the schema to see what fields are contained:

In [None]:
schema = pd.read_table('twitter-swisscom/schema.txt', header=None, delim_whitespace=True, index_col=0,
                       names=['name', 'type', 'specification', '??', 'format'] )
schema

While we have no exact information of what exactly each column contains, we can infer it form the column names.

For out analysis, the following columns are potentialy useful:
- userId: to see if some users occur frequently
- createdAt: to see when the tweet was created, so we can look into seasonal changes
- text: the actual tweet
- longitude & latitude / placeLongitude & placeLatitude: giving us the exact location of the tweet
- followersCount & friendCounts: to see how sociable or integrated in twitter a user is
- userLocation: to give us the location of the user


We see that we have no way of dirrectly seeing the language of the tweet.

## Dataset analysis

We now take a look at the sample tsv provided, containing a smaller subset of the data.

In [None]:
#after previous reads we saw that \N was used as NA value
#tsv -> use \t as separator
#use schema names for column name
twitter_df = pd.read_table('twitter-swisscom/sample.tsv', 
              sep='\t', engine='c', encoding='utf-8', quoting=csv.QUOTE_NONE,
              header=None, names=schema.name, na_values='\\N')

In [None]:
twitter_df.dtypes

In [None]:
twitter_df.head()

We immediatly see the following things:

- longitude/latitude are often nan, this is note the case for the place_ equivalent
- userLocation contains places outside of Switzerland
- there are a lot of different languages present in the dataset

We look at bit more closely at the nan values:

In [None]:
twitter_df.isnull().sum()

- longtiude/latitude contains a lot of nan values
- this is not the case for placeLatitude/placeLongitude, but still 10% nan
- we have Nans in every column, we would have to remove these

As natural language processing methods are at the heart of our project, knowing which language is used in a tweet is essential.

We thus tried to find a way to detect language in a tweet. The langdetect library provides this functionality by making calls to the google translate api.

In [None]:
from langdetect import detect

we try to apply this to the sample set and get an error:

In [None]:
#we try to get the distribution of the languages in the tweets
twitter_df['text'].map(detect).value_counts()

The issue is that some tweets contain no language or language usable for the classification. We thus have to first remove the tweets containing these values

In [None]:
twitter_df['text'] = twitter_df['text'].astype(str)
twitter_df.text.dropna() #remove NaNs
twitter_df.text = twitter_df.text.str.replace('http\S+|www.\S+', '', case=False)#remove website
twitter_df.text = twitter_df.text.str.replace('@\S+|via', '', case=False)#remove @ and via
twitter_df.text = twitter_df.text.str.replace('\((.+?)\)', '', case=False)#remove content in ()
twitter_df.text = twitter_df.text.str.replace('([^\s\w]|_)+','', case=False)#remove non alphanumeric (needed for language dec)
twitter_df = twitter_df[twitter_df.text.map(lambda x: len(x)) > 0 ] #remove empty strings

In [None]:
twitter_df.text = twitter_df.text[twitter_df.text.map(lambda x:any(c.isalpha() for c in x))]

In [None]:
twitter_df['text'].map(detect).value_counts()

We can get a sense of the overall distribution of languages in the dataset.
We see that the majority of tweets is either english, french or german.

We note that there is a small issue with the accuracy of detection, sometimes swiss german gets categorized as something other than german:

In [None]:
detect('wie gahts dir') #How are you

We see that this sentence is categorized as afrikaans

## moving to spark

We also looked into how we could read this in spark:

In [None]:
sc = SparkContext('local', 'pyspark')
sqlContext = SQLContext(sc)

In [None]:
df = sqlContext.read.format('com.databricks.spark.csv')
    .option("delimiter", "\t")
    .options(header='false')
    .load('twitter-swisscom/sample.tsv')

In [None]:
df.show(n=4, truncate=False)

As an alternative dataset was later povided we did not preform any more tasks on spark, as we will be using the other dataset provided.