# Data exploration


In this first notebook, we will start by exploring the data. This should give us more insights in which features are important when building our model. According to the CRISP-DM model, understanding the data is the first step to extract business value out of it. Before this step, we will also look at the existing literature and define the goal of this study.

## 1) Initialize pyspark and inspect the data

#### initialize pyspark

In [1]:
import findspark
import findspark

# initialize findspark with spark directory

#ALWAYS HAVE TO BE CHANGED 
#path = "/Users/konstantinlazarov/Desktop/Big_Data/PySpark/Week_5/spark"
path = "/Users/Artur/spark"
findspark.init(path) 

# import pyspark
import pyspark
# create spark context
sc = pyspark.SparkContext()
# create spark session 
spark = pyspark.sql.SparkSession(sc)

#### Import necessary packages

In [2]:
# select interesting features
import pyspark.sql.functions as F
import pandas as pd 
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np 

# import packages
import os 
import pickle

import re
from datetime import datetime

import requests

import pytz

import pandas as pd
import numpy as np

import ast

import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql.functions import *

import matplotlib.pyplot as plt
import seaborn as sns

#### Import the twitter data 

In [3]:
#set this path to your path, for some reason I have an error 
#reading in all the files
import os 
path_json = ".././../Big_Data_Group_3/data/Topic_vegan/*.json"



#### inspect the data

In [4]:
# inspect the files
twitter_all = spark.read.option("multiline","true").json(path_json)


In [5]:
# inspect the number of rows
nr_observations = twitter_all.count()
nr_observations

3428559

In [6]:
# inspect the structure of the data
twitter_all.printSchema()

root
 |-- contributors: string (nullable = true)
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- display_text_range: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- display_url: string (nullable = true)
 |    |    |    |-- expanded_url: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- id_str: string (nullable = true)
 |    |    |    |

In [7]:
#

## 2) Define the goal of our analysis

### 2.1 Literature study

## 2.2 The goal of our research

## 3) Inspect the individual variables of the twitter data

We will now look at all the individual variables of the twitter data. This will give us more insights which features will be usefull for further analysis. We will start by analyzing the data about the tweets.

we will start with analyzing the 'Tweet Object' data. However, we first want to clarify the structure of the 'Tweet Object' data:

The Tweet object has a long list of ‘root-level’ attributes, including fundamental attributes such as id, created_at, and text. Tweet objects are also the ‘parent’ object to several child objects. Tweet child objects include user, entities, and extended_entities. Tweets that are geo-tagged will have a place child object.


Besides, we also want to look up the explanation of unclear variables in the data dictionary of the twitter developper platform:
https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet


Note: If a lot of null values were present, we increased the number in show() temporarily to 100 to gain more insights into the variable.

### Contributors

As we see that contributors only exists out of null values, we will drop this value. It does not provide any value to our data.

In [10]:
# inspect the first rows of the variable
twitter_all.select(F.col("contributors")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("contributors").isNull()).count()\
        /nr_observations*100, " %")


+------------+
|contributors|
+------------+
|        null|
|        null|
|        null|
|        null|
|        null|
+------------+
only showing top 5 rows

Percentage of null values: 100.0  %


### Coordinates

Definition: Nullable. Represents the geographic location of this Tweet as reported by the user or client application. The inner coordinates array is formatted as geoJSON (longitude first, then latitude)

As we see that coordinates only exists out of null values, we will drop this value. It does not provide any value to our data.

In [11]:
# inspect the first rows of the variable
twitter_all.select(F.col("coordinates")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("coordinates").isNull()).count()\
        /nr_observations*100, " %")

+-----------+
|coordinates|
+-----------+
|       null|
|       null|
|       null|
|       null|
|       null|
+-----------+
only showing top 5 rows

Percentage of null values: 99.8574911500721  %


### Created at

Definition: It indicates when a tweet was created.

This variable does not contain any null values. This variable will be used for our analysis as it contains usefull information.

In [20]:
# inspect the first rows of the variable
twitter_all.select(F.col("created_at")).show(5, truncate = False)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("created_at").isNull()).count()\
        /nr_observations*100, " %")

+------------------------------+
|created_at                    |
+------------------------------+
|Mon Apr 04 10:09:55 +0000 2022|
|Mon Apr 04 10:09:54 +0000 2022|
|Mon Apr 04 10:09:54 +0000 2022|
|Mon Apr 04 10:09:52 +0000 2022|
|Mon Apr 04 10:09:52 +0000 2022|
+------------------------------+
only showing top 5 rows

Percentage of null values: 0.0  %


### Display text range

This variable does not contain any null values. It indicates how many characters of the tweet were text and how long this text section was. This variable will be used for our analysis as it contains usefull information.

In [13]:
# inspect the first rows of the variable
twitter_all.select(F.col("display_text_range")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("display_text_range").isNull()).count()\
        /nr_observations*100, " %")

+------------------+
|display_text_range|
+------------------+
|          [0, 139]|
|          [0, 140]|
|          [17, 87]|
|          [0, 139]|
|          [0, 139]|
+------------------+
only showing top 5 rows

Percentage of null values: 0.0  %


### Entities

Entities provide metadata and additional contextual information about content posted on Twitter. The entities section provides arrays of common things included in Tweets: hashtags, user mentions, links, stock tickers (symbols), Twitter polls, and attached media. These arrays are convenient for developers when ingesting Tweets, since Twitter has essentially pre-processed, or pre-parsed, the text body. 

So, for entities we will look at its subvariables:

1) Hashtags = Represents hashtags which have been parsed out of the Tweet text.
2) Media = Represents media elements uploaded with the Tweet
3) Urls = Represents URLs included in the text of a Tweet
4) user_mentions = Represents other Twitter users mentioned in the text of the Tweet
5) symbols = Represents symbols, i.e. $cashtags, included in the text of the Tweet

Each of these variables again exists out of different components. Therefore, we will look at each of these variables individually and discuss the most interesting subcomponents. Variables that are not discussed will be dropped for our analysis as we do not think they have any value.


#### 1) hashtags

The entities section will contain a hashtags array containing an object for every hashtag included in the Tweet body, and include an empty array if no hashtags are present.

Within hashtags, only the variable text could provide any value. This is the name of the hashtag. However, these all seem empty arrays.

In [17]:
# inspect the first rows of the variable
twitter_all.select(F.col("entities.hashtags.text")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("entities.hashtags.text")== F.array()).count()\
        /nr_observations*100, " %")



+----+
|text|
+----+
|  []|
|  []|
|  []|
|  []|
|  []|
+----+
only showing top 5 rows

Percentage of null values: 57.50905847033696  %


#### 2) Media

The entities section will contain a media array containing a single media object if any media object has been ‘attached’ to the Tweet. If no native media has been attached, there will be no media array in the entities. For the following reasons the extended_entities section should be used to process Tweet native media:
+ Media type will always indicate ‘photo’ even in cases of a video and GIF being attached to Tweet.
+ Even though up to four photos can be attached, only the first one will be listed in the entities section.

NOG AANPASSEN: It could be interesting to see whether or not a media type was used in the tweet and the effect on the reach? I have no idea yet which subvariables to include as the media type is always photo and this was the most interesting one for a predictive model. Check op deze sites de rest: https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/entities



In [27]:
# inspect the first rows of the variable
twitter_all.select(F.col("entities.media")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("entities.media").isNull()).count()\
        /nr_observations*100, " %")

+-----+
|media|
+-----+
| null|
| null|
| null|
| null|
| null|
+-----+
only showing top 5 rows

Percentage of null values: 76.36076847445239  %


#### 3) Symbols                              

Definition: The entities section will contain a symbols array containing an object for every $cashtag included in the Tweet body, and include an empty array if no symbol is present.

It can be interesting to indicate whether or not symbols were included. However, we again see a lot of null values.

In [54]:
# inspect the first rows of the variable
twitter_all.select(F.col("entities.symbols")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("entities.symbols")== F.array()).count()\
        /nr_observations*100, " %")

+-------+
|symbols|
+-------+
|     []|
|     []|
|     []|
|     []|
|     []|
+-------+
only showing top 5 rows

Percentage of null values: 99.68062384226143  %


#### 4) URLS                          

Definition: The entities section will contain a urls array containing an object for every link included in the Tweet body, and include an empty array if no links are present.

Again, we could look if a url was included in the tweet or not. We see that in one 5th of the cases a url was used. I do not think it is usefull to include more specifics of this variable than just the presence of it?

In [31]:
# inspect the first rows of the variable
twitter_all.select(F.col("entities.urls")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("entities.urls")== F.array()).count()\
        /nr_observations*100, " %")

+----+
|urls|
+----+
|  []|
|  []|
|  []|
|  []|
|  []|
+----+
only showing top 5 rows

Percentage of null values: 78.05637878770644  %


#### 5) User mentions

The entities section will contain a user_mentions array containing an object for every user mention included in the Tweet body, and include an empty array if no user mention is present

Again, we can look if another user is mentioned or not in the tweet

In [38]:
# inspect the first rows of the variable
twitter_all.select(F.col("entities.user_mentions")).show(5, truncate = False)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("entities.user_mentions")== F.array()).count()\
        /nr_observations*100, " %")

+--------------------------------------------------------------------------------+
|user_mentions                                                                   |
+--------------------------------------------------------------------------------+
|[{4886284964, 4886284964, [3, 13], Ohmpawat ( คิดถึงBBS ), ohmpawatt}]          |
|[{296107534, 296107534, [3, 17], TRIPLE.N, mynameisnanon}]                      |
|[{755329842, 755329842, [0, 16], Trudie Bellamy 💕😻👀👩🏻‍🚀, trudiebakescake}]|
|[{4886284964, 4886284964, [3, 13], Ohmpawat ( คิดถึงBBS ), ohmpawatt}]          |
|[{4886284964, 4886284964, [3, 13], Ohmpawat ( คิดถึงBBS ), ohmpawatt}]          |
+--------------------------------------------------------------------------------+
only showing top 5 rows

Percentage of null values: 26.13663641197366  %


In [53]:
# Maybe it is an idea to look at how many times a id name is referenced in tweets?
# name = name of the referenced user
# if you use id you get the id of the referenced user
nr_times_referenced = twitter_all.groupBy('entities.user_mentions.name').count()
nr_times_referenced.sort(col('count').desc()).show(5, truncate = False)
                


+-------------------------+------+
|name                     |count |
+-------------------------+------+
|[]                       |896110|
|[E! News]                |28620 |
|[🌱 Vegan Animal Lover ⓥ]|27466 |
|[Angie KaranⓋ🌱🐾🌻🇺🇦] |27130 |
|[ANIMAL ADVOCATE  Ⓥ]     |22022 |
+-------------------------+------+
only showing top 5 rows



### Extended entities

Use this link for the definitions and extra information:
https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/extended-entities

Extra explanation: If a Tweet contains native media (shared with the Tweet user-interface as opposed via a link to elsewhere), there will also be a extended_entities section. When it comes to any native media (photo, video, or GIF), the extended_entities is the preferred metadata source for several reasons. Currently, up to four photos can be attached to a Tweet. The entities metadata will only contain the first photo (until 2014, only one photo could be included), while the extended_entities section will include all attached photos.

With native media, another deficiency of the entities.media metadata is that the media type will always indicate ‘photo’, even in cases where the attached media is a video or animated GIF. The actual type of media is specified in the extended_entities.media[].type attribute and is set to either photo, video, or animated_gif. For these reasons, if you are working with native media, the extended_entities metadata is the way to go.

We saw around 76.3% of our tweets did not have a media object (empty array for this part). As we look at the defintions provided by twitter, it would be more usefull to examine extended entities when analyzing media data. We see that the number of null values is the same for both variables, which is what we expected.

In [64]:
# inspect the first rows of the variable
twitter_all.select(F.col("extended_entities.media")).show(5, truncate = False)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("extended_entities.media").isNull()).count()\
        /nr_observations*100, " %")

+-----+
|media|
+-----+
|null |
|null |
|null |
|null |
|null |
+-----+
only showing top 5 rows

Percentage of null values: 76.36076847445239  %


It is strange that this variable consists out of null values an not empty arrays as mentioned in the structure. However, we will also inspect the media types.

Now, we will look at the types of the media as this will definitely be interesting to include in our model.

We see that it are indeed not only photos as media type. Furthermore, the number of elements in the array also indicate how many media types were added. We can easily add this to the data


In [70]:
# inspect the first rows of the variable
twitter_all.select(F.col("extended_entities.media.type")).show(5, truncate = False)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("extended_entities.media.type").isNull()).count()\
        /nr_observations*100, " %")

+----+
|type|
+----+
|null|
|null|
|null|
|null|
|null|
+----+
only showing top 5 rows

Percentage of null values: 76.36076847445239  %


In [79]:
# look at the number of media elements 
# problem: -1 als geen array, nog geen tijd hiervoor gehad
twitter_all.select(size(F.col("extended_entities.media.type"))).show(5)

+----------------------------------+
|size(extended_entities.media.type)|
+----------------------------------+
|                                -1|
|                                -1|
|                                -1|
|                                -1|
|                                -1|
+----------------------------------+
only showing top 5 rows



### favorite count

Nullable. Indicates approximately how many times this Tweet has been liked by Twitter users.

Interesting variable for an analysis. However, it is weird this variable has a high amount of null values

In [85]:
# inspect the first rows of the variable
twitter_all.select(F.col("favorite_count")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("favorite_count") ==0).count()\
        /nr_observations*100, " %")

+--------------+
|favorite_count|
+--------------+
|             0|
|             0|
|             0|
|             0|
|             0|
+--------------+
only showing top 5 rows

Percentage of null values: 78.11229149038998  %


### favorited

Nullable. Indicates whether this Tweet has been liked by the authenticating user. 

We will not include this variable as it has only values of the boolean type False.

In [87]:
# inspect the first rows of the variable
twitter_all.select(F.col("favorited")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("favorited") == False).count()\
        /nr_observations*100, " %")

+---------+
|favorited|
+---------+
|    false|
|    false|
|    false|
|    false|
|    false|
+---------+
only showing top 5 rows

Percentage of null values: 100.0  %


### full_text

This variable speaks for itself. However, it is usefull to notice that some tweets have no text. Some start with the value RT, which probably indicates a retweet.

In [88]:
# inspect the first rows of the variable
twitter_all.select(F.col("full_text")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("full_text") == False).count()\
        /nr_observations*100, " %")

+--------------------+
|           full_text|
+--------------------+
|RT @ohmpawatt: เพ...|
|RT @mynameisnanon...|
|@trudiebakescake ...|
|RT @ohmpawatt: เพ...|
|RT @ohmpawatt: เพ...|
+--------------------+
only showing top 5 rows

Percentage of null values: 2.9166772396216602e-05  %


### geo

Definition: Deprecated. Nullable. Use the coordinates field instead. This deprecated attribute has its coordinates formatted as [lat, long], while all other Tweet geo is formatted as [long, lat].

As twitter indicates themselves not to use this variable, we will not use this variable. Besides, it also has an enormous amount of null variables.

In [90]:
# inspect the first rows of the variable
twitter_all.select(F.col("geo")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("geo").isNull()).count()\
        /nr_observations*100, " %")

+----+
| geo|
+----+
|null|
|null|
|null|
|null|
|null|
+----+
only showing top 5 rows

Percentage of null values: 99.8574911500721  %


### id, id_str

The integer representation of the unique identifier for this Tweet. This number is greater than 53 bits and some programming languages may have difficulty/silent defects in interpreting it. Using a signed 64 bit integer for storing this identifier is safe. Use id_str to fetch the identifier to be safe.

So, we will use the id_str variable to identify tweets.

#### in_reply_to_screen_name

Nullable. If the represented Tweet is a reply, this field will contain the screen name of the original Tweet’s author.

In [96]:
# inspect the first rows of the variable
twitter_all.select(F.col("in_reply_to_screen_name")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("in_reply_to_screen_name").isNull()).count()\
        /nr_observations*100, " %")

+-----------------------+
|in_reply_to_screen_name|
+-----------------------+
|                   null|
|                   null|
|        trudiebakescake|
|                   null|
|                   null|
+-----------------------+
only showing top 5 rows

Percentage of null values: 84.66705691808133  %


#### in_reply_to_status_id, in_reply_to_status_id_str

Nullable. If the represented Tweet is a reply, this field will contain the string representation of the original Tweet’s ID. (the tweet itself)

Before, we stated that we will use the string version of the id.

#### in_reply_to_user_id, in_reply_to_user_id_str

Nullable. If the represented Tweet is a reply, this field will contain the string representation of the original Tweet’s author ID. (the author of the tweet)

Before, we stated that we will use the string version of the id.

#### is_quote_status

Indicates whether this is a Quoted Tweet. 

We can see that 5.1% of the tweets are quoted.

In [100]:
# inspect the first rows of the variable
twitter_all.select(F.col("is_quote_status")).show(5)

# look at the perecentage of null values
print("Percentage of quoted tweets:", twitter_all.filter(F.col("is_quote_status") == True).count()\
        /nr_observations*100, " %")

+---------------+
|is_quote_status|
+---------------+
|           true|
|           true|
|          false|
|           true|
|           true|
+---------------+
only showing top 5 rows

Percentage of quoted tweets: 5.110222691223922  %


### lang


Nullable. When present, indicates a BCP 47 language identifier corresponding to the machine-detected language of the Tweet text, or und if no language could be detected.

In [104]:
# inspect the first rows of the variable
twitter_all.select(F.col("lang")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("lang") == 'und').count()\
        /nr_observations*100, " %")

+----+
|lang|
+----+
|  th|
|  th|
|  en|
|  th|
|  th|
+----+
only showing top 5 rows

Percentage of null values: 3.1184238042862904  %


### metadata

This does not contain any valauble information so it will not be included in our analysis.

In [102]:
# inspect the first rows of the variable
twitter_all.select(F.col("metadata")).show(5)


+------------+
|    metadata|
+------------+
|{th, recent}|
|{th, recent}|
|{en, recent}|
|{th, recent}|
|{th, recent}|
+------------+
only showing top 5 rows



### Place

Nullable When present, indicates that the tweet is associated (but not necessarily originating from) a Place. This root level attribute also has some subvariables that we will explore so we have more information about the location of our tweets.

However, it is clear from the start that also this attribute has a lot of null values.

In [105]:
# inspect the first rows of the variable
twitter_all.select(F.col("place")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("place").isNull()).count()\
        /nr_observations*100, " %")

+-----+
|place|
+-----+
| null|
| null|
| null|
| null|
| null|
+-----+
only showing top 5 rows

Percentage of null values: 98.89901267558761  %


#### 1) coordinates

A series of longitude and latitude points, defining a box which will contain the Place entity this bounding box is related to. Each point is an array in the form of [longitude, latitude]. Points are grouped into an array per bounding box.

In [106]:
# inspect the first rows of the variable
twitter_all.select(F.col("place.bounding_box.coordinates")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("place.bounding_box.coordinates").isNull()).count()\
        /nr_observations*100, " %")

+-----------+
|coordinates|
+-----------+
|       null|
|       null|
|       null|
|       null|
|       null|
+-----------+
only showing top 5 rows

Percentage of null values: 98.89901267558761  %


#### 2) country

Name of the country containing this place. For example: United States.

In [107]:
# inspect the first rows of the variable
twitter_all.select(F.col("place.country")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("place.country").isNull()).count()\
        /nr_observations*100, " %")

+-------+
|country|
+-------+
|   null|
|   null|
|   null|
|   null|
|   null|
+-------+
only showing top 5 rows

Percentage of null values: 98.89901267558761  %


#### 3) place_type

The type of location represented by this place. For example: 'city'

In [109]:
# inspect the first rows of the variable
twitter_all.select(F.col("place.place_type")).show(5)

# look at the perecentage of null values
print("Percentage of null values:", twitter_all.filter(F.col("place.place_type").isNull()).count()\
        /nr_observations*100, " %")

+----------+
|place_type|
+----------+
|      null|
|      null|
|      null|
|      null|
|      null|
+----------+
only showing top 5 rows

Percentage of null values: 98.89901267558761  %


### possibly_sensitive

# # The volume of a brands tweet


In [43]:
# so first we look at the total number of tweets per brand. Therefore, I followed the same steps as Viktor did,
# but I need to group the number of tweets per brand

In [10]:
# select interesting features 
twitter_sub = df_json.select(F.col("user.name"),
                                F.col("user.screen_name"),
                                F.col("created_at"), 
                                F.col("full_text"),
                                F.col("user.followers_count"))

In [11]:
# remove duplicates and retweets
twitter_processed = twitter_sub.filter(~F.col("full_text").startswith("RT")) \
                               .drop_duplicates() \
                               .cache()

In [12]:
# calculate the number of tweets in our dataset per brand
nr_tweets_brand = twitter_processed.groupBy('name').count()\
                    .withColumnRenamed('count', 'count_tweets')
nr_tweets_brand.toPandas().head(10)

Unnamed: 0,name,count_tweets
0,coopgrafik,10
1,Foxborough Chuck 🏳️‍⚧️🏳️‍🌈💯,1
2,omnigreensa,23
3,veganfestargentina,62
4,634BAGEL（ムサシ・ベーグル）,106
5,Générations Vegan - JM,70
6,Brian Skellenger on Gab,3
7,K,131
8,MareⓋ1111🐾,2
9,レッドゆっくりーマ,594


In [13]:
## The language used in the tweets (words, sensitivity, emojis)

In [14]:
# select interesting features 
twitter_sub = df_json.select(F.col("user.name"),
                                F.col("user.screen_name"),
                                F.col("created_at"), 
                                F.col("full_text"),
                                F.col("user.followers_count"))

In [15]:
# 1) Import the necessary packages
!pip install textblob
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as F
from textblob import TextBlob



In [None]:
##########@ 2de optie


In [16]:
def apply_blob(sentence):
    temp = TextBlob(sentence).sentiment[0]
    return temp

In [17]:
# register the function as a Spark UDF
sentiment = udf(apply_blob)

In [18]:
# apply to the dataframe
twitter_processed.withColumn("sentiment", sentiment(twitter_processed['full_text'])).show(5)


+-------------------+---------------+--------------------+--------------------+---------------+---------+
|               name|    screen_name|          created_at|           full_text|followers_count|sentiment|
+-------------------+---------------+--------------------+--------------------+---------------+---------+
|       Malaury Buis|         BMalau|Mon Apr 04 10:22:...|I inform you that...|           1933|      0.0|
|    PS.Cafe Harding|      PSharding|Mon Oct 18 09:56:...|BREAKFAST - PS KA...|            261|      0.0|
|FairWild Foundation|       fairwild|Mon Oct 18 09:31:...|Announcing the la...|            456|  0.23125|
|             Riddhi|    _Riddhi1609|Sun Oct 17 09:58:...|1.5 hours and alr...|            205|   0.4375|
|   kokorobotanicals|kokorobotanical|Mon Oct 18 10:57:...|Today is #worldch...|             23|      1.0|
+-------------------+---------------+--------------------+--------------------+---------------+---------+
only showing top 5 rows



In [19]:
# plot results
twitter_processed.plot.bar('sentiment') 

AttributeError: 'DataFrame' object has no attribute 'plot'