# Extracting Twitter Data

* The dataset only provides the tweet id with the emotion label. This code below extracts the text data for each tweet id.
* final dataset is smaller since it seems like some tweets were deleted after the dataset was created

## Dataset information:

Paper: http://knoesis.org/sites/default/files/wenbo_socialcom_2012_0.pdf

Data: http://knoesis.org/projects/emotion


In [16]:
#Importing libraries
import tweepy
import pandas as pd
import time
from extract_twitter_data import import_data, subsetting_emotions, connect_to_twitter_OAuth, get_tweets

__Importing Data__

In [17]:
filelist = ['./data/test.txt','./data/dev.txt','./data/train_1.txt','./data/train_2_1.txt','./data/train_2_10.txt','./data/train_2_2.txt','./data/train_2_3.txt','./data/train_2_4.txt','./data/train_2_5.txt','./data/train_2_6.txt','./data/train_2_7.txt','./data/train_2_8.txt','./data/train_2_9.txt']

In [18]:
df = import_data(filelist)

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2488982 entries, 0 to 2488981
Data columns (total 2 columns):
id         int64
emotion    object
dtypes: int64(1), object(1)
memory usage: 38.0+ MB


In [20]:
df.shape

(2488982, 2)

In [21]:
df.emotion.value_counts()

joy             706182
sadness         616471
anger           574170
love            301759
fear            135154
thankfulness    131340
surprise         23906
Name: emotion, dtype: int64

## Subsetting by emotion

* This will help to run the extract in batches

In [22]:
sad = subsetting_emotions(df,'sadness')
sad.shape

(616471, 2)

In [23]:
joy = subsetting_emotions(df,'joy')
joy.shape

(706182, 2)

In [24]:
anger = subsetting_emotions(df,'anger')
anger.shape

(574170, 2)

In [25]:
fear = subsetting_emotions(df,'fear')
fear.shape

(135154, 2)

In [26]:
surprise = subsetting_emotions(df,'surprise')
surprise.shape

(23906, 2)

In [27]:
thankfulness = subsetting_emotions(df,'thankfulness')
thankfulness.shape

(131340, 2)

In [28]:
love = subsetting_emotions(df,'love')
love.shape

(301759, 2)

## Setting up Twitter API

* credentials removed since this code is being shared in github

In [29]:
# Variables that contains the credentials to access Twitter API

ACCESS_TOKEN =  'input value here'
ACCESS_SECRET = 'input value here'
CONSUMER_KEY = 'input value here'
CONSUMER_SECRET = 'input value here'

# Create API object
api = connect_to_twitter_OAuth(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET)

__Get Tweet posts by Tweet ID__

Twitter Reference: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets

In [30]:
get_tweets(api, joy,'joy')

2020-02-21 07:36:12 - extracting  joy


Rate limit reached. Sleeping for: 417
Rate limit reached. Sleeping for: 422
Rate limit reached. Sleeping for: 449
Rate limit reached. Sleeping for: 434
Rate limit reached. Sleeping for: 396
Rate limit reached. Sleeping for: 384
Rate limit reached. Sleeping for: 349


2020-02-21 09:30:09 done with joy batch.


In [31]:
get_tweets(api,surprise,'surprise')

2020-02-21 10:11:56 - extracting  surprise
2020-02-21 10:14:22 done with surprise batch.


In [32]:
get_tweets(api, thankfulness,'thankfulness')

2020-02-21 10:14:22 - extracting  thankfulness


Rate limit reached. Sleeping for: 335


2020-02-21 10:33:47 done with thankfulness batch.


In [33]:
get_tweets(api, fear,'fear')

2020-02-21 10:33:48 - extracting  fear


Rate limit reached. Sleeping for: 343
Rate limit reached. Sleeping for: 348


2020-02-21 10:59:21 done with fear batch.


In [34]:
get_tweets(api, love,'love')

2020-02-21 10:59:22 - extracting  love


Rate limit reached. Sleeping for: 338
Rate limit reached. Sleeping for: 331
Rate limit reached. Sleeping for: 342


2020-02-21 11:48:28 done with love batch.


In [35]:
get_tweets(api, anger,'anger')

2020-02-21 11:48:28 - extracting  anger


Rate limit reached. Sleeping for: 313
Rate limit reached. Sleeping for: 334
Rate limit reached. Sleeping for: 344
Rate limit reached. Sleeping for: 357
Rate limit reached. Sleeping for: 299
Rate limit reached. Sleeping for: 348


2020-02-21 13:21:43 done with anger batch.


In [36]:
get_tweets(api, sad,'sadness')

2020-02-21 13:21:44 - extracting  sadness


Rate limit reached. Sleeping for: 356
Rate limit reached. Sleeping for: 354
Rate limit reached. Sleeping for: 339
Rate limit reached. Sleeping for: 286
Rate limit reached. Sleeping for: 365
Rate limit reached. Sleeping for: 351
Rate limit reached. Sleeping for: 354


2020-02-21 15:06:11 done with sadness batch.
