# Target

We've downloaded two different datasets from two sources.<br>
And, further prepared csv out of it, in a format suitable for our consumption later on. <br>

Sources:
1. Tweets dataset containing emotions downloaded from: http://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html
2. Emotion classification data downloaded from: https://www.kaggle.com/eray1yildiz/emotion-classification

In [1]:
import pandas as pd
import os

# Preparing dataset 1

For each text, we have a label in dataset with a confidence that represents its intensity. In such a case, tweets having confidence > 0.5 qualifies to be in our dataset. They have been accordingly picked. 

In [2]:
data_loc = 'data/base'

files = os.listdir(data_loc)
train_files = []
test_files = []

for file in files:
    if 'test' in file:
        test_files.append(file)
    elif 'train' in file:
        train_files.append(file)
    elif 'dev' in file:
        train_files.append(file)
        
print(train_files)
print(test_files)

['anger-ratings-0to1.dev.gold.txt', 'anger-ratings-0to1.train.txt', 'fear-ratings-0to1.dev.gold.txt', 'fear-ratings-0to1.train.txt', 'joy-ratings-0to1.dev.gold.txt', 'joy-ratings-0to1.train.txt', 'sadness-ratings-0to1.dev.gold.txt', 'sadness-ratings-0to1.train.txt']
['anger-ratings-0to1.test.gold.txt', 'fear-ratings-0to1.test.gold.txt', 'joy-ratings-0to1.test.gold.txt', 'sadness-ratings-0to1.test.gold.txt']


In [7]:
def get_dataframe(files):
    
    base_text = []
    tags = []
    for file in files:
        file_name = f"{data_loc}/{file}"
        f = open(file_name, errors="ignore")
        lines = f.readlines()
        lines = [line.split() for line in lines]
        lines = [line[2:] for line in lines]
        for line in lines:
            tag = line[-2]
            confidence = line[-1]
            if float(confidence) > 0.5:
                line = line[:-2]
                base = ' '.join(line)
                base_text.append(base)
                tags.append(tag)
    
    df = pd.DataFrame(list(zip(base_text, tags)), columns =['text', 'tag'])
    return df

In [8]:
train_df = get_dataframe(train_files)
test_df = get_dataframe(test_files)

In [9]:
train_df.head()

Unnamed: 0,text,tag
0,that Rutgers game was an abomination. An affro...,anger
1,I get mad over something so minuscule I try to...,anger
2,I get mad over something so minuscule I try to...,anger
3,eyes have been dilated. I hate the world right...,anger
4,One chosen by the CLP members! MP seats are no...,anger


In [10]:
print(test_df.shape)
test_df.head()

(1505, 2)


Unnamed: 0,text,tag
0,game has pissed me off more than any other gam...,anger
1,@mrsajhargreaves @Melly77 @GaryBarlow if he ca...,anger
2,@mrsajhargreaves @Melly77 @GaryBarlow if he ca...,anger
3,I've been disconnected whilst on holiday ðŸ˜¤ ...,anger
4,I'm furious ðŸ˜©ðŸ˜©ðŸ˜©,anger


In [12]:
train_df.to_csv('data/base/train.csv', index=None)
test_df.to_csv('data/base/test.csv', index=None)

# Preparing kaggle dataset

In [28]:
data = pd.read_csv('data/raw/emotion.data')
data.shape

In [30]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,emotions
0,27383,i feel awful about it too because it s my job ...,sadness
1,110083,im alone i feel awful,sadness
2,140764,ive probably mentioned this before but i reall...,joy
3,100071,i was feeling a little low few days back,sadness
4,2837,i beleive that i am much more sensitive to oth...,love


In [32]:
del data['Unnamed: 0']
data.head()

In [34]:
data.to_csv('data/raw/kaggle_emotion_data.csv', index=None)