## Setup 
The first step is to set up the file reading processes. One of the things to remember at this stage is that since testing the models occur at the process of training by withholding a untouched set of data and testing it every so often, joining the training and development set of data renders more data.

In [1]:
import pandas as pd
import re


fear_files = ['train/EI-oc-En-fear-train.txt', 'dev/2018-EI-oc-En-fear-dev.txt']
anger_files = ['train/EI-oc-En-anger-train.txt', 'dev/2018-EI-oc-En-anger-dev.txt']
joy_files = ['train/EI-oc-En-joy-train.txt', 'dev/2018-EI-oc-En-joy-dev.txt']
sad_files = ['train/EI-oc-En-sadness-train.txt', 'dev/2018-EI-oc-En-sadness-dev.txt']


The first step is to create a function that takes a list of files paths as a parameters, and combines the files into one single data frame. This is the base frame for the rest of the cleaning procedures.

In [2]:
def combine_data(file_list):
    frames = []
    for f in file_list:
        frames.append(pd.read_csv(f, sep="\t"))
    
    return pd.concat(frames)
        

The second step is to remove the fluff of the classes. `TensorFlow` will want to be fed numerical classes so the first step will be to pull out the numebrs from the 'Intensity Class' column.

In [4]:
def convert_classes(frame, tri_class=False):
    int_class = list(map(lambda x: int(re.findall('\d+', x)[0]), frame['Intensity Class']))
    frame['Sentiment'] = int_class
    
    if tri_class:
        frame['Sentiment'] = frame['Sentiment'].map({2: 1})
    return frame


From here, we need to downsample to the frame so that each class is evenly represented. The downsampling function randomizes and samples the rows for each class to match the number of rows of the least represented class.

In [42]:
def downsample(frame):
    # split data across sentiments
    sub_list = []
    for v in pd.unique(frame['Sentiment']):
        sub_list.append(frame.loc[frame['Sentiment'] == v])
    
    # find number of rows for each subset 
    sub_row_lengths = [len(d) for d in sub_list]
    
    downsamp_num = min(sub_row_lengths)
    
    # downsample based off of the smallest subset 
    downsamped = pd.concat([d.sample(downsamp_num) for d in sub_list])
    
    return downsamped.sample(len(downsamped))

Wrap this all up into a class and voila...

In [15]:
class DataShogun(object):
    """DataShogun is a data preparation class for data drawn from SentiEval.
    """
    def __init__(self, tri_class=False, *paths):
        self.files = paths
        self.combined = self._combine_data(paths)
        self.clean_class = self._convert_classes(self.combined, tri_class)
        self.downsamped = self._downsample(self.clean_class)
        self.final_frame = self.downsamped.rename(columns={"Tweet": "SentimentText"})
        
        
        
    def _combine_data(self, file_list):
        frames = []
        for f in file_list:
            frames.append(pd.read_csv(f, sep="\t"))
    
        return pd.concat(frames)
    
    def _convert_classes(self, frame, tri_class=False):
        int_class = list(map(lambda x: int(re.findall('\d+', x)[0]), frame['Intensity Class']))
        frame['Sentiment'] = int_class
        
        if tri_class:
            frame['Sentiment'] = frame['Sentiment'].map({0: 0, 1:1, 2: 1, 3: 2})
        return frame
    
    def _downsample(self, frame):
        # split data across sentiments
        sub_list = []
        for v in pd.unique(frame['Sentiment']):
            sub_list.append(frame.loc[frame['Sentiment'] == v])
        
        # find number of rows for each subset 
        sub_row_lengths = [len(d) for d in sub_list]
        
        downsamp_num = min(sub_row_lengths)
        
        # downsample based off of the smallest subset 
        downsamped = pd.concat([d.sample(downsamp_num) for d in sub_list])
        
        return downsamped.sample(len(downsamped))

Now create separate DataShogun objects for each of the four emotions and then write them to csvs

In [22]:
fear = DataShogun(True, 'train/EI-oc-En-fear-train.txt', 'dev/2018-EI-oc-En-fear-dev.txt')
anger = DataShogun(True, 'train/EI-oc-En-anger-train.txt', 'dev/2018-EI-oc-En-anger-dev.txt')
sad = DataShogun(True, 'train/EI-oc-En-sadness-train.txt', 'dev/2018-EI-oc-En-sadness-dev.txt')
joy = DataShogun(True, 'train/EI-oc-En-joy-train.txt', 'dev/2018-EI-oc-En-joy-dev.txt')

In [23]:
fear.final_frame.to_csv("train/training_fear_tri_class.csv")
anger.final_frame.to_csv("train/training_anger_tri_class.csv")
sad.final_frame.to_csv("train/training_sad_tri_class.csv")
joy.final_frame.to_csv("train/training_job_tri_class.csv")

Unnamed: 0_level_0,ID,SentimentText,Affect Dimension,Intensity Class
Sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,217,217,217,217
1,217,217,217,217
2,217,217,217,217
