# Analysis of ISEAR Data Set

In [1]:
import cufflinks as cf
import pandas as pd
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

## Reading in the CSV File

First, we read in the CSV file with they entire data set with columns for the emotion labels and text. We skip the bad lines and drop NaNs. Finally, we filter out all 'no response' items which always appear between square brackets.

In [2]:
column_names = ['Emotion', 'Text']
df = pd.read_csv('data.csv', names=column_names, header=None, on_bad_lines='skip')
df = df.dropna()
pattern = "\[(.*?)\]"
filtered = df['Text'].str.contains(pattern)
df = df[~filtered]


This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.



In [3]:
print("Dimensions are: ", df.shape)
df.head(25)

Dimensions are:  (7430, 2)


Unnamed: 0,Emotion,Text
0,joy,When I understood that I was admitted to the U...
1,fear,I broke a window of a neighbouring house and I...
2,joy,Got a big fish in fishing.
3,fear,"Whenever I am alone in a dark room, walk alone..."
4,shame,I bought a possible answer to a homework probl...
5,disgust,I read about a murderer who brutalized his vic...
6,joy,The day that my boyfriend appeared at home wit...
7,guilt,I went to a pub with a group of friends (not v...
8,anger,Had an insulting letter from my father.
10,fear,I was to be given an audition to get a role. I...


## Examining the Distribution of Emotion Labels

We can see from the chart below, that the distribution of emotions over the whole dataset is roughly equal. 

In [4]:
# Emotion distributions
df['Emotion'].value_counts(normalize=True).iplot(kind='bar',
                                                      yTitle='Percentage', 
                                                      linecolor='black', 
                                                      opacity=0.7,
                                                      color='blue',
                                                      theme='pearl',
                                                      bargap=0.2,
                                                      gridcolor='white',
                                                     
                                                      title='Distribution of emotions')

## Examining Text Length Distributions

In the charts below, we see the text length distributions for the entire data set as well as for each individual emotion. For each emotion, we can see that relatively few texts have a length greater than 40 for any emotion. Fear, however, appears to contain the most outliers.

In [5]:
df['len_text'] = df['Text'].astype(str).apply(lambda x: len(x.split()))

In [6]:
anger = df[df['Emotion']=='anger']
joy = df[df['Emotion']=='joy']
disgust = df[df['Emotion']=='disgust']
guilt = df[df['Emotion']=='guilt']
shame = df[df['Emotion']=='shame']
sadness = df[df['Emotion']=='sadness']
fear = df[df['Emotion']=='fear']

In [7]:
df['len_text'].iplot(
    kind='hist',
    bins=100,
    xTitle='Text Length (num words)',
    linecolor='black',
    color='black',
    yTitle='Number of Documents',
    title='Text Length Distribution')
    
anger['len_text'].iplot(
    kind='hist',
    bins=100,
    xTitle='Text Length (num words)',
    linecolor='black',
    color='red',
    yTitle='count',
    title='Anger Text Length Distribution')

joy['len_text'].iplot(
    kind='hist',
    bins=100,
    xTitle='Text Length (num words)',
    linecolor='black',
    color='green',
    yTitle='count',
    title='Joy Text Length Distribution')

disgust['len_text'].iplot(
    kind='hist',
    bins=100,
    xTitle='Text Length (num words)',
    linecolor='black',
    color='yellow',
    yTitle='count',
    title='Disgust Text Length Distribution')

guilt['len_text'].iplot(
    kind='hist',
    bins=100,
    xTitle='Text Length (num words)',
    linecolor='black',
    color='blue',
    yTitle='count',
    title='Guilt Text Length Distribution')

shame['len_text'].iplot(
    kind='hist',
    bins=100,
    xTitle='Text Length (num words)',
    linecolor='black',
    color='teal',
    yTitle='count',
    title='Shame Text Length Distribution')

sadness['len_text'].iplot(
    kind='hist',
    bins=100,
    xTitle='Text Length (num words)',
    linecolor='black',
    color='purple',
    yTitle='count',
    title='Sadness Text Length Distribution')

fear['len_text'].iplot(
    kind='hist',
    bins=100,
    xTitle='Text Length (num words)',
    linecolor='black',
    yTitle='count',
    title='Fear Text Length Distribution')