In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [2]:
tweet_df = pd.read_csv('../data/labeled-twitter-hate-speech.csv')

## Exploring the Data Structure

In [3]:
tweet_df.columns

Index([u'_unit_id', u'_golden', u'_unit_state', u'_trusted_judgments',
       u'_last_judgment_at', u'does_this_tweet_contain_hate_speech',
       u'does_this_tweet_contain_hate_speech:confidence', u'_created_at',
       u'orig__golden', u'orig__last_judgment_at', u'orig__trusted_judgments',
       u'orig__unit_id', u'orig__unit_state', u'_updated_at',
       u'orig_does_this_tweet_contain_hate_speech',
       u'does_this_tweet_contain_hate_speech_gold',
       u'does_this_tweet_contain_hate_speech_gold_reason',
       u'does_this_tweet_contain_hate_speechconfidence', u'tweet_id',
       u'tweet_text'],
      dtype='object')

In [6]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14509 entries, 0 to 14508
Data columns (total 20 columns):
_unit_id                                           14509 non-null int64
_golden                                            14509 non-null bool
_unit_state                                        14509 non-null object
_trusted_judgments                                 14509 non-null int64
_last_judgment_at                                  14442 non-null object
does_this_tweet_contain_hate_speech                14509 non-null object
does_this_tweet_contain_hate_speech:confidence     14509 non-null float64
_created_at                                        0 non-null float64
orig__golden                                       67 non-null object
orig__last_judgment_at                             0 non-null float64
orig__trusted_judgments                            67 non-null float64
orig__unit_id                                      67 non-null float64
orig__unit_state               

## Insights into Data Structure

Many of the columns have too many null values to be useful so we'll have to drop them. Curious what the `gold` designation means, though, so let's explore that.

In [4]:
tweet_df.orig__golden.unique()

array([True, nan], dtype=object)

In [8]:
tweet_df.orig__golden.value_counts()

True    67
Name: orig__golden, dtype: int64

In [12]:
golden_df = tweet_df[tweet_df['orig__golden'] == True]

In [13]:
golden_df.does_this_tweet_contain_hate_speech.value_counts()

The tweet uses offensive language but not hate speech    29
The tweet is not offensive                               21
The tweet contains hate speech                           17
Name: does_this_tweet_contain_hate_speech, dtype: int64

In [14]:
golden_df._trusted_judgments.value_counts()

91    8
93    7
90    7
88    7
94    6
92    6
95    5
89    5
87    5
98    3
86    3
96    2
97    1
85    1
84    1
Name: _trusted_judgments, dtype: int64

## Golden Tweets

Determined that `golden` tweets represent tweets reviewed by a large number of users who reached a consensus. Unfortunately, this is not data that we will have when ingesting new tweets from the Twitter streaming API, so we'll have to drop them (and all other features relating to the contributors who labeled).

## Cleaning Data for Analysis

Building a table that contains only the tweet text and the labels to allow focus on the feature engineering and modeling. 

In [15]:
tweet_df.rename(columns={'does_this_tweet_contain_hate_speech':'labels'}, inplace=True)

In [18]:
labels_text_only_cols = ['labels', 'tweet_text']

In [19]:
labels_text_only_df = tweet_df.copy()[labels_text_only_cols]

In [21]:
labels_text_only_df['labels'].value_counts()

The tweet is not offensive                               7274
The tweet uses offensive language but not hate speech    4836
The tweet contains hate speech                           2399
Name: labels, dtype: int64

In [22]:
def convert_labels(row):
    if row == 'The tweet is not offensive':
        return 'not offensive'
    elif row == 'The tweet uses offensive language but not hate speech':
        return 'offensive'
    else:
        return 'hate'

In [24]:
labels_text_only_df['labels'] = labels_text_only_df['labels'].apply(convert_labels)

In [25]:
labels_text_only_df['labels'].value_counts()

not offensive    7274
offensive        4836
hate             2399
Name: labels, dtype: int64

## Outputting Cleaned Data to New File

In [28]:
labels_text_only_df.to_csv('../data/labels_tweets_only.csv',
                           index=False)