In [1]:
import pandas as pd

## Explore test-balanced.csv

In [4]:
df = pd.read_csv("..\data\test-balanced.csv", 
                 sep='\t', 
                 header=None, 
                 names=['label','comment','author','subreddit','score',
                        'ups','downs','date','created_utc','parent_comment'])

FileNotFoundError: [Errno 2] File b'..\\data\test-balanced.csv' does not exist: b'..\\data\test-balanced.csv'

In [None]:
df.info(memory_usage='deep', null_counts=True)

In [18]:
df.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,Actually most of her supporters and sane peopl...,Quinnjester,politics,3,3,0,2016-09,1473569605,Hillary's Surrogotes Told to Blame Media for '...
1,0,They can't survive without an echo chamber whi...,TheGettysburgAddress,The_Donald,13,-1,-1,2016-11,1478788413,Thank God Liberals like to live in concentrate...
2,0,you're pretty cute yourself 1729 total,Sempiternally_free,2007scape,8,-1,-1,2016-11,1478042903,Saw this cutie training his Attack today...
3,0,If you kill me you'll crash the meme market,Catacomb82,AskReddit,2,-1,-1,2016-10,1477412597,If you were locked in a room with 49 other peo...
4,0,I bet he wrote that last message as he was sob...,Dorian-throwaway,niceguys,5,-1,-1,2016-11,1477962278,You're not even that pretty!


In [35]:
test_bal_shape = df.shape
test_bal_shape

(1010826, 10)

In [20]:
df.describe()

Unnamed: 0,label,score,ups,downs,created_utc
count,251608.0,251608.0,251608.0,251608.0,251608.0
mean,0.5,6.757452,5.410953,-0.143751,1438535000.0
std,0.500001,48.450781,39.402618,0.350838,39502260.0
min,0.0,-329.0,-329.0,-1.0,1230881000.0
25%,0.0,1.0,0.0,0.0,1420403000.0
50%,0.5,2.0,1.0,0.0,1448768000.0
75%,1.0,4.0,3.0,0.0,1468522000.0
max,1.0,9923.0,4835.0,0.0,1483229000.0


### Are there any NA values?

In [21]:
df.isnull().values.any()  # True

True

### How many?

In [22]:
df.isnull().sum().sum()  # 14

14

### Which columns?

In [23]:
df.isnull().any()  # comment col only

label             False
comment            True
author            False
subreddit         False
score             False
ups               False
downs             False
date              False
created_utc       False
parent_comment    False
dtype: bool

### Which rows?

In [24]:
df[df['comment'].isnull()]  # 14 rows, 9 labeled sarcasm

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
3860,1,,Red_Hawk13,GlobalOffensive,1,-1,-1,2016-11,1479233380,If that was me in mm peeking there I'd be head...
13981,0,,MrVacuous,Warhammer,1,-1,-1,2016-10,1477159100,how much money have you spent on this hobby? H...
139814,0,,John_Wall_Star,nba,4,4,0,2015-12,1450934641,#Wallstar
147372,1,,D_VoN,gaming,1,1,0,2015-07,1436199078,How do you not know which way to put the cartr...
151112,0,,OH_NO_MR_BILL,pics,1,1,0,2015-07,1437708816,[click the 'source' link](/intensifies)
153607,1,,EranVonBaron,CivilizatonExperiment,1,1,0,2015-09,1441934837,yo wot
165219,1,,embGOD,leagueoflegends,1,1,0,2015-01,1420585477,what's wrong with that :|
178639,0,,TheSoupKitchen,TeraOnline,1,1,0,2015-03,1426902595,You on EU? I had this problem a year ago but i...
181985,0,,vVs_Pidgeon,heroesofthestorm,1,1,0,2015-02,1425035932,On which region are you?
192810,1,,doppelwurzel,funny,1,1,0,2015-05,1432681571,Colorblind maybe? I dunno. I'm sure he won't d...


### Will these NAs negatively affect our planned analysis?

To consider it from another angle, might a nonresponse might be useful as we seek to detect sarcasm? 

1. In the context of the parent_comment, might the empty response be intentional?
 - Consider index 228124. The parent_comment is "I hate "not sure if sarcasm" comments but your...." This means that author previously wrote or indicated "not sure if sarcasm." And in response to "I hate 'not sure if sarcasm'" author responds with an empty string. In the context of a live human conversation, especially a contentious one, a silent stare might be interpreted as anger, aggression, or sarcasm. Thus it seems that an empty response could be not only intentional but pointed and full of meaning.
 - Also, because some of the empty comments were flagged by author as sarcasm, we can be highly confident that the lack of content is intentional in most of these cases. Note that intentionality is less clear in empty comments that are _not_ annotated as sarcasm by author.
1. Will dropping or keeping the NAs affect our analysis?
 - Because there are so few relative to the size of the dataset, dropping the NAs would likely not damage our analysis significantly.
 - However, because they can represent a specific form of sarcasm, we will keep them.
 
**Decision: Keep empty comments. They should not hurt and will likely help our analysis.**

## Explore train-balanced.csv

In [25]:
df = pd.read_csv("train-balanced.csv", 
                 sep='\t', 
                 header=None, 
                 names=['label','comment','author','subreddit','score',
                        'ups','downs','date','created_utc','parent_comment'])

In [26]:
df.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,1476662123,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,1477959850,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,1474580737,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,1476824627,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,1483117213,Yep can confirm I saw the tool they use for th...


In [27]:
train_bal_shape = df.shape
train_bal_shape

(1010826, 10)

In [28]:
df.describe()

Unnamed: 0,label,score,ups,downs,created_utc
count,1010826.0,1010826.0,1010826.0,1010826.0,1010826.0
mean,0.5,6.885676,5.498885,-0.1458629,1438684000.0
std,0.5,48.34288,41.27297,0.3529689,39458120.0
min,0.0,-507.0,-507.0,-1.0,1230851000.0
25%,0.0,1.0,0.0,0.0,1420734000.0
50%,0.5,2.0,1.0,0.0,1448915000.0
75%,1.0,4.0,3.0,0.0,1468588000.0
max,1.0,9070.0,5163.0,0.0,1483229000.0


### Are there any NA values?

In [29]:
df.isnull().values.any()  # True

True

### How many?

In [30]:
df.isnull().sum().sum()  # 53

53

### Which columns?

In [31]:
df.isnull().any()  # comment col only

label             False
comment            True
author            False
subreddit         False
score             False
ups               False
downs             False
date              False
created_utc       False
parent_comment    False
dtype: bool

### How many rows and how many are labeled sarcasm?

In [34]:
nulls = df[df['comment'].isnull()]  
print(len(nulls))
nulls_sarc = nulls[nulls['label'] == 1]
print(len(nulls_sarc))
print(len(nulls_sarc) / len(nulls))

53
45
0.8490566037735849


Note that in the balanced dataset the incidence of sarcasm for empty comments is 0.85, much higher than the expected value of 0.5 for comments in general. (The incidence of sarcasm in the _unbalanced_ data is < 0.01.)
### We will keep the nulls.

## Balanced test to train ratio

In [None]:
test_bal_shape / train_bal_shape

## Explore test-unbalanced.csv
Too large, can't load.

In [None]:
df = pd.read_csv("test-unbalanced.csv", 
                 sep='\t', 
                 header=None, 
                 names=['label','comment','author','subreddit','score',
                        'ups','downs','date','created_utc','parent_comment'])

In [26]:
df.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,1476662123,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,1477959850,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,1474580737,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,1476824627,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,1483117213,Yep can confirm I saw the tool they use for th...


In [27]:
test_unbal_shape = df.shape
test_unbal_shape

(1010826, 10)

In [28]:
df.describe()

Unnamed: 0,label,score,ups,downs,created_utc
count,1010826.0,1010826.0,1010826.0,1010826.0,1010826.0
mean,0.5,6.885676,5.498885,-0.1458629,1438684000.0
std,0.5,48.34288,41.27297,0.3529689,39458120.0
min,0.0,-507.0,-507.0,-1.0,1230851000.0
25%,0.0,1.0,0.0,0.0,1420734000.0
50%,0.5,2.0,1.0,0.0,1448915000.0
75%,1.0,4.0,3.0,0.0,1468588000.0
max,1.0,9070.0,5163.0,0.0,1483229000.0


### Are there any NA values?

In [29]:
df.isnull().values.any()  # True

True

### How many?

In [30]:
df.isnull().sum().sum()  # 53

53

### Which columns?

In [31]:
df.isnull().any()  # comment col only

label             False
comment            True
author            False
subreddit         False
score             False
ups               False
downs             False
date              False
created_utc       False
parent_comment    False
dtype: bool

### How many rows and how many are labeled sarcasm?

In [34]:
nulls = df[df['comment'].isnull()]  
print(len(nulls))
nulls_sarc = nulls[nulls['label'] == 1]
print(len(nulls_sarc))
print(len(nulls_sarc) / len(nulls))

53
45
0.8490566037735849


Note that in the balanced dataset the incidence of sarcasm for empty comments is 0.85, much higher than the expected value of 0.5 for comments in general. (The incidence of sarcasm in the _unbalanced_ data is < 0.01.)
### We will keep the nulls.

## Unbalanced test to train ratio
Assuming that the ratio is similar to that for the balanced train/test data:

In [None]:
print("train_unbal_shape")

test_unbal_shape[0] / train_unbal_shape[0]