## Summary of changes to original dataset
- Remove date and date_created columns
- Remove empty comments
- Add column for word count
- Add column for capitalised character frequency
- Add column for punctuation frequency
- Remove subreddit and author columns
- Add column for processed comment (remove punctuation and change characters to lowercase)
- Split into training (90%) and validation (10%) sets

## Preparation


In [1]:
# Mount google drive to access google drive on colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/CS3244_Project/

/content/drive/MyDrive/CS3244_Project


In [3]:
import pandas as pd
import numpy as np
import string
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

## Explore original dataset

In [4]:
df = pd.read_csv('train-balanced-sarcasm.csv',engine='python')
df.head(5)

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...


In [5]:
shape = df.shape
print(f'Rows: {shape[0]}')
print(f'Features: {shape[1]}')
print("Label count")
print(df['label'].value_counts())  # Balanced dataset

Rows: 1010826
Features: 10
Label count
1    505413
0    505413
Name: label, dtype: int64


## Remove columns

In [6]:
del df['date']
del df['created_utc']
del df['author']
del df['subreddit']
df.head()

Unnamed: 0,label,comment,score,ups,downs,parent_comment
0,0,NC and NH.,2,-1,-1,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,-4,-1,-1,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",3,3,0,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",-8,-1,-1,deadass don't kill my buzz
4,0,I could use one of those tools.,6,-1,-1,Yep can confirm I saw the tool they use for th...


## Empty Comments 
### - Removed

In [7]:
df_na = df.loc[df.isna().any(axis=1)]
print(f'# of empty comments: {df_na.shape[0]}')
print("Label count")
print(df_na['label'].value_counts())  # Majority of empty comments are sarcastic, might still be useful for predicting sarcasm
df_na.head()

# of empty comments: 53
Label count
1    45
0     8
Name: label, dtype: int64


Unnamed: 0,label,comment,score,ups,downs,parent_comment
56269,1,,1,1,0,"LPL shitshow, EU LCS shitshow. What isn't a sh..."
68590,1,,1,-1,-1,Car fires smell delicious to you? You should p...
135348,0,,1,-1,-1,Will do. EU or NA?
199910,0,,1,1,0,"woah, thanks."
258718,1,,5,5,0,"No, doing drugs while forming a fetus (your ki..."


In [8]:
# Replace NaN values with emptry string ''
df = df[df['comment'].notna()]
print(f'# of empty comments: {df.loc[df.isna().any(axis=1)].shape[0]}')

# of empty comments: 0


## Word count

In [9]:
df['word_count'] = df['comment'].apply(lambda x: len(x.split(' '))) #Add word count collumn
df.head()

Unnamed: 0,label,comment,score,ups,downs,parent_comment,word_count
0,0,NC and NH.,2,-1,-1,"Yeah, I get that argument. At this point, I'd ...",3
1,0,You do know west teams play against west teams...,-4,-1,-1,The blazers and Mavericks (The wests 5 and 6 s...,14
2,0,"They were underdogs earlier today, but since G...",3,3,0,They're favored to win.,19
3,0,"This meme isn't funny none of the ""new york ni...",-8,-1,-1,deadass don't kill my buzz,12
4,0,I could use one of those tools.,6,-1,-1,Yep can confirm I saw the tool they use for th...,7


## Capitalised character frequency
### - No. of captilalised characters / No. of characters

In [10]:
def cap_freq(text):
  total = 0
  cap_total = 0
  for char in text:
    total += 1
    if char.isupper():
      cap_total += 1
  return cap_total / total if total != 0 else 0

In [11]:
df['capitilised_frequency'] = df['comment'].apply(cap_freq) 
df.head()

Unnamed: 0,label,comment,score,ups,downs,parent_comment,word_count,capitilised_frequency
0,0,NC and NH.,2,-1,-1,"Yeah, I get that argument. At this point, I'd ...",3,0.4
1,0,You do know west teams play against west teams...,-4,-1,-1,The blazers and Mavericks (The wests 5 and 6 s...,14,0.013514
2,0,"They were underdogs earlier today, but since G...",3,3,0,They're favored to win.,19,0.024793
3,0,"This meme isn't funny none of the ""new york ni...",-8,-1,-1,deadass don't kill my buzz,12,0.016667
4,0,I could use one of those tools.,6,-1,-1,Yep can confirm I saw the tool they use for th...,7,0.032258


## Punctuation character frequency
### - No. of punctuations / No. of characters

In [12]:
def punctuation_freq(text):
  total = 0
  punc_total = 0
  punctuation = string.punctuation
  for char in text:
    total += 1
    if char in string.punctuation:
      punc_total += 1
  return punc_total / total if total != 0 else 0

In [13]:
df['punctuation_frequency'] = df['comment'].apply(punctuation_freq) 
df.head()

Unnamed: 0,label,comment,score,ups,downs,parent_comment,word_count,capitilised_frequency,punctuation_frequency
0,0,NC and NH.,2,-1,-1,"Yeah, I get that argument. At this point, I'd ...",3,0.4,0.1
1,0,You do know west teams play against west teams...,-4,-1,-1,The blazers and Mavericks (The wests 5 and 6 s...,14,0.013514,0.013514
2,0,"They were underdogs earlier today, but since G...",3,3,0,They're favored to win.,19,0.024793,0.033058
3,0,"This meme isn't funny none of the ""new york ni...",-8,-1,-1,deadass don't kill my buzz,12,0.016667,0.066667
4,0,I could use one of those tools.,6,-1,-1,Yep can confirm I saw the tool they use for th...,7,0.032258,0.032258


## Process comment text
### - Remove punctuation
### - Lowercase all characters

In [14]:
def process_text(text):
  return text.translate(str.maketrans('', '', string.punctuation)).lower()

In [15]:
df['comment_processed'] = df['comment'].apply(process_text) 
df['parent_comment_processed'] = df['parent_comment'].apply(process_text) 
df.head()

Unnamed: 0,label,comment,score,ups,downs,parent_comment,word_count,capitilised_frequency,punctuation_frequency,comment_processed,parent_comment_processed
0,0,NC and NH.,2,-1,-1,"Yeah, I get that argument. At this point, I'd ...",3,0.4,0.1,nc and nh,yeah i get that argument at this point id pref...
1,0,You do know west teams play against west teams...,-4,-1,-1,The blazers and Mavericks (The wests 5 and 6 s...,14,0.013514,0.013514,you do know west teams play against west teams...,the blazers and mavericks the wests 5 and 6 se...
2,0,"They were underdogs earlier today, but since G...",3,3,0,They're favored to win.,19,0.024793,0.033058,they were underdogs earlier today but since gr...,theyre favored to win
3,0,"This meme isn't funny none of the ""new york ni...",-8,-1,-1,deadass don't kill my buzz,12,0.016667,0.066667,this meme isnt funny none of the new york nigg...,deadass dont kill my buzz
4,0,I could use one of those tools.,6,-1,-1,Yep can confirm I saw the tool they use for th...,7,0.032258,0.032258,i could use one of those tools,yep can confirm i saw the tool they use for th...


## Split train/validation set
### - 90% training, 10% validation

In [16]:
train, valid = train_test_split(df, train_size=0.9, random_state=88, shuffle=True)

In [17]:
# Training set is still balanced
print(f"# of rows: {train.shape[0]}")
print("Training label counts")
train['label'].value_counts()

# of rows: 909695
Training label counts


0    454858
1    454837
Name: label, dtype: int64

In [18]:
print(f"# of rows: {valid.shape[0]}")
print("Validation label counts")
valid['label'].value_counts()

# of rows: 101078
Validation label counts


0    50547
1    50531
Name: label, dtype: int64

In [19]:
print(f"# of rows: {df.shape[0]}")
print("Training+Validation label counts")
df['label'].value_counts()

# of rows: 1010773
Training+Validation label counts


0    505405
1    505368
Name: label, dtype: int64

In [20]:
train.to_csv('sarcasm_train.csv',index = False)
valid.to_csv('sarcasm_valid.csv',index = False)
df.to_csv('sarcasm_train_valid.csv', index = False)

In [21]:
train.isnull().sum()

label                       0
comment                     0
score                       0
ups                         0
downs                       0
parent_comment              0
word_count                  0
capitilised_frequency       0
punctuation_frequency       0
comment_processed           0
parent_comment_processed    0
dtype: int64

## Test set
## - Prepare the provided test set

In [22]:
df2 = pd.read_table('test-balanced.csv',engine='python', header=None)
df2.columns = ['label', 'comment', 'author', 'subreddit', 'score', 'ups', 'downs',
       'date', 'created_utc', 'parent_comment']

In [23]:
df2

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,Actually most of her supporters and sane peopl...,Quinnjester,politics,3,3,0,2016-09,1473569605,Hillary's Surrogotes Told to Blame Media for '...
1,0,They can't survive without an echo chamber whi...,TheGettysburgAddress,The_Donald,13,-1,-1,2016-11,1478788413,Thank God Liberals like to live in concentrate...
2,0,you're pretty cute yourself 1729 total,Sempiternally_free,2007scape,8,-1,-1,2016-11,1478042903,Saw this cutie training his Attack today...
3,0,If you kill me you'll crash the meme market,Catacomb82,AskReddit,2,-1,-1,2016-10,1477412597,If you were locked in a room with 49 other peo...
4,0,I bet he wrote that last message as he was sob...,Dorian-throwaway,niceguys,5,-1,-1,2016-11,1477962278,You're not even that pretty!
...,...,...,...,...,...,...,...,...,...,...
251603,1,Respect your elders you little snot.,Tiffany_Butler,sports,7,7,0,2009-06,1245445833,"Aren't you a little old to be on the internet,..."
251604,1,I'm just glad they won't be using taxpayer mon...,harryballsagna,canada,8,8,0,2009-06,1246140814,"""I'm sorry, I can't hear you over the sound of..."
251605,1,what.. with this awesome narration?,aberant,lost,4,4,0,2009-04,1240452084,"So far, so lame."
251606,1,He looks trustworthy.,permaculture,unitedkingdom,1,1,0,2009-01,1231343418,"""I don't care"" says Lapland boss"


In [24]:
del df2['date']
del df2['created_utc']
del df2['author']
del df2['subreddit']

In [25]:
df2_na = df2.loc[df2.isna().any(axis=1)]
print(f'# of empty comments: {df2_na.shape[0]}')
df2 = df2[df2['comment'].notna()]
print(f'# of empty comments: {df2.loc[df2.isna().any(axis=1)].shape[0]}')

# of empty comments: 14
# of empty comments: 0


In [26]:
df2['word_count'] = df2['comment'].apply(lambda x: len(x.split(' '))) #Add word count collumn

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [27]:
def cap_freq(text):
  total = 0
  cap_total = 0
  for char in text:
    total += 1
    if char.isupper():
      cap_total += 1
  return cap_total / total if total != 0 else 0

df2['capitilised_frequency'] = df2['comment'].apply(cap_freq) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


In [28]:
def punctuation_freq(text):
  total = 0
  punc_total = 0
  punctuation = string.punctuation
  for char in text:
    total += 1
    if char in string.punctuation:
      punc_total += 1
  return punc_total / total if total != 0 else 0

df2['punctuation_frequency'] = df2['comment'].apply(punctuation_freq) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [29]:
def process_text(text):
  return text.translate(str.maketrans('', '', string.punctuation)).lower()

df2['comment_processed'] = df2['comment'].apply(process_text) 
df2['parent_comment_processed'] = df2['parent_comment'].apply(process_text) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [30]:
df2.head()

Unnamed: 0,label,comment,score,ups,downs,parent_comment,word_count,capitilised_frequency,punctuation_frequency,comment_processed,parent_comment_processed
0,0,Actually most of her supporters and sane peopl...,3,3,0,Hillary's Surrogotes Told to Blame Media for '...,17,0.038462,0.048077,actually most of her supporters and sane peopl...,hillarys surrogotes told to blame media for de...
1,0,They can't survive without an echo chamber whi...,13,-1,-1,Thank God Liberals like to live in concentrate...,12,0.028571,0.028571,they cant survive without an echo chamber whic...,thank god liberals like to live in concentrate...
2,0,you're pretty cute yourself 1729 total,8,-1,-1,Saw this cutie training his Attack today...,6,0.0,0.026316,youre pretty cute yourself 1729 total,saw this cutie training his attack today
3,0,If you kill me you'll crash the meme market,2,-1,-1,If you were locked in a room with 49 other peo...,9,0.023256,0.023256,if you kill me youll crash the meme market,if you were locked in a room with 49 other peo...
4,0,I bet he wrote that last message as he was sob...,5,-1,-1,You're not even that pretty!,11,0.019608,0.019608,i bet he wrote that last message as he was sob...,youre not even that pretty


In [31]:
df2.to_csv('sarcasm_test.csv', index = False)