README.md

This is the repo for my Springboard capstone project files.

Project Proposal: Formality in Reddit comments as a predictor of sarcasm

The problem of detecting sarcasm in text is of great interest to Natural Language Processing (NLP) practitioners and their clients. For example, consumer brands analyze online comments to gauge customer satisfaction and political campaigns seek to measure public reaction to policy proposals in user-created online content. Detecting sarcasm in text may also help disambiguate satire and related genres in the context of fake news detection.

When we use NLP techniques to measure sentiment in text, we think in terms of positive or negative (or neutral). Sarcasm, however, is a little more difficult to pin down. The literature is divided on how to define sarcasm, and on whether sarcasm is "net" positive or negative. Sometimes sarcasm is interpreted as positive by one person and negative by another. If sarcasm contains so much ambiguity that humans can't agree on what it is, do computers stand a chance of doing better?

In this study we analyze a large collection of online discussion forum comments to see if their degree of formality is associated with sarcasm. Formality here is defined as a text's conformance to certain standards of formal written English — specifically, correct spelling, proper punctuation, proper capitalization, formal grammar, and formal register (tone). If necessary in order to keep the study manageable, we will limit the number of elements of "formality" included as features in the model. Using these features as predictors, we seek to learn whether comments that are more (or less) formal than the average comment are more (or less) likely to be labeled as sarcasm.

Our dataset is A Large Self-Annotated Corpus for Sarcasm (SARC) <1>. SARC contains over 1 million sarcastic comments in a balanced set (equal number of non-sarcastic comments) and also an unbalanced set (100+ million non-sarcastic comments, the ratio found in the wild). SARC data is labeled via the self-annotating feature of Reddit by which authors indicate via an "/s" notation that all or part of their comment should be read as sarcasm.

Hypothesis: Reddit authors tend to spend more time crafting their sarcastic comments than their non-sarcastic comments, and thus sarcastic comments are characterized by greater degrees of formality.

A counter hypothesis: Reddit authors tend to spend less time writing their sarcastic comments, dashing them off in fits of righteous indignation, and thus include less formality in their sarcastic comments than in their nonsarcastic comments.
Either hypothesis, if true, would likely be helpful in predicting sarcasm.

We start by using one or two Supervised Machine Learning classification techniques such as Logistic Regression to establish a baseline for our model's accuracy in predicting sarcasm. We then use several Deep Learning techniques such as Deep Convolutional Neural Networks (CNN) and Long Short-Term Memory Recurrent Neural Networks (LSTM, RNN) to see if we can improve the model's accuracy. The final model is then productionized and deployed as a web service via an API.

Minumum computational resources needed to do this project:

Processing power (CPU): 4+ CPUs
Memory: 32GB
Specialized hardware such as GPUs: 1 GPU

This capstone project is undertaken in partial fulfilment of the requirements for the AI/Machine Learning Engineer Career Track Bootcamp at Springboard, under the mentorship of Jeff Hevrin.

<1> Unpublished SARC, Mikhail Khodak and Nikunj Saunshi and Kiran Vodrahalli, A Large Self-Annotated Corpus for Sarcasm, https://arxiv.org/abs/1704.05579, 2017.

Dataset

The SARC dataset is hosted at https://nlp.cs.princeton.edu/SARC/0.0/main/.

We use version 0.0/main files, the balanced set for EDA and traditional ML techniques and the the unbalanced set for DL models. Files in main have been cleaned and filtered; note file sizes for raw files compared to main files in table below:

Index of /SARC/0.0/main	main	raw
sarc.csv.bz2	11G	59G
stats.json	13M	23M
test-balanced.csv.bz2	19M
test-unbalanced.csv.bz2	2.3G
train-balanced.csv.bz2	78M
train-unbalanced.csv.bz2	8.9G

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
ipynb		ipynb
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README.md

Project Proposal: Formality in Reddit comments as a predictor of sarcasm

Dataset

About

Uh oh!

Releases

Packages

Languages

karlwbaker/Springboard_capstone

Folders and files

Latest commit

History

Repository files navigation

README.md

Project Proposal: Formality in Reddit comments as a predictor of sarcasm

Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages