
## One Million Reddit Jokes

### Introduction

**Overview**
The dataset was downloaded as a CSV file containing 1M posts from the r/Jokes subreddit. Of the relevant features, the "title" is the title's post or the joke's setup. The "selftext" is the punchline, or what you see once a user clicks on the post's content. It's worth nothin that many jokes in this data table don't meet this criterion (nans). 

**Score**
The "score" value describes the number of upvotes, i.e. the number of positive ratings the post received. Posts can additionally be downvoted, and while Reddit allows for negative values, the minimum value in the dataset is zero. When a user posts something to Reddit, however, they are automatically given a single upvote, so I am making the assumption that values of zero in this dataset were downvoted. 

[Original Source](https://query.data.world/s/htrdsouy327xqa4w457qx6k6sjtj6r)

### Project Ideas

**Exploratory Data Analysis** 
- Try to understand intuitively "what makes a joke funny" using simple exploratory data analysis. 

**Funny / Not Funny - Classification**
- The ultimate goal in wrangling these data is to create a dataset to classify as either funny or not funny using the upvotes. 

**Jokes Generation**
- Train and generate jokes using a language generation model (GPT for example).

**Funny Jokes Generation**
- Training and generating jokes using language models is one thing but generating **Funny** jokes using language models is a completely different task! (which is much much harder to do)

Are you up for a challenge? ;) 

### Data Loading

In [None]:
import numpy as np
import pandas as pd
import warnings; warnings.filterwarnings("ignore")

df_one_million_reddit_jokes = pd.read_csv('../input/one-million-reddit-jokes/one-million-reddit-jokes.csv', names = ['type', 'id', 'subreddit.id', 'subreddit.name', 'subreddit.nsfw', 'created_utc', 'permalink', 'domain', 'url', 'selftext', 'title', 'score'])

print('Loaded')

### Exploratory Data Analysis

- Let's take a look at the average score for each setup (note that in a joke the setup is the title). Which setups have the highest average score?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# let's check the number of missing values per column
df_one_million_reddit_jokes.isnull().sum()

lots of nans in the selftext column, as we'd expect

In [None]:
df_one_million_reddit_jokes = df_one_million_reddit_jokes.loc[df_one_million_reddit_jokes['score'].apply(lambda x: str(x).isdigit())]
df_one_million_reddit_jokes['score'] = df_one_million_reddit_jokes['score'].astype(int)
df_one_million_reddit_jokes['title'] = df_one_million_reddit_jokes['title'].astype(str)

# Get average score per setup
setup_score = df_one_million_reddit_jokes.groupby('title').score.mean()
setup_score = setup_score.reset_index()

# Take a look at the top 25 setups
setup_score.sort_values(by = 'score', ascending = False).head(25)

In [None]:
# Plot the first 1000 rows of the above data
plt.figure(figsize = (48,8))
sns.barplot(x = 'title', y = 'score', data = setup_score.head(300))
plt.xticks(rotation = 90)
plt.show()

- What's the average setup length? What's the average setup length for the setups with the highest average score?

In [None]:
import matplotlib.pyplot as plt

# Plot average setup length
plt.figure(figsize = (12,8))
sns.distplot(df_one_million_reddit_jokes.title.str.len())
plt.show()

# Let's get a list of the top 25 setups
top_setups = setup_score.sort_values(by = 'score', ascending = False).head(25).title.values

# Plot average setup length for the setups with the highest average score
plt.figure(figsize = (12,8))
sns.distplot(df_one_million_reddit_jokes[df_one_million_reddit_jokes.title.isin(top_setups)].title.str.len())
plt.show()

- What's the average punchline length? What's the average punchline length for the setups with the highest average score?

In [None]:
# Plot average punchline length
plt.figure(figsize = (12,8))
sns.distplot(df_one_million_reddit_jokes.selftext.str.len())
plt.show()

# Plot average punchline length for the setups with the highest average score
plt.figure(figsize = (12,8))
sns.distplot(df_one_million_reddit_jokes[df_one_million_reddit_jokes.title.isin(top_setups)].selftext.str.len())
plt.show()