### **Step 1: Import Necessary Libraries**

Start by importing the required libraries:

In [12]:
import openai
import json
import os
import time
from tqdm import tqdm
import random

### **Step 2: Define File Paths**

Set the input and output file paths:

In [7]:
input_file = '../../data/raw/fullrjokesdata.json'
output_file = '../../data/processed/joke_selection_untaged.json'


### **Step 3: Specify Relevant Columns**

List the columns we want to retain:

In [5]:
relevant_columns = ['id', 'title', 'selftext', 'ups', 'score', 'created_utc']


### **Step 4: Process the Data**

Since the dataset is large, we'll read and process it line by line:

In [10]:
# Initialize a counter for the number of jokes processed
jokes_count = 0

# Open the input and output files
with open(input_file, 'r', encoding='utf-8') as infile, \
     open(output_file, 'w', encoding='utf-8') as outfile:

    # Iterate over each line (each joke)
    for line in tqdm(infile, desc='Processing jokes'):
        try:
            # Parse the JSON line
            joke = json.loads(line)
            
            # Filter jokes with more than 50 score
            if joke.get('score', 0) > 50:
                # Extract relevant columns
                filtered_joke = {key: joke.get(key, None) for key in relevant_columns}
                
                # Write the filtered joke to the output file
                json.dump(filtered_joke, outfile)
                outfile.write('\n')  # Write each joke on a new line
                
                jokes_count += 1
        except json.JSONDecodeError:
            # Skip lines that are not valid JSON
            continue

print(f"Total jokes after filtering: {jokes_count}")

Processing jokes: 1064928it [00:05, 190753.01it/s]

Total jokes after filtering: 98070





### **Step 5: Verify the Output**

To make sure the data has been correctly processed, we will read a few lines from the processed file to ensure everything worked correctly:


In [11]:
with open(output_file, 'r', encoding='utf-8') as f:
    for _ in range(5):
        line = f.readline()
        joke = json.loads(line)
        print(joke)

{'id': '8rch9', 'title': 'How do you know your girlfriend is getting fat?', 'selftext': '', 'ups': 73, 'score': 73, 'created_utc': 1244640182.0}
{'id': '99k53', 'title': 'Knock knock... [pic]', 'selftext': '', 'ups': 84, 'score': 84, 'created_utc': 1249996447.0}
{'id': '9p7em', 'title': 'Husband Asks Wife "What would you do if I hit the lottery?"', 'selftext': "A husband ask's his wife...what would you do if I hit the lotto? She replies I'd take half and leave your ass...the husband says..okay I just won 12 bucks on this scratch off...here's 6 bux now get the fuck out!", 'ups': 62, 'score': 62, 'created_utc': 1254242875.0}
{'id': 'a0ut6', 'title': "Don't do that.", 'selftext': 'One day a little girl is outside with her father. She claps her hands together and said "Daddy, I killed a butterfly." Her father replied "Don\'t do that, butterflies are our friends. No butter for a week." A little while later the girl was playing and she clapped her hands and said "Daddy, daddy, I killed a hon

### Step 6: Select a Test Set of High-Scoring Jokes

In this step, we will load the processed jokes from the file `joke_selection_untaged.json` and extract a small test set of jokes with high scores. The purpose of this test set is to explore tagging strategies and test our heuristics on jokes that received positive feedback from users. We'll define a threshold for "high-scoring" jokes and randomly select a small subset for experimentation.

#### Criteria for Selecting the Test Set:
- Jokes with a score above a certain threshold (e.g., `score > 90`).
- Randomly select 10 jokes to form the test set.


In [14]:

# Define the threshold for high-scoring jokes
score_threshold = 90
test_set_size = 20

# Load the filtered jokes and gather high-scoring jokes
high_scoring_jokes = []

with open(output_file, 'r', encoding='utf-8') as infile:
    for line in infile:
        joke = json.loads(line)
        if joke.get('score', 0) > score_threshold:
            high_scoring_jokes.append(joke)

# Randomly select a subset of high-scoring jokes for the test set
test_set = random.sample(high_scoring_jokes, min(test_set_size, len(high_scoring_jokes)))

# Display the selected test set of jokes
print(test_set)




### **Step 7: Tagging Jokes Using ChatGPT**

In this step, we'll use OpenAI's GPT model to automatically tag our selected test set of jokes based on predefined categories. We'll utilize the OpenAI API to send each joke to the model and receive the corresponding tags.