## **NOTEBOOK FOR TESTING THE FORMATTING OF THE MODEL OUTPUT**

## **Overview**
In this notebook, we are addressing a formatting issue with the output generated by a model. The model produces its responses in JSON format, but the content within some fields (like post_summary and comments_summary) is itself a stringified JSON object. This causes the JSON to be improperly formatted.

### **Set up the functions**

In [None]:
from transformers import pipeline
import json

summarizer = pipeline(
            "summarization",
            model = "../models/redditsummary",
            tokenizer = "../models/redditsummary",
            # device=-1
        )
        
def summarize(text, prompt):
    inputs = f"{prompt}: {text}"
    input_tokens = summarizer.tokenizer.encode(inputs, truncation=False)
    input_len = len(input_tokens)
    max_length = min(input_len * 2, 1024)
    min_length = max(32, input_len // 4)
    if min_length >= max_length:
            max_length = min_length + 20 
    summary = summarizer(
        inputs,
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=4,
    )
    return summary[0]['summary_text']
    
def process_data(response, prompt):
    post_content = response[0]['data']['children'][0]['data'].get('selftext', '')
    comments = []
    for comment in response[1]['data']['children']:
        if 'body' in comment['data']:
            comments.append(comment['data']['body'])
    comments_all = ' '.join(comments)

    post_summary = summarize(post_content, prompt)
    comments_summary = summarize(comments_all, prompt)

    return {
        "post_summary": post_summary,
        "comments_summary": comments_summary
    }


### **Run the Functions**
- Open the scraped reddit post
- Process the summarization function
- Save the summary to a json file

In [None]:
with open('../data/response.json', 'r') as file:
    response = json.load(file) # scraped reddit post saved in a json file
    
summary = process_data(response, "Summarize and highlight popular brands")
    
print("Post Summary:", json.dumps(summary["post_summary"], indent=4))
print("Comments Summary:", json.dumps(summary["comments_summary"], indent=4))

# save the raw summary to a json file as well
with open("../data/test_output.json", "w") as output_file:
    json.dump(summary, output_file, indent=4)

### **Raw Data**

The following is the raw data retrieved from the file `test_output.json`. This JSON output is generated by the model but is not formatted correctly because it contains stringified JSON objects. You can view the raw data file [here](../data/test_output.json).

In [None]:
with open('../data/test_output.json', 'r') as file:
    raw_data = json.load(file)
    
raw_data

### **Fixing the JSON Format**

In this step, we define the function `fix_json` to process the raw data. This function will check if the raw data is a dictionary, attempt to decode the JSON strings inside the ```post_summary``` and ```comments_summary``` fields, and save the formatted output to a new file.


In [None]:
def fix_json(raw_data, fixed_path):
    if not isinstance(raw_data, dict):
        raise ValueError(f"Expected a dictionary, but got: {type(raw_data)}")

    try:
        formatted_data = {
            "post_summary": json.loads(raw_data["post_summary"]),  
            "comments_summary": json.loads(raw_data["comments_summary"])  
        }
    except json.JSONDecodeError as e:
        print("Error decoding JSON:", e)
        return
    
    with open(fixed_path, "w") as file:
        json.dump(formatted_data, file, indent=4)

    print(f"Formatted JSON saved to {fixed_path}")


fix_json(raw_data, "../data/formatted_output.json")


Once we have successfully formatted the raw data, we can load and view the resulting `formatted_output.json` file to ensure the JSON is structured correctly. View the file [here](../data/formatted_output.json)

In [None]:
with open('../data/formatted_output.json', 'r') as file:
    formatted_data = json.load(file)

formatted_data