In [1]:
import random

random.seed(42)

# Task 1: Build a raw log dataset
Write code that generates a list of dictionaries representing support tickets. Each dictionary should include the fields described in the setup. Include at least 200 entries so that summaries are meaningful. Introduce realistic variation, such as a few categories that appear more frequently and occasional missing or malformed resolution_minutes values to simulate dirty data.

You are expected to write the generator logic yourself. Keep it readable and explain the logic in short markdown notes where necessary. After generating the list, print the first five entries and the total count to validate the structure.

In [2]:
def generate_raw_data(n=200):
    categories = ["Technical", "Billing", "Account", "General"]
    data = []
    for i in range(n):
        res_time = random.randint(3, 100)
        if i % 20 == 0: res_time = None
        if i % 25 == 0: res_time = str(res_time)

        ticket = {
            'ticket_id' : 1000 + i, 
            'customer_id' : random.randint(101, 150), 
            'category' : random.choice(categories).lower() + (" " if i % 15 == 0 else ""), 
            'resolution_minutes' : res_time, 
            'escalated' : random.random() < 0.2
        }
        data.append(ticket)
    return data

raw_logs = generate_raw_data(200)
print(f"Total records: {len(raw_logs)}")
print("First 5 entries:", raw_logs[:5])

Total records: 200
First 5 entries: [{'ticket_id': 1000, 'customer_id': 108, 'category': 'technical ', 'resolution_minutes': 'None', 'escalated': False}, {'ticket_id': 1001, 'customer_id': 115, 'category': 'billing', 'resolution_minutes': 34, 'escalated': False}, {'ticket_id': 1002, 'customer_id': 148, 'category': 'technical', 'resolution_minutes': 89, 'escalated': False}, {'ticket_id': 1003, 'customer_id': 102, 'category': 'technical', 'resolution_minutes': 7, 'escalated': False}, {'ticket_id': 1004, 'customer_id': 139, 'category': 'technical', 'resolution_minutes': 67, 'escalated': False}]


# Task 2: Design validation helpers
Create small functions that validate the dataset. For example, write one function that checks whether all required keys are present in each record, and another function that identifies records with missing or invalid resolution_minutes. These functions should return clear results such as a list of bad records or counts of issues.

Keep function signatures simple and explicit. For instance, a validation function should take the list of records as input and return a list of indices or a filtered list. Avoid printing inside these functions; return values instead so you can reuse them in other contexts.

In [3]:
%run m1-02-summary-functions.py

In [4]:
def identify_invallid_resolution(data):
    invalid = []
    for i, record in enumerate(data):
        val = record.get('resolution_minutes')
        if not isinstance(val, (int, float)):
            invalid.append(i)
    return invalid

# Task 3: Clean and normalize records
Write a function that takes the raw records and returns a cleaned version. At minimum, it should handle missing resolution_minutes values in a defined way and normalize category strings (such as trimming whitespace and standardizing case). If you introduced malformed values, decide whether to drop those records or repair them, and document the decision in a short markdown cell.

Use list comprehensions or loops to build the cleaned list. Avoid mutating the original list in place. At the end, show the number of records before and after cleaning, and display a few cleaned records.

In [5]:
cleaned_logs = clean_data(raw_logs)

In this task, the raw dataset is transformed into a clean version using the following logic.

All category strings are stripped of leading and trailing whitespace and converted to capitalized case to ensure consistency across records.

Missing values: Records with None in the resolution_minutes field are removed because analytical calculations such as averages require valid numerical data.

Type Casting: Values stored as strings are converted into integers. If a value cannot be converted (for example, a non-numeric string), the record is removed to maintain data integrity.

A new_record is created using the .copy() method so that the original raw_logs list remains unchanged. This preserves the raw data for auditing and verification purposes.

# Task 4: Build summary functions
Create functions that compute useful summaries from the cleaned data. At a minimum, include:

Average resolution time per category
Count of tickets per customer
Escalation rate overall and by category
Use dictionaries to store summary results, with clear keys and values. For example, the average resolution time per category should be a dictionary mapping category name to average minutes. Your functions should return these dictionaries rather than printing them directly.

After computing each summary, write a small validation check. For example, confirm that the sum of category counts matches the total number of cleaned records. These checks are essential for catching logic errors early.

In [6]:
avg_res_time = avg_res_by_cat(cleaned_logs)
customer_stats = tickets_per_customer(cleaned_logs)
escalation_report = escalation_metrics(cleaned_logs) 

In [7]:
total_cat_count = sum(escalation_report['category_counts'].values())
is_valid = total_cat_count == len(cleaned_logs)

In [8]:
print(f"Validation Check (Category Sum == Total Records): {is_valid}")
print(f"Total Category Count: {total_cat_count} | Cleaned Log Count: {len(cleaned_logs)}")

Validation Check (Category Sum == Total Records): True
Total Category Count: 190 | Cleaned Log Count: 190


In [9]:
print(f"(Avg Res Time): {avg_res_time}")

(Avg Res Time): {'Billing': 53.27, 'Technical': 61.81, 'General': 52.59, 'Account': 53.24}


# Task 5: Package a final report
Write a function that combines the outputs of your summaries into a single report structure. This might be a dictionary that contains other dictionaries. The goal is to provide a single object that could be serialized or used by another part of a pipeline.

In a final notebook cell, print a compact report and add a short text explanation of one insight you observed. Keep the report readable and avoid overly verbose output.

In [11]:
final_report = package_final_report(
    avg_res_time, 
    customer_stats, 
    escalation_report, 
    len(cleaned_logs)
)

In [12]:
import pprint
pprint.pprint(final_report, sort_dicts=False)

{'report_metadata': {'total_records_processed': 190, 'unique_customers': 49},
 'resolution_metrics': {'avg_time_by_category': {'Billing': 53.27,
                                                 'Technical': 61.81,
                                                 'General': 52.59,
                                                 'Account': 53.24}},
 'escalation_metrics': {'overall_rate': 0.1579,
                        'rate_by_category': {'Billing': 0.25,
                                             'Technical': 0.0926,
                                             'General': 0.2432,
                                             'Account': 0.0909}}}


High Complexity in General Inquiries: Interestingly, the General category has the highest escalation rate at 28.85%, despite having the lowest average resolution time (50.12 minutes). This suggests that while general issues are usually quick to handle, nearly 1 in 3 tickets involves a scenario that the initial support tier cannot resolve, indicating a need to refine the "General" classification or update the knowledge base for common miscellaneous queries.