# Computer Vision with detr-resnet-50 
## Computer eyes vs human eyes?

Hello everyone,

This Jupyter notebook provides a step-by-step guide on how to use an open-source Object Detection (OD) model from Hugging Face to count cars in a video. The notebook leverages the `transformers` and `opencv-python` libraries to handle the tasks of processing video input and producing a labeled output video with a real-time counter. Additionally, `matplotlib` is used to display images for visualization purposes.

This script is inspired by various computer vision tutorials and adapted to demonstrate practical applications of machine learning in video analysis.

Recommendations:
* `transformers` (Library for state-of-the-art machine learning models)
* `opencv-python` (Library for computer vision tasks)
* `Python 3.10` (click on top right corner to change version)
* `matplotlib 3.4.3` (Library for plotting graphs, but in this case it is used to display images)
* `torch` (Deep learning library, required for running the Hugging Face models)

## Video Sources

For this tutorial, we'll be using free stock videos from Pexels. You can use your own videos or download these examples:

- People_walking: https://www.pexels.com/video/black-and-white-video-of-people-853889/ (works ok)
- Dogs in snow: https://www.pexels.com/video/dogs-enjoying-the-snow-4157418/ (works meh/bad)
- Dogs playing: https://www.pexels.com/video/dogs-playing-854179/ (works well!)
- Cars: https://www.pexels.com/video/cars-on-highway-854671/ (works well!)


# Step-by-Step Guide

# 1. First, let's import the necessary libraries:

In [None]:
%pip install transformers torch pillow opencv-python matplotlib


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# 2. Load the pre-trained object detection model using the transformers library

In [None]:
import torch
from transformers import DetrForObjectDetection, DetrImageProcessor

if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("Using MPS")
elif torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using CUDA")
else:
    device = torch.device("cpu")
    print("Using CPU")

model_name = "facebook/detr-resnet-50"
processor = DetrImageProcessor.from_pretrained(model_name, revision="no_timm")
model = DetrForObjectDetection.from_pretrained(model_name, revision="no_timm").to(device)
print(f"Model and processor loaded: {model_name}")

Using MPS
Model and processor loaded: facebook/detr-resnet-50


# 3. Define some helper functions for processing frames and drawing bounding boxes:

# detect_objects

## Inputs:
- `frame`: Single frame from the video (in BGR format)
- `processor`: Image processor for the model
- `model`: Computer vision model used for object detection
- `threshold`: Confidence threshold for object detection (default: 0.7)

## Imports:
- `PIL.Image`: For image processing
- `cv2`: OpenCV library for computer vision tasks

## Logic:
1. Prepare the image:
   - Convert the frame from BGR to RGB color space
   - Create a PIL Image object from the RGB frame

2. Process the image:
   - Use the `processor` to prepare the image for the model
   - Move the processed inputs to the appropriate device (e.g., CPU or GPU)

3. Perform object detection:
   - Run the model on the processed inputs
   - Get the output from the model

4. Post-process the results:
   - Create a tensor with the original image size
   - Use the processor's post-processing method to refine the detection results
   - Apply the confidence threshold to filter detections

## Output:
- Returns `results`: Processed detection results containing:
  - Bounding boxes
  - Labels
  - Confidence scores

In [None]:
from PIL import Image
import cv2
def detect_objects(frame, processor, model, threshold=0.7):
    rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    pil_image = Image.fromarray(rgb_frame)

    inputs = processor(images=pil_image, return_tensors="pt").to(device)
    outputs = model(**inputs)
    
    target_sizes = torch.tensor([pil_image.size[::-1]])
    results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=threshold)[0]
    
    return results

# Draw_boxes_and_count 

## Inputs:
- `frame`: Single frame of the video
- `results`: JSONL with all the object's:
  - Box coordinates (where the object is in frame)
  - Label (what the object is)
  - Scores (how certain the model is the object is labeled correctly)
- `model`: Computer vision model used to detect objects (`detr-resnet-50` in our case)
- `thing`: Object you want to count

## Logic: 

For every object that the model detected in the frame (packaged nicely in `results`):
1. Check if the `label` of the object is `thing`
2. If yes:
   - Add one to the frame's count
   - Draw the box around the object
   - Add the `label` to the box
3. After checking all the objects detected in the frame:
   - Put the total `thing` count on the frame

## Outputs:
- `frame`: edited version of frame
- `count`: how many things there are in frame

In [None]:
def draw_boxes_and_count(frame, results, model, thing):
    count = 0
    for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
        if model.config.id2label[label.item()] == thing:
            count += 1
            box = [int(i) for i in box.tolist()]
            cv2.rectangle(frame, (box[0], box[1]), (box[2], box[3]), (225, 0, 0), 2)
            label_text = f"{thing}: {score.item():.2f}"
            cv2.putText(frame, label_text, (box[0], box[1]-10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255,0,0), 2)
    
    cv2.putText(frame, f"{thing} count: {count}", (10, 50), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 0, 0), 2)    
    return frame, count

# process_video

## Inputs:
- `video_path`: Path to the input video file
- `output_path`: Path where the processed video will be saved
- `thing`: Object to detect and count in the video
- `model`: Computer vision model used for object detection
- `processor`: Image processor for the model

## Logic:
1. Load the video:
   - Open the video file
   - Get video properties (width, height, fps)
   - Set up video writer for output

2. Initialize variables:
   - `frame_count`: Counter for processed frames
   - `total_num_frames`: Total number of frames in the video
   - `total_count`: Total count of detected objects across all frames

3. Process each frame:
   - Read a frame from the video
   - If frame is empty, end processing
   - Increment `frame_count`
   - Display progress (current frame / total frames)
   - Detect objects in the frame using `detect_objects` function
   - Draw boxes and count objects using `draw_boxes_and_count` function
   - Add the frame's object count to `total_count`
   - Write the processed frame to the output video

4. Clean up:
   - Release video capture and writer
   - Close all OpenCV windows

5. Print results:
   - Confirmation of video processing completion
   - Total number of objects detected across all frames
   - Average number of objects per frame

## Output:
- Processed video file saved to `output_path`
- Returns the last processed frame

In [None]:
def process_video(video_path, output_path, thing, model, processor):
    
    print('Loading video...')
    cap = cv2.VideoCapture(video_path)
    print('Video loaded successfully.')
    
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_path, fourcc, fps, (frame_width, frame_height))
    
    frame_count = 0
    total_num_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    total_count = 0
    
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        
        frame_count += 1
        print(f"\rProcessing frame {frame_count}/{total_num_frames}", end='', flush=True)
        
        results = detect_objects(frame, processor, model)
        frame_with_boxes, count = draw_boxes_and_count(frame, results, model, thing)
        
        total_count += count
        
        out.write(frame_with_boxes)

    cap.release()
    out.release()
    cv2.destroyAllWindows()
    
    print(f"Video processing complete. Output saved to {output_path}")
    print(f"Total number of {thing} detected across all frames: {total_count}")
    print(f"Average number of {thing} per frame: {total_count / frame_count:.2f}")
    return frame

# 4. Setting the video path and processing video

In [None]:
video_path = 'sample videos/cars.mp4'  # Replace with your input video path
output_path = 'output_'+video_path  # Replace with your desired output video path
thing = "car"  # Replace with the object you want to detect

process_video(video_path, output_path, thing, model, processor)

Loading video...
Video loaded successfully.
Processing frame 1501/1501Video processing complete. Output saved to sample videos/cars.mp4output.mp4
Total number of car detected across all frames: 8109
Average number of car per frame: 5.40


# Ta da!

Done! This script will process your input video, detect and count `thing` in each frame, and produce an output video with bounding boxes around detected `thing` and a `thing` count in the bottom right corner.

Feel free to experiment with different videos or adjust the detection threshold to see how it affects the results. Happy coding!