# What is Video Classification?  

A video is a sequence of frames arranged in a specific order.  

### Image Classification Steps  
When we perform **image classification**:  
- We take **images**  
- We **extract features** (using a CNN)  
- We **classify the image** based on the extracted features  

### Video Classification Steps  
For **video classification**, the necessary steps are:  
1. **Extraction of frames** from the video  
2. **Feature extraction** from each frame (using CNNs)  
3. **Classify every frame** based on the extracted features  

human activity recognition

classifying the action performed by a human

how is it different from a normal classification task?

    It is different because we need series of frames to predict the activity performed by a person.

    imagine if you provide one frame
    an algorithm will give a wrong result

    if you want to accurately recognise the action, you need the entire video chopped into a series of frames to make prediction

with a VCM, you can perform human action recognition

it can also make use of environmental context when making prediction

1 drawback:
    our algorithm will classify on the basis of every single frame
    for some frames, the algorithm is confident it is a certain acgtion being performed but sometimes it predicts incorrect label for a particular frame

solution: 
    classify a video not based on single frames but based on average over each frame


practical i

use cnn and lstm

cnn is used to extract the features of all the frames of the video 

and then each output of the cnn is fed into the rnn and then that one fuses something sha

create dataset folder
    train and test directory splitts
    inside the train folder, you should have three different classes of actions
    you should have short videos on each action (at least 50 😂)

    in the test directory, five videos of each action as well


In [1]:
import os

def delete_ds_store(root_dir):
    for root, dirs, files in os.walk(root_dir):
        for file in files:
            if file == ".DS_Store":
                file_path = os.path.join(root, file)
                os.remove(file_path)
                print(f"Deleted: {file_path}")

# Usage
delete_ds_store("/Users/morakinyo.akin-jimoh/Desktop/flow/src/dataset")  # Replace with your dataset path
print("All .DS_Store files removed!")

All .DS_Store files removed!


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

dataset_path = os.listdir('dataset/train')
label_types = os.listdir('dataset/train')
print(label_types)

['music', 'cooking', 'gymnastics', 'soccer', 'basketball', 'workout']


### Prepare Training Data

In [3]:
rooms = []

for item in dataset_path:
# Get all file names
    all_rooms = os.listdir('dataset/train' + '/' + item)

# add them to the list    
    for room in all_rooms:
        rooms.append((item, str('dataset/train' + '/' + item) + '/' + room))
        
train_df = pd.DataFrame(data=rooms, columns=['tag', 'video_name'])
print(train_df.head())
print(train_df.tail())

     tag                        video_name
0  music  dataset/train/music/music-31.avi
1  music  dataset/train/music/music-25.avi
2  music  dataset/train/music/music-19.avi
3  music  dataset/train/music/music-18.avi
4  music  dataset/train/music/music-24.avi
         tag                            video_name
650  workout  dataset/train/workout/workout-60.avi
651  workout  dataset/train/workout/workout-61.avi
652  workout  dataset/train/workout/workout-75.avi
653  workout   dataset/train/workout/workout-3.avi
654  workout  dataset/train/workout/workout-49.avi


In [4]:
df = train_df.loc[:,['video_name','tag']]
df
df.to_csv('train.csv')

### Preparing Test Data

In [5]:
dataset_path = os.listdir('dataset/test')
print(dataset_path)

room_types = os.listdir('dataset/test')
print("Type of activities found: ", len(dataset_path))

rooms = []

for item in dataset_path: 
    all_rooms = os.listdir('dataset/test' + '/' + item)

    for room in all_rooms:
        rooms.append((item, str('dataset/test' + '/' + item) + '/' + room))

test_df = pd.DataFrame(data=rooms, columns=['tag','video_name'])
print(test_df.head())
print(test_df.tail())

df = test_df.loc[:,['video_name', 'tag']]
df
df.to_csv('test.csv')

['music', 'cooking', 'gymnastics', 'soccer', 'basketball', 'workout']
Type of activities found:  6
     tag                           video_name
0  music  dataset/test/music/music-3.DS_Store
1  music       dataset/test/music/music-9.avi
2  music       dataset/test/music/music-8.avi
3  music       dataset/test/music/music-6.avi
4  music      dataset/test/music/music-10.avi
        tag                          video_name
62  workout  dataset/test/workout/workout-4.avi
63  workout  dataset/test/workout/workout-5.avi
64  workout  dataset/test/workout/workout-1.avi
65  workout  dataset/test/workout/workout-2.avi
66  workout  dataset/test/workout/workout-3.avi


In [6]:
# pip install imageio

### Data Preparation

In [7]:
from tensorflow_docs.vis import embed
from tensorflow import keras
from imutils import paths

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os



In [8]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(f"Total videos for training: {len(train_df)}")
print(f"Total videos for testing: {len(test_df)}")

train_df.sample(10)

Total videos for training: 655
Total videos for testing: 67


Unnamed: 0.1,Unnamed: 0,video_name,tag
230,230,dataset/train/gymnastics/gymnastics-58.avi,gymnastics
405,405,dataset/train/soccer/soccer-83.avi,soccer
580,580,dataset/train/workout/workout-9.avi,workout
285,285,dataset/train/gymnastics/gymnastics-8.avi,gymnastics
93,93,dataset/train/music/music-29.avi,music
494,494,dataset/train/basketball/basketball-87.avi,basketball
303,303,dataset/train/soccer/soccer-1.avi,soccer
329,329,dataset/train/soccer/soccer-6.avi,soccer
638,638,dataset/train/workout/workout-67.avi,workout
61,61,dataset/train/music/music-64.avi,music


### Feed the videos to a network

In [9]:
import cv2
import numpy as np

IMG_SIZE = 224

def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)  # Fixed: Use y instead of x
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]

def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    if not cap.isOpened():
        print(f"Error: Could not open video {path}")
        return np.array([])  # Return empty array if video can't be opened

    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break  # End of video or read error
            
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]  # BGR to RGB
            frames.append(frame)

            if max_frames > 0 and len(frames) == max_frames:
                break  # Stop if max_frames is reached
    except Exception as e:
        print(f"Error processing video {path}: {e}")
        return np.array([])  # Return empty on error
    finally:
        cap.release()

    return np.array(frames) if len(frames) > 0 else np.array([])

#### Feature Extraction

In [10]:
def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")

feature_extractor = build_feature_extractor()

#### Label Encoding

In [11]:
label_processor = keras.layers.StringLookup(num_oov_indices=0, vocabulary=np.unique(train_df["tag"]))
print(label_processor.get_vocabulary())

labels = train_df["tag"].values
labels = label_processor(labels[..., None]).numpy()
labels

[np.str_('basketball'), np.str_('cooking'), np.str_('gymnastics'), np.str_('music'), np.str_('soccer'), np.str_('workout')]


array([[3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
       [3],
    

#### Define hyperparameters

In [12]:
IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 100

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048

In [None]:
def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    video_paths = df["video_name"].values.tolist()

    labels = df["tag"].values

    labels = label_processor(labels[..., None]).numpy()

    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    for idx, path in enumerate(video_paths):
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None, ...]

        temp_frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :]
                )
            temp_frame_mask[i, :length] = 1 # 1 = not masked, 0 - masked
        frame_features[idx,] = temp_frame_features.squeeze()
    return(frame_features, frame_masks), labels

train_data, train_labels = prepare_all_videos(train_df, "")
test_data, test_labels = prepare_all_videos(test_df, "")

print(f"Frame features in train set: {train_data[0].shape}")
print(f"Frame masks in train set: {train_data[1].shape}")

print(f"train_labels in train set: {train_labels.shape}")
print(f"test_labels in train set: {test_labels.shape}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 59ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 64ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 66ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 67ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 61ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 62

#### The sequence model

In [None]:
def get_sequence_model():
    class_vocab = label_processor.get_vocabulary()

    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")

    x = keras.layers.GRU(16, return_sequences=True)(frame_features_input, mask=mask_input)
    x = keras.layers.GRU(8)(x)
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.Dense(8, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)

    rnn_model = keras.Model([frame_features_input, mask_input], output)

    rnn_model.compile(
        loss="sparse_categorical_crossentropy", optimizer="adam",metrics=["accuracy"]
    )
    return rnn_model

EPOCHS= 30
# Utility for running experiments
def run_experiment():
    filepath = "./tmp/video_classifier.weights.h5"
    checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose = 1
    )
    seq_model = get_sequence_model()
    history = seq_model.fit(
        [train_data[0], train_data[1]],
        train_labels,
        validation_split=0.3,
        epochs=EPOCHS,
        callbacks=[checkpoint],
    )
    seq_model.load_weights(filepath)
    _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)
    print(f"Test accuracy: {round(accuracy * 100,2)}%")

    return history, seq_model

_, sequence_model = run_experiment()


Epoch 1/30
[1m13/15[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 9ms/step - accuracy: 0.2766 - loss: 1.7234 
Epoch 1: val_loss improved from inf to 1.98358, saving model to ./tmp/video_classifier.weights.h5
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 0.2900 - loss: 1.6986 - val_accuracy: 0.0000e+00 - val_loss: 1.9836
Epoch 2/30
[1m 8/15[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 8ms/step - accuracy: 0.5245 - loss: 1.3730 
Epoch 2: val_loss did not improve from 1.98358
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.5352 - loss: 1.3500 - val_accuracy: 0.0000e+00 - val_loss: 2.2025
Epoch 3/30
[1m 8/15[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 7ms/step - accuracy: 0.6968 - loss: 1.1936 
Epoch 3: val_loss did not improve from 1.98358
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.7034 - loss: 1.1863 - val_accuracy: 0.0000e+00 - val_lo

## Inference

In [None]:
def prepare_single_video(frames):
    frames = frames[None, ...]
    frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    for i, batch in enumerate(frames):
        video_length = batch.shape[0]
        length = min(MAX_SEQ_LENGTH, video_length)
        for j in range(length):
            frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :])
        frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

    return frame_features, frame_mask


def sequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()

    frames = load_video(os.path.join("", path))
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = sequence_model.predict([frame_features, frame_mask])[0]

    for i in np.argsort(probabilities)[::-1]:
        print(f"{class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
    return frames

test_video = np.random.choice(test_df["video_name"].values.tolist())
print(f"Test video path: {test_video}")

test_frames = sequence_prediction(test_video)

Test video path: dataset/test/soccer/soccer-110.avi
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 69ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m 

In [None]:
from IPython.display import HTML

HTML("""
    <video alt="test" width="520" height="440" controls>
        <source src="UCF-101/Rowing/v_Rowing_g01_c02.mp4" type="video/mp4" style="height:300px; width:300px">
    </video>
""")