# **Feature Extraction Function Guide**

## *1. Importing Functions from feature_extraction.py*


- Purpose:
   
    This imports all functions and variables from the file feature_extraction.py. It assumes that the file contains utility functions for processing the MVTec anomaly detection dataset. 

- Key Imported Function: 
  
    The primary function used is assemble_mvtec_dataset_train_test()

In [1]:
# import all from file feature_extraction.py
from feature_extraction import *

ModuleNotFoundError: No module named 'feature_extraction'

## *2. About assemble_mvtec_dataset_train_test()*

- ##### Purpose: 
  - This function processes the MVTec anomaly detection dataset, creating multiple derived datasets and saving them as pickle files for efficient reuse.
  
- ##### Inputs:

  - path_to_mvtec_dataset_dir: Path to the dataset directory containing image categories.
  - Flags like force_pkl_overwrite allow forcing re-computation of various datasets even if pickle files already exist.

- ##### Outputs:

  - Returns seven datasets:
    - categories: A list of image category names.
    - image_paths: A dictionary mapping categories to lists of image file paths.
    - resized_images: A dictionary mapping categories to resized images (as numpy arrays).
    - metadata_df: A DataFrame containing metadata about all images.
    - features_df: A DataFrame of extracted features for all images.
    - pixel_df: A DataFrame where images are flattened into 1D arrays.
    - train_test_df: A merged DataFrame containing metadata, features, and pixel data.

- ##### Saves: 

These datasets are saved as pickle files in the directory /pickle_files for reuse.


In [2]:
# principal function in feature_extraction.py is assemble_mvtec_dataset_train_test()
help(assemble_mvtec_dataset_train_test)

Help on function assemble_mvtec_dataset_train_test in module feature_extraction:

assemble_mvtec_dataset_train_test(path_to_mvtec_dataset_dir, force_pkl_overwrite=False, force_pkl_overwrite_resized=False, force_pkl_overwrite_metadata=False, force_pkl_overwrite_features=False, force_pkl_overwrite_pixel=False, force_pkl_overwrite_train_test=False)
    computes and/or loads several datasets and lists from the mvtec anomaly detection dataset. 
    the data is read from .pkl file if it has been previously computed, unless re-computation and overwrite is forced by setting the corresponding flag.
    
    returns 'categories', 'image_paths', 'resized_images', 'metadata_df', 'features_df', 'pixel_df', 'train_test_df' where 
        categories -- list of categories in mvtec dataset
        image_paths -- dictionary (keys: categories. values: list of paths to all image files for given category)
        resized_images -- dictionary (keys: categories. values: list of resized images as np.arrays fo

## *3. Defining the Dataset Directory*

Specifies the path to the MVTec anomaly detection dataset directory. This directory should contain subfolders for each category (e.g., "bottle", "cable", etc.).

In [3]:
# give path to mvtec_anomaly_detection dataset (to the folder, that contains all the category folders)
dataset_dir = 'C:/Mici/Unterlagen/DataScientest/Project/DS_project_workspace/Data/mvtec_anomaly_detection'
print(dataset_dir)


C:/Mici/Unterlagen/DataScientest/Project/DS_project_workspace/Data/mvtec_anomaly_detection


## *4. Using assemble_mvtec_dataset_train_test()*

- ##### Purpose: 
- Calls the function to:

  - Load preprocessed datasets from pickle files (if they exist).
  - Otherwise, preprocess the raw dataset and save results as pickle files.
  - 
- ##### Result: 
- The outputs are loaded into variables for further processing or analysis

In [4]:
# use function to create datasets or read from files, if present 
categories, image_paths, resized_images, metadata_df, features_df, pixel_df, train_test_df = assemble_mvtec_dataset_train_test(dataset_dir)
print('done.')

category names have been extracted.
image paths have been extracted.
loaded resized_images from .pkl file.
loaded metadata_df from .pkl file.
loaded features_df from .pkl file.
loaded pixel_df from .pkl file.
loaded train_test_df from .pkl file.
done.


## *5. Checking the Returned Outputs*

Purpose: 
Verifies the types and sizes of the returned datasets.

Key Observations:
1. Categories: A list of 15 image category names.
2. Image Paths: A dictionary mapping categories to image paths.
3. Resized Images: A dictionary with preprocessed (resized) images.
4. Metadata DataFrame: Contains 8 columns (e.g., image names, labels, etc.) for 5,354 images.
5. Features DataFrame: Contains 40 features extracted for each of the 5,354 images.
6. Pixel DataFrame: Contains flattened images (224×224×3 → 150,528 features).
7. Train-Test DataFrame: Combines metadata, features, and pixel data into one DataFrame with 150,576 columns.


In [5]:
# check, what has been returned 
returned = {'categories': categories, 
            'image_paths': image_paths, 
            'resized_images': resized_images, 
            'metadata_df': metadata_df, 
            'features_df': features_df, 
            'pixel_df': pixel_df, 
            'train_test_df': train_test_df
            }
for name, data in returned.items():
    print('\n')
    print('type of ', name, ': ', type(data))
    if isinstance(data, list):
        print('    length is ', len(data))
    if isinstance(data, dict):
        print('    keys are ', data.keys())
    if isinstance(data, pd.DataFrame):
        print('    size is ', data.shape)





type of  categories :  <class 'list'>
    length is  15


type of  image_paths :  <class 'dict'>
    keys are  dict_keys(['bottle', 'cable', 'capsule', 'carpet', 'grid', 'hazelnut', 'leather', 'metal_nut', 'pill', 'screw', 'tile', 'toothbrush', 'transistor', 'wood', 'zipper'])


type of  resized_images :  <class 'dict'>
    keys are  dict_keys(['bottle', 'cable', 'capsule', 'carpet', 'grid', 'hazelnut', 'leather', 'metal_nut', 'pill', 'screw', 'tile', 'toothbrush', 'transistor', 'wood', 'zipper'])


type of  metadata_df :  <class 'pandas.core.frame.DataFrame'>
    size is  (5354, 8)


type of  features_df :  <class 'pandas.core.frame.DataFrame'>
    size is  (5354, 40)


type of  pixel_df :  <class 'pandas.core.frame.DataFrame'>
    size is  (5354, 150528)


type of  train_test_df :  <class 'pandas.core.frame.DataFrame'>
    size is  (5354, 150576)


## *6. MVTec Anomaly Detection Dataset*

A dataset containing various categories of objects (e.g., "bottle", "cable").
Includes images labeled as normal or anomalous for anomaly detection tasks.

### Purpose of the Workflow

Preprocesses the dataset for anomaly detection experiments.
Converts raw image data into reusable datasets:
Categories: List of object types.
Image Paths: For accessing raw images.
Resized Images: Preprocessed images with a uniform size.
Metadata: Useful information about images.
Features: High-level features extracted (e.g., embeddings from a model).
Pixels: Raw image data in flattened form.
Train-Test Dataset: Combined dataset for training/testing models.

### Pickle Files

Enables efficient storage and retrieval of preprocessed datasets to save computation time.

### Train-Test DataFrame

Combines:
Metadata about images.
Extracted features.
Flattened pixel data.
This makes it ready for training or testing anomaly detection models.

### Usage of Flags

Flags like force_pkl_overwrite are used to recompute specific datasets (e.g., resized images or features) even if existing pickle files are present.

### Example Output Sizes

Dataset                 Type            Details
categories              List            15 categories.
image_paths             Dictionary      Keys are category names.
resized_images	        Dictionary      Keys are category names.
metadata_df	            DataFrame	    5,354 rows × 8 columns.
features_df         	DataFrame	    5,354 rows × 40 columns.
pixel_df	            DataFrame	    5,354 rows × 150,528 columns.
train_test_df       	DataFrame	    5,354 rows × 150,576 columns.


This structured workflow prepares the MVTec dataset for machine learning tasks, ensuring flexibility for preprocessing and efficient loading of intermediate data.