Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



By Zhedong Zheng

This is a University of Technology Sydney computer vision practical, authored by Zhedong Zheng. The practical explores the basis of learning pedestrian features. In this practical, we will learn to build a simple person re-ID system step by step. Any suggestion is welcomed.

Person re-ID can be viewed as an image retrieval problem. Given one query image in Camera A, we need to find the images of the same person in other Cameras. The key of the person re-ID is to find a discriminative representation of the person. Many recent works apply deeply learned models to extract visual features, and achieve the state-of-the-art performance.


  • Python 3.6
  • GPU Memory >= 6G
  • Numpy
  • Pytorch 0.3+ (http://pytorch.org/)
  • Torchvision from the source
git clone https://github.com/pytorch/vision
cd vision
python setup.py install

Getting started

Check the Prerequisites. The download links for this practice are:

Part 1: Training

Part 1.1: Prepare Data Folder (python prepare_data.py)

You may notice that the downloaded folder is organized as:

β”œβ”€β”€ Market/
β”‚   β”œβ”€β”€ bounding_box_test/          /* Files for testing (candidate images pool)
β”‚   β”œβ”€β”€ bounding_box_train/         /* Files for training 
β”‚   β”œβ”€β”€ gt_bbox/                    /* We do not use it 
β”‚   β”œβ”€β”€ gt_query/                   /* Files for multiple query testing 
β”‚   β”œβ”€β”€ query/                      /* Files for testing (query images)
β”‚   β”œβ”€β”€ readme.txt

Open and edit the script prepare.py in the editor. Change the fifth line in prepare.py to your download path, such as \home\zzd\Download\Market. Run this script in the terminal.

python prepare.py

We create a subfolder called pytorch under the download folder.

β”œβ”€β”€ Market/
β”‚   β”œβ”€β”€ bounding_box_test/          /* Files for testing (candidate images pool)
β”‚   β”œβ”€β”€ bounding_box_train/         /* Files for training 
β”‚   β”œβ”€β”€ gt_bbox/                    /* We do not use it 
β”‚   β”œβ”€β”€ gt_query/                   /* Files for multiple query testing 
β”‚   β”œβ”€β”€ query/                      /* Files for testing (query images)
β”‚   β”œβ”€β”€ readme.txt
β”‚   β”œβ”€β”€ pytorch/
β”‚       β”œβ”€β”€ train/                   /* train 
β”‚           β”œβ”€β”€ 0002
|           β”œβ”€β”€ 0007
|           ...
β”‚       β”œβ”€β”€ val/                     /* val
β”‚       β”œβ”€β”€ train_all/               /* train+val      
β”‚       β”œβ”€β”€ query/                   /* query files  
β”‚       β”œβ”€β”€ gallery/                 /* gallery files  

In every subdir, such as pytorch/train/0002, images with the same ID are arranged in the folder. Now we have successfully prepared the data for torchvision to read the data.

+ Quick Question. How to recognize the images of the same ID?

For Market-1501, the image name contains the identity label and camera id. Check the naming rule at here.

Part 1.2: Build Neural Network (model.py)

We can use the pretrained networks, such as AlexNet, VGG16, ResNet and DenseNet. Generally, the pretrained networks help to achieve a better performance, since it preserves some good visual patterns from ImageNet [1].

In pytorch, we can easily import them by two lines. For example,

from torchvision import models
model = models.resnet50(pretrained=True)

You can simply check the structure of the model by:


But we need to modify the networks a little bit. There are 751 classes (different people) in Market-1501, which is different with 1,000 classes in ImageNet. So here we change the model to use our classifier.

import torch
import torch.nn as nn
from torchvision import models

# Define the ResNet50-based Model
class ft_net(nn.Module):
    def __init__(self, class_num = 751):
        super(ft_net, self).__init__()
        #load the model
        model_ft = models.resnet50(pretrained=True) 
        # change avg pooling to global pooling
        model_ft.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.model = model_ft
        self.classifier = ClassBlock(2048, class_num) #define our classifier.

    def forward(self, x):
        x = self.model.conv1(x)
        x = self.model.bn1(x)
        x = self.model.relu(x)
        x = self.model.maxpool(x)
        x = self.model.layer1(x)
        x = self.model.layer2(x)
        x = self.model.layer3(x)
        x = self.model.layer4(x)
        x = self.model.avgpool(x)
        x = torch.squeeze(x)
        x = self.classifier(x) #use our classifier.
        return x
+ Quick Question. Why we use AdaptiveAvgPool2d? What is the difference between the AvgPool2d and AdaptiveAvgPool2d?
+ Quick Question. Does the model have parameters now? How to initialize the parameter in the new layer?

More details are in model.py. You may check it later after you have gone through this practical.

Part 1.3: Training (python train.py)

OK. Now we have prepared the training data and defined model structure.

We can train a model by

python train.py --gpu_ids 0 --name ft_ResNet50 --train_all --batchsize 32  --data_dir your_data_path

--gpu_ids which gpu to run.

--name the name of the model.

--data_dir the path of the training data.

--train_all using all images to train.

--batchsize batch size.

--erasing_p random erasing probability.

Let's look at what we do in the train.py. The first thing is how to read data and their labels from the prepared folder. Using torch.utils.data.DataLoader, we can obtain two iterators dataloaders['train'] and dataloaders['val'] to read data and label.

image_datasets = {}
image_datasets['train'] = datasets.ImageFolder(os.path.join(data_dir, 'train'),
image_datasets['val'] = datasets.ImageFolder(os.path.join(data_dir, 'val'),

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=opt.batchsize,
                                             shuffle=True, num_workers=8) # 8 workers may work faster
              for x in ['train', 'val']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}

Here is the main code to train the model. Yes. It's only about 20 lines. Make sure you can understand every line of the code.

            # Iterate over data.
            for data in dataloaders[phase]:
                # get a batch of inputs
                inputs, labels = data
                now_batch_size,c,h,w = inputs.shape
                if now_batch_size<opt.batchsize: # skip the last batch
                # print(inputs.shape)
                # wrap them in Variable, if gpu is used, we transform the data to cuda.
                if use_gpu:
                    inputs = Variable(inputs.cuda())
                    labels = Variable(labels.cuda())
                    inputs, labels = Variable(inputs), Variable(labels)

                # zero the parameter gradients

                #-------- forward --------
                outputs = model(inputs)
                _, preds = torch.max(outputs.data, 1)
                loss = criterion(outputs, labels)

                #-------- backward + optimize -------- 
                # only if in training phase
                if phase == 'train':
+ Quick Question. Why we need optimizer.zero_grad()? What happens if we remove it?
+ Quick Question. The dimension of the outputs is batchsize*751. Why?

Every 10 training epoch, we save a snapshot and update the loss curve.

                if epoch%10 == 9:
                    save_network(model, epoch)

Part 2: Test

Part 2.1: Extracting feature (python test.py)

In this part, we load the network weight (we just trained) to extract the visual feature of every image.

python test.py --gpu_ids 0 --name ft_ResNet50 --test_dir your_data_path  --batchsize 32 --which_epoch 59

--gpu_ids which gpu to run.

--name the dir name of the trained model.

--batchsize batch size.

--which_epoch select the i-th model.

--data_dir the path of the testing data.

Let's look at what we do in the test.py. First, we need to import the model structure and then load the weight to the model.

model_structure = ft_net(751)
model = load_network(model_structure)

For every query and gallery image, we extract the feature by simply forward the data.

outputs = model(input_img) 
# ---- L2-norm Feature ------
ff = outputs.data.cpu()
fnorm = torch.norm(ff, p=2, dim=1, keepdim=True)
ff = ff.div(fnorm.expand_as(ff))
+ Quick Question. Why we flip the test image horizontally when testing? How to fliplr in pytorch?
+ Quick Question. Why we L2-norm the feature?

Part 2.2: Evaluation

Yes. Now we have the feature of every image. The only thing we need to do is matching the images by the feature.

python evaluate_gpu.py

Let's look what we do in evaluate_gpu.py. We sort the predicted similarity score.

query = qf.view(-1,1)
# print(query.shape)
score = torch.mm(gf,query) # Cosine Distance
score = score.squeeze(1).cpu()
score = score.numpy()
# predict index
index = np.argsort(score)  #from small to large
index = index[::-1]

Note that there are two kinds of images we do not consider as right-matching images.

  • Junk_index1 is the index of mis-detected images, which contain the body parts.

  • Junk_index2 is the index of the images, which are of the same identity in the same cameras.

    query_index = np.argwhere(gl==ql)
    camera_index = np.argwhere(gc==qc)
    # The images of the same identity in different cameras
    good_index = np.setdiff1d(query_index, camera_index, assume_unique=True)
    # Only part of body is detected. 
    junk_index1 = np.argwhere(gl==-1)
    # The images of the same identity in same cameras
    junk_index2 = np.intersect1d(query_index, camera_index)

We can use the function compute_mAP to obtain the final result. In this function, we will ignore the junk_index.

CMC_tmp = compute_mAP(index, good_index, junk_index)

Part 3: A simple visualization (python demo.py)

To visualize the result,

python demo.py --query_index 777

--query_index which query you want to test. You may select a number in the range of 0 ~ 3367.

It is similar to the evaluate.py. We add the visualization part.

try: # Visualize Ranking Result 
    # Graphical User Interface is needed
    fig = plt.figure(figsize=(16,4))
    ax = plt.subplot(1,11,1)
    for i in range(10): #Show top-10 images
        ax = plt.subplot(1,11,i+2)
        img_path, _ = image_datasets['gallery'].imgs[index[i]]
        label = gallery_label[index[i]]
        if label == query_label:
            ax.set_title('%d'%(i+1), color='green') # true matching
            ax.set_title('%d'%(i+1), color='red') # false matching
except RuntimeError:
    for i in range(10):
        img_path = image_datasets.imgs[index[i]]
    print('If you want to see the visualization of the ranking result, graphical user interface is needed.')

Part 4: Your Turn.

  • Market-1501 is a dataset collected at Tsinghua University in summer.

Let's try another dataset called DukeMTMC-reID, which is collected at Duke University in winter.

You may download the dataset at Here. Try it by yourself.

The dataset is quite similar to Market-1501. You may also check with the state-of-the-art results at Here.

+ Quick Question. Could we directly apply the model trained on Market-1501 to DukeMTMC-reID? Why?

Part5: Other Related Works

  • Could we use natural language as query? Check this paper.

  • Could we use other losses (i.e. contrastive loss) to further improve the performance? Check this paper.

  • Person-reID dataset is not large enough to train a deep-learned network? You may check this paper and try some data augmentation method like random erasing.

  • Pedestrian detection is bad? Try Open Pose and Spatial Transformer to align the images.


[1] Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "Imagenet: A large-scale hierarchical image database." In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248-255. Ieee, 2009.