# 02. Making Train set & Test set
* Before run this file, Please check this:
    * 01_video-to-numpy-save.ipynb
    * **`Are those files exist on there?`** Those files were made by 01_video_to_numpy_save.ipynb
        * 01_data_Fight_210512.pickle
        * 01_label_Fight_210512.pickle
        * 01_data_NonFight_210507.pickle
        * 01_label_NonFight_210507.pickle

# imports

In [1]:
import numpy as np
import pickle
from random import shuffle

# 02-A. Load Fight / NonFight Video Pickle Files
* Files are saved from 21XXXX_01_video-to-numpy-save.ipynb
    * **data_Fight.pickle** : List of Fight video's frame image Numpy arrays
    * **data_NonFight.pickle** : List of NonFight video's frame image Numpy arrays
    * **label_Fight.pickle** : List of Fight video's label Numpy arrays
    * **label_NonFight.pickle** : List of NonFight video's label Numpy arrays
* **`The reason why I repeated saving and loading .pickle is`** :
    * Just, because of RAM & memory issues.
    * When I ran those codes, my desktop had 16GB RAM, 100GB rest capacity in C:, and 1TB D: drive.

In [None]:
# Fight Video frames Numpy array list
with open("D:/datasets/AllVideo_numpy_list_pickle/01_data_Fight_210512.pickle","rb") as fr:
    data_Fight=pickle.load(fr)
print(len(data_Fight))

In [None]:
# NonFight Video frames Numpy array list
with open("D:/datasets/AllVideo_numpy_list_pickle/01_label_Fight_210512.pickle","rb") as fr:
    label_Fight=pickle.load(fr)
print(len(label_Fight))

In [None]:
# Fight label Numpy array list
with open("D:/datasets/AllVideo_numpy_list_pickle/01_data_NonFight_210507.pickle","rb") as fr:
    data_NonFight=pickle.load(fr)
print(len(data_NonFight))

In [None]:
# NonFight label Numpy array list
with open("D:/datasets/AllVideo_numpy_list_pickle/01_label_NonFight_210507.pickle","rb") as fr:
    label_NonFight=pickle.load(fr)
print(len(label_NonFight))

# 02-B. Merge data & Random Shuffle

## 1. Merge data : Fight + NonFight

In [None]:
data_total=data_Fight+data_NonFight
print(len(data_total))

In [None]:
label_total=label_Fight+label_NonFight
print(len(label_total))

## 2. Shuffle merged dataset

In [None]:
np.random.seed(42)

In [None]:
c=list(zip(data_total, label_total)) # zip : elements 
shuffle(c) # 셔플(무작위 섞기) : 파일이름, 레이블
data_total, label_total=zip(*c) # c를 풀어 주어 names, labels를 만든다는 뜻으로 보인다. unpacking

## 3. save shuffled dataset as .pickle
* **`pickle.dump(protocol=pickle.HIGHEST_PROTOCOL)`** : You can solve lack of memory issue when pickle save process

In [None]:
# Save data
with open("D:/datasets/AllVideo_numpy_list_pickle/02_data_total_210512.pickle","wb") as fw:
    pickle.dump(data_total, fw, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Save label
with open("D:/datasets/AllVideo_numpy_list_pickle/02_label_total_210512.pickle","wb") as fw:
    pickle.dump(label_total, fw)

# 02-C. Split training set / test set

## 1. Load shuffled dataset(.pickle)
* **`The reason why I repeated saving and loading .pickle is`** :
    * Just, because of RAM & memory issues.

In [2]:
# load data
with open("D:/datasets/AllVideo_numpy_list_pickle/02_data_total_210512.pickle","rb") as fr:
    data_total=pickle.load(fr)

In [3]:
# load label
with open("D:/datasets/AllVideo_numpy_list_pickle/02_label_total_210512.pickle","rb") as fr:
    label_total=pickle.load(fr)

## 2. Split dataset as training set / test set (8:2 ratio)

### 1) The number of training set, test set

In [4]:
training_set=int(len(data_total)*0.8)
test_set=int(len(data_total)*0.2)

In [5]:
data_training=data_total[0:training_set] # 훈련 셋 파일 이름들
data_test=data_total[training_set:] #테스트 셋 파일 이름들

label_training=label_total[0:training_set] #훈련 셋 파일 레이블
label_test=label_total[training_set:] #테스트 셋 파일 레이블

In [6]:
len(data_training), len(label_training), len(data_test), len(label_test)

(2878, 2878, 720, 720)

### 2) Check the shape of elements

In [7]:
data_training[0].shape, label_training[0].shape

((30, 160, 160, 3), (2,))

In [8]:
data_training[0][0, :, :, 0]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## 3. Save training set & test set as .pickle file
* **`The reason why I repeated saving and loading .pickle is`** :
    * Just, because of RAM & memory issues.

In [9]:
# training set, data
with open("D:/datasets/AllVideo_numpy_list_pickle/02_data_training_210512.pickle","wb") as fw:
    pickle.dump(data_training, fw, protocol=pickle.HIGHEST_PROTOCOL)

In [10]:
# training set, label
with open("D:/datasets/AllVideo_numpy_list_pickle/02_label_training_210512.pickle","wb") as fw:
    pickle.dump(label_training, fw)

In [11]:
# test set, data
with open("D:/datasets/AllVideo_numpy_list_pickle/02_data_test_210512.pickle","wb") as fw:
    pickle.dump(data_test, fw, protocol=pickle.HIGHEST_PROTOCOL)

In [12]:
# test set, label
with open("D:/datasets/AllVideo_numpy_list_pickle/02_label_test_210512.pickle","wb") as fw:
    pickle.dump(label_test, fw)

# 02-D. Transform dataset as Numpy array

## 1. Load training set & test set (.pickle)

### 1) Training set : data, label

In [None]:
with open("D:/datasets/AllVideo_numpy_list_pickle/02_data_training_210512.pickle","rb") as fr:
    data_training=pickle.load(fr)

In [None]:
with open("D:/datasets/AllVideo_numpy_list_pickle/02_label_training_210512.pickle","rb") as fr:
    label_training=pickle.load(fr)

### 2) Test set : data, label

In [2]:
with open("D:/datasets/AllVideo_numpy_list_pickle/02_data_test_210512.pickle","rb") as fr:
    data_test=pickle.load(fr)

In [3]:
with open("D:/datasets/AllVideo_numpy_list_pickle/02_label_test_210512.pickle","rb") as fr:
    label_test=pickle.load(fr)

## 2. Transform training set & test set as Numpy array, and save them (.npy)

### 1) Training set

In [13]:
data_training_ar=np.array(data_training, dtype=np.float16) #> (2878, 30, 160, 160, 3)

In [14]:
np.save('D:/datasets/AllVideo_numpy_list_pickle/02_data_training_Numpy_210512.npy', data_training_ar)

In [15]:
label_training_ar=np.array(label_training) #> (2878, 2)

In [16]:
np.save('D:/datasets/AllVideo_numpy_list_pickle/02_label_training_Numpy_210512.npy', label_training_ar)

In [17]:
data_training_ar.shape, label_training_ar.shape

((2878, 30, 160, 160, 3), (2878, 2))

### 2) Test set

In [4]:
data_test_ar=np.array(data_test, dtype=np.float16) #> (720, 30, 160, 160, 3)

In [5]:
np.save('D:/datasets/AllVideo_numpy_list_pickle/02_data_test_Numpy_210512.npy', data_test_ar)

In [6]:
label_test_ar=np.array(label_test) #> (720, 2)

In [7]:
np.save('D:/datasets/AllVideo_numpy_list_pickle/02_label_test_Numpy_210512.npy', label_test_ar)

In [8]:
data_test_ar.shape, label_test_ar.shape

((720, 30, 160, 160, 3), (720, 2))