## 1.  Reading CSV files

In my case, I have a folder containing vectors comming from the data. <br>
Each vector is an econding or a feature obtained through pre-processing. The pre-processing could be done over images, videos, audio, etc. The pre-processing is out of the scope of this document. Indeed, sometimes is required to have the raw information. The raw data can also be reshaped to be a vector. 

The folder contains one Comma-Separated-Value (CSV) file for each class. Each CSV file contains rows of vectors. Each row could represent an image, a video segment, or something else. The next Figure depicts this scenario.

Note: In my case I am actually not separating with comma but with the delimiter tab. <br>
Note2: Recall that each files contains all the data from one class. Prior processing is required to put together all data from each class in one file as the ones used here. A similar procedure than in section 3 could be used to achieve this.

<img src="images/Pre-processing_CSVFiles.png">

First, we need to import the libraries that we will be using: <br>
csv - to use the writer <br>
numpy - to convert to numpy array, this is not required but if some processing is needed could be helpful, such as framing <br>
os - to read directories <br>
math - to calculate values <br>
random - to do shuffling <br>


In [1]:
#Import libraries 
import csv            
import numpy as np
import os
import math
from random import shuffle

Now, some functions that will be used. <br>
get_number_of_lines_in_file - To know the number of rows in a text file, it is required to read all the file. <br>
load_csv - Load all the file and save it in a List. <br>
str_dataset_2_numpy_dataset - To convert the List into a numpy array

In [2]:
#This function returns the number of rows in a CVS file
def get_number_of_lines_in_file(fileName):
    file_object = open(fileName)
    row_count = sum(1 for row in file_object)
    file_object.close()
    return row_count


# Load a CSV file
def load_csv(filename):
    dataset = []
    with open(filename, 'rt') as csvfile:
        readData = csv.reader(csvfile, delimiter='\t')
        for row in readData:
            dataset.append(row)    
    return dataset

# Convert string data to numpy matrix
# This is not appropiate for a huge amount of data because it requires to load it all in an array
def str_dataset_2_numpy_dataset(dataset):
    Mat = np.zeros((int(len(dataset)),len(dataset[0])))
    for i in range(len(dataset)):   
        Mat[i] = np.array(dataset[i][:], dtype=float)
    return Mat

In [3]:
#Example of getting the number of rows in a file
fileName = "prueba/class_0.csv"
print(get_number_of_lines_in_file(fileName))

20


In [4]:
#Example of reading a file and converting to a numpy matrix
dataset = load_csv(fileName)
Mat = str_dataset_2_numpy_dataset(dataset)
Mat.shape


(20, 10)

## 2. Separating train and test

First, we use a rutine to separate the vectors from each class to obtain train and test files. In this example I separate the data into 70% for train and 30% for test. The next Figure shows this.



<img src="images/separate_train_test.png">

In the next code are instructions to separate each file into two files, the train and test files. <br>
It first load the dataset, then it converts it to a numpy array. We get the number of elements and create a list with the number of elements that then we shuffle. We save the new files by traversing through the shuffled index and put 70% in the train file and 30% in the test file.

In [5]:
#In this code, I have to put the file name. In this example I have 3 files and I would need to change the class name {1,2,3} 
#to perform the separation over all files. However, this can be automated using os.walk as shown ahead
className = "2"  #{0,1 or 2 in this example}
fileName = "prueba/class_" + className + ".csv"
fileNameTrain = "prueba/train/class_" + className + ".csv"
fileNameTest  ="prueba/test/class_" + className + ".csv"

#Load the data
dataset = load_csv(fileName)

#Convert the data to a numpy array (These functions could be merged and it would be possible to create the array since reading it)
#without requiering to make first a list
dataset_array = str_dataset_2_numpy_dataset(dataset)

#Get the number of rows and columns
len_file_csv,nCols = dataset_array.shape

#Create an index starting from 0 to the number of rows
index_shuffle = [i for i in range(len_file_csv)]

#Shuffle the rows
np.random.shuffle(index_shuffle)

#Getting how many lines are going to be in the train file
n_train =  math.floor(len_file_csv*.7)

count = 0

#Saves 70% of the shuffled rows into a file
csvfile = open(fileNameTrain, 'wt', newline='')
while (count < n_train):
    MV_writer = csv.writer(csvfile, delimiter='\t')
    MV_writer.writerow(dataset_array[index_shuffle[count]])
    count += 1            
csvfile.close()

#Saves the other 30% of the shuffled rows into a file
csvfile = open(fileNameTest, 'wt',newline='')
while (count < len_file_csv):
    MV_writer = csv.writer(csvfile, delimiter='\t')
    MV_writer.writerow(dataset_array[index_shuffle[count]])
    count += 1    
csvfile.close()

print("------ Train----------")
print(get_number_of_lines_in_file(fileNameTrain))
print("------ Test-------")
print(get_number_of_lines_in_file(fileNameTest))

------ Train----------
14
------ Test-------
6


## 3. Putting together all train and test files
So far, we have the files as shown in the next Figure. One folder contains the training data and another the test data. <br>
Each folder of training or test data consist so far in files from each class containg all the vectors from that class.

<img src="images/train_test_classesFiles.png">


We want to put them all in a single file containing all data from all classes. However, we don't want that the rows will be in order, but we want to have mixed rows from every class. The next Figure shows in the left an example of how would it be a file containing the classes in order with a respective index. In the right side it shows how would it be a file with the data mixed from all the classes and a respective shuffled index.

<img src="images/whole_shuffled.png">

What we do is to create a list with the index of all vectors and then shuffle it. We write row by row in a new file by reading each class file depending in the shuffled list as depicted in the next Figure. We do this with train and test separatedely. In this case, the idea is to avoid to load in memory all data because it can be a huge amount of data.

<img src="images/step_by_step_write.png">

The next code used os.walk to walk among directories and files contained in a root path. More information in here <br>
https://docs.python.org/3/library/os.html

In [6]:
#This function get information about how much files are within a folder and how many elements does each file has. 
#It also return the name of the files as Classes names (Recall that all information about one class is now in one file)
def get_info_folder_with_classes(pathFile):
    i=0
    Classes = []
    ClassesNames = []
    nRowsInClasses = []
    for root, dirs, files in os.walk(pathFile):
        path = root.split(os.sep)
        print((len(path) - 1) * '---', os.path.basename(root))
        for file in files:
            Classes.append(i)
            i +=1
            ClassesNames.append(file)
            print(len(path) * '---', file)
            nRows = get_number_of_lines_in_file(os.path.join(root,file))
            nRowsInClasses.append(nRows)
    return Classes, ClassesNames, nRowsInClasses


#This function writes in a file the shuffled data from all classes. Reads row by row according to a shuffling of the classes.
def write_shuffled_classes(folderWithFiles,folderToSave,Classes,ClassesNames,nRowsInClasses):
    #Files to be opened
    fileNameToSave = os.path.join(folderToSave,"all_Shuffled_"+dirName+".csv")
    fileIndexClass = os.path.join(folderToSave,"classesIndex_"+dirName+".csv")
    fileClassNames = os.path.join(folderToSave,"classesNames_"+dirName+".csv")
    
    #File to write all data comming from all classes
    csvFile = open(fileNameToSave, 'wt',newline='') 
    Shuffle_data_writer = csv.writer(csvFile, delimiter='\t')
    
    #File to write to which class each rows corresponds to
    csvIndex = open(fileIndexClass, 'wt',newline='') 
    NameIx_data_writer = csv.writer(csvIndex, delimiter='\t')
    
    #File to write the name of the classes
    csvNameCl = open(fileClassNames, 'wt') 
    
    #Makes a list of the number of elements of the index of each class
    index = np.array([])
    for i in range(len(Classes)):
        index = np.append(index,np.ones((1,nRowsInClasses[i]))*Classes[i])
        csvNameCl.write(ClassesNames[i]+"\n")
        
    #Shuffle the index
    np.random.shuffle(index)
    
    #Makes a list of fileObjects to write depending of the shuffled index
    fileObjectsReader = []
    for i in range(len(Classes)):
        fileObjectsReader.append(csv.reader(open(os.path.join(folderWithFiles,ClassesNames[i])),delimiter = "\t"))
        
    #Writes the whole data step by step getting data from a given class depending in the shuffles list
    for i in range(len(index)):
        Shuffle_data_writer.writerow(next(fileObjectsReader[int(index[i])]))
        NameIx_data_writer.writerow(str(int(index[i])))
        
    #Closes all files
    csvFile.close()
    csvIndex.close()
    csvNameCl.close()
    
        
        
#The path with the folders having the data (traind and test)
folderWithFiles="prueba/"
#The name of the folder where the data will be sabed
folderToSave = folderWithFiles + "all_train_and_test"


for root, dirs, files in os.walk(folderWithFiles):
    if root != folderWithFiles and root!= folderToSave:  #To avoid reading the files from the root and the folder where the data will be saved
        pathRoot = root.split('/') #root.split(os.sep) depending on how to separate path {\\,/}
        dirName = os.path.basename(root)
        Classes, ClassesNames, nRowsInClasses = get_info_folder_with_classes(os.path.join(pathRoot[0],dirName))
        print(Classes)
        print(ClassesNames)
        print(nRowsInClasses)
        write_shuffled_classes(os.path.join(pathRoot[0],dirName),folderToSave,Classes,ClassesNames,nRowsInClasses)

--- test
------ class_0.csv
------ class_1.csv
------ class_2.csv
[0, 1, 2]
['class_0.csv', 'class_1.csv', 'class_2.csv']
[6, 6, 6]
--- train
------ class_0.csv
------ class_1.csv
------ class_2.csv
[0, 1, 2]
['class_0.csv', 'class_1.csv', 'class_2.csv']
[14, 14, 14]


## Write hdf5

In [7]:
import h5py
#Path and filenames
path_data = "prueba/all_train_and_test/"
fileNameTest_x = "all_Shuffled_test.csv"
fileNameTest_y = "classesIndex_test.csv"

#Read the datasets
dataset = load_csv(os.path.join(path_data,fileNameTest_x))
dataset_array = str_dataset_2_numpy_dataset(dataset)

datasetY = load_csv(os.path.join(path_data,fileNameTest_y))
datasetY_array = str_dataset_2_numpy_dataset(datasetY)

#Creates a hdf5 file
fileName_hdf5 = "prueba/all_train_and_test/My_test_file_.hdf5"
f = h5py.File(fileName_hdf5, "w")

#Creates datasets HDF5
dset = f.create_dataset("test_data", data=dataset_array)
dsetY =  f.create_dataset("test_data_Y", data=datasetY_array)


## Read hdf5


In [8]:
dataset = h5py.File(fileName_hdf5, "r")

test_set_x_orig = np.array(dataset["test_data"][:]) # your test set features
test_set_y_orig = np.array(dataset["test_data_Y"][:]) # your test set labels

print(test_set_x_orig.shape)
print(test_set_y_orig.shape)

(18, 10)
(18, 1)


In [9]:
f.close()