<a href="https://colab.research.google.com/github/luciamartinf/BigData/blob/main/LeNet5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LeNet5 Model with BigDL

#### Ángela Gómez Sacristán, Álvaro González Berdasco y Lucía Martín Fernández

## BigDL environment set-up

### BigDL Dllib installation

In [1]:
# Install latest pre-release version of bigdl-dllib with spark3
# Find the latest bigdl-dllib with spark3 from https://sourceforge.net/projects/analytics-zoo/files/dllib-py-spark3/ and intall it
!pip install https://sourceforge.net/projects/analytics-zoo/files/dllib-py-spark3/bigdl_dllib_spark3-0.14.0b20211107-py3-none-manylinux1_x86_64.whl

#exit() # restart the runtime to refresh installed pkg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bigdl-dllib-spark3==0.14.0b20211107
  Downloading https://sourceforge.net/projects/analytics-zoo/files/dllib-py-spark3/bigdl_dllib_spark3-0.14.0b20211107-py3-none-manylinux1_x86_64.whl (93.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.9/93.9 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting pyspark==3.1.2
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Creat

In [2]:
!pip install findspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


### Importing principal libraries

In [3]:
import matplotlib
matplotlib.use('Agg')
%pylab inline
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

import datetime as dt
import tempfile
import numpy as np
import seaborn as sns
import os, random
import pandas as pd
from pathlib import Path
from IPython.display import Markdown, display


from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


from bigdl.dllib.nncontext import *
from bigdl.dllib.nnframes import *
from bigdl.dllib.nn.criterion import *
from bigdl.dllib.nn.layer import *
from bigdl.dllib.optim.optimizer import *

from bigdl.dllib import keras
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import load_img
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import *
from keras.optimizers import Adam
from keras import backend as K
from keras.preprocessing import image

Prepending /usr/local/lib/python3.8/dist-packages/bigdl/share/dllib/conf/spark-bigdl.conf to sys.path


### Setting-up Spark Session

In [4]:
import findspark
findspark.init()

In [5]:
sc = init_nncontext(cluster_mode="local") # run in local mode
spark = SparkSession(sc)
spark = SparkSession \
    .builder \
    .appName("Foo") \
    .config("spark.executor.memory", '50G') \
    .config("spark.driver.memory", '50G') \
    .getOrCreate()

Current pyspark location is : /usr/local/lib/python3.8/dist-packages/pyspark/__init__.py
Start to getOrCreate SparkContext
pyspark_submit_args is:  --driver-class-path /usr/local/lib/python3.8/dist-packages/bigdl/share/dllib/lib/bigdl-dllib-spark_3.1.2-0.14.0-SNAPSHOT-jar-with-dependencies.jar pyspark-shell 
Successfully got a SparkContext


## Import data

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
# Define path to the data directory
dir_alldata = Path('/content/drive/MyDrive/Colab Notebooks/chest_xray')
# Path to train directory (Fancy pathlib...no more os.path!!)
train_data_dir = dir_alldata / 'train'

# Path to validation directory
validation_data_dir = dir_alldata / 'val'

# Path to test directory
test_data_dir = dir_alldata / 'test'

# Get the path to the normal and pneumonia sub-directories
normal_cases_train = train_data_dir / 'NORMAL'
pneumonia_cases_train = train_data_dir / 'PNEUMONIA'

### Data transformation

Transformation of the images to generators and then to np.arrays, together with its correspondent labels

In [29]:
img_width, img_height = 32,32 
nb_train_sample = 5216
nb_validation_samples =16
nb_test_samples = 624

batch_size = 32

if K.image_data_format()=="channels_first":
    input_shape =(3,img_width, img_height)
else:
    input_shape =(img_width, img_height,3)

train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True)
validation_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator= train_datagen.flow_from_directory(train_data_dir, target_size =(150,150), batch_size = batch_size, class_mode="binary" )
validation_generator = validation_datagen.flow_from_directory(validation_data_dir, target_size = (150,150), batch_size =  batch_size, class_mode="binary")
test_generator = test_datagen.flow_from_directory(test_data_dir, target_size = (150,150), batch_size= batch_size,  class_mode="binary")

Found 5216 images belonging to 2 classes.
Found 16 images belonging to 2 classes.
Found 624 images belonging to 2 classes.


The generator datatypes automatically generate data batches. However to obtain the full data in np.array mode we used the tqdm library:

In [30]:
#!pip install tqdm  # if not install
import tqdm

In [41]:
# For the train dataset
train_generator.reset()
X_train, y_train = next(train_generator)
for i in tqdm.tqdm(range(int(train_generator.n/batch_size)-1)): 
  img, label = next(train_generator)
  X_train = np.append(X_train, img, axis=0 )
  y_train = np.append(y_train, label, axis=0)


100%|██████████| 162/162 [02:26<00:00,  1.11it/s]


In [34]:
print(X_train.shape)
print(y_train.shape)

(5216, 150, 150, 3)
(5216,)


In [33]:
# For the test dataset
test_generator.reset()
X_test, Y_test = next(test_generator)
for i in tqdm.tqdm(range(int(test_generator.n/batch_size)-1)): 
  img, label = next(test_generator)
  X_test = np.append(X_test, img, axis=0 )
  Y_test = np.append(Y_test, label, axis=0)

100%|██████████| 18/18 [00:06<00:00,  2.66it/s]


In [42]:
# Since validation size is smaller than the batch size we don't need to perform the tqdm step.
X_val, Y_val = next(validation_generator)

## LeNet5 Model

### Building the model

In [35]:
def build_lenet_model():

    """
    Function to build the lenet model as Sequential datatype
    """

    model = Sequential()
    model.add(Reshape((3, 150, 150), input_shape=(150, 150, 3)))
    model.add(Convolution2D(6, 5, 5, activation="tanh", name="conv1_5x5"))
    model.add(MaxPooling2D())
    model.add(Convolution2D(12, 5, 5, activation="tanh", name="conv2_5x5"))
    model.add(MaxPooling2D())
    model.add(Flatten())
    model.add(Dense(100, activation="tanh", name="fc1"))
    model.add(Dense(2, activation="softmax", name="fc2")) # 2 classes


    return model

In [36]:
lenet = build_lenet_model()

creating: createZooKerasSequential
creating: createZooKerasReshape
creating: createZooKerasConvolution2D
creating: createZooKerasMaxPooling2D
creating: createZooKerasConvolution2D
creating: createZooKerasMaxPooling2D
creating: createZooKerasFlatten
creating: createZooKerasDense
creating: createZooKerasDense


### Model Compilation

Then, we need to compile the model so the model is optimized in every training step. 

In [37]:
lenet.compile(loss='sparse_categorical_crossentropy',
                  optimizer='adadelta',
                  metrics=['accuracy'])

creating: createAdadelta
creating: createZooKerasSparseCategoricalCrossEntropy
creating: createZooKerasSparseCategoricalAccuracy


### Model fitting

We were not able to fit the model with the whole dataset so we selected a smallest portion of the data:

In [38]:
sX_train = X_train[:500]
sy_train = y_train[:500]

In [39]:
lenet.fit(
        sX_train, sy_train,
        nb_epoch = 200,
        batch_size = 50,
        validation_data=(X_val, Y_val))

Implementing the fit_generator option, we could have directly use the train_generator object as input and our fitting could have much better. However, this function is not implemented in the bigdl.dllib library 

### Model Evaluation

In [40]:
accuracy = lenet.evaluate(X_test, Y_test)
print("TestLoss: ", accuracy[0])
print("Accuracy: ", accuracy[1])

TestLoss:  2.4382448196411133
Accuracy:  0.7713815569877625


We observe that the model can still improve a lot since we are obtained a pretty high Loss. However, the accuracy obtained is not so bad. This shows the potential of this model, with further training and using more data, we could expect some promising results

## Comments

We found out that functions and methods implemented in the bigdl.dllib library are sometimes pretty different from original keras methods and functions which complicated the building of different models implement the bigdl.dllib library. 


Considering these difficulties, we decided to build the lenet model implement the bigdl.dllib library and two other different models using tensorflow 