# Spark tensorflow

Dans les tutoriels précédents nous avons travaillé avec des dataset textuel.

Nous pouvons aussi utiliser les
model de machine learning dans notre spark data pipeline.

Dans ce tutoriel, nous allons apprendre comment utiliser spark pour traiter les images.
Nous allons voire aussi comment utiliser les model de machine learning dans notre spark
data pipeline. Notre data pipeline suivre les etapes suivants :
1. Utiliser spark pour lire et nettoyer les images originaux,
2. Utiliser haar-cascade model pour detect les visages et extraire chaque visage comme un
image individuel
3. Utiliser un model de classification pre-entrainer avec tensorflow pour verifier
si le mask est bien porte ou pas
4. Integrer les prediction de notre model sur les images originaux.

## Installer les dependencies

Comme nous allons utilise des libraries pas standard, nous devons les installer dans notre spark
driver. Ouvrir un terminal, et execute les command suivant.
```shell
pip install opencv-contrib-python
pip install tensorflow
pip install --upgrade  protobuf

# vous pouvez verifier s'ils sont bien install en utilisant
pip show <package-name>
```
Vous pouvez les installe directement via notebook

In [None]:
%pip install opencv-contrib-python

In [None]:
%pip install tensorflow

In [None]:
%pip install --upgrade  protobuf

## Création du context Spark




In [1]:
spark = SparkSession \
    .builder.master("k8s://https://kubernetes.default.svc:443") \
    .appName("SparkStreamingComputerVision") \
    .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:master") \
    .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
    .config("spark.executor.instances", "4") \
    .config("spark.executor.memory","8g") \
    .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
    .config("spark.kubernetes.executor.podTemplateFile","/home/jovyan/work/SparkStreamingCV/k8s_pod_template/custom_python_dependencies.yaml") \
    .getOrCreate()

<pyspark.conf.SparkConf at 0x7fa270555850>

On note que :
* on a pris 4 executeurs et on surcharge spark pour, lors de shuffle ou repartition, que son niveau de parallelisme soit
4 plutôt que 200
* comme spark est un framework de calcul distribuer, les executor doit avoir les meme dependencies comme driver. on a
plusieur solutions: 1.) Build un image specific et indiquer spark executor utilise ce image. 2.) Utilise un spark
feature "podTemplate" pour installer les dependencies sur une image standard. Nous allons choisi la solution 2, car
c'est plus souple et leger.

## Configurer variable


In [2]:
# input image path
image_input_folder_path = "s3a://projet-spark-lab/diffusion/spark_cv/data/input/"
# model path
face_mask_model_url="https://minio.lab.sspcloud.fr/projet-spark-lab/diffusion/spark_cv/models/masknet.h5"

endpoint = "https://"+os.environ['AWS_S3_ENDPOINT']
AWS_ACCESS_KEY_ID=os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY=os.getenv("AWS_SECRET_ACCESS_KEY")
SESSION_TOKEN=os.environ['AWS_SESSION_TOKEN']

## Configure helper function

In [3]:
# deseriallize byte to opencv image format
def convert_byte_to_nparr(img_byte):
    np_array = cv.imdecode(np.asarray(bytearray(img_byte)), cv.IMREAD_COLOR)
    return np_array

# serialize opencv image format to byte
def convert_nparr_to_byte(img_np_array):
    success, img = cv.imencode('.png', img_np_array)
    return img.tobytes()

# column function for extract image name
def extract_file_name(path):
    return f.substring_index(path, "/", -1)

# render image byte in jupyter
def render_image(image_bytes_list):
    for image_bytes in image_bytes_list:
        image=Image.open(io.BytesIO(image_bytes))
        display(image)

Collecting graphframes
  Downloading graphframes-0.6-py2.py3-none-any.whl (18 kB)
Collecting nose
  Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[K     |████████████████████████████████| 154 kB 5.2 MB/s eta 0:00:01
Installing collected packages: nose, graphframes
Successfully installed graphframes-0.6 nose-1.3.7
Note: you may need to restart the kernel to use updated packages.


### Helper functions pour etap 2 (Detecte les visages et extraire chaque visage comme un image individual)

Vous pouvez remarque que, a la fin du block, on cree un spark udf(user define function). Les UDF
permettent de créer une nouvelle colonne dans un dataframe qui sera le résultat d’un calcul
pouvant utiliser les valeurs d’une (ou plusieurs) colonne(s) existante(s).

Dans notre cas, notre udf prend un image et retourne une list d'objet(sous-image) qui contient tous les
visages extraire d'image original. L'object a deux champs:
1. Le nom en string qui contient la position de visage dans l'image original.
2. Le contenu d'image en byte

In [None]:
def face_extraction(image_name, raw_img_content):
    haar_model_name = "haarcascade_frontalface_default.xml"
    haar_model_path = "{}{}".format("/opt/conda/lib/python3.7/site-packages/cv2/data/",haar_model_name)
    img = cv.imdecode(np.asarray(bytearray(raw_img_content)), cv.IMREAD_COLOR)
    img = cv.cvtColor(img, cv.IMREAD_GRAYSCALE)
    face_model = cv.CascadeClassifier(haar_model_path)
    faces = face_model.detectMultiScale(img, scaleFactor=1.1, minNeighbors=4)  # returns a list of (x,y,w,h) tuples

    # Extract faces from the origin image
    extracted_face_list = []
    for i in range(len(faces)):
        (x, y, w, h) = faces[i]
        img_content = img[y:y + h, x:x + w]
        img_content = cv.resize(img_content, (128, 128))
        extracted_face_img_name = image_name[:-4] + "_x" + str(x) + "_y" + str(y) + "_w" + str(
            w) + "_h" + str(h) + ".png"
        img_byte = convert_nparr_to_byte(img_content)

        extracted_face_list.append((extracted_face_img_name, img_byte))
    return extracted_face_list


face_extraction_schema = ArrayType(StructType([
    StructField("img_name", StringType(), False),
    StructField("img_content", BinaryType(), False)
]))

Face_Extraction_UDF = f.udf(lambda image_name, raw_image_content: face_extraction(image_name, raw_image_content),
                            face_extraction_schema)

### Helper functions pour etap 3 (predict si un mask est bien porte ou pas)

Dans ce spark udf, il prend un image, si le model predit que le mask est bien porte, le udf
retourne true, sinon il retourne false

In [4]:
def face_mask_prediction(np_img_str):
    # read raw face image
    np_arr_img = convert_byte_to_nparr(np_img_str)
    img = np.reshape(np_arr_img, [1, 128, 128, 3])
    img = img / 255.0
    # fetch model from s3
    model_path = get_file('masknet.h5', face_mask_model_url)
    vgg19_model = tf.keras.models.load_model(model_path)

    score = vgg19_model.predict(img)
    if np.argmax(score) == 0:
        res = True
    else:
        res = False
    # print(res)
    return res

Face_Mask_Prediction_UDF = f.udf(lambda face_image_content: face_mask_prediction(face_image_content), BooleanType())

### Helper functions pour etap 4 (Integre les resultats sur l'image original

Pour que les resultats de notre prediction soit plus lisible, nous allons integres les resultats de
prediction sur l'image original en utilisant les tags mask/no-mask.

In [5]:
# extract face coordinate from the face image name
def get_face_coordinate_of_origin_image(face_image_name):
    x = face_image_name.split("_")[1][1:]
    y = face_image_name.split("_")[2][1:]
    w = face_image_name.split("_")[3][1:]
    h = face_image_name.split("_")[4][1:].split('.')[0]
    return int(x), int(y), int(w), int(h)


def integrate_face_mask_prediction(origin_image_name, face_list, origin_image_content):
    buffer_img = cv.imdecode(np.asarray(bytearray(origin_image_content)), cv.IMREAD_COLOR)
    for face in face_list:
        face_image_name = face[0]
        has_mask = face[1]
        # set Label text
        if has_mask:
            mask_label = "MASK"
        else:
            mask_label = "NO MASK"
        # Get the coordinate and size of face image
        (x, y, w, h) = get_face_coordinate_of_origin_image(face_image_name)
        # Set text color for mask label
        mask_label_color = {"MASK": (0, 255, 0), "NO MASK": (0, 0, 255)}

        # Insert mask label to image
        buffer_img = cv.putText(buffer_img, mask_label, (x, y - 10), cv.FONT_HERSHEY_SIMPLEX, 0.5,
                                mask_label_color[mask_label], 2)
        # Insert a rectangle around the face
        buffer_img = cv.rectangle(buffer_img, (x, y), (x + w, y + h), mask_label_color[mask_label], 1)
    # serialize cv image to bytes
    img_bytes=convert_nparr_to_byte(buffer_img)
    return img_bytes


Integrate_Face_Mask_Prediction_UDF = f.udf(
    lambda origin_img_name, face_list, origin_img_content: integrate_face_mask_prediction(origin_img_name, face_list,origin_img_content),BinaryType())

+---+---+------+
|src|dst|action|
+---+---+------+
|  1|  2|  love|
|  2|  1|  hate|
|  2|  3|follow|
+---+---+------+



## Le data pipeline principal pour traiter les images

1. Utiliser spark pour lire et nettoyer les images originaux,
2. Utiliser haar-cascade model pour detect les visages et extraire chaque visage comme un
image individuel
3. Utiliser un model de classification pre-entrainer avec tensorflow pour verifier
si le mask est bien porte ou pas
4. Integrer les prediction de notre model sur les images originaux.


### Etap 1: Lire l'image de s3

In [None]:
image_schema = spark.read.format("binaryFile").load(image_input_folder_path).schema
raw_image_df = spark.read \
        .format("binaryFile") \
        .option("maxFilesPerTrigger", "500") \
        .option("recursiveFileLookup", "true") \
        .option("pathGlobFilter", "*.png") \
        .schema(image_schema) \
        .load(image_input_folder_path)
raw_image_df.show(5)

In [6]:
# show the origin image
origin_col_name="content"
origin_image_list = raw_image_df.select(origin_col_name).toPandas()[origin_col_name]
render_image(origin_image_list)

### 1.1 Nettoye le dataframe

In [None]:
image_name_df = raw_image_df \
        .select("path", "content") \
        .withColumn("origin_image_name", extract_file_name(f.col("path"))).drop("path")
image_name_df.show()

## Etap 2: Extraire les visages

In [None]:
# use udf Face_Extraction_UDF to extract faces
detected_face_list_df = image_name_df.withColumn("detected_face_list",Face_Extraction_UDF("origin_image_name", "content"))
detected_face_list_df.show()
detected_face_list_df.printSchema()

In [9]:
# Flat the list column to multi rows
detected_face_ob_df = detected_face_list_df.withColumn("extracted_face",f.explode(f.col("detected_face_list"))).drop("detected_face_list")
detected_face_ob_df.show()
detected_face_ob_df.printSchema()

In [None]:
# flat the struct column to primitive column
detected_face_df = detected_face_ob_df.select(f.col("origin_image_name"),f.col("content"),
                                                  f.col("extracted_face.img_name").alias("extracted_face_image_name"),
                                                  f.col("extracted_face.img_content").alias(
                                                      "extracted_face_image_content"))
detected_face_df.show()
detected_face_df.printSchema()

In [10]:
# show the extracted faces
face_col_name="extracted_face_image_content"
face_image_list = detected_face_df.select(face_col_name).toPandas()[face_col_name]
render_image(face_image_list)

## Etap 3 : Donne le prediction de mask sur les visages extraire

In [11]:
predicted_mask_df = detected_face_df.withColumn("with_mask",Face_Mask_Prediction_UDF("extracted_face_image_content")).cache()
predicted_mask_df.show()
predicted_mask_df.printSchema()

## Etap 4: Integrer les prediction de notre model sur les images originaux.

In [13]:
# map the face name with the mask prediction result and group them by their origin
grouped_face_df = predicted_mask_df.drop("extracted_face_image_content").groupBy("origin_image_name","content").agg(f.collect_list(f.struct(
                *[f.col("extracted_face_image_name").alias("face_name"), f.col("with_mask").alias("with_mask")]))
            .alias("face_list"))

grouped_face_df.show()

+--------+--------------------------------------+
|id      |name                                  |
+--------+--------------------------------------+
|4898091 |FinancialTimes                        |
|7540072 |neufmetres                            |
|10575072|Webzine de la dracenie et du Var Est ن|
|16683666|spectator                             |
|17385313|Julien_W                              |
|17437184|alphoenix                             |
|17464719|PascalR                               |
|17779850|Jennifer Ogor                         |
|18976358|SylvainePascual                       |
|19713578|chris dabin                           |
+--------+--------------------------------------+
only showing top 10 rows



In [None]:
# integre les predictions
final_df = grouped_face_df.withColumn("marked_img_content",Integrate_Face_Mask_Prediction_UDF("origin_image_name", "face_list","content"))
final_df.show()

In [14]:
# affiche les image avec les tags "mask" "no_mask"
col_name="marked_img_content"
image_list = final_df.select(col_name).toPandas()[col_name]
render_image(image_list)

In [15]:
# stop sparksession
spark.sparkContext.stop()

+--------+------------------+---+--------------------+--------------------+
|     src|               dst| nb|            hashtags|                  id|
+--------+------------------+---+--------------------+--------------------+
|15217683|793749212118867969|  1|                  []|[1380995523168120...|
|15872615|         121468512|  1|                  []|[1380245558141591...|
|16600674|         217473382|  6|                  []|[1380505509611069...|
|17193568|         217473382|  1|          [immigrés]|[1380513459482337...|
|17385313|         217473382|  6|                  []|[1380247279894986...|
|17464719|        2515649016|  1|                  []|[1380436148057608...|
|18629937|         121468512|  1|                  []|[1380492059690295...|
|18969131|          20947741|  1|                  []|[1380468019009355...|
|19377400|         112754792|  1|                  []|[1381174506446848...|
|20064944|        1460135654|  1|                  []|[1380449540738850...|
|20181221|  