# Open Data Hub - Basic Tutorial

The intent of this notebook is to provide examples of how data scientists can use Open Data Hub with object storage, and more specifically, Ceph object storage, much in the same way they are accoustomed to interacting with Amazon Simple Storage Service (S3) for data science work.

*Table of Contents:*
1. Working with Boto
2. Working with a Tensorflow Neural Network example
3. Working with Spark and machine learning libraries

# Working with Boto

Boto is an integrated interface to current and future infrastructural services offered by Amazon Web Services. It provides interfaces into Amazon S3 and Ceph Object Storage, two object stores often used for data lakes, along with many other services. For lightweight analysis of data using python tools like numpy or pandas, it is handy to interact with data stored in object storage using pure python. This is where Boto shines. Some notebooks from [Open Data Hub](https://radanalytics.io) may not include Boto, but you can install it from the comfort of a notebook using the conda install command below. If you find yourself using Boto frequently, it might be worth modifying [base-notebook](https://github.com/radanalyticsio/base-notebook) and building a custom notebook image that includes Boto.

You'll use environment variables passed into the notebook from OpenShift for access to the Ceph Object Storage and Spark.

In [None]:
import sys
import os
import boto3

s3 = boto3.client('s3','us-east-1', endpoint_url= os.environ['S3_ENDPOINT_URL'],
                       aws_access_key_id = os.environ['S3_ACCESS_KEY'],
                       aws_secret_access_key = os.environ['S3_SECRET_KEY'])


Creating a bucket, uploading an example object with the 'put' statement, and listing the bucket contents.

In [None]:
s3.create_bucket(Bucket=os.environ['ATTENDEE_ID'])
s3.put_object(Bucket=os.environ['ATTENDEE_ID'],Key='object',Body='data')
for key in s3.list_objects(Bucket=os.environ['ATTENDEE_ID'])['Contents']:
    print(key['Key'])

# Working with a Tensorflow Neural Network example

Before we do anything else with Ceph and data, let's run a Tensorflow example.  We'll start by installing several machine learning libraries that we will need for our machine learning example.

In [None]:
!pip install keras==2.1.2 scikit-learn tensorflow matplotlib seaborn

Build a 2-hidden layers fully connected neural network (a.k.a multilayer perceptron) with TensorFlow.

This example is using some of TensorFlow higher-level wrappers (tf.estimators, tf.layers, tf.metrics, ...), you can check 'neural_network_raw' example for a raw, and more detailed TensorFlow implementation.

- Author: Aymeric Damien
- Project: https://github.com/aymericdamien/TensorFlow-Examples/

## Neural Network Overview

<img src="http://cs231n.github.io/assets/nn1/neural_net2.jpeg" alt="nn" style="width: 400px;"/>

## MNIST Dataset Overview

This example is using MNIST handwritten digits. The dataset contains 60,000 examples for training and 10,000 examples for testing. The digits have been size-normalized and centered in a fixed-size image (28x28 pixels) with values from 0 to 1. For simplicity, each image has been flattened and converted to a 1-D numpy array of 784 features (28*28).

![MNIST Dataset](http://neuralnetworksanddeeplearning.com/images/mnist_100_digits.png)

More info: http://yann.lecun.com/exdb/mnist/

In [None]:
from __future__ import print_function

# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=False)

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# Parameters
learning_rate = 0.1
num_steps = 1000
batch_size = 128
display_step = 100

# Network Parameters
n_hidden_1 = 256 # 1st layer number of neurons
n_hidden_2 = 256 # 2nd layer number of neurons
num_input = 784 # MNIST data input (img shape: 28*28)
num_classes = 10 # MNIST total classes (0-9 digits)

In [None]:
# Define the input function for training
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': mnist.train.images}, y=mnist.train.labels,
    batch_size=batch_size, num_epochs=None, shuffle=True)

In [None]:
# Define the neural network
def neural_net(x_dict):
    # TF Estimator input is a dict, in case of multiple inputs
    x = x_dict['images']
    # Hidden fully connected layer with 256 neurons
    layer_1 = tf.layers.dense(x, n_hidden_1)
    # Hidden fully connected layer with 256 neurons
    layer_2 = tf.layers.dense(layer_1, n_hidden_2)
    # Output fully connected layer with a neuron for each class
    out_layer = tf.layers.dense(layer_2, num_classes)
    return out_layer

In [None]:
# Define the model function (following TF Estimator Template)
def model_fn(features, labels, mode):
    
    # Build the neural network
    logits = neural_net(features)
    
    # Predictions
    pred_classes = tf.argmax(logits, axis=1)
    pred_probas = tf.nn.softmax(logits)
    
    # If prediction mode, early return
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode, predictions=pred_classes) 
        
    # Define loss and optimizer
    loss_op = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
        logits=logits, labels=tf.cast(labels, dtype=tf.int32)))
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    train_op = optimizer.minimize(loss_op, global_step=tf.train.get_global_step())
    
    # Evaluate the accuracy of the model
    acc_op = tf.metrics.accuracy(labels=labels, predictions=pred_classes)
    
    # TF Estimators requires to return a EstimatorSpec, that specify
    # the different ops for training, evaluating, ...
    estim_specs = tf.estimator.EstimatorSpec(
      mode=mode,
      predictions=pred_classes,
      loss=loss_op,
      train_op=train_op,
      eval_metric_ops={'accuracy': acc_op})

    return estim_specs

In [None]:
# Build the Estimator
model = tf.estimator.Estimator(model_fn)

In [None]:
# Train the Model
model.train(input_fn, steps=num_steps)

In [None]:
# Evaluate the Model
# Define the input function for evaluating
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': mnist.test.images}, y=mnist.test.labels,
    batch_size=batch_size, shuffle=False)
# Use the Estimator 'evaluate' method
model.evaluate(input_fn)

In [None]:
# Predict single images
n_images = 4
# Get images from test set
test_images = mnist.test.images[:n_images]
# Prepare the input data
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': test_images}, shuffle=False)
# Use the model to predict the images class
preds = list(model.predict(input_fn))

# Display
for i in range(n_images):
    plt.imshow(np.reshape(test_images[i], [28, 28]), cmap='gray')
    plt.show()
    print("Model prediction:", preds[i])

# Working with Spark and machine learning libraries

When running an application you can either establish a Spark session locally in the notebook pod, or point it to a remote Spark cluster running in OpenShift or somewhere else externally accessible.  For this tutorial, each data scientist is given their own Spark environment running in OpenShift.

We'll start by installing several machine learning libraries that we will need for our machine learning example.  In this example, we will be creating a model for detecting sentiment in text.

In [None]:
import os
import pyspark

from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext

#Add the necessary Hadoop and AWS jars to access Ceph from Spark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.hadoop:hadoop-aws:2.7.3,com.amazonaws:aws-java-sdk:1.7.4 pyspark-shell'

spark = SparkSession.builder.master('local[3]').getOrCreate()

**Set the parameters for connecting Spark to Ceph**

In [None]:
hadoopConf=spark.sparkContext._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.endpoint", os.environ['S3_ENDPOINT_URL'])
hadoopConf.set("fs.s3a.access.key", os.environ['S3_ACCESS_KEY'])
hadoopConf.set("fs.s3a.secret.key", os.environ['S3_SECRET_KEY'])
hadoopConf.set("fs.s3a.path.style.access", "true")
hadoopConf.set("fs.s3a.connection.ssl.enabled", "false")

**Run a basic Spark command to test out the connection to Spark.**

In [None]:
import socket
spark.range(5, numPartitions=5).rdd.map(lambda x: socket.gethostname()).distinct().collect()

**Read the contents of the file uploaded to your Ceph bucket a few steps earlier using Spark and display it.**

In [None]:
df0 = spark.read.text("s3a://" + os.environ['ATTENDEE_ID'] + "/object")
df0.show()

**Upload a sample data set to use for training the sentiment analysis model.**

In [None]:
#Install the wget library to download data from online
!pip install wget

import wget
import boto3

s3 = boto3.client('s3','us-east-1', endpoint_url= os.environ['S3_ENDPOINT_URL'],
                       aws_access_key_id = os.environ['S3_ACCESS_KEY'],
                       aws_secret_access_key = os.environ['S3_SECRET_KEY'])

#upload the text file to Ceph
url = "https://gitlab.com/opendatahub/opendatahub-operator/raw/master/tutorials/basic_workshop_tutorial/sample_text_data.tsv?inline=false"
filename = wget.download(url=url, out='sample_text_data.tsv')
s3.upload_file(filename, os.environ['ATTENDEE_ID'], "sample_text_data.tsv")

__Access the data using Spark__

In [None]:
feedbackFile = spark.read.csv("s3a://" + os.environ['ATTENDEE_ID'] + "/sample_text_data.tsv",sep="\t", header=True)

__Convert the data to a Pandas data frame__

In [None]:
import re

import pandas as pd
import matplotlib.pyplot as plt

df = feedbackFile.toPandas()

df.head()

# Visualize the data

__Types of trip outcomes by field representatives__

In [None]:
import numpy as np
np.random.seed(sum(map(ord, "categorical")))

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)

outcome_dict = {'Successful':0,'Partial Success':1,'Unsuccessful':2 }

df_vis = df[['Your Name', 'Outcome']]
df_vis['outcome_numeric'] = df_vis['Outcome'].apply(lambda a:outcome_dict[a])



outcome_cross_table = pd.crosstab(index=df_vis["Your Name"], 
                          columns=df_vis["Outcome"])


outcome_cross_table.plot(kind="bar", 
                 figsize=(16,12),
                 stacked=True,fontsize=12)
plt.show();

__Types of outcomes by event type__

In [None]:
event_type_cross_table = pd.crosstab(index=df["Primary Audience Engaged"], 
                          columns=df["Outcome"])

event_type_cross_table.plot(kind="bar", 
                 figsize=(16,12),
                 stacked=True,fontsize=12)
plt.show();

# Now convert "Highlights" data to prepare for training the sentiment analysis model

In [None]:
df['Highlights'] = df['Highlights'].astype(str)

df[['Highlights','Outcome']].head(20)

In [None]:
df_outcome = df[['Highlights','Outcome']]

grouped_highlights = pd.DataFrame(df_outcome.groupby('Outcome')['Highlights'].apply(lambda x: "%s" % ' '.join(x)))

grouped_highlights['Outcome'] = list(grouped_highlights.index.get_values())
grouped_highlights.reset_index(drop=True, inplace=True)

grouped_highlights['Highlights'] = grouped_highlights['Highlights'].astype(str)

df['Highlights'] = df['Highlights'].apply(lambda a: a.lower())

df_success = df[df['Outcome'] == 'Successful']
df_unsuccess = df[df['Outcome'] == 'Unsuccessful']
df_part_success = df[df['Outcome'] == 'Partial Success']

__Import additional Machine Learning libraries__

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical

__Separating train and test data. Taking successful and unsuccessful separately__

In [None]:
df_failure = df_part_success.append(df_unsuccess, ignore_index= True)

df_failure['Outcome'] = 'Unsuccessful'

test_hold_out = 0.1

#### Success

train = df_success[ : -int(test_hold_out * len(df_success))]
test = df_success[-int(test_hold_out * len(df_success)) : ]

#### Failure

train = train.append(df_failure[ : -int(test_hold_out * len(df_failure))])
test = test.append(df_failure[-int(test_hold_out * len(df_failure)) : ])


train = train.sample(frac = 1)
train['type'] = "Train"
test['type'] = "Test"

train = train.append(test)

train.reset_index(drop=True,inplace=True)

Y = pd.get_dummies(train['Outcome']).values

test_index_list = list(train[train['type'] == 'Test'].index)

test_index_list

# Use the HIGHLIGHTS field for sentiment analysis

__max_features__ = Vocabulary size, its a hyper parameter

*Tokenizer creates vectors from text, mainly works like a dictionary id in total vocabulary, returns list of integers, where every integer acts like an index 

In [None]:
max_fatures = 10000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(train['Highlights'].values)
X_highlights = tokenizer.texts_to_sequences(train['Highlights'].values)
X_highlights = pad_sequences(X_highlights)

__Creating the network layer by layer__

First layer is word embedding layer, second layer is LSTM based RNN, and third layer is Softmax activation layer, due to categorical outcome

In [None]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X_highlights.shape[1], dropout=0.05))
model.add(LSTM(lstm_out, dropout_U=0.1, dropout_W=0.1))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

__Separating train and test data__

In [None]:
X_highlights_train = X_highlights[0:test_index_list[0]]
Y_highlights_train = Y[0:test_index_list[0]]

X_highlights_test = X_highlights[test_index_list[0]:]
Y_highlights_test = Y[test_index_list[0]:]

__Running the model__

In [None]:
batch_size = 20
model.fit(X_highlights_train, Y_highlights_train, epochs = 10, batch_size=batch_size, verbose = 2)

__Printing test data accuracy__

In [None]:
score,accuracy = model.evaluate(X_highlights_test, Y_highlights_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("accuracy: %.2f" % (accuracy))

# Store the model, tokenizer and feature dimension in Ceph for later use

In [None]:
model.save("./model")

import pickle

with open('./tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

feature_dimension = X_highlights_train.shape[1]
with open('./feature_dimension.pickle', 'wb') as handle:
    pickle.dump(feature_dimension, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
import boto3
s3 = boto3.resource('s3')

#Create S3 session for writing manifest file
session = boto3.Session(
    aws_access_key_id = os.environ['S3_ACCESS_KEY'],
    aws_secret_access_key = os.environ['S3_SECRET_KEY']
)

s3 = session.resource('s3', endpoint_url=os.environ['S3_ENDPOINT_URL'], verify=False)

# Upload the model to S3
s3.meta.client.upload_file('./model', os.environ['ATTENDEE_ID'], 'models/trip_report_model')

# Upload the tokenizer to S3
s3.meta.client.upload_file('./tokenizer.pickle', os.environ['ATTENDEE_ID'], 'models/trip_report_tokenizer.pickle')

# Upload the feature dimension to S3
s3.meta.client.upload_file('./feature_dimension.pickle', os.environ['ATTENDEE_ID'], 'models/trip_report_feature_dimension.pickle')

The model has been saved to Ceph as binary objects and can be viewed or used at a later time.  You should see three model files from the above step now stored in Ceph.

In [None]:
s3 = boto3.client('s3','us-east-1', endpoint_url= os.environ['S3_ENDPOINT_URL'],
                       aws_access_key_id = os.environ['S3_ACCESS_KEY'],
                       aws_secret_access_key = os.environ['S3_SECRET_KEY'])

for key in s3.list_objects(Bucket=os.environ['ATTENDEE_ID'], Prefix='models/')['Contents']:
    print(key['Key'])

### Thank you for participating in the Open Data Hub workshop!  For more information on the project or how to contribute check out [OpenDataHub.io](https://opendatahub.io)