This project provides a reference implementation for building and running analytics applications deployed on hybrid cloud environment
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
IBMEventStreams_GreenKafkaTest
deployments
docs
helm
notebooks
scripts
streams.examples
.classpath
.gitignore
.project
7977_v3.pdf
Dockerfile
LICENSE
README.md
busybox.yml
ingress.yml
nfs-pv.yml
pvs.yml

README.md

Analytics Reference Architecture

Big data analytics (BDA) and cloud computing are a top priority for CIOs. Harnessing the value and power of big data and cloud computing can give your company a competitive advantage, spark new innovations, and increase revenue. As cloud computing and big data technologies converge, they offer a cost-effective delivery model for cloud-based analytics.

This project provides a reference implementation for building and running analytics application deployed on hybrid cloud environment. We are focusing on offering tools and practices for Data Scientists to work efficiently and to IT architect to design hybrid analytics solution.

To get explanation of the components involved in this architecture see Architecture Center - Analytics Architecture article

Table of Contents

Data Sciences Introduction

The goals for data science is to infer from data actionable insights for the business execution improvement. The main stakeholders are business users, upper managers, who want to get improvement to some important metrics and indicators to control their business goals and objectives. Data scientists have to work closely with business users and be able to explain and represent findings clearly and with good visualization, pertinent for the business users.

Data science falls into these three categories:

Descriptive analytics

This is likely the most common type of analytics leveraged to create dashboards and reports. They describe and summarize events that have already occurred. For example, think of a grocery store owner who wants to know how items of each product were sold in all store within a region in the last five years.

Predictive analytics

This is all about using mathematical and statistical methods to forecast future outcomes. The grocery store owner wants to understand how many products could potentially be sold in the next couple of months so that he can make a decision on inventory levels.

Prescriptive analytics

Prescriptive analytics is used to optimize business decisions by simulating scenarios based on a set of constraints. The grocery store owner wants to creating a staffing schedule for his employees, but to do so he will have to account for factors like availability, vacation time, number of hours of work, potential emergencies and so on (constraints) and create a schedule that works for everyone while ensuring that his business is able to function on a day to day basis.

Some concepts

  • supervised learning: learn a model from labeled training data that allows us to make predictions about unseen or future data. We give to the algorithm a dataset with a right answers (y), during the training, and we validate the model accuracy with a test data set with right answers. So a data set needs to be split in training and test sets.
  • unsupervised learning: giving a dataset, try to find tendency in the data, by using techniques like clustering.
  • classification problem is when we are trying to predict one of a small number of discrete-valued outputs
  • regression classification problem when the goal is to predict continuous value output
  • a feature is an attribute to use for classifying

Challenges

There are a set of standard challenges while developing an IT solution which integrates results from analytics model. We are listing some that we want to address, document and support as requirements.

  • Are we considering a scoring service or a classification one?
  • Is it a batch processing to update static records or real time processing on data stream or transactional data
  • How to control resource allocation for Machine Learning job.
  • How to manage consistency between model and data and code: version management
  • How to assess the features needed for the training and test sets.
  • How to leverage real time cognitive / deep learning classification inside scoring service

Methodology

Combining the development of analytics, machine learning and traditional software development involves adapting the agile iterative methodology. At IBM we are using the Design thinking and lean approach for developing innovative business applications. The garage method explains this approach. To support AI and analytics the method needs to be extended, focusing on data and data sciences. In this article we cover the specifics activities for analytics.

Algorithm selection

In the application from https://samrose3.github.io/algorithm-explorer you can assess what algorithm to use to address a specific problem.

Solution Overview

The solution needs to cover the following capabilities:

  • Develop model with Data Science eXperience using static data loaded in DSX or in DB2 Warehouse
  • Move data from static database to Db2 warehouse
  • Integrate to different data sources in real time to get value for selected features

The system context may look like the following figure:

  • The data scientists are using IBM Data Science eXperience to collaborate on models. The data they work one are coming from DB2 warehouse
  • The finalized model is implemented and deployed as a scoring service using a micro service approach
  • Data may come from different data sources, customer internal database, and SaaS CRM application, tone analyzer service and tweet classifications
  • A front end application is developed to interact with the user and tp consume the scoring service to assess the risk for churn
  • The front end application is using a product catalog managed by a MDM, and can interact with the customer database via APIs defined in API management and supported by a data access layer micro service.

The proposed allocation of the components of this system context is still open, but we want to represent hybrid deployment and using IBM Cloud private to leverage cloud development practices for any new micro services and web application deployment.

Repositories

The following repositories are part of the reference architecture implementation

Build and run

For Data Sciences

For customer manager

The user interface is packaged as part of the Case Portal application and we documented how to add the specifics customer management UI inside the portal. The backend component is a micro-service developed with JAXRS, packaged with Liberty server as a docker image.

For Db2 Warehouse data ingestion

To install DB2 warehouse on IBM Cloud Private read this article. For moving data from DB2 running on-premise to DB2 warehouse on ICP there are different approaches, we documented one approach based on dB2 warehouse out of the box capabilities.

For taxi scenario

The Taxi forecast scenario is based on public data on New York city taxi usage and illustrates on how a Data Scientist responds to a manager demand with a quick around time for data analysis using IBM Data management and jupyter notebooks.

DevOps

Continuous integration

Deployments

Service management

Compendium

Contribute

We welcome your contribution. There are multiple ways to contribute: report bugs and improvement suggestion, improve documentation and contribute code. We really value contributions and to maximize the impact of code contributions we request that any contributions follow these guidelines

  • Please ensure you follow the coding standard and code formatting used throughout the existing code base
  • All new features must be accompanied by associated tests
  • Make sure all tests pass locally before submitting a pull request
  • New pull requests should be created against the integration branch of the repository. This ensures new code is included in full stack integration tests before being merged into the master branch.
  • One feature / bug fix / documentation update per pull request
  • Include tests with every feature enhancement, improve tests with every bug fix
  • One commit per pull request (squash your commits)
  • Always pull the latest changes from upstream and rebase before creating pull request.

If you want to contribute, start by using git fork on this repository and then clone your own repository to your local workstation for development purpose. Add the up-stream repository to keep synchronized with the master.

Please contact me for any questions.

Authors:

  • Jerome Boyer - IBM
  • Rob Wilson - IBM
  • Sunil Dube - IBM
  • Sandra Tucker - IBM
  • Amaresh Rajasekharan - IBM
  • Zach Silverstein - IBM
  • John Marting - IBM