# Module 10 Part 2: Tools for Moderately Sized Datasets

# Introduction

This notebook introduces several popular tools for analysis of moderately sized datasets. These tools are easy to use, mature in their capabilities, and can be used when your dataset is small enough to not require distributed processing.

This module consists of 3 parts:

- **Part 1** - Databases and SQL Basics
- **Part 2** - Tools for Moderately Sized Datasets
- **Part 3** - Data Privacy and Security

Each part is provided in a separate notebook file. The notebooks can be reviewed in any order.

# Learning Outcomes

In this workbook, you will develop familiarity with a sampling of tools that can be used to quickly develop models for moderately sized datasets. The tools that you will explore are:

- MicroStrategy
- Weka
- Tableau
- SAS 

# Readings and Resources

Detailed information about the tools covered in this notebook can be found on the web. Here are the links:

- http://www.microstrategy.com


- http://www.cs.waikato.ac.nz/ml/weka/


- http://www.tableau.com


- http://www.sas.com

<h1>Table of Contents<span class="tocSkip"></span></h1>
<br>
<div class="toc">
<ul class="toc-item">
<li><span><a href="#Module-10-Part-2:-Tools-for-Moderately-Sized-Datasets" data-toc-modified-id="Module-10-Part-2:-Tools-for-Moderately-Sized-Datasets">Module 10 Part 2: Tools for Moderately Sized Datasets</a></span>
</li>
<li><span><a href="#Introduction" data-toc-modified-id="Introduction">Introduction</a></span>
</li>
<li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes">Learning Outcomes</a></span>
</li>
<li><span><a href="#Readings-and-Resources" data-toc-modified-id="Readings-and-Resources">Readings and Resources</a></span>
</li>
<li><span><a href="#Table-of-Contents" data-toc-modified-id="Table-of-Contents">Table of Contents</a></span>
<ul class="toc-item">
<li><span><a href="#Moderately-Sized-Datasets" data-toc-modified-id="Moderately-Sized-Datasets">Moderately Sized Datasets</a></span>
</li>
<li><span><a href="#MicroStrategy" data-toc-modified-id="MicroStrategy">MicroStrategy</a></span>
</li>
<li><span><a href="#Weka" data-toc-modified-id="Weka">Weka</a></span>
</li>
<li><span><a href="#Tableau" data-toc-modified-id="Tableau">Tableau</a></span>
</li>
<li><span><a href="#SAS" data-toc-modified-id="SAS">SAS</a></span>
</li>
<li><span><a href="#Other-Tools" data-toc-modified-id="Other-Tools">Other Tools</a></span>
</li>
</ul>
</li>
<li><span><a href="#References" data-toc-modified-id="References">References</a></span>
</li>
</ul>
</div>

## Moderately Sized Datasets
   
In this module, we will refer to **moderately sized datasets** as those for which either:
    
* The dataset fits in the physical memory of a single desktop computer.
* Or, the dataset can be loaded, manipulated and processed in a stream without chunking it or needing the power of distributed processing.

**Chunking** refers to breaking up a large dataset into smaller parts where each part is called a chunk. For large data, we often chunk the data and process each chunk in a batch that will fit in memory.

In other words, these datasets are *small*, not *big* data. Although many of these tools were developed prior to the dawn of big data, many now can read or write data to Hadoop and other big data datastores for highly interactive analysis of subsets of larger datasets.

The tools for these are often referred to as **business intelligence** (BI) tools. Here is a map of some of the most popular from Gartner's 2021 report for BI tools.

![Figure_1_Magic_Quadrant_for_Analytics_and_Business_Intelligence_Platforms.png](attachment:Figure_1_Magic_Quadrant_for_Analytics_and_Business_Intelligence_Platforms.png)


**Source**: https://www.qlik.com/us/lp/sem/gartner-magic-quadrant-2021

## MicroStrategy

**MicroStrategy** was founded in 1989 by MIT alumni Michael J. Saylor and Sanju Bansal with the focus on developing data mining software for businesses. The MicroStrategy platform supports interactive dashboards, scorecards, highly formatted reports, ad hoc queries, thresholds and alerts, and automated report distribution.

There are three main products:

1. **MicroStrategy Analytics**: Provides tools for business intelligence and predictive analytics to search through and perform analytics on big data from a variety of sources, including: data warehouses, Excel files, and Apache Hadoop distributions.<br><br>

2. **MicroStrategy Mobile**: Introduced in 2010, this is a software platform integrating analytics capabilities into apps for mobile devices. It allows easier access without needing to reformat the data for different platforms.<br><br>

3. **Usher**: Usher is a digital credentials and identity intelligence product that provides a secure way for organizations to control digital and physical access. It replaces physical badges and passwords with secure digital badges and generates information on user behavior and resource usage.

The MicroStrategy Analytics Platform has the following features:

- Dashboards and visualizations
- Support for mobile apps
- Library of 300+ OLAP, mathematical, financial and data mining functions
- Out-of-the-box integration with R and PMML (an XML-based predictive model interchange format) 
- Connectors for a wide variety of datastores, including Hadoop, Cassandra and other big data sources
- Secure Cloud


The MicroStrategy website can be found here: http://www.microstrategy.com.

## Weka

**Weka** is a free and open source point-and-click machine learning tool. It was developed at the University of Waikato, New Zealand, and stands for **Waikato Environment for Knowledge Analysis**. Weka is a standalone desktop tool and comes equipped with an array of popular ML algorithms. It also provides a few standard datasets to play around with.

Weka is written in Java and runs on Windows, macOS, and Linux operating systems. The toolset contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.

Weka supports:

- Over 100 classification algorithms
- 20 algorithms for clustering, finding association rules, etc. 
- 75 data cleansing and preparation methods 
- 25 feature selection methods

Here are some helpful resources for learning more about Weka:

- More about the Weka Project can be found here: http://www.cs.waikato.ac.nz/ml/weka/ 


- Weka also has a companion book that provides an easy-to-read introduction to data mining: https://www.amazon.ca/Data-Mining-Practical-Learning-Techniques/dp/0128042915/


- If you wish, you may download the software from here: http://www.cs.waikato.ac.nz/ml/weka/downloading.html


- Weka packages can be downloaded at: http://weka.sourceforge.net/packageMetaData/

Here are some screenshots:

![7_3.png](attachment:7_3.png)

**Source**: http://www.cs.waikato.ac.nz/~ml/weka/gui_explorer.html

The screenshots above show different panels of the Weka software for classification, clustering and visualizing.  

## Tableau 

Tableau was founded "with a simple mission: help people see and understand their data” (Tableau Software, 2019).

It's a very popular tool in enterprises for data visualization. Unlike plotting software, Tableau can read in data from a wide variety of data sources. The input can be from relational databases or flat files. You can also run queries in the background and feed in aggregations or result sets from big data tools to visualize data extracts and build reports. For example, some organizations use Tableau for reporting on extract, transform and load (ETL) processes that move data between systems. For example, you can specify a query which the Tableau dashboard runs and uses the result to plot a graph of progress or results. However, Tableau shouldn't be used as an ETL tool for performance reasons.

Tableau can be a very powerful tool for slicing multidimensional datasets to look at interpret different cuts. Tableau is meant for enterprises and the licenses can be expensive for a single user. Microsoft Power BI is comparable to Tableau in terms of look and feel. However, Tableau provides additional capabilities, such as Tableau Server where reports can be hosted and updated regularly as a production service.

## SAS

SAS was developed beginning in the early 1970s at North Carolina State University. Originally intended for analysis of agricultural data to improve crop yields, it has since been used in many areas including where statistical analysis is needed. The SAS name originates from "Statistical Analysis System".

Today, SAS is a software suite used for advanced analytics, data management and predictive analytics &mdash; particularly in financial services, though its popularity has waned somewhat with the rise of Pandas.

Here is the company website: https://www.sas.com/en_ca/home.html.

The SAS product suite includes more than 200 components and solutions, including:

- SAS/STAT
- SAS Analytics Pro
- SAS Customer Intelligence 360
- SAS Enterprise Miner
- SAS Data Management
- SAS Visual Analytics
- SAS Analytics for IoT
- SAS Text Miner
- SAS Viya, a new, open architecture built for analytics innovation

Here is a complete list of all products: https://www.sas.com/en_ca/software/all-products.html.

The SAS toolset is highly programmable and has its own programming language for writing complex queries and new algorithms, although the language syntax is quite dated. A SAS program can contain a DATA step, a PROC step, or any combination. 

A **DATA** step allows you to manage and manipulate your data. You typically use a DATA step to read data from an input source, process it, and create a SAS table. With DATA steps, you can:

- Put your data into a SAS table
- Compute the values for new variables
- Check for and correct errors in your data
- Produce new SAS datasets by subsetting, merging, and updating existing datasets

A **PROC** step consists of a group of SAS statements that call and execute a procedure, usually with a SAS dataset as input. Use PROCs to:

- Analyze the data in a SAS dataset
- Produce formatted reports or other results
- Manage SAS files

You can modify PROCs with minimal effort to generate the output you need. PROCs can also perform functions such as displaying information about a SAS dataset. 

Large enterprises often invest in SAS Grid to run SAS jobs to help with workload balancing, and achieving high availability and faster processing.

## Other Tools

There are many, many other tools in the business intelligence category. They vary in their cost, licensing arrangements and whether they are stand-alone or cloud-hosted. A good, comprehensive list can be found here: https://en.wikipedia.org/wiki/Business_intelligence_software.

**End of Part 2**

This notebook makes up one part of this module. Now that you have completed this part, please proceed to the next notebook in this module.

If you have any questions, please reach out to your peers using the discussion boards. If you and your peers are unable to come to a suitable conclusion, do not hesitate to reach out to your instructor on the designated discussion board.

# References

- 2021 Gartner Magic Quadrant for Analytics and Business Intelligence Platforms. Retrieved from: https://www.qlik.com/us/lp/sem/gartner-magic-quadrant-2021


- Microstrategy's platform capabilities. Retrieved from: https://www.microstrategy.com/us/platform


- Weka's wiki page. Retrieved from: http://weka.wikispaces.com/


- List of Weka packages. Retrieved from: http://weka.sourceforge.net/packageMetaData/


- List of all SAS products. Retrieved from: https://www.sas.com/en_ca/software/all-products.html


- Business Intelligence Software list. Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Business_intelligence_software


- Tableau Software (2019). Changing the way you think about data. Retrieved from: https://www.tableau.com/