# [CptS 215 Data Analytics Systems and Algorithms](https://github.com/gsprint23/cpts215)
[Washington State University](https://wsu.edu)

[Gina Sprint](http://eecs.wsu.edu/~gsprint/)
# Introduction

Learner objectives for this lesson:
* Gain an overview of the course and its expectations
* Understand the details of the syllabus
* Discuss the general field of data analytics
* Re-visit algorithms/data structures learned in previous courses

Content used in this lesson is based upon information in the following sources:
* None to report

## Course Overview
Welcome to CptS 215! Please see the [course website](https://piazza.com/wsu/fall2017/cpts215/home) for the syllabus and information about the course. Now let's get started!

## What is Data Analytics?
Data analytics (DA) is the science of analyzing data to gain insight, draw conclusions, or make decisions about the data. 

What are examples of data in the real-world and how is that data being analyzed?
* Social media data collected from social networks, posting, news feeds, etc.
    * Analyzed to suggest friends, deliver user-specific content, recommend products, target advertising, etc.
* Financial data collected from banking transactions, trading, etc.
    * Analyzed to project stock market trends, recommend certain investments, determine credit scores, etc.
* Medical data collected from electronic health records, physician/nurse notes, etc.
    * Analyzed to determine health risk factors, onset of early disease, insurance billing, etc.
* Many others

## What is Data Analytics Systems and Algorithms?
Data analytics systems and algorithms is software designed to implement and support data analytics. In this class, we are going to focus on the fundamental data structures used to implement data analytics systems and the algorithms that associated with such data structures. For the data structures and algorithms covered in this course, we are going to investigate their *efficiency* in terms of time and space complexity. This is especially important for data analytics systems that include algorithms operating on big data sets.

What are data structures you are already familiar with from 121/131 and 122/132?
* Arrays/lists
* Structs/Classes/Objects
* Linked lists
* Stacks
* Queues
* Trees
* Others?

In this class we are going to cover the following data structures:
* Hash tables
* Trees
* Heaps
* Graphs

And we are going to cover searching and sorting in detail, particularly:
* Linear search
* Binary search
* Selection sort
* Insertion sort
* Bubble sort
* Shell sort
* Merge sort
* Quick sort
* Heap sort

For assignments, we will implement and utilize the the above data structures and algorithms in the context of data analytics. Some topics related to data analytics that we will cover in this class (at a high level) includes the following:
* [Data munging/wrangling](https://en.wikipedia.org/wiki/Data_wrangling): Describes the overall process of manipulating unstructured and/or messy data into a structured and clean form.
* [Data mining](https://en.wikipedia.org/wiki/Data_mining): The computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
* [Machine learning](https://en.wikipedia.org/wiki/Machine_learning): Provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can change when exposed to new data.

## Python
In this class, we are going to learn and use the Python programming language for all of our coding assignments. According to [IEEE Spectrum](http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages), Python is a top 3 programming language of 2016 and according to [KDNuggets](http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html), Python is a top 2 programming language for analytics, data mining, and data science (second to R, which is taught in our WSU CptS 115 course). 

### Why Use Python for Data Analytics?
Advantages of learning Python include:
1. Easy to learn
1. Free, open source
1. Support for the life cycle of software (prototyping, development, testing, release, maintenance)
1. Many available libraries, especially for data analytics:
    1. [numpy](http://www.numpy.org/)
    1. [scipy](https://www.scipy.org/)
    1. [sci-kits](https://scikits.appspot.com/) (especially [sci-kit learn](http://scikit-learn.org/stable/) for machine learning)
    1. [pandas](http://pandas.pydata.org/)
    1. [Plotting libraries](https://wiki.python.org/moin/NumericAndScientific/Plotting), such as [matplotlib](http://matplotlib.org/) and [Plotly](https://plot.ly/)
1. Many supported GUI backends
1. LOTS of community support/development online
1. Cross platform support
    * Python is an interpreted language, which means it can run on any system with the Python interpreter installed; however, this is also a disadvantage in some ways, meaning Python code can be SLOW to run, compared with compiled languages like C
    
### Python Distribution and IDE
We will use the [Anaconda3](https://docs.continuum.io/anaconda/index) Python3 distribution. This is a free distribution of Python version 3 available for Windows, OS X, and Linux. You can download Anaconda3 [here](https://www.continuum.io/downloads) and view the installation instructions [here](https://docs.continuum.io/anaconda/install).

Anaconda comes packaged with an easy-to-use editor called [Spyder](http://spyder-ide.org/) (Scientific Python Development Environment). I encourage you to use Spyder or one of the following [Anaconda-supported IDEs](https://docs.continuum.io/anaconda/ide_integration) to develop your Python code:
1. PyCharm
1. Eclipse with the PyDev Plugin
1. Visual Studio with Python Tools
1. Wing IDE

Should you choose to use an IDE that uses a project structure (i.e. Eclipse, Visual Studio, etc.), *only turn in your .py source files and ensure your code is runnable from the Python command prompt.*

### Algorithms
Let's review, what is an algorithm? **A sequence of instructions that solves a problem**. A sequence of instructions is an algorithm if it meets the following criteria:
* Well-ordered instructions
* Unambiguous instructions
* Computable instructions
* Produces a result
* Doesn't run forever

Why are algorithms so important to computer science? If we can specify an algorithm...
* We can automate the solution
* We can also repeat a solution to a problem

Impossible to do these algorithms on this scale without computers: [Google Inside Search](http://www.google.com/insidesearch/howsearchworks/thestory/)
In 60 seconds, 2,314,800 google searches!!

### Examples of Algorithms Learned
What are algorithms you learned/implemented in previous CS courses?
* String/array/list algorithms
    * Length, compare, equals, reverse, etc.
* Searching algorithms
    * Linear, binary, etc.
* Sorting algorithms
* Recursive solutions to common problems (Fibonacci, Towers of Hanoi, maze traversal, etc.)
* Linked list algorithms
    * Insert, insert in order, remove node, is empty, etc.
* Stack algorithms
    * Push, peek, pop, is empty, etc.
* Queue algorithms
    * Enqueue, dequeue, peek, is empty, etc.
* Tree algorithms
    * Breadth first traversal, depth first traversal, search, etc.
* Others? There are lots of them!