# Lecture 0 - Data Analytics for Engineers

<center>
<img src="images/analytics_overview.jpg" width="600">
</center>

## Key questions for the course:

* What is "data analytics"?
  - Why do engineers care?
* What is "big data"?
* What is "machine learning"?
* Will students be "data scientists" after this class?

## What is Data Analytics?

According to [Wikipedia](https://en.wikipedia.org/wiki/Data_analysis) data analysis is the "process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making".

For the purpose of this class, "data analysis" = "data analytics" = "data science" = "data mining".

### Why should engineers care?

* Many engineering problems involve using data to make decisions
  - Statistical models can be easier to develop than physical models

* Engineers will need to work increasingly closely with statisticians, analysts, and data scientists.
  - Knowing the "language" of data analytics enables communication.

## What is "big data"?

Nobody really knows, but according to McKinsey and NIST: Big datasets are "datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze."

This means that the definition of "big" data varies by the "typical database software tools" that are used in a particular field.

Most engineers rely primarily on Excel for data storage and analysis. Assuming Excel is the typical tool, what might constitute "big data" for engineers?

## What is "machine learning"?

According to [Wikipedia](https://en.wikipedia.org/wiki/Machine_learning) machine learning is the use of "statistical techniques to give computers the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed."

This is not very specific - this would include linear regression! A more specific definition is "the use of statistical techniques to quantitatively optimize the complexity of a data-driven model". This is a concept that we will discuss in the course.

<center>
<img src="images/overfitting.png" width="600">
</center>

* Simplest possible model: Assume the answer is a constant.
   - Problem: This is usually wrong! The model has not learned anything (underfitting).
   
* Most complex possible mode: Fit a line that goes through all points.
   - Problem: The information will not generalize to new points. The model has memorized the data (overfitting).
   
* Optimum complexity model: 
   - The model is (approximately) correct for data it has seen, and can generalize to new data. This is learning!
   
We will discuss a number of machine learning models, and how to optimize their complexity, in this course. We will **not** cover neural networks or "deep learning", but will try to explain them conceptually. 

## Will students be "data scientists" after this class?

No.

Data analytics includes statistics, machine learning, database design, data visualization and many other topics. This course will **not** make you an expert in these topics, but will give you an idea of the basics and provide literacy in the jargon and methods that are commonly used.

This course will help you decide if you want to learn more about data science/analytics, but more study and experience will certaily be needed before you can call yourself a "data scientist".

## Approach and Tools

* Python programming language
   - More versatile than Matlab (and much more versatile than Excel!)
* Jupyter notebooks
   - Enable integration of content with code (this is a Jupyter notebook).

## What is Python?

[Python](http://www.python.org/) is a modern, general-purpose, object-oriented, high-level programming language.

General characteristics of Python:
* **clean and simple language:** Easy-to-read and intuitive code, easy-to-learn minimalistic syntax, maintainability scales well with size of projects.
* **expressive language:** Fewer lines of code, fewer bugs, easier to maintain.

In [5]:
#This is Python code!
x = 2.0
y = 3+2
print(x*y)

10.0


## Why are we using Python?

* Python is commonly used for data analysis, and is growing rapidly
* Python is a full programming language

<center>
<img src="images/python-growth.png" width="600">
</center>

# Python vs. Matlab

<center>
<img src="images/python-vs-matlab.png" width="800">
</center>

[Python vs. Matlab details](http://www.pyzo.org/python_vs_matlab.html)

### Advantages:
* Flexibility and portability
* Free and open source
* Huge and supportive community

### Disadvantages
* Speed and efficiency
* Hard to install and manage versions
* No integrated IDE (see [Spyder](https://pythonhosted.org/spyder/), etc.)

## Python vs. Compiled (C++, Fortran):

<center>
<img src="images/optimizing-what.png" width="600">
</center>

### Advantages:
* Fast development time
* Easy to read/write/maintain

### Disadvantages:
* Speed/efficiency

## Homework 0: "Due" Thursday 8/23

Install Python, Jupyter, and OpenRefine on your computer. Install the `numpy`, `scipy`, `pandas`, `matplotlib`, and `seaborn` packages. If successful, you should be able to open this file and execute the block below, and you should be able to open the OpenRefine GUI from a desktop link.

This homework will not be collected, but you will need it to complete your next homework, and follow along in the next lecture.

The easiest way to achieve this is to install the [Anaconda](https://conda.io/docs/user-guide/install/index.html) suite, which contains Python and Jupyter, and you can easily install additional packages using `conda install ...`. These [instructions](https://datacarpentry.org/OpenRefine-ecology-lesson/setup.html) from Data Carpentry will be useful for OpenRefine.

In [6]:
import numpy as np
import scipy
import pandas
import matplotlib
import seaborn

print("All packages successfully installed")

All packages successfully installed


## Bonus: Set up a dedicated `conda` environment

Python package management can be tricky. If you already use Python you may want to create a dedicated "environment" for the packages we will use in this course. You can read more about this here:

https://conda.io/docs/user-guide/tasks/manage-environments.html

## Need Help?

Ray and Aish will run a hands-on workshop to help work out any kinks in software installation and go over the basics of Python. This will be Friday 8/24 from 10am-12pm in Whitaker 1103 BME Classroom.