# [CPSC 310](https://github.com/GonzagaCPSC310) Data Mining
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Introduction

## Learner Objectives
What are our learning objectives for this lesson?
* Understand the general field of data analytics
* Run a Python program on their own computer
    * Interactive mode
    * Scripting mode

## Acknowledgments
Content used in this lesson is based upon information in the following sources:
* None to report

## What is Data Mining?
Data mining is the science of analyzing data to gain insight, draw conclusions, or make decisions about the data. 

What are examples of data in the real-world and how is that data being analyzed?
* Medical data collected from electronic health records, physician/nurse notes, etc.
    * Analyzed to determine health risk factors, onset of early disease, insurance billing, etc.
* Time series data collected from sensors installed in the environment or worn on the body (wearables)
    * Analyzed to detect physical activity, daily behavior, changes in behavior over time, etc.
* Social media data collected from social networks, posting, news feeds, etc.
    * Analyzed to suggest friends, deliver user-specific content, recommend products, target advertising, etc.
* Financial data collected from banking transactions, trading, etc.
    * Analyzed to project stock market trends, recommend certain investments, determine credit scores, etc.
* Many others

Some topics related to data analytics that we will cover in this class (at a high level) includes the following:
* [Data cleaning/munging/wrangling](https://en.wikipedia.org/wiki/Data_wrangling): Describes the overall process of manipulating unstructured and/or messy data into a structured and clean form.
* [Data mining](https://en.wikipedia.org/wiki/Data_mining): The computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
* [Machine learning](https://en.wikipedia.org/wiki/Machine_learning): Provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can change when exposed to new data.

## Python
In this class, we are going to learn and use the Python programming language for all of our coding assignments. According to [IEEE Spectrum](http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages), Python is a top 3 programming language of 2016 and according to [KDNuggets](http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html), Python is a top two programming language for analytics, data mining, and data science (second only to R). 

### Why Use Python for Data Mining?
Advantages of learning Python include:
1. Easy to learn
1. Free, open source
1. Support for the life cycle of software (prototyping, development, testing, release, maintenance)
1. Many available libraries, especially for data analytics:
    1. [numpy](http://www.numpy.org/)
    1. [scipy](https://www.scipy.org/)
    1. [sci-kits](https://scikits.appspot.com/) (especially [sci-kit learn](http://scikit-learn.org/stable/) for machine learning)
    1. [pandas](http://pandas.pydata.org/)
    1. [Plotting libraries](https://wiki.python.org/moin/NumericAndScientific/Plotting), such as [matplotlib](http://matplotlib.org/) and [Plotly](https://plot.ly/)
1. Many supported GUI backends
1. LOTS of community support/development online
1. Cross platform support
    * Python is an interpreted language, which means it can run on any system with the Python interpreter installed; however, this is also a disadvantage in some ways, meaning Python code can be slow to run, compared with compiled languages like C
    
### Python Distribution and IDE
We will use the [Anaconda v3.7](https://docs.continuum.io/anaconda/index) Python3 distribution. This is a free distribution of Python version 3 available for Windows, OS X, and Linux. You can download Anaconda3 [here](https://www.continuum.io/downloads) and view the installation instructions [here](https://docs.continuum.io/anaconda/install).

Anaconda comes packaged with an easy-to-use integrated development environment (IDE) called [Spyder](http://spyder-ide.org/) (Scientific Python Development Environment). I encourage you to use Spyder or one of the following [Anaconda-supported IDEs](https://docs.continuum.io/anaconda/ide_integration) to develop your Python code:
1. PyCharm
1. Eclipse with the PyDev Plugin
1. Visual Studio with Python Tools
1. Wing IDE

## Download/Install Anaconda
Visit the [Anaconda downloads page](https://www.continuum.io/downloads) and download Anaconda v3.7 graphical installer for your operating system. Once the download is complete, run the installer. 

On a Mac machine, the graphical installer looks similar to this: 
<img src="https://raw.githubusercontent.com/GonzagaCPSC310/U0-Introduction/master/figures/anaconda_mac_installer.png" width="500">

On a Windows machine, at the "Advanced Options" screen, make sure both check boxes are selected, as follows:
![](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/figures/anaconda_adv.png)

At the "Microsoft Visual Studio Code" screen, install the Visual Studio Code editor.
<img src="https://raw.githubusercontent.com/GonzagaCPSC310/U0-Introduction/master/figures/anaconda_install_vscode.png" width="500">

You can uncheck the box to "Learn more about Anaconda Cloud." We won't be using Anaconda Cloud in this class.

Test that your Python installation is complete and correct. To do this, open the [Anaconda Navigator](https://docs.continuum.io/anaconda/navigator.html) located in the Anaconda3 folder (on my Windows machine it is at C:\ProgramData\Microsoft\Windows\Start Menu\Programs\Anaconda3 (64-bit) and on my Mac it is at /Users/gsprint/anaconda3). You can also search your computer for "Anaconda Navigator" (on a Windows machine press the windows key and start typing, on a Mac its command key + space and start typing).
![](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/figures/anaconda_folder.png)

Note: You may want to make desktop shortcuts for Anaconda Navigator, IPython, and Spyder.

Anaconda Navigator should like this (note that Visual Studio Code has been installed):
![](https://raw.githubusercontent.com/GonzagaCPSC310/U0-Introduction/master/figures/anaconda_navigator_dashboard.png)

Try launching "qtconsole". Something like this program should pop up:
<img src="https://raw.githubusercontent.com/GonzagaCPSC310/U0-Introduction/master/figures/qt_console.png" width="500">

If it does, congrats! You successfully installed Python and Anaconda. You are ready to start developing Python code. You can try writing your first line of Python code if you want right now! In the open Jupyter QtConsole, type exactly the following:

```
print("Hello World!")
```

And press enter. The text `Hello World!` will be displayed back.

## Execute Python Code
Python code can be executed in *interactive* mode and in *scripting* mode. In interactive mode, Python code is entered/executed in a command prompt/console/terminal. In scripting mode, Python code in a source file (e.g. .py) file is executed as a program. 

We are going to perform the following steps to re-write the previous `"Hello World!"` program in *interactive* mode and again in *scripting* mode using VS Code and the command line.

### Interactive Python
1. Open the Qt Console and type `print("Hello World!")` and press enter. You should see "Hello World!" echoed back out on the console. Congrats! You just wrote and executed your first line of Python code. We just executed this code in "interactive" mode.

1. Let's explore some features of interactive Python. Type the following commands into the IPython shell and observe the output:
    1. `help(print)`: You can type the name of any identifier between the parens of the `help()` command to learn more about the Python construct.
    1. `pwd()`: Tells you the "present working directory" where Python is executing
    1. `x = 5`: Declare a variable named `x` and assign it the value 5
    1. `type(x)`: Returns the data type of the value stored in variable `x`. What is it?
    1. `course_name = "AHA"`: Declare a variable named `course_name` and assigns it the string "AHA"
    1. `type(course_name)`
    1. `course_name.<tab>`: Type the variable name `course_name` and a dot. Then press tab. IPython will provide you with all of the available information and behaviors associated with this variable. We will learn much more about this information later in the course.
    1. `course_name.upper()`: What does `upper()` do?
    1. `dir()`: Lists the known objects (variables) in Python. Do you see your variable names?
    
### Scripting Python
1. Open VS Code and create a new file.
1. Type `print("Hello World!")`
1. Navigate to the menu bar. Select File -> Save As and select a folder to save your Python code files in. Name this file hello_world.py and save it in your newly made folder.
1. Run the program! You can do this by pressing F5 on your keyboard or selecting Debug -> Start Debugging. The output of your code will be in the Terminal. Congrats! You just wrote and executed your first Python *script*.
<img src="https://raw.githubusercontent.com/GonzagaCPSC310/U0-Introduction/master/figures/vscode_hello_world.png">