Author: Benjamin W. Ong
Date: September 9, 2018

Goal for this module:
* Introduction to data science
* Introduction to Jupyter Notebooks
* Introduction to Pandas

# What is Data Science?
* A systematic approach to infer knowledge from data;
* A truly interdisciplinary field that utilizes techniques and theories from, among others, mathematics, statistics, information science and computer science.
* Career opportunities across a wide range of industries (common job titles): data scientist, data analyst, data enginer, computer and information research scientist, operations research analyst, ...
* Although the discipline of data science is very broad, most employers seek data scientists who 
    1. have a strong statistical background
    2. have a strong analytical background
    3. communicate well (both written and oral)
* Here is a youtube snippet from a data scientist at a large, un-named tech company, https://www.youtube.com/watch?v=xC-c7E5PK0Y


# UN5550 - Introduction to Data Science
This course is intended to give you a broad overview of / gentle introduction to foundational topics in data science.  As part of the broader Masters in Data Science curriculum, you will study various of these topics in depth.

Course Objectives:
* Proficiency using the Python programming language and awareness of tools/libraries that are available
* Proficiency presenting findings using Jupyter notebooks
* Proficiency in data management: getting data, cleaning data, dealing with missing data, dimension reduction, exploratory analysis
* Appreciation for machine learning: regression, classification and clustering


# What can you expect from UN 5550?
* Weekly lectures on Tuesdays: it will be team taught, although a large percentage will be taught by Ben.  
* Lectures taught by Ben will use Jupyter notebooks, which you can edit and run on the fly.
* Weekly projects (due Friday at 5pm). You can work in teams of two or three, but each person is responsible for submitting their own notebook -- no plagiarism tolerated! 
* Weekly labs on Thursdays will be run by the Neelima (TA).  Labs are intended to be a brief review of material covered on Tuesday, and in-depth dive into practical issues related to the project.

# My expectations
* Come to **every** class
* Be respectful
* Ask questions
* Check your email
* Consult course page on Canvas regularly
* **Adhere to the MTU academic integrity policy **

# Computational Platform
Following previous iterations of UN5550, we will primarily be using Python as the computational platform for this course.  
* Python is a general high-level programming language that is portable, and recently, highly adopted in the data science community.  
* Has many tools and libraries for data analytics, data processing and visualization (funded by DARPA in 2012)
* Used across many industries (e.g. social media, finance), at least in the US. 
* Ben is **NOT** an expert at Python, but I hope to be more proficient by the end of the course.
* Your weekly projects are to be done using Jupyter Notebooks.  Why?  Interpreted, east to read code with embedded markdown language. 
* Lets take some time to setup our environment. https://github.com/ongbw/UN5550-Fall2018/blob/master/installing_python.md


# About Python
* There are two different versions of Python: Python 2.x and Python 3.x
* Unfortunately, the versions are not compatible.  Python 2.0 introduced in 2000, Python 3.0 introduced in 2008.  
* Most of the scientific community did not immediately change over to Python 3.x, though now most of the libraries have now been ported.
* We will use Python 2.7 in this course, to be consistent with the textbook.  You may use Python 3.x if you are comfortable with it, and realize that some comments from the text will be inaccurate.


# Python Ecosystem
* There are may toolboxes and libraries for data analytics, data processing and visualization
* By the end of the course, you should be proficient with:
    * NumPy: basic operations for arrays and useful linear algebra functions
    * SciPy: collection of numerical algorithms and Matplotlib (for visualization)
    * SCIKIT-learn: machine learning (clasification, clustering, dimensionality reduction, model selection, pre-processing)
    * PANDAS: data structures and data analysis tools

# Getting Started
* Launch a jupyter notebook by opening a terminal, navigating to an appropriate directory, then typing
```shell
  jupyter notebook
```
This should open up a browser window, listing your files in the director.  For now, click New, Python 2 to create your new netebook
* Lets import the libraries 
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```
To execute, click on the "run" button, or press the Ctrl+Enter keys.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Data Frames
* The key feature of Pandas is a fast and efficient DataFrame object for data manipulation
* A DataFrame is a tabular data structure, with rows and columns.  Lets learn about DataFrames by creating one and manipulating it.

In [None]:
data = {"country": ["Canada","USA","Mexico","India","Singapore","China"], 
       "capital": ["Ottawa","Washington","Mexico City","New Delhi","Singapore","Beijing"],
       "population": [37.0, 327.2, 130.8, 1356.5, 5.8,1415.0]}
myworld = pd.DataFrame(data)
print(myworld)

Notice that the DataFrame object has stored the columns in alphabetical order.  We can arrange it at construction time by entering using the "columns" keyword, along with a list of the columns ordered the way we want.


In [None]:
myworld2 = pd.DataFrame(data, columns = ["country","capital","population"])
print(myworld2)

Pandas has also assigned a "key" for each row, in this case, with numerical values from 0 through 5.  You can access a subset of rows (observations) using square brackets

In [None]:
print(myworld[1:3])

If you only want one column from a DataFrame, you can put the column name in square brackets.  The result will be a series data structure (not a Data Frame) because only one column is retrieved

In [None]:
myworld["country"]

Now that we have some experience with DataFrame objects, lets import a larger data set from a csv file.  We'll use the data/population.csv file, (which can be regenerated if desired with updated data using the WorldbankData.ipynb script.

In [None]:
pop = pd.read_csv('data/population.csv')
pop

We can use the head() function to look at the first few rows, and the tail() function to look at the bottom few rows.  (Default is 5, you can specify number to print in the parenthesis)

In [None]:
pop.head()

In [None]:
pop.tail(3)

The describe() function gives quick statistical information on all __numeric__ columns

In [None]:
pop.describe()

Often, we want to filter data based on some criteria.  For example, if we only care about populations above 1 billion,

In [None]:
pop [ pop['Value'] > 1000000000]

hmm that was less useful than I expected.  I guess there are many Country names that I wasn't expecting. Lets search specifically for the year 2015.  (For logical statements, don't forget to use parenthesis to generate a series)

In [None]:
pop [ (pop['Value'] > 1000000000) & (pop['Year']==2015)]

A useful way to inspect data is to group according to criteria.  For example, perhaps it would be nice to group all the data by country, regardless of year.  We need to thus aggregate the data in an appropriate fashion.  For example, we could take the mean population (over time) for each country. 

In [None]:
group = pop[['Country Name', 'Value']].groupby('Country Name').mean()
group.head()

Lets explore some data visualization.  Suppose we were interested in plotting the population of China over time.  Lets first create a variable, cn, that extracts data involving china, and then plot that new variable.

In [None]:
cn = pop[ pop['Country Name']=='China']
cn.plot(x="Year",y="Value")

Or perhaps, we can make use of the aggregate data and plot a pie chart based on the first five entries ...

In [None]:
group.head().plot.pie(y='Value')

The plotting utilities in MatPlotlib are endless.  If you are new to generating graphs in Python, please review the examples given in the textbook, as well as some examples at: https://pandas.pydata.org/pandas-docs/stable/visualization.html