# Introduction to Computers, Python, and Data Science

## Objectives

Students will __learn__:

* recent developments in computing
* hardware, software, and Internet basics
* data hierachy from bits to database
* of the different types of programming languages
* basics of object-oriented paradigm
* strengths of Python as well as other programming languages
* importance of libraries
* about Python and data science libraries



---


In [None]:
# Comments - use hashtag to denote a comment
# comment are used to describe the code
# only used by humans

# Block Comments use """
"""
lkajsdflkjdflkjas ;asdflkj f;adlkjf ;asdlkfj;ladskjflk asjdf;lkjasdf ;;asdlkfja;sdf jas;dlkfja
;lakdsjf;lkajsdf;lkajsdf;lkjsadf
;lakjsdf;lkajsdf;lkjasdf
sadf;lkajsdf;lkjsadf;lkdjf
"""

# 1.1 - Introduction

# Python

* Object-oriented scripting language 
* Developed by van Rossum at the National Research Institute for Mathematics and Computer Science in Amsterdam
* Publically released in 1991
* Surpassed R as most used data science software
* Open source, large community, platform independent
* Easier to learn than C++ and Java
* Thousands of libraries to extend capabilities
* Widely used in Data Science (and finance)
* Extensive job market especially in data science, high salaries

---

# 1.2 - Hardware and Software

* Computers perform (billions) calculations (per second) and make logical decisions
* Supercomputers 122 quadrillion calculations per second
  - IBM Summit @ Oak Ridge
  - Fugaku at Riken
  - https://en.wikipedia.org/wiki/TOP500
* Programs (software) process data

### 1.2.1 - Moore's Law
* Computer capabilities double every 18 months (at half cost)
  - communication (e.g., bandwidth) also follows Moore's Law

### 1.2.2 - Computer Organization
Divided into logical units
* Input - keyboards, mouse, touch screens
* Output - Monitors, printers, audio, video, hard-drives, SSDs
* Memory - RAM (volatile)
* Arithmetic and Logic (ALU) - performs +, -, *, /, <, >, ...
* CPU - Coordinates and supervises all operations.
  - multicore processors to perform operations simultaneously
* Second Storage Unit - hard-drive
  - persistent 
  - cheap
  - SSD, USB (TB = 1 trillion bytes)


# 1.3 - Data Hierachy

* bits - binary digit (0 and 1)
  - byte is 8 bits
  - kilo = 1024
* characters/character set (composed of bits)
  - unicode UTF-8 (https://docs.python.org/3/howto/unicode.html)
  - ASCII is a subset of UTF-8
* Fields (composed of characters)
* Records (composed of fields)
  - e.g., name, address, DOB, ...
* Files (group of records) or sequence of bytes
* Database - collection of data
  - relational database
  - graph database
  - flat database
* Big Data
  - What is big? TB, PB, EB, ZB)



# 1.4 - Machine languages, assemby langues, and high-level languages
* Machine language 
  - string of 0,1
  - machine dependent
* Assembly Language
  - use _assemblers_ to convert assembly-language to machine language
* High-level languages
  - souce code is English like
  - easy for humans
  - Compilers - convert into machine language
    - fast!!!
  - interpreters - JIT converts to machine language on execution
    - slower
    - avoid delay in compiling
  

# 1.5 - Introduction to Object Technology
* Class is the fundamental software unit
* Objects come from classes
* Practical principles (inheritance, polymorphism)


# 1.6 - Operating Systems
* Windows (proprietary)
* Linux (open source community contributes, free)
  - Linux kernel = core
* macOS (Xerox PARC = Desktop GUI)
  - BSD (NeXT)
* iOS
* Android (88% market share)
  - based on Linux kernel

## Notebooks

Interactive notebooks are convenient and popular method to run python programs.  Common notebook enviromnments:

* Jupyter Notebook
* SageMath (CoCal)
* iPython
* Google Colab

# Libraries

* Libraries provide means to do significant tasks with less code.
* Avoid reinventing the wheel


### Standard Library
Provides capabilities for:
* Text processing
* Mathematics, 
* File I/O
* Cryptography
* OS services
* Network protocols
* Multimedia

Here are some commonly used modules contained in the standard library

* ```csv``` - process comma-separated data
* ```math``` - common math functions and constansts (e.g., pi)
* ```os``` - interaction with operating system
* ```statistics``` - mean, median, mode, ...
* ```string```  - string processing
* ```random``` - pseudorandom numbers

See Page 18 for more. 

----





## Data Science Libraries

Here is a list of some libraries used in data science

###### __Scientific Computing and Statistics__
* _NumPy_ (Numerical Python) - used in numerical analysis, _numpy_ provides data structures to work with matrices and vectors.  Although it is possible  to use _lists_, they are very slow compared to ```ndarray``` data structure.  

* _SciPy_ (Scientific Python) - integrals, differential equations 

* _pandas_ - Data manipulation and analysis (i.e., data.frames)

* _matplotlib_ - visualization
* _seaborn_ - visualization

* _scikit-learn_ - Machine learning, deep learning

---

###### __Data Manipultion and Analysis__

_Pandas_ - provides ```data.frame``` structure similar to R.

___

###### __Visualization__

_Matplotlib_ - Plotting library supporting scatter, bar, contour, pie, quiver, grid, polar, and 3D.

_Seaborn_ - Built on _matplotlib_, adds additional visualizations with less code.

---

###### __Machine Learning, Deep Learning, and Reinforced Learning__

___

###### __Natural Language Processing__


---

## 1.9 - Other (Popular) Programming Languages
* __Basic__: Beginners All-purpose Symbolic Instruction Code

* __C__ (general purpose)

* __C++__ (Most applications)

* __Fortran__ (Mathematical computations)

* __Java__ (portability)

* __Javascript__ (website interactivity)

* __Swift__ (iOS)

---

## 1.10 - Skip
* Covers other IPython notebook environments such as Juypter.
* We use Colab by Google for our Interactive Python Notebook.

## 1.11 - Internet and World Wide Web (www)
* US Department of Defense (ARPA - Adv. Research Projects Agency) (late 1960s)
developed network of computers called ARPANET (now the Internet).  
* Transmission Control Protocols (TCP) are the set of rules for communicating over the net.
* A network of networks is now the Internet and uses TCP/IP protocols.
* Bandwidth is information carrying capacity
 

## Mashups and IoT

A __mashup__ combines two or more services to develop a new application.
For example, apartment rental advertisments on Craig's list paired with Google Maps to show the locations of the rentals.

See Programmable Web http://www.programmableweb.com/ for over directory of over 20000 web services and 8000 mashups.  

The Internet contains __things__ that have an IP address and can send and receive data.  For example, some _things_ are:

* heart monitors (e.g., FitBit, iWatch)
* radition detectors
* wildlife cameras
* home appliances
* thermostats
* smart meters
* toll tags


## Big Data

### Four V's
1. Volume - amount of data (see below)
2. Velocity - speed at which data in produced
3. Variety - text, audio, video
4. Veracity - validity of the data, fake-data, accuracy and reliability of data 


Is 50 MB big?

#### Megabytes
* MP3 audio files ~ 2 MB
* Photos ~ 10 MB
* Videos ~ 100 MB 

#### Gigabytes
 10 GB = 
 * 150 hours of MP3 audio files
 * 1000 photos
 * 10 minutes of video

 #### Terabyte



 #### Petabytes, Exabytes, and Zettabytes



## FLOPS
Floating-point operations per second

* Measure of computer performance.

# Data Science Use Cases
* cancer diagosis
* crime recidivism
* fraud detection
* marriage infidelity
* terrorist prevention

See page 39 - 40 for more exhaustive list.
* There has to be something on this list you find interesting and would like to solve.

---

# Homework (Due Sept 1.)

Chapter 1: 4, 12, 14, 19

---
