# Data Science

## Intro

### Machine Learning

- Supervised Learning
  - Regression
    - Linear Regression
      - low dimensional, ridge regression, lasso, greedy regression
    - Nonpar Regression: 
      - kernel regression, local polynomials, additive, RKHS regression
  - Classification
    - Linear Classification: 
      - linear, logistic, SVM, sparse logistic
    - Nonpar Classification: 
      - NN, naive Bayes, plug-in, kernelized SVM
  - Conformal Prediction
  - Cross Validation

- Unsupervised Learning
  - Clustering: 
    - k-means, mixtures, single-linkage, density clustering, spectral clustering
  - Nonpar Density Estimation
  - Measures of Dependence
  - Graphical Models: 
    - correlation graphs, partial correlation graphs, cond. indep. graphs
    

### Deep Learning

# DataScience

In [None]:
---
id: DataScience
title: DataScience
sidebar_label: DataScience
---

## History and Evolve

---

## Theories

### CAP

- Consistancy
- Availability
- Partition Tolerance

- CAP: It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency, Availability, and Partition Tolerance

### ACID

- Atomicity
- Consistency
- Isolation
- Durability

### Categories

- IMQAV model
  - Ingest
  - Model
  - Query
    - SQL:
      - RDBMS: MySQL, PostgreSQL, SQLServer, Oracle
      - Store: Row + Column
      - Schema on write
    - NoSQL:
      - DBMS: MongoDB, Cassandra, Redis, Neo4j, HBase
      - Store:
        - Document
        - Key-Value
        - Graph-based
      - Schema on read
  - Analyze
    - Descriptive Analysis
    - Exploratory Analysis
    - Predictive Analysis
    - Mechanistic Analysis
    - Inferential Analysis
    - Causal Analysis
  - Visulize

---

## Applications

### Cyber-Physical-Social Systems

## Technologies

### NoSQL

#### Introduction

- Categories

  - Key-Value Store
  - Document Store
  - Graph Store
  - Column Store

- Consistency Models

  - not ACID
  - BASE
    - Basic Availability
    - Soft-state
    - Eventual consistency

---

---

## Big Data

### Introduction

- 5V Features:

  - Volume
    - huge amount of data
  - Variety
    - variety of source of data
    - variety of format of data
  - Velocity
    - high speed of accumulation of data
  - Veracity
    - inconsistancy and uncertainty in data
    - Quality
    - Validation
  - Value
    - useful of data

- Data Categories:

  - Structured
    - tables in DB, excel table
    - sensor data generated by Machine
    - Weblogs generated by Machine
  - Semi-structured
    - webpage
    - JSON
    - XML
  - Unstructured
    - PDF
    - txt
    - images
    - videos

- Data Sources:

  - Web Logs
  - IoT Sensors
  - Social Network
  - Webs 2.0/3.0
  - Scientific Data
  - By Vertical Industials:
    - Healthcare
      - ICU Monitorying
      - Epidemic early warning
    - Transportation
      - traffic congestion
      - logistic optimization
    - Telecommunication
      - network ops
      - geo-mapping
    - IT
      - cyber security
      - system log analysis
    - Retail
      - realtime promotion
      - timely analysis of inventory
    - Fintech
      - Fraud detection
      - Audit trials
      - Risk management
      - Customer Insights
      - Cybersecurity

- Data Lifecycle:

  - Business Case Development
  - Identify Data
  - Data filtering
  - Data extracting
  - Data aggregation
  - Data analysis
  - Data visualize
  - Business Case Validation

- Analytic Categories

  - Descriptive Analytic: what happened
  - Diagnostic Analytic: why it happened
  - Predictive Analytic: what will happen
  - Prescriptive Analytic: what's the solution

- Technologies Categories

  - Flow Perspective

    - data integration
      - Online
      - Offline
    - data in transient
      - Operational
    - data in rest
      - Analytic
      - Real-time interactive
      - Batch-Oriented Analytic

  - Stack Perspective

    - Data Storage and Management
    - Data Cleaning
    - Data Mining
    - Data Visualization
    - Data Reporting
    - Data Ingestion
    - Data Analysis
    - Data Acquigisition

### Processing Framework

- Categories:

  ![Alt](/img/DS-Hadoop-ComputingFramwork.png "Computing Framwork")

  - General-purpose processing frameworks
  - Abstraction frameworks
  - SQL frameworks
  - Graph processing frameworks
  - Machine learning frameworks
  - Real-time/streaming frameworks
  - Batch Process Framework: bounded, persistent, large
    - MapReduce
      - Input -> Split -> Map -> Reduce -> Output
  - Steam Process Framework: unbounded
    - Storm
      - real-time stream processing.
    - Samza
      - near real-time stream processing.
  - Hybrid Process Framework
    - Spark
      - Spark SQL
      - Spark Streaming
      - Spark MLlib
      - GraphX
    - Flink

### Hadoop

#### Hadoop Overview

- Hadoop

  - a big data processing framework running on commodity hardware easily.
  - provides data storage, resources management, data processing and etc. capabilities
  - written in Java
  - governed by Apache Software Fundatiaon

- Vendors
  - Amazon, Microsoft, AliCloud
  - Huawei, IBM, HPE
  - Cloudera, Hortonworks, MapR, MapReduce

#### Hadoop Architecture

- Intro

  - Data Storage
  - Cluster Management
  - Data Processing

- HDFS

  - Features
    - cost-effective: build on commodity hardware
    - scale-out: distributed system for large volume datastore
    - falt-tolerance: n-copies
  - Building Blocks
    - Client
    - NameNode
    - Secondary NameNode
    - DataNode
    - Data Block: 128MB by default
    - Metadata
      - editlog
      - fsimage
  - Architecture
    ![Alt](/img/DS-Hadoop-Architecture-HDFS.png "HDFS Architecture")
    - Primary-Secondary namenode
    - master-worker model
    - n-copies redundancy: 3 by default
    - rack awareness HA
  - User Case
    - not for large number of small files
    - WORM: write once, read many times

- YARN

  - scalability, compatibility, resouce utilization, multitanants
  - Building Block
    - Client
    - ResourceManager
      - negotiate resources required by app master
      - Scheduler
      - Application Manager
    - NodeManager
      - approve resources required by resource manager
      - Container
        - resources abstration on ram, cpu, ios
      - App Master
        - get task execution done
  - Architecture
    ![Alt](/img/DS-Hadoop-Architecture-YARN.png "YARN Architecture")
  - User Case

- Oozie

  - a scheduler system for managing hadoop jobs in a distributed environment.
  - Building Block
    - Jobs
      - Oozie workflow jobs
      - Oozie coordinator jobs
      - Oozie bundle
  - Architecture
  - User Case

- HBase

  - NoSQL, Non-Relational, Distributed column-oriented Database system works on HDFS
    - Scalable
    - HA
  - Building Blocks
    - client
    - table, row, column famility, column, k-v pair
    - Cell
    - HBase Data Model
      ![Alt](/img/DS-Hadoop-Architecture-HBase-DataModel.png "HBase DataModel")
  - Architecture
    ![Alt](/img/DS-Hadoop-Architecture-HBase.png "HBase Architecture")
    - HMaster
    - RegionServer
      - Region
        - BlockCache
        - Memstore
      - HDFS
        - HFile
        - Index
    - Zookeeper
    - HBase Shell
  - User Case

    - real-time random r/w data service
    - sparse tables processing
    - structured and unstructured data processing
    - no transaction integrity
    - no referential integrity

- MapReduce

  - a batch processing framework for large dataset
  - Building Blocks
    - Map Tasks
      - K-V pairs
      - split dataset depends on business logic
    - Reduce Tasks
      - K-V pairs
      - shuffle/aggregate/sort/summary and etc.
    - Data Flow
      ![Alt](/img/DS-Hadoop-ComputingFramwork-MapReduce-Dataflow.png "MapReduce Data Flow")
  - Architecture
    ![Alt](/img/DS-Hadoop-ComputingFramwork-MapReduce.png "MapReduce Architecture")
    - Input-Map-Reduce-Output Model
      - Input from HDFS
      - Mapper Class
        - Init Inputs
        - Mapping
        - Shuffling/Sorting
      - Reducer Class
        - Searching
        - Reducing
      - Output to HDFS
    - Job Scheduling(Yarn Resource Manager)
      - FIFO Scheduler
      - Capacity Scheduler
      - Fair Scheduler
  - User Case
    - Batch processing framework

- Hive

  - data warehouse infrustructure to process structured data with HQL on top of HDFS.
  - Building Block
    - Client
      - Thrift
      - JDBC
      - ODBC
    - Hive Server
    - Hive GUI
    - Hive CLI
    - Hive Driver
      - Compiler
      - Optimizer
      - Executor
    - Metastore
    - Table
    - Partition
    - Bucket
  - Architecture
    ![Alt](/img/DS-Hadoop-Architecture-Hive.png "Hive Architecture")
    - Hive Data Model
      - Data Types
    - Hive Dataflow
      ![Alt](/img/DS-Hadoop-Architecture-Hive-Dataflow.png "Hive Architecture")
  - User Case
    - EDW

- Pig

  - an abstract layer over mapreduce with pig latin and pig engine to process data
  - Building Block
    - Pig Shell
      - Grunt
    - Pig Server
    - Parser
    - Optimizer
    - Compiler
    - Execution Engine
  - Architecture
    - pig latin scripts tranlated into Map-Reduce tasks for execution.
    - Data Model
      - atom: int, long, float, double, chararray, datetime, boolean, bytearray
      - atom -> field -> tuple -> bag -> relation
      - map
    - Operators
      - LOAD, STORE, FILTER, DISTINCT, FOREACH ... GENERATE, STREAM, DUMP
      - JOIN, COGROUP, GROUP, CROSS, ORDER, LIMIT, UNION, SPLIT, DESCRIBE
  - User Case

    - structure and unstructure data processing

- Spark

  - In-Memory cluster processing framework,
  - Building Block
    - Spark SQL
    - Spark Stream
    - MLlib
    - GraphX
    - SparkR
    - Spark Shell
      - scala
  - Architecture
    - APIs
      - R, SQL, Python, Scala, Java
    - RDD
      - Resilient Distributed Dataset
      - Operation:
        - Transformation
        - Action
  - User Case

- Sqoop

#### Hadoop Ecosystem

- Intro
  ![Alt](/img/DS-Hadoop-Ecosystem-1.png "Hadoop Ecosystem")

---

## Best Practices

### Schema Design Principle

- understand your DBMS features and limitations
- understand your APP and DATA patterns
- Balance these two facets during data modeling desicion-making process

---

[1]: , "empirical evidence, scientific theory, computational science, data science"


# Python Anaconda

In [None]:
---
id: Python4DS
title: Python Data Science DevOps.
sidebar_label: Data Science with Python
---

## Introduction

---

---

## Anaconda

### Package Management

- conda
  - basics
    - `conda --version`
  - virtual env
    - `conda info --envs`
    - `onda create -n MYENV`
    - `conda create -n MYENV --clone OLDENV`
    - `conda create -n MYENV python=3.6.0`
    - `conda activate MYENV`
    - `conda deactivate MYENV`
  - package management
    - `conda list`
    - `conda search PACKAGE`
    - `conda update conda`
    - `conda update anaconda`
    - `conda update --all`
    - `conda update PACKAGE`
    - `conda install PACKAGE`
    - `conda install PACKAGE=M.N.P`
    - `conda remove PACKAGE`
    - `conda build PACKAGE`
  - config sources
    - `conda config --show-source`
    - `conda config --remove channels NOT_WANTED`
  - set conda-forge
    - `conda config --add channels conda-forge`
    - `conda config --set channel_priority strict`

---

---

## Misc

### Learning Resources

- [Anaconda Document](https://docs.anaconda.com/anaconda/)
