# Data Science 

## Process

- Problem, hypothesis, goal, question definition
- Data wrangling
    - gathering, collecting
    - understanding
    - quality and structure tidyness (mess) assessment
        - missing values
        - invalid values
        - inaccurate values
        - inconsistent values
    - cleaning
    - re-assess and iterate
- Data preprocessing
- Data storing
- EDA
- Modeling

- Data exploration analysis (EDA) and CLeaning (ensure better information understanding, availability and accuracy)
    - Variables Identification
        - Naming convention consistency application
        - (classification) ballanced/imbalance dataset handling
    - Univariate Analysis
    - Bi-variate Analysis
    - Missing Values Handling
    - Outliers
- Data preprocessing, feature engineering (make data ready for ML, have a remarkable impact on the power of prediction)
    - Variables transformation
        - normalization
        - standardization
    - Variable creation
        - character encoding
    - Working with dates
    - Inconsistent data entry
- Re-assessment and iteration

In [None]:
# 

## Big Data and Data Science Industries

- Agriculture (Precision Farming), [Big Data in Smart Farming](https://www.sciencedirect.com/science/article/pii/S0308521X16303754)
- [Manufacturing](https://www.mckinsey.com/business-functions/operations/our-insights/how-big-data-can-improve-manufacturing), [Modeling Smart Farm](https://pdf.sciencedirectassets.com/287278/1-s2.0-S2214317317X00048/1-s2.0-S2214317316301287/main.pdf)
- [Construction](https://www.bostonglobe.com/business/2018/11/23/she-building-better-construction-process/tkR6qB9Ngp6ELpqw0AVfHO/story.html)
- [Smart Cities](https://datafloq.com/read/how-barcelona-deploys-big-data-to-improve-lives-an/297)

## Advantages of using big data and statistics:

- finding hidden opportunities for efficiency,
- using data to become more responsive to clients,
- developing entirely new and unanticipated product lines.

## Three Types of Analytics

- **Descriptive analytics** tell you what happened.
- **Predictive analytics** tell you what is likely to happen in the future.
- **Prescriptive analytics** are build on top of predictive analytics to make recommendations.

## Organization's Readiness for big data and analytics

### Data

#### Data Lake
A large repository of organizational data characterized by best practices in architecture, curation, and access. Data lake is a loose confederation of databases that:

- can have different structures, 
- come from different vendors (internal and external sources),
- be processed through different tools.

The goals for a data lake include:

- Integrating new sources of data as social media feeds or sensor data from IoT,
- Democratization of data (self-service), business users with little or no programming skill can create their own reports and dashboards,
- Access to broader ranges of data coupled with better security and privacy guarantees (personal data regulations, trade secrects.

By sharing data, data lakes can open up communications between company divisions, cutting down silos and multiplying the benefits of insights from each division.

### Technology

- [MapReduce: Simplified Data Processing on Large Clusters (2004)](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) - MapReduce is a two-step algorithm. It accepts input data as a key/value store, or creates a key, as part of `the Map step`. The Map step determines all of the values associated with a key; for instance, all the web pages that contain a word. This step is highly parallelizable. `The Reduce step` takes all output of the Map jobs and creates the final dataset linking documents to words. The algorithm is designed from the start to distribute work among large numbers of computers. MapReduce's opensource implementation is Hadoop. 

    In 2009 [Spark](http://spark.apache.org/) extended MR algorithm so that you can set up any kind of pipeline you like with key/value algorithms. Hadoop traditionally processed data in big batches. For modern needs a streaming (fast) data must be processed in near real-time fashion, so additional tools were added to the picture, like [Storm](http://storm.apache.org/) and [Flink](https://flink.apache.org/). These tools separate the streams into managable chunks for processing effectively, creating microbatches of data called `windows`.
    
- **Containers**. Modern programming involves dividing programs into small modules that do individual, well-defined services and expose the services through APIs (`microservices`). `Virtual machines` (WMs) and `containers` have become important because the microservices don't need a full computer system, so running many of them on each single chip can save a lot of hardware. In addition, each service may be created and torn down quickly, or fail and be replced automatically.
    
    [Docker](https://www.docker.com/) is by far the most popular container platform. [Kubernetes](https://kubernetes.io/) is an open-source system for automating deployment, scaling, and management of containerized applications. There are many other tools for administering and providing resources (`orchestration`).
    
- ** AI libraries**. Analytics are being increasingly formalized and packaged into programming libraries, usually providing modern AI tool. When these algorithms were discovered, they were published in papers and each data scientist would code up their own functions. As they became standard, the programmers released libraries to handle classic data analysis tasks such as clustering and classification ([scikit-learn](https://scikit-learn.org/stable/)). A few years later, programmers took the next step and released libraries such as [TensorFlow](https://www.tensorflow.org/) and [MXNet](http://mxnet.apache.org/).

### People

### Processes

#### Curation: Cleaning, Prepping, and Provenance

Curation means putting things in a suitable order for use. In terms of datasets, it covers a number of tasks in cleaning, prepping, and preserving the provenance (`lineage`) of data.

Typical `cleaning operations` include (automatically):
- converting inconsistent values to standards (i.e. units)
- removing bad values (i.e. human age above 150)
- filling in missing values (i.e. through arithmetic tricks such as interpolation)
- removing rows that contain bad values
- deduplication (i.e. proximity of different records)
TODO: work on this list and link to an external notebook for steps

Even after data is cleaned, there may be need for more preprocessing before querying. Security might also require `preparation` in order to conform to regulations and protect corporate secrets. The preparation mey include:
- Merging records from different sources (can be also considered as part of cleaning)
- Putting data into a schema
- Adding tags to mark the value of the data for various purposes or to indicate provenance information such as source of the data
- Providing aggregate information such as totals or averages
- Determining the security and privacy rules that apply to datasets or individual fields and preparing them so that they are kept safe from unauthorized view (access rules, encryption, separation of sensitive data to different datasets)
- De-identifying or anonymizing data to protect privacy. This involves generalizing the data so that it applies to a larger group of entities.



#### Architecture