## Data Pipelines

https://towardsdatascience.com/data-science-for-startups-introduction-80d022a18aec

Common Data scientist role in an orginization:

**For Product (Inference Scientist)**

**As Product (Applied Scientist)**

**For Operations (Systems Scientist)**

**As Operations (ML Engineer)**


**“For” vs “As”:** Is the data scientist supporting a team that is building something (For), or building something themselves (As)?

**Product vs Operations:** Is the data scientist building something that is customer facing (Product) or a backend system that is critical to running the business (Operations)

I’ve worked in many of these different functions myself: at Twitch I was embedded on the mobile product team, and had a product analytics focused role (For Product), at Windfall Data I have an applied science role focused on building customer-facing data products (As Product), and at Twitch I supervised a scientist focused on forecasting operational metrics of the platform, such as page-load times (For Operations). 

### Data Science "For" Product:

One of the key responsibilities of this role is to provide insights to teams, which are then used to improve products and company roadmaps. Ex: Data Scientists at Ford Motors.

Performing well in this role usually requires the following skills:

**Exploratory Analysis**

**Experimentation:** If the product team makes a change, how do you evaluate the impact? This can include A/B testing.

### Data Science "As" Product

Data science teams uses Machine learning to build data products. Data Science team continously focuses on improving this data product. Ex: Google Analytics. 

Data product is defined as  "product that facilitates an end goal through the use of data".

Skills needed in this role are:

**Machine Learning:**

**Prototyping:** building MVPs

**Software Engineering:** coding to build data products.

### Data Science "For" operations

The key responsibility of this position was to understand how different factors influence operational metrics of our products, such as page-load times. Need understanding of the infrastructure.

**forecasting**

**Alerting**

### Data Science "As" Operations

This is a data science role that is usually part of an engineering team, where the goal is to build data products that are required to run the business that are not customer facing. Building automated ad bidding systems is one example of this role, and building fraud detection systems is another. 

**Building Distributed Systems like spark**

**Online learning - Incremental Learning**

**DevOps**

## Tracking Data

In order to make data-driven decisions at a startup, you need to collect data about how your products are being used. Usually data is generated directly by the product. For example, a mobile game can generate data points about launching the game, starting additional sessions, and leveling up.

#### Why record data?

*To track health of the Application*

*Enable experimentation: To determine if making changes to a product is beneficial, you need to be able to measure results.*

*Build data products: In order to make something like a recommendation system, you need to know which items users are interacting with.*

Data collected from data products frequently is called **Tracking**

#### What to record?

**Installs**: How big is the user base? usually you will get the data from the Google Play

**Sessions**: How engaged is the user base? get data from client application

*DAU: Daily active users*

*MAU: Monthly Active Users*

*Avg no. of Sessions per DAU*

*DAU/MAU*: ratio of daily app users to monthly app users tells you how well app retains users. ranges from 0-1. Fb is 50%.

*retentions*

*Churns:* opposite to rentention. How many people downloaded your app and not playing.


**Monetization**: How much are users spending?

*ARPDAU:* Average Revenue Per Daily Active User

*Conversion Rate*

*Player Rank, player goals, player time, player ID, etc*

**Tracking Specs**

web browser name, version, userID, landing page, refering page, URL, client time stamp etc.,

### Client vs Server Tracking

Collecting data directly from user's browser is called Client Tracking. Whereas collecting data from backend server is called Server Tracking. Server tracking is best and safe, because if you collect data directly from client, you will need to expose end-point in the web which is not safe and you may not get enough data because client may use adblockers in browsers.

Generating Trackers from servers rather than client applications helps to avoid issues around, fraud, security & versioning.

## Different ways of Sending events to end-point

### 1.Web Call

The easiest way to set up tracking by making web-calls with the event data to a website. Implemented with lightweight PHP script. Dont scale and Not secure.

### 2.Web Server

Set up web service to collect tracking events. Ex: Jetty webservice to collect data. Can scale but insecure. Now a days comapanies using Kafka, Amazon Kenesis, Google PubSub to build stream processing systems.

### 3.Subscription Services

using messaging services such as PubSub enables you to collect massive tracking data and forward data to number of consumers.

Some systems such as Kafka require settingup and maintaining services while other approches such as PubSub are managed services that are serverless.

Managed services are great for startups, because they reduce the amount of support needed but costly.

### 4.Message Encoding

Scalable & Secure.

Good to avoid language specifit encoding such as Java serialization. because applications and backend servers are implemented in different languages. 

Common ways of encoding tracking events are  *JSON format and Google's protocol Buffers*.

Benefit of using these approches is Schema does not need to be defined before you send events, since meta data about the event is included in the message.

## Building Tracking API

A production system should handle following issues:

**Delivery Failures:** If a message delivery fails, the system should retry sending the message and have backup mechanism.

**Queing** If the end-point is not available tracking library should store events for later transmission.

**Batching:** Instead of sending a large number of small requests, its often useful to send batches of tracking events.

**Prioritization:** A tracking library should prioritize critical events.

## Data Pipelines

Core component.

Typically destination for a data pipeline is datalake, such as hadoop or paraquet files on S3 or relational database such as redshift.

#### Data Lake:

Stores vast amount of raw data.

While data warehouse stores data in files and folders, data lakes use flat architecture to store data.

Each data element in data lake is assigned with unique identifier and tagged with extended meta tags.

A data lake holds data in unstructured way and there is no hierarchy in organization among individual pieces of data.

It holds the data in rawest form - It is not processed or analyzed. Aditionally data lakes accepts and retains data from all data sources all data types and schemas are applied only when data is ready to be used.

#### Data Warehouse:

Stores data in organized manner.

Databases focuses on (OLTP) transactions but data warehouse focuses on Data analysis. Many types of business data are analyzed via datawarehouse.

Data warehousing is technique for collecting data from various data sources. 

Data warehouse is maintained separately from operational database. DataWarehouse is heart of BI used for reporting and analysis.

DataWarehouse is central repository where data arrives from one or more data sources
.
data flows into Data warehouse from transactional system are structured data, semi-structured data, unstructured data.

Data is processed, transformed and ingested so that users can access the processed data in Data warehouse through BI tools.

Data warehouse merges information coming from different data sources into comprehensive database. helps for data mining (finding patterns in data).

#### Types of Datawarehouses

Enterprise DW,  Operational DW, Data Mart

**Components of DW**

Load manager, Warehouse Manager, Query Manager, End-user access tools.

### Properties of Ideal Pipeline

low event Latency

Scalability

Interactive Querying

Versioning

## Types of Data

**Raw Data:** Tracking data with no processing applied. This data is stored in the message encoded format used to send tracking events such as JSON. 

Raw data does not have any schema applied. All data that tracked is send to end-point as raw data and schema is applied on raw data later.

**Processed Data** Processed data is raw data that has been decoded into event specific formats with schema applied.

**Cooked Data**: Processed data that has been aggregated or summerized. Data scientists usuall work with processed data and use tools to create cooked data.

## ETL

The process of Extracting data from different RDBMS sources then transform the data (Applying calculation, concatenation etc) and finally load the data into DW.

![image.png](attachment:image.png)

#### Extraction
The process of extracting data from source to staging area. Transformations are done in the staging area.

Data Extraction processes: 

1.Full extraction

2.Partial extraction - with notification

3.Partial extraction - without notification

Irrespective of the method used, extraction should not affect the source system. Because source systems are in live production.

#### Validation during extraction:

Reconcile with source data

No spam

Data type checked

Remove duplicates

#### Transformation
Raw data extracted needs to be cleansed, mapped and transformed.

**Validations during Transformation**

Filtering - selecting only required columns

Character set conversion

Data threshold validation (age should not be more than 150)

conversion of units of measurements

map null to 0

splitting and merging columns.