# Big Data
# Principles and best practices of scalable real-time data system
## **Part 1 Batch Layer**
1. What is the name of the innovative distributed key/value store that it was created by Amazon?

Dynamo. The open source community responded in the years following with Hadoop, HBase, MongoDB, Cassandra, RabbitMQ, and countless other projects.

**1.2.1Scaling with a queue**

2. How to resolve a problem of overloaded provisionally?

Insert a queue between the web server and the database

**1.2.2 Scaling by sharing the database**

3. This technique spread the write load across multiple machines.

Horizontal partitioning or sharing

**1.2.5 What went wrong?**

4. If the application becomes more and more complex, what is most likely to happen?

The most likely thing that can happen is a human error that breaks the database and the program.

**1.2.6 How will Big Data techniques help?**

5. How will Big Data techniques help?

Help us to balance our database and computer system when we have storage and resource problems with the techniques we can automate and simplify our programs so that it does not become complex.
1.3NoSQL is not a panacea.

6. When should we not use Hadoop?

Do not use Hadoop for anything where you need low-latency result.

**1.4First principles**

7. What does a data system do?

A data system answers questions based on information that was acquired in the past up to the present.

8. What does Lambda architecture provide with this expression?

The Lambda Architecture provides a general-purpose approach to implementing
an arbitrary function on an arbitrary dataset and having the function return its results
with low latency.

**1.5.7 Minimal maintenance**

9. What are the desired properties of the Big Data?

Robustness and fault tolerance, low latency reads and updates, scalability, generalization, extensibility, ad hoc queries, minimal maintenance, and debuggability. 

10. Why should we do at least a minimum of maintenance on our system?

Maintenance is the work required to keep a system running smoothly. This includes anticipating when to add machines to scale, keeping processes up and running, and debugging anything that goes wrong in production. An important part of minimizing maintenance is choosing components that have as little implementation complexity as possible.

**1.6.1 Operational complexity**

11. What operational complexity should we focus on in the Big Data?

we will focus on one: the need for read/write databases to perform online compaction, and what you have to do operationally to keep things running smoothly.

**1.6.2 Extreme complexity of achieving eventual consistency**

12. What is consistency in terms of database and complexity?

Consistency in database systems refers to the requirement that any given database transaction must change affected data only in allowed ways. Also, it turns out that achieving high availability competes directly with another important property called consistency.

**1.7 Lambda Architecture**

13. What is Lambda Architecture? 

The main idea of the Lambda Architecture is to build Big Data systems as a series of Layers (speed layer, serving layer, batch layer). Each layer satisfies a subset of the properties and builds upon the functionality provided by the layers beneath it.

**1.7.1 Batch layer**

14. What does Batch layer do?

The batch layer stores the master copy of the dataset and precomputes batch views on that master dataset.

**1.7.2 Serving layer**

15. What does Serving layer do?

The serving layer is a specialized distributed database that loads in a batch view and makes it possible to do random reads on it.

**1.7.4 Speed layer**

16. What does Speed layer do?

The speed layer compensates for high latency of updates to serving layer, incremental algorithms quickly and with batch layer eventually overrides speed layer.

### Important!

The Lambda Architecture in full is summarized by these three equations:
batch view = function(all data)
realtime view = function(realtime view, new data)
query = function(batch view, realtime view)
 
## **Part 2 Data model for Big Data**

**2.1 The properties of data**

1. What are the shapes of the data?

Our data must have information, it is advisable to know this for our Big Data system. It must have something unique that can be manipulated and used in systems. That can be seen from other databases for collection and that can be queried.

2. Why do companies use data standardization?

Companies use data normalization to correct duplicates in the database, prevent unwanted storage, reduce database review time and complexity, and facilitate interpretation.

**2.1.1 Data is raw**

3. When to store unstructured data?

Due to its structure we cannot use relational architecture, for which Big Data tools work. But we can use it to extract information and make changes to our algorithm.

**2.1.2 Data is immutable**

4. What do we get from immutability?

We obtain advantages over our mistakes, the fact that the human being makes mistakes does not make lose much value in our database and with the immutability, no data can be lost. And finally, we have the simplicity in the data to add and update relevant data.

**2.1.3 Data is eternally true**

5. Recycle data?

With the garbage collector you can manage the memory automatically, as its function is to remove objects that are no longer in use. And with the regulation you can save information with specific conditions.


**2.2 The fact-based model for representing data**

6. What is Fact Based Modeling?

The main purpose of fact based modeling is to capture as much of the semantics as possible, to validate intermediate and final results with the subject matter expert in his preferred language, preferably using concrete illustrations and to remain independent of the representation for a specific implementation.

7. Duplicates are bad?

Duplicates can be useful in the big data because when a maximum of users and storage arrive and if the provider makes a mistake in the handling of data it can erase information, then having duplicates we have the possibility of having reserve containers where we possibly have what is needed.

**2.3 Graph schemas**

8. What a graphic scheme?

It is a visual representation of structured, relational data where each type of information present in the database is shown, these have nodes and edges which are related.

## **Part 3 Data model for Big Data: Illustration**

**3.1 Why a serialization framework?**

1. Which errors in the data take the longest to debug?

Data corruption errors.

2. What are Serialization frameworks?

Tey generate code for whatever languages you wish to use for reading, writing, and validating objects that match your schema.

**3.2 Apache Thrift**

3. What makes the tool Protocol Buffers?

It's a method of serializing structured data. It is useful in developing programs to communicate with each other over a wire or for storing data. The method involves an interface description language that describes the structure of some data and a program that generates source code from that description for generating or parsing a stream of bytes that represents the structured data.

**3.3 Limitations of serialization frameworks**

4. How does The schema language for Apache Thrift benefit us?

It helps us to keep restrictions on the data when we input it into the database, to lower the error rate it does not provide functions in which we divide the work for greater efficiency.

5. What is the importance of working with a programming language when yuo read/write data?

The importance is the decisions you make when working with data, if you use more than one language to read/write data, you need to keep the same logic which can be easy for you, but for what you want to read the code can be complicated. The recommendation is to work with a language that carries the same logic at all times, and not be confusing to other people.

## **Part 4 Data Storage on the batch layer**

**4.2.1 Using a key/value store for the master dataset**

1. What is the key/value storage intended for?

Key/value stores are meant to be used as mutable stores, which is a problem if enforcing immutability is so crucial for the master dataset. 

2. What is teh biggest problem from a key/value store?

The biggest problem, though, is that a key/value store has a lot of things you don’t need: random reads, random writes, and all the machinery behind making thosework.

**4.2.2 Distributed filesystems**

3. What is *distributed filesystem*?

Distributed file system (DFS) is a method of storing and accessing files based in a client/server architecture. In a distributed file system, one or more central servers store files that can be accessed, with proper authorization rights, by any number of remote clients in the network.

**4.4 Storing a master dataset with a distributed filesystem**

4. What are some problems with the empirical data?

Gathering empirical data requires modification to an operating system to monitor these parameters. This process itself may impact on the system performance in a small way.

When interpreting the data, consideration must be given to the environment in which it was gathered. A study of a file system used in an academic environment may not be sufficiently general for other kinds of environments.