## Why Learn Spark?
Spark is currently one of the most popular tools for big data analytics. You might have heard of other tools such as Hadoop. Hadoop is a slightly older technology although still in use by some companies. Spark is generally faster than Hadoop, which is why Spark has become more popular over the last few years.

There are many other big data tools and systems, each with its own use case. For example, there are database system like [Apache Cassandra](http://cassandra.apache.org/) and SQL query engines like [Presto](https://prestodb.io/). But Spark is still one of the most popular tools for analyzing large data sets.

### QUIZ QUESTION
To test your current hardware knowledge, match each computer hardware component with the best corresponding description. Don't worry if you're not sure. You'll get more information about this in the next few videos.

|HARDWARE COMPONENT|DESCRIPTION|
|------------------|-----------|
|Memory            |Short-term, quick data storage|
|Solid State Drive |Long-term, sage data storage|
|Network|Connection between computers|
|CPU|Brain of the computer|

## Numers Everyone Should Know
### CPU (Central Processing Unit)
The CPU is the "brain" of the computer. Every process on your computer is eventually handled by your CPU. This includes calculations and also instructions for the other components of the compute.

### Memory (RAM)
When your program runs, data gets temporarily stored in memory before getting sent to the CPU. Memory is ephemeral storage - when your computer shuts down, the data in the memory is lost.

### Storage (SSD or Magnetic Disk)
Storage is used for keeping data over long periods of time. When a program runs, the CPU will direct the memory to temporarily load data from long-term storage.

### Network (LAN or the Internet)
Network is the gateway for anything that you need that isn't stored on your computer. The network could connect to other computers in the same room (a Local Area Network) or to a computer on the other side of the world, connected over the internet.

### Other Numbers to Know?
You may have noticed a few other numbers involving the L1 and L2 Cache, mutex locking, and branch mispredicts. While these concepts are important for a detailed understanding of what's going on inside your computer, you don't need to worry about them for this course. If you're curious to learn more, check out [Peter Norvig's original blog post](http://norvig.com/21-days.html) from a few years ago, and [an interactive version](http://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html) for today's current hardware.

<img src="images/four_key_machine_components.png">

### QUIZ QUESTION
Rank the hardware component in order from fastest to slowest

|SPEED|HARDWARE COMPONENT|
|-----|------------------|
|Fastest|CPU|
|2nd Fastest|Memory(RAM)|
|3rd Fastest|Disk Storage(SSD)|
|Slowest|Network|

## Hardware: CPU
The CPU is the brains of a computer. The CPU has a few different functions including directing other components of a computer as well as running mathematical calculations. The CPU can also store small amounts of data inside itself in what are called **registers**. These registers hold data that the CPU is working with at the moment.

For example, say you write a program that reads in a 40 MB data file and then analyzes the file. When you execute the code, the instructions are loaded into the CPU. The CPU then instructs the computer to take the 40 MB from disk and store the data in memory (RAM). If you want to sum a column of data, then the CPU will essentially take two numbers at a time and sum them together. The accumulation of the sum needs to be stored somewhere while the CPU grabs the next number.

This cumulative sum will be stored in a register. The registers make computations more efficient: the registers avoid having to send data unnecessarily back and forth between memory (RAM) and the CPU.

### QUESTION 1 OF 2
A 2.5 Gigahertz CPU means that the CPU processes 2.5 billion operations per second. Let's say that for each operation, the CPU processes 8 bytes of data. How many bytes could this CPU process per second?
- [ ] 312.5 million bytes per second
- [ ] 3.2 billion bytes per second
- [x] 20 billion bytes per second
- [ ] I'm not sure how to calculate this

### QUESTION 2 OF 2
Twitter generates about 6,000 tweets per second, and each tweet contains 200 bytes. So in one day, Twitter generates data on the order of:

(6000 tweets / second) x (86400 seconds / day) x (200 bytes / tweet) = 104 billion bytes / day

Knowing that tweets create approximately 104 billion bytes of data per day, how long would it take the 2.5 GigaHertz CPU to analyze a full day of tweets?
- [ ] 0.19 seconds
- [ ] 3.5 seconds
- [x] 5.2 seconds
- [ ] 47 seconds
- [ ] 136 seconds

## Hardware: Memory
Most of the time CPU is sitting idle as it waits for the input data from the memory. Memory takes about 250 times longer to find a random byte than to process the same byte in the CPU. The solution is data is loaded sequentially. 

With CPU and Memory, what else do we need?
It seems like the right combination of CPU and memory can help you quickly load and process data. We could build a single computer with lots of CPUs and a ton of memory. The computer would be incredibly fast.

### QUESTION 1 OF 2
What are the potential trade offs of creating one computer with a lots of CPUs and memory?

It's true that you could build a computer with a ton of CPU and memory, which is effectively a supercomputer but it'll not be cost effective.

### Characteristic of Memory
* Efficient
* Ephemeral
* Expensive

Beyond the fact that memory is expensive and ephemeral, we'll learn that for most use cases in the industry, memory and CPU aren't the bottleneck. Instead the storage and network, which you'll learn about in the next videos, slow down many tasks you'll work on in the industry.

### QUESTION 2 OF 2
What are the limitations of memory (RAM)?

- [x] RAM is relatively expensive
- [ ] You can only fit 8GB RAM onto a single computer
- [x] Data stored in RAM gets erased when the computer shuts down
- [ ] Operations in RAM are relatively inefficient compared to disk storage and network operations

## Hardware: Storage
Long term storage like hard drive disk(magnetic disk) is cheap and durable but it is much slower than memory. Loading data from the hard drive can be 200 times slower. Even the new SSDs  are still about 15 times slower.

<img src="images/diff_mem_storage.png">

This difference of seconds and mili seconds may seem small but when we are working with GBs and TBs of data then this difference adds up

## Hardware: Network
In the past few decades every part of the computer is improved doubling in efficiency every few years. Unfortunately the speed of our network lags behind the improvements in CPU, memory and Storage as demonstrated in the following figure:

<img src="images/hardware_improvement.png">

As a result moving data from one machine to another is the most common bottleneck when working with big data.
For example the same hour of tweets that would take half a second to process from storage would take 30 seconds to download from a twitter API on  a typical network.
<img src="images/ssd_tweet.png">

<img src="images/network_tweet.png">

It usually takes at lest 20 times longer to prcess data when we have to download it from the other machine first. For this reason, the distributed systems try to minimize **Shuffling** the data back and forth.
**Shuffling** is the process of moving data back and forth between nodes of a cluster.

## Hardware: Key Ratios

<img src="images/key_ratios.png">


## Small Data Numbers
If you are working on a data that can fit into your memory then you are working on small data. Let's suppose you want to analyze 4 GB of log file and it can easily fit into the memory so it's the small data. Hence its better to use just a python code.

## Big Data Numbers
Now suppose your log data explodes to 200 GB. The same python code won't work as you'll run out of the memory.

### QUIZ QUESTION
What happened to your computer when you started to run the program with the larger data set? Keep in mind that in this current configuration, the data set is stored on your local computer and does not need to travel across a network (Select all that apply)
- [ ] It took too long to move data across the network
- [x] The CPU couldn't load data quickly from memory
- [x] The memory couldn't load data quickly from storage
- [ ] The CPU simply couldn't handle the large data set
- [ ] The storage couldn't load data quickly from the network
- [ ] The CPU couldn't load data quickly from the network

Click [here](https://www.youtube.com/watch?v=QjPr7qeJTQk) for more explanation about the question