# Characteristics of Big Data

## Assignment 1

In this assignment, you will calculate the estimated sizes of big data sets and the latency involved in transmitting data. 

This notebook contains the skeleton necessary for you to complete the assignment.  Look for comments that include `# TODO:` for sections that you need to complete. This notebook also contains the functions `check_data_items` and `check_latency_items` that check that you completed the assignment correctly.  Before you submit the assignment, the notebook should run without any assertion errors. 

Warning: Do not change the names of the dataframes (i.e. `df1_1`, `df1_2`, `df`_3`) as the instructor uses these names when checking the assignments. 

In [2]:
# This code helps check assignment data

import pandas as pd
from collections import namedtuple
from dataclasses import dataclass

InformationUnit = namedtuple('InformationUnit', ['name', 'size'])
DataItem = namedtuple('DataItem', ['name', 'size', 'unit'])
LatencyItem = namedtuple('LatencyItem', ['name', 'time', 'unit', 'explanation'])

information_units = dict(
    B=InformationUnit("byte", 1),
    KB=InformationUnit("kilobyte", 1e3),
    MB=InformationUnit("megabyte", 1e6),
    GB=InformationUnit("gigabyte", 1e9),
    TB=InformationUnit("terabyte", 1e12),
    PB=InformationUnit("petabyte", 1e15),
    EB=InformationUnit("exabyte", 1e18),
    ZB=InformationUnit("zettabyte", 1e21),
    YB=InformationUnit("yottabyte", 1e24)
)

time_units = {
    "ms": "millisecond",
    "s": "second",
    "min": "minute"
}

def check_data_items(items):
    # Checks to see if data sizes and units are filled out correctly
    for item in items:
        assert item.size > 0, 'Size for "{}" should be greater than zero'.format(item.name)
        assert item.unit in information_units, 'Unit "{}" not in units dictionary'.format(item.unit)
        
def check_latency_items(items):
    # Checks to see if time sizes and units are filled out correctly
    for item in items:
        # assert item.time > 0, 'Time for "{}" should be greater than zero'.format(item.name)
        assert item.unit in time_units, 'Unit "{}" not in time units dictionary'.format(item.unit)
        assert item.explanation != "FILL IN THE EXPLANATION HERE", 'Fill in explanation for "{}"'.format(item.name)

### Assignment 1.1

Provide estimates for the size of various data items.  Please explain how you arrived at the estimates for the size of each item by citing references or providing calculations. 

* Assume all videos are 30 frames per second
* [HEVC](https://en.wikipedia.org/wiki/High_Efficiency_Video_Coding) stands for High Efficiency Video Coding
* See the Wikipedia article on [display resolution](https://en.wikipedia.org/wiki/Display_resolution) for information on HD (1080p) and 4K UHD resolutions. 

| Data Item                                  | Size per Item | 
|:-------------------------------------------|--------------:|
| 128 character message                      | ? Bytes       |
| 1024x768 PNG image                         | ? MB          |
| 1024x768 RAW image                         | ? MB          | 
| HD (1080p) HEVC Video (15 minutes)         | ? MB          |
| HD (1080p) Uncompressed Video (15 minutes) | ? MB          |
| 4K UHD HEVC Video (15 minutes)             | ? MB          |
| 4k UHD Uncompressed Video (15 minutes)     | ? MB          |
| Human Genome (Uncompressed)                | ? GB          |

In [10]:
# TODO: Fill in the estimated sizes for each item
# You may need to adjust the units as well

items1_1 = [
    DataItem('1 Byte', 1, 'B'),
    DataItem("128 character message", 128, "B"), # 1 character= 8bits
    DataItem("1024x768 PNG image", 1.6, "MB"), # 1024x768=0.786 megapixels (compression save around 62%)
    DataItem("1024x768 RAW image", 2.4, "MB"), #1024x768=0.786 megapixels, 0.786 Megapixels, Aspect Ratio 4:3 (1.333:1)
    DataItem("HD (1080p) HEVC Video (15 minutes)", 300, "MB"), #1080p (FHD)	5 Mbps	20MB	50 minutes
    DataItem("HD (1080p) Uncompressed Video (15 minutes)", 1260, "MB"), # calculated based on the above size
    DataItem("4K UHD HEVC Video (15 minutes)", 6000, "MB"), #4k @ 60 fps 15 mins = 6,000 mb
    DataItem("4k UHD Uncompressed Video (15 minutes)", 6000, "MB"), #4k @ 60 fps 15 mins = 6,000 mb
    DataItem("Human Genome (Uncompressed)", 3, "GB"), #The human genome with 3Gb of nucleotides correspond with 3Gb of bytes and not
]

# Checks if items properly updated
check_data_items(items1_1)
    
df1_1 = pd.DataFrame(items1_1)
df1_1.style.hide_index()

  df1_1.style.hide_index()


name,size,unit
1 Byte,1.0,B
128 character message,128.0,B
1024x768 PNG image,1.6,MB
1024x768 RAW image,2.4,MB
HD (1080p) HEVC Video (15 minutes),300.0,MB
HD (1080p) Uncompressed Video (15 minutes),1260.0,MB
4K UHD HEVC Video (15 minutes),6000.0,MB
4k UHD Uncompressed Video (15 minutes),6000.0,MB
Human Genome (Uncompressed),3.0,GB


### Assignment 1.2

Using the estimates for data sizes in the previous part, determine how much storage space you would need for the following items.

* [Twitter statistics](https://www.internetlivestats.com/twitter-statistics/) estimates 500 million tweets are sent each day. For simplicity, assume each tweet is 128 characters. 
* See the [Snappy Github repository](https://github.com/google/snappy) for estimates of Snappy's performance. 
* [Instagram statistics](https://www.omnicoreagency.com/instagram-statistics/) estimates over 100 million videos and photos are uploaded to Instagram every day.   Assume that 75% of those items are 1024x768 PNG photos.
* [YouTube statistics](https://www.omnicoreagency.com/youtube-statistics/) estimates 500 hours of video is uploaded to YouTube every minute.  For simplicity, assume all videos are HD quality encoded using HEVC at 30 frames per second. 


| Data Item                                  | Size per Item | 
|:-------------------------------------------|--------------:|
| Daily Twitter Tweets (Uncompressed)        | ??? TB        |                       
| Daily Twitter Tweets (Snappy Compressed)   | ??? PB        |                       
| Daily Instagram Photos                     | ??? GB        |                       
| Daily YouTube Videos                       | ??? TB        |                       
| Yearly Twitter Tweets (Uncompressed)       | ??? PB        |                       
| Yearly Twitter Tweets (Snappy Compressed)  | ??? PB        |                       
| Yearly Instagram Photos                    | ??? PB        |                       
| Yearly YouTube Videos                      | ??? PB        | 

In [9]:
# TODO: Fill in the estimated sizes for each item
# You may need to adjust the units as well

items1_2 = [
    DataItem("Daily Twitter Tweets (Uncompressed)", 0.1, "TB"),
    DataItem("Daily Twitter Tweets (Snappy Compressed)", 0.0001, "PB"),
    DataItem("Daily Instagram Photos", 167187, "GB"),
    DataItem("Daily YouTube Videos", 0.6, "TB"),
    DataItem("Yearly Twitter Tweets (Uncompressed)", 0.25, "PB"),
    DataItem("Yearly Twitter Tweets (Snappy Compressed)", 0.01, "PB"),
    DataItem("Yearly Instagram Photos", 0.2, "PB"),
    DataItem("Yearly YouTube Videos", 0.0006, "PB"),
]

# Checks if items properly updated
check_data_items(items1_2)

df1_2 = pd.DataFrame(items1_2)
df1_2.style.hide_index()

  df1_2.style.hide_index()


name,size,unit
Daily Twitter Tweets (Uncompressed),0.1,TB
Daily Twitter Tweets (Snappy Compressed),0.0001,PB
Daily Instagram Photos,167187.0,GB
Daily YouTube Videos,0.6,TB
Yearly Twitter Tweets (Uncompressed),0.25,PB
Yearly Twitter Tweets (Snappy Compressed),0.01,PB
Yearly Instagram Photos,0.2,PB
Yearly YouTube Videos,0.0006,PB


### Assignment 1.3

Provide estimates of the one way latency for each of the following items.  Please explain how you arrived at the estimates for each item by citing references or providing calculations. 

|                           | One Way Latency      |
|:--------------------------|---------------------:|
| Los Angeles to Amsterdam  | ? ms                 |
| Low Earth Orbit Satellite | ? ms                 |
| Geostationary Satellite   | ? ms                 |
| Earth to the Moon         | ? ms                 |
| Earth to Mars             | ? min                | 

In [11]:
# TODO: Provide explanations for how you arrived at each estimation

los_angeles_to_amsterdam_explanation = """
https://www.consoleconnect.com/locations/amsterdam/
Amsterdam	Los Angeles, CA	USA	224 ms two ways
"""
low_earth_orbit_satellite_explanation = """
https://www.omniaccess.com/leo/
"""
geostationary_satellite_explanation = """
1. https://web.archive.org/web/20160103125227/https://www.isoc.org/inet97/proceedings/F5/F5_1.HTM#:~:text=For%20GEO%20satellite%20communications%20systems,as%20high%20as%20400%20milliseconds).
2. https://www.satsig.net/latency.htm
"""
earth_to_the_moon_explanation = """
https://www.spaceacademy.net.au/spacelink/commdly.htm
5 min estimated, multiplied by 300 in order to get milliseconds.
"""
earth_to_mars_explanation = """
1. https://interimm.org/comms-latency/en/
2. https://mars.nasa.gov/mars2020/spacecraft/rover/communications/#:~:text=It%20generally%20takes%20about%205,Earth%2C%20depending%20on%20planet%20positions.
1.3 second estimated, multiplied by 1000 in order to get milliseconds.
"""

# TODO: Fill in the estimated times for each item

items1_3 = [
    LatencyItem(
        "Los Angeles to Amsterdam",
        112,
        "ms",
        los_angeles_to_amsterdam_explanation.strip()
    ),
    LatencyItem(
        "Low Earth Orbit Satellite",
        40,
        "ms",
        low_earth_orbit_satellite_explanation.strip()
    ),
    LatencyItem(
        "Geostationary Satellite",
        280,
        "ms",
        geostationary_satellite_explanation.strip()
    ),
    LatencyItem(
        "Earth to the Moon",
        1300,
        "ms",
        earth_to_the_moon_explanation.strip()
    ),
    LatencyItem(
        "Earth to Mars",
        78000,
        "min",
        earth_to_mars_explanation.strip()
    ),
]

# Checks if items properly updated
check_latency_items(items1_3)

df1_3 = pd.DataFrame(items1_3)
df1_3.style.hide_index()

  df1_3.style.hide_index()


name,time,unit,explanation
Los Angeles to Amsterdam,112,ms,"https://www.consoleconnect.com/locations/amsterdam/ Amsterdam	Los Angeles, CA	USA	224 ms two ways"
Low Earth Orbit Satellite,40,ms,https://www.omniaccess.com/leo/
Geostationary Satellite,280,ms,"1. https://web.archive.org/web/20160103125227/https://www.isoc.org/inet97/proceedings/F5/F5_1.HTM#:~:text=For%20GEO%20satellite%20communications%20systems,as%20high%20as%20400%20milliseconds). 2. https://www.satsig.net/latency.htm"
Earth to the Moon,1300,ms,"https://www.spaceacademy.net.au/spacelink/commdly.htm 5 min estimated, multiplied by 300 in order to get milliseconds."
Earth to Mars,78000,min,"1. https://interimm.org/comms-latency/en/ 2. https://mars.nasa.gov/mars2020/spacecraft/rover/communications/#:~:text=It%20generally%20takes%20about%205,Earth%2C%20depending%20on%20planet%20positions. 1.3 second estimated, multiplied by 1000 in order to get milliseconds."
