# 1.13 How Big Is Big Data?
For computer scientists and data scientists, data is now as important as writing programs
* According to IBM, approximately 2.5 quintillion bytes (2.5 _exabytes_) of data are created daily, and 90% of the world’s data was created in the last two years
* According to IDC, the global data supply will reach 175 _zettabytes_ (equal to 175 trillion gigabytes or 175 billion terabytes) annually by 2025

### Megabytes (MB)
* One megabyte is about one million (actually 2<sup>20</sup>) bytes
* Many of the files we use on a daily basis require one or more MBs of storage
    * MP3 audio files—High-quality MP3s range from 1 to 2.4 MB per minute
    * Photos—JPEG format photos taken on a digital camera can require about 8 to 10 MB per photo 
    * Video—Smartphone cameras can record video at various resolutions
        * Each minute of video can require many megabytes of storage
        * On one of our iPhones, the **Camera** settings app reports that 1080p video at 30 frames-per-second (FPS) requires 130 MB/minute and 4K video at 30 FPS requires 350 MB/minute

### Gigabytes (GB)
* One gigabyte is about 1000 megabytes (actually 2<sup>30</sup> bytes
* A dual-layer DVD can store up to 8.5 GB, which translates to:
    * as much as 141 hours of MP3 audio
    * approximately 1000 photos from a 16-megapixel camera
    * approximately 7.7 minutes of 1080p video at 30 FPS
    * approximately 2.85 minutes of 4K video at 30 FPS
* Highest-capacity Ultra HD Blu-ray discs can store up to 100 GB of video
* Streaming a 4K movie can use between 7 and 10 GB per hour (highly compressed)

### Terabytes (TB)
* One terabyte is about 1000 gigabytes (actually 2<sup>40</sup> bytes)
* Recent disk drives for desktop computers come in sizes up to 15 TB, which is equivalent to
    * approximately 28 years of MP3 audio
    * approximately 1.68 million photos from a 16-megapixel camera
    * approximately 226 hours of 1080p video at 30 FPS  
    * approximately 84 hours of 4K video at 30 FPS
* Nimbus Data now has the largest solid-state drive (SSD) at 100 TB, which can store 6.67 times the 15-TB examples of audio, photos and video listed above

### Petabytes, Exabytes and Zettabytes 
* There are nearly four billion people online creating about 2.5 quintillion bytes of data each day
    * 2500 petabytes (each petabyte is about 1000 terabytes) or 2.5 exabytes (each exabyte is about 1000 petabytes)
* According to a March 2016 _AnalyticsWeek_ article, within five years there will be over 50 billion devices connected to the Internet and by 2020 we’ll be producing 1.7 megabytes of new data every second _for every person on the planet_
* At today’s numbers (approximately 7.7 billion people), that’s about 
    * 13 petabytes of new data per second
    * 780 petabytes per minute
    * 46,800 petabytes (46.8 exabytes) per hour 
    * 1,123 exabytes per day—that’s 1.123 zettabytes (ZB) per day (each zettabyte is about 1000 exabytes)
* That’s the equivalent of over 5.5 million hours (over 600 years) of 4K video every day or approximately 116 billion photos every day!

### Additional Big-Data Stats 
* For an entertaining real-time sense of big data, check out https://www.internetlivestats.com, with various statistics, including the numbers so far today of
    * Google searches
    * Tweets
    * Videos viewed on YouTube
    * Photos uploaded on Instagram

### Additional Big-Data Stats (cont.)
* Every hour, YouTube users upload 24,000 hours of video, and almost 1 billion hours of video are watched on YouTube every day
* Every second, there are 51,773 GBs (or 51.773 TBs) of Internet traffic, 7894 tweets sent, 64,332 Google searches and 72,029 YouTube videos viewed
* On Facebook each day there are 800 million “**likes**,” 60 million emojis are sent, and there are over two billion searches of the more than 2.5 trillion Facebook posts since the site’s inception

### Additional Big-Data Stats (cont.)
* In June 2017, Will Marshall, CEO of Planet, said the company has 142 satellites that image the whole planet’s land mass once per day
    * They add one million images and seven TBs of new data each day
    * They’re using machine learning on that data to improve crop yields, see how many ships are in a given port and track deforestation
    * With respect to Amazon deforestation, he said: “Used to be we’d wake up after a few years and there’s a big hole in the Amazon. Now we can literally count every tree on the planet every day.” 

### Additional Big-Data Stats (cont.)
Domo, Inc. has a nice infographic called “Data Never Sleeps 6.0” showing how much data is generated _every minute_, including:
    * 473,400 tweets sent.
    * 2,083,333 Snapchat photos shared.
    * 97,222 hours of Netflix video viewed.
    * 12,986,111 million text messages sent.
    * 49,380 Instagram posts. 
    * 176,220 Skype calls.
    * 750,000 Spotify songs streamed.
    * 3,877,140 Google searches.
    * 4,333,560 YouTube videos watched. 

### Computing Power Over the Years
* Data is getting more massive and so is the computing power for processing it
* Performance of today’s processors is measured in terms of **FLOPS (floating-point operations per second)**
* In the early to mid-1990s, the fastest supercomputer speeds were measured in gigaflops (109 FLOPS)
* Late 1990s: Intel produced the first teraflop (10<sup>12</sup> FLOPS) supercomputers
* Early-to-mid 2000s: Speeds reached hundreds of teraflops
* 2008: IBM released the first petaflop (10<sup>15</sup> FLOPS) supercomputer
* Currently, the fastest supercomputer—the IBM Summit, located at the Department of Energy’s (DOE) Oak Ridge National Laboratory (ORNL)—is capable of 122.3 petaflops

### Computing Power Over the Years (cont.)
* Distributed computing can link thousands of personal computers via the Internet to produce even more FLOPS
* 2016: The Folding@home network—a distributed network in which people volunteer their personal computers’ resources for use in disease research and drug design—was capable of over 100 petaflops
* Companies like IBM are now working toward supercomputers capable of exaflops (10<sup>18</sup> FLOPS) 

### Computing Power Over the Years (cont.)
* **Quantum computers** now under development theoretically could operate at 18,000,000,000,000,000,000 times the speed of today’s “conventional computers”! 
* In one second, a quantum computer theoretically could do staggeringly more calculations than the total that have been done by all computers since the world’s first computer appeared. 
    * Could wreak havoc with blockchain-based cryptocurrencies like Bitcoin
    * Engineers are already rethinking blockchain to prepare for such massive increases in computing power

### Computing Power Over the Years (cont.)
* Computing power’s cost continues to decline, especially with cloud computing
* People used to ask the question, “How much computing power do I need on my system to deal with my _peak_ processing needs?” 
* That thinking has shifted to “Can I quickly carve out on the cloud what I need _temporarily_ for my most demanding computing chores?” 
    * Pay for only what you use to accomplish a given task

### Processing the World’s Data Requires Lots of Electricity
* Data from the world’s Internet-connected devices is exploding, and processing that data requires tremendous amounts of energy. 
* According to a recent article, energy use for processing data in 2015 was growing at 20% per year and consuming approximately three to five percent of the world’s power
    * That total data-processing power consumption could reach 20% by 2025 

### Processing the World’s Data Requires Lots of Electricity (cont.)
* Another enormous electricity consumer is the blockchain-based cryptocurrency Bitcoin
    * Processing just one Bitcoin transaction uses approximately the same amount of energy as powering the average American home for a week! 
    * The energy use comes from the process Bitcoin “miners” use to prove that transaction data is valid

### Big-Data Opportunities
* Big data’s appeal to big business is undeniable given the rapidly accelerating accomplishments
* Many companies are making significant investments and getting valuable results through technologies in this book, such as big data, machine learning, deep learning and natural-language processing
* Forcing competitors to invest as well, rapidly increasing the need for computing professionals with data-science and computer science experience

## 1.13.1 Big Data Analytics
* The term “data analysis” was coined in 1962, though people have been analyzing data using statistics for thousands of years going back to the ancient Egyptians
* Big data analytics is a more recent phenomenon—the term “big data” was coined around 2000
* Four of the V’s of big data:
    1. Volume—the amount of data the world is producing is growing exponentially.
    2. Velocity—the speed at which that data is being produced, the speed at which it moves through organizations and the speed at which data changes are growing quickly.
    3. Variety—data used to be alphanumeric (that is, consisting of alphabetic characters, digits, punctuation and some special characters)—today it also includes images, audios, videos and data from an exploding number of Internet of Things sensors in our homes, businesses, vehicles, cities and more.
    4. Veracity—the validity of the data—is it complete and accurate? Can we trust that data when making crucial decisions? Is it real?

## 1.13.1 Big Data Analytics (cont.)
* Most data is now being created digitally in a _variety_ of types, in extraordinary _volumes_ and moving at astonishing _velocities_
* Digital data storage has become so vast in capacity, cheap and small that we can now conveniently and economically retain _all_ the digital data we’re creating

## 1.13.1 Big Data Analytics (cont.)
To get a sense of big data’s scope in industry, government and academia, check out the high-resolution graphic 
> http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png

## 1.13.2 Data Science and Big Data Are Making a Difference: Use Cases
* The data-science field is growing rapidly because it’s producing significant results that are making a difference
* Some data-science and big data use cases in the following table

| Data-science use cases 
| ------------ 
| anomaly detection
| assisting people with disabilities
| auto-insurance risk prediction
| automated closed captioning
| automated image captions
| automated investing
| autonomous ships
| brain mapping
| caller identification
| cancer diagnosis/treatment 
| carbon emissions reduction 
| classifying handwriting
| computer vision
| credit scoring
| crime: predicting locations 
| crime: predicting recidivism 
| crime: predictive policing
| crime: prevention
| CRISPR gene editing
| crop-yield improvement
| customer churn
| customer experience
| customer retention
| customer satisfaction
| customer service
| customer service agents
| customized diets
| cybersecurity
| data mining
| data visualization
| detecting new viruses
| diagnosing breast cancer 
| diagnosing heart disease
| diagnostic medicine
| disaster-victim identification
| drones
| dynamic driving routes
| dynamic pricing
| electronic health records
| emotion detection
| energy-consumption reduction
| facial recognition
| fitness tracking
| fraud detection
| game playing
| genomics and healthcare
| Geographic Information Systems (GIS)
| GPS Systems
| health outcome improvement
| hospital readmission reduction
| human genome sequencing
| identity-theft prevention
| immunotherapy
| insurance pricing
| intelligent assistants
| Internet of Things (IoT) and medical device monitoring
| Internet of Things and weather forecasting
| inventory control
| language translation
| location-based services
| loyalty programs
| malware detection
| mapping 
| marketing
| marketing analytics 
| music generation
| natural-language translation
| new pharmaceuticals 
| opioid abuse prevention
| personal assistants
| personalized medicine 
| personalized shopping 
| phishing elimination
| pollution reduction
| precision medicine
| predicting cancer survival
| predicting disease outbreaks
| predicting health outcomes
| predicting student enrollments
| predicting weather-sensitive product sales 
| predictive analytics
| preventative medicine
| preventing disease outbreaks
| reading sign language
| real-estate valuation
| recommendation systems
| reducing overbooking
| ride sharing
| risk minimization
| robo financial advisors 
| security enhancements
| self-driving cars
| sentiment analysis
| sharing economy
| similarity detection
| smart cities
| smart homes
| smart meters
| smart thermostats
| smart traffic control
| social analytics
| social graph analysis 
| spam detection 
| spatial data analysis
| sports recruiting and coaching
| stock market forecasting
| student performance assessment
| summarizing text
| telemedicine
| terrorist attack prevention
| theft prevention 
| travel recommendations
| trend spotting 
| visual product search
| voice recognition
| voice search
| weather forecasting

------
&copy;1992&ndash;2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 1 of the book [**Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud**](https://amzn.to/2VvdnxE).

DISCLAIMER: The authors and publisher of this book have used their 
best efforts in preparing the book. These efforts include the 
development, research, and testing of the theories and programs 
to determine their effectiveness. The authors and publisher make 
no warranty of any kind, expressed or implied, with regard to these 
programs or to the documentation contained in these books. The authors 
and publisher shall not be liable in any event for incidental or 
consequential damages in connection with, or arising out of, the 
furnishing, performance, or use of these programs.                  