-
Notifications
You must be signed in to change notification settings - Fork 0
/
CM0.Rmd
657 lines (444 loc) · 22.8 KB
/
CM0.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
---
title: "Big data tools"
subtitle: "Lesson 3 — Clood computing and the AWS platform"
author: Arthur Katossky & Rémi Pépin
institute: "ENSAI (Rennes, France)"
date: "March 2020"
output:
xaringan::moon_reader:
css: ["default", "css/presentation.css"]
nature:
ratio: 16:10
scroll: false
---
<!-- Cours 1h30
Cloud computing principles:
> infrastructure as a service
> platform as a service
> software as a service
Cloud computing ecosystem: Azure, AWS, GCP...
Interface & principles
- organisation: storage space, VMs, PaaS (elastic map reduce), SaaS (SageMaker)
- services not used: security, DNS servers, etc.
- tarification
In practice:
- security issues (ssh etc.)
- setting up a storage space
- setting up a virtual machine
- setting up a cluster
-->
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
# TO DO:
# [X] The principles of IaaS
# [X] Storage
# [X] Virtual machines
# [X] Other services
# [X] The principles of PaaS
# [ ] Detailed services
# [ ] Comparison between providers
# [ ] Add non-AWS exemples
# [X] Add something about data grids
# [ ] Read: http://www.ianfoster.org/wordpress/wp-content/uploads/2014/01/History-of-the-Grid-numbered.pdf
# [ ] Add outline for the course
# [ ] Add something about data duplication and data decay
# [ ] More about encryption
# [ ] Add something about physical situation of clusters (why do you have to always chose a zone?)
# [ ] Cloud vs. on=premices
# [ ] Hybrid cloud
```
## Course outline
1. A quick history of IT hardware in enterprise
2. What's cloud computing ?
3. Cloud computing services
1. Infrastructure as a Service (IaaS)
2. Platform as a Service (PaaS)
3. Software as a Service (SaaS)
4. Virtualization
5. Ecology
---
## A quick history of IT hardware in enterprise
.pull-left[
- 1940 - mid 1970 (still use today) : Mainframe era
- It's not a super-computer
- Commercial, scientific, and large-scale computing only.
- Only big enterprise can buy mainframes.
- No direct user interaction (scren, keyboard), but card punches
- Throughtput optimize (I/O operation)
- Use case: credit card operation, flight reservation, share market.
]
.pull-right[![](http://www.rhod.fr/images_dossiers/wargames/wargames03.jpg)
*Source : WarGames, John Badham, 1983*
]
---
## A quick history of IT hardware in enterprise
.pull-left[
- mid 1970 - mid-1990 (still use today) : Fat client era.
- Developement of micro-computing, personnal computers become cheaper and cheaper
- Instead of one mainframe for data and computing, one server is for the data, and computation are executed on personnal computer (**fat client**)
- Fat client are autonomous of each other. They need the data server only to download the data
]
.pull-right[.center.height450[![](https://img-9gag-fun.9cache.com/photo/a0Rz2rL_700bwp.webp)]
*Source : Friends Season 2 Episode 8*
]
---
## A quick history of IT hardware in enterprise
.pull-left[
- mid-1990 - 2010 (still use today) : In-house datacenter era.
- mid 1990 : democratization of the Internet
- Communication speed increase quickly
- Centralisation of data and computing in data datacenter.
- Personnal computers for daily tasks (mail, office etc)
- Servers for computation
Note : You can go on step further, like at Ensai, and put everything in the server
]
.pull-right[<img src="https://leshorizons.net/wp-content/uploads/2019/06/DataCenter.jpg" style="zoom:100%;" />
*Source : https://leshorizons.net/datacenter/*
]
---
## A quick history of IT hardware in enterprise
.pull-left[
- mid-1990 - 2010 (still use today) : Grid Computing
- what enteprise could afford, could not be done by scientists
- idea: using idle machines and make them work together on bigger tasks (**CPU scavenging**)
- parallel ideas could now be applied _between_ computers
- In November 2019 the best grid's 495 820 computers was computing on average at the rate 30 petaflops (BOINC, Berkeley, [link](https://boinc.berkeley.edu)), same as the 5th fastest supercomputer
- for comparison the fastest super computer today (Summit, IBM, [link](https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit)) delivers ~200 petaflops
- still use today ([SETI@home](https://setiathome.ssl.berkeley.edu/))
**Source:** Top 500, the List ([link](https://www.top500.org/lists/2019/11))
]
.pull-right[<img src="https://www.lebigdata.fr/wp-content/uploads/2018/02/grid-computing-vs-cluster-computing-1.jpg" style="zoom:100%;" />
*Source : https://www.lebigdata.fr/wp-content/uploads/2018/02/grid-computing-vs-cluster-computing-1.jpg*
]
---
## A quick history of IT hardware in enterprise
.pull-left[
- 2005 - ? : cloud computing era.
- 2002 : amazon web service
- 2006 : amazon elastic compute cloud
- 2008 : google app engine
- 2008 : analyst house Garner claim rise of cloup computing
- 2010 : microsoft azure
Cloud compute revenue in 2018 : 182,4 G$, estimation in 2022 331,4 G$ ([Source](https://www.gartner.com/en/newsroom/press-releases/2019-04-02-gartner-forecasts-worldwide-public-cloud-revenue-to-g))
]
.pull-right[ ![](https://www.astelis.fr/wp-content/uploads/cloud-maillage.jpg)
*Source : https://www.astelis.fr/wp-content/uploads/cloud-maillage.jpg*
]
---
## A quick history of IT hardware in enterprise
.pull-left[
- ? (2020) - ? : edge/fog/mist computing era.
- 2016 : W. Shi, J. Cao, Q. Zhang, Y. Li and L. Xu, "Edge Computing: Vision and Challenges," in IEEE Internet of Things Journal
- 2017 : M. Satyanarayanan, "The emergence of egde computing"
- 2019 : Jad Darrous, Thomas Lambert, Shadi Ibrahim. On the Importance of Container Image Placementfor Service Provisioning in the Edge
- 2019 : Pedro Silva, Alexandru Costan, Gabriel Antoniu. Investigating Edge vs. Cloud Computing Trade-offs for Stream Processing. BigData 2019
]
.pull-right[ ![](https://img.alicdn.com/tfs/TB1QdC2KhjaK1RjSZKzXXXVwXXa-3302-1854.png)
*Source : https://www.alibabacloud.com/fr/knowledge/what-is-edge-computing*
]
---
## Traditional IT : 1 server = 1 physical machine = 1 function
- SMTP (mail) server
- DNS
- Web server, FTP server
- proxy
- ...
--
- .green[Pros : security, no middleware]
- .red[Cons : cost, overdimensioned hardware]
---
## Operating system virtualization
Idea: one physical machine = multiple virtual machines (VMs).
--
- Must be independant
--
- Can have their own OS
--
- Don't know there are other VMs on the same machine
--
.green[Pros :]
- Mutualization of cost (less machines)
- Better CPU utilization
- Independant OS
- Easy migration and deployment
- Can run old OS
--
.red[Cons :]
- Is the physical machine goes down all VMs go down too
- There are some security issues
???
It's not something new. Existe since 1960, but still relevant !
---
## The key to virtualization : Hypervisor
- Manages VMs
- Present ressources to the VMs
--
- 2 types
- Type 1 : bare material hypervisor. Used in datacenter
- Type 2 : hosted hypervisor. Used by end user.
--
- Exemples of hypervisors
- Type 1
- Hyper-V (Microsoft)
- KVM (Red Hat)
- VMware Infrastructure (VMware)
- Xen (Oracle)
- Type 2
- Virtual Box (Oracle)
- Microsoft Virtual Server (Microsoft)
- VMware Fusion (VMware)
---
## The key to virtualization : Hypervisor
![](https://raw.githubusercontent.com/katossky/panorama-bigdata/master/img/hypervisors.jpg)
---
## Another type fo virtualization : containerization
Virtualized a full OS come with a cost in ressources. Especially if you want only one specific application (database, python server, R server ...).
--
So instead of virtualize a full OS, you can create an image with only your app and the needed libraries.
--
You need a runtime engine to run and manage your container. But containers are more lightweight and more portable than VMs
--
Exemple of container technology
- Docker
- rktlet
- CRI-O
- Microsoft Containers
- Linux Conainers
---
## Another type fo virtualization : containerization
![](https://raw.githubusercontent.com/katossky/panorama-bigdata/master/img/container.jpg)
---
## Cloud computing definition
Wikipedia : Cloud computing is **Internet-based computing**, whereby **shared resources, software, and informationare** provided to computers and other devices **on demand**, like the electricity grid. Cloud computing is a style of computing in **which dynamically scalable** and often **virtualized resources** are provided as a **service** over the Internet.
--
In a nutshell :
- shared physical ressources (mutalization of cost) in remote location
- Access via the Internet
- On demand scalable services
---
## Why cloud computing ?
- The world is changing faster than ever
--
- Need quick reponses to new problems
--
- Building a new datacenter, buying new servers or create new architectures take too much time !
--
- The on demand nature of cloud computing make possible to create new services in no time (or in a coffee break time)
---
## Why cloud computing ?
- Less investment : don't have to buy anything, just pay for what you use
- Easy scalable : "just" pay for more ressources
- Flexible : create and delete ressources in no time
- No maintenance cost : it's the resposability of your cloud provider
- Reliable : it's the resposability of your cloud provider
- Innovation : new ressources are propose regularly, and you pay for what you use.
---
## Different services
For example you want to have somewhere to sleep. You can :
--
- Buy a site to build your house.
-> But you have to keep everything in good condition all by yourself. Your investement is important, and before your house is built you can't sleep in your house.
--
- Rent a site to build a house.
-> Your landlord keep in good condition the site and you your home. Your investemet is lower but you still have to wait for your house.
--
- Rent an already built empty house.
-> Your landlord keep in good condition everything. You still have to buy a bed.
--
- Go to the hotel.
-> You do not have to do anything BUT you don't control anything
---
## Different services
For example you want to have somewhere to sleep. You can
--
- Buy a site to build your house **In house datacenter**.
High investement, but you have full control on your infrastructure.
--
- Rent a site to build a house. **Infrastrucre as a Service**.
You pay for an infrastructure but you have to build up your own IT system. You have a full controll on what you do with your ressources.
--
- Rent an already built empty house. **Platform as a Service**.
You pay for an already installed platform, you do not manage any lower level resource management and can develop you own IT system on it
--
- Go to the hotel. **Software as a Servce**.
You directly use some existed IT solution. Little control on the software.
---
## Different services
.center.height400[![](https://www.supinfo.com/articles/resources/174920/9516/0.png)
]
sources : https://www.supinfo.com/articles/single/9516-cloud-computing-une-partie-integrante-informatique-aujourd-hui
---
## Some cloud providers
- Amazon web service
- Google cloud plateform
- Microsoft azure
- OVHcloud
<!-- --- -->
<!-- ## Infrastrucre as a Service, IaaS -->
<!-- - Provide to the consumer -->
<!-- - Processing capacity -->
<!-- - Storage -->
<!-- - Network -->
<!-- - and other fundamental computing ressources -->
<!-- - The consumer do not manage physical infrastructure but has control over the provided ressources -->
<!-- - Exemples : -->
<!-- - VM : Amazon Elastic Compute Cloud, Google compute Engine, Azure Virtual Machines, OVH Public Cloud -->
<!-- - Storage : Amazon Simple Storage Service, Gougle Cloud Storage, Azure Blob Storage, OVH Object Storage -->
<!-- --- -->
<!-- ## Platform as a Service, PaaS -->
<!-- - Provide to the consumer already configured infrascture to deploy it's own application -->
<!-- - The consumer do not manage physical infrastructure neither has control over the provided ressources -->
<!-- - The consumer control the deployed application -->
<!-- - Exemples : Amazon Relational Database Service, Amazon Elastic Map Reduce -->
<!-- --- -->
<!-- ## Software as a Service, SaaS -->
<!-- - Provide to the consumer the capability to use a running application manage by the cloud provider -->
<!-- - The consumer do not control anything, excepte some user-specific settings -->
<!-- - Exemples : all the Google Apps -->
---
# Principles of IaaS
---
## Principles of IaaS
When it comes to infrastructure as a service, all cloud-providers basically share the same two basic products:
1. normal storage
2. virtual machines
---
## Normal storage
This is the place where you put all of your files, script, logs, data or else.
You create as many "buckets" as you wish, where a "bucket" is a unit of storage.
"Normal", in "normal storage" means:
- slower access than directly from the virtual machine
- faster than recovery storage
On the other hand, the provider takes charge of preventing potential corruption of your data, by keeping usually 3 distinct copies of your data, in different racks situated in different rooms or even locations.
Each resource you store in such a way gets a unique URL, that you can refer to from inside the cloud-provider ecosystem (for instance, from a virtual machine) or, if you allow it, from the outside world. You can control who and how people get access to these resources.
Data is usually encrypted server-side (meaning that stealing physically disks from a Google server is useless) but by default, the provider stores the encryption key themselves somewhere.
???
(Of course you may want to use database services instead, but let's forget about that for now.)
---
## Normal storage
Storage solutions have different names, but similar pricing schemes:
| Provider | Service's name | Store | Write | Read* |
|----------|-----------------|------------|----------------------|-----------|
| Google | Cloud Storage | 0,03 \$/Go | free | free |
| Amazon | S3 | 0,02 \$/Go | free | 0,09 \$/Go |
| Azure | Blob Storage | 0,02 \$/Go | 0,00 \$ / 10 000 op. | 0.02 \$ / 10 000 op. |
| OVH | Object Storage | 0,01 €/Go | free | 0,01 €/Go |
Prices are given monthly.
* Local reads are made locally inside _one_ provider's cluster. For instance, you may be charge extra for an external reader, or for reading data from a cluster situated in an other place (ex: reading data sotred in Europe from a virtual machine located in Australia).
**Sources:** GCP ([link](https://cloud.google.com/storage/pricing)), AWS ([link](https://aws.amazon.com/fr/s3/pricing)), Windows ([link](https://azure.microsoft.com/en-us/pricing/details/storage/blobs))OVH ([link](https://www.ovhcloud.com/fr/public-cloud/prices)), visited on 2020/01/18.
---
## Normal storage
Other forms of storage, beyonf "normal" storage, include:
- disks that can be "attached" to a virtual machine (faster, but more expensive)
- storage whose physical location adapts to usage
- so-called "cold" storage for unfrequently-accessed storage (less expensive, but you pay reads and writes more)
- archive solution (inexpensive, but takes min. to hours for access)
---
## Normal storage, the snowmobile
.center.height400[![](https://media.datacenterdynamics.com/media/images/Snowmobile-AWS-truck-data-delivery.original.jpg)]
More info : https://www.datacenterdynamics.com/en/news/aws-snowmobile-delivers-petabytes-to-the-cloud-by-truck/
---
## Virtual machines
The other component of a typical IaaS environement are **virtual machines** (VMs).
A VM is made of a combination of components:
- memory
- processing units
- local storage (hard disk)
- an operating system (FR: système d'exploitation), typically Linux
The VM is... **virtual** which does **not** mean that there is no physical memory, physical processing units or physical storage, but rather that your virtual computer has rights over pooled ressources situated on neighbooring servers of the same cluster.
VMs come in a fixed (but high) number of pre-configured options. Some providers offer tailored configuration, but they are usually more expensive that the closest pre-configured option.
???
Imagine that on cluster C, on rack R, on server S, there is in total 124 CPUs, 14 GPUs, 200Go of memory and 10 To of storage. If the configuration you request for your virutal machine V fits on S, then V will get access to the physical resources there. If V does not fit on V, the orchestrator of the cluster will try to fit in on machine S' in the same rack. Then on an other rack. If it can't find a room, it will maybe try to fit a part of the configuration (say memory and processing units) on S and the rest (local storage) on S'.
---
## Virtual machines
Virtual machines (VMs) come with little pre-configured software, and _you_ are the one to perform all the rest of the configuration. You may use such machines to
VMs **may be launched and stopped for specific tasks**. Staticians are typically concerned by this king of virtual machines, for instance for training one model, for exploring one large dataset, for using software that need specific configuration, etc. **VMs may be launched and stopped programatically.**
VMs, on the other hand, **may also be used continuously**. Think of a mail server, a live trafic visualisation, a back-end for a web-service, etc. Statisticians are less often concerned with this kind of usage.
**BEWARE!** Holding the pysical resource and doing nothing with it comes at a cost for the provider. Thus, **you will pay for a VM even you do not use it.** It is a good reflex to shut down all your VMs before quitting a session.
---
## Virtual machines
| Provider | Service's name | Processing* | Memory** | Storage*** |
|----------|------------------|-------------|-----------|------------|
| Google | Compute Engine | ~0,03 \$ | ~0,004 \$ | ~0,04 \$ |
| Amazon | EC2 | ~0,05 \$ | ~0,005 \$ | ? |
| Azure | Virtual Machines | ~0.05 \$ | ~0.005 \$ | ~0.00 \$ |
| OVH | Public Cloud | ~0,04 € | ~0,002 € | ~0,00 € |
* (Average) cost per processor (CPU or GPU) per hour
** (Average) cost per Go (CPU or GPU) per hour
*** (Average) cost per Go per month
.footnote[**Sources:** GCP, Iowa cluster (links [for processing and memory](https://cloud.google.com/compute/vm-instance-pricing) and [for storage](https://cloud.google.com/compute/disks-image-pricing)) ; AWS, London cluster, own calculation ([link](https://aws.amazon.com/ec2/pricing/on-demand/)) ; Azure, France South cluster ([link](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)) ; OVH, Gravelines vluster, own calculation ([link](https://www.ovhcloud.com/fr/public-cloud/prices/)) ; all visited on 2020/01/18.]
???
Comparison Azure vs. AWS by Microsoft:
https://docs.microsoft.com/fr-fr/azure/architecture/aws-professional/
Comparison GCP vs. AWS by Google:
https://cloud.google.com/docs/compare/aws/
---
## Virtual machines
Depending on the providers, you may have extra features such as:
- preenptible machines / spot instances (for tasks you don't need to run at a specific moment, you can get much cheaper machines that can be stopped at any time, and relaunched later)
- redundancy (you get exact copies of a given VM) ; this may be useful to scale a process such as a web server
- specialised VMs such as high CPU-to-memory ratio (e.g. for batch processing) or on the contrary high memory-to-CPU ratio (e.g. for in memory computation) or tailored configurations
---
## Infrastructure as a service: billing
**BEWARE!** **All what you book is billed, even what you end up not using!** If there is something you do not use, _you_ have to shut it down. It is good practice to start small, and scale up when you need.
Also, in order for customers to mind their architectural choices, **most provider will bill communication between you virtual machines and your storage spaces** — unless sometimes if they are situated in precisely the same cluster.
---
## Other services
There are other services of pure infrastructure that the cloud providers offer, but that you won't have to worry about as a statistician:
- network
- security
- orchestration
---
# Paas and Saas
---
## Platform as a service
Besides storage and VMs, you will probably get to use services, without having to install and configure yourself a computer from the bottom up. This is **platform as a service** (**PaaS**).
As a statician, services you may come to use are:
- databases
- data analytics
- machine-learning
- serverless computing
.green[++Pros]: scale automatically, less maintenance, easy to access
.red[--Cons]: less configurable, possibly more expensive, cloud provider dependent
Exemple : Hadoop Platform (AWS EMR, GCP Cloud Dataproc, Azuer HDInsight)
---
## Software as a Service, SaaS
- Provide to the consumer the capability to use a running application manage by others (application provider)
- The consumer do not control anything, except some user-specific settings
- You must agree some terms and conditions before use
- I heavly discourage using those kind of services for sensible data analysis (specialy if it's a "free" service)
- Exemples : all the Google suite, Facebook, Twitter, Instagram, Github, Binder ...
---
## Some exemples of cloud computing success: New York Times
- The need
- conversion from TIFF to pdf of all articles from 1851 to 1922
- 4TB of input (405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml files mapping articles to rectangular regions in the TIFF’s)
--
- The realization
- Run around 100 EC2 instances to create an Hadoop cluster for 24h (estimate cost 800$)
- Realize they made a mistake
- Rerun the cluster with the fix for 24h (estimate cost 800$)
--
- Conclusion
- Convert to pdf 70 years of newspapers for less than 2000$ in no time!
---
## Some exemples of cloud computing success: Animoto
Animoto is a cloud-based video creation service (Saas)
- Start-up without any physical server
- Rents servers (IaaS) to deploy its application
- Starts with 40 servers
- Scales up from 40 to 5000 servers in 4 days to meet the growing number of subscriber !
--
Cannot be done with on site datacenter
---
#Cloud and ecology
.center.height500[![](https://pbs.twimg.com/media/CbeMAqaUAAEFWZU.png)]
---
## Cloud and ecology
- It's not because you don't own the server that the server does not pollute.
- The internet causes about 4% of the worldwide emission of GHG
- Datacenters are reponsible for 25% of it
- Theoretically, with the mutualization of ressource cloud computing could be greenner than in site datacenter
- But, because cloud computing become cheaper, more people use it, and are less carefull with the cost. So more servers are running, maybe over dimensionned one, and cloud computing can me more polluting than old school datacente (rebound effect).
- No real evidence of that.
*Source : ADEME, la face caché du numérique, consulté le 19/01/2020 [lien](https://www.ademe.fr/sites/default/files/assets/documents/guide-pratique-face-cachee-numerique.pdf)*