|
2 | 2 | how Spark manages data and how can you read and write tables from Python. |
3 | 3 |
|
4 | 4 | \ What is Spark, anyway? / |
| 5 | +
|
| 6 | + | parallel computation | ( split data in clusters ) |
5 | 7 | |
6 | | - |
| 8 | + platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes |
| 9 | + Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data. |
| 10 | + parallel computation can make certain types of programming tasks much faster. |
| 11 | + |
| 12 | + | Using Spark in Python | |
| 13 | + |
| 14 | + - cluster: <remote machine connects all nodes> |
| 15 | + - master : ( main pc ) <manages splitting data and computations> |
| 16 | + - worker : ( rest of computers in cluster ) <receives calculations, return results> |
| 17 | + |
| 18 | + . create connection : <SparkContext> class <sc> #creating an instance of the |
| 19 | + . attributes : <SparkConf() constructor> |
| 20 | +""" |
| 21 | +#| |
| 22 | +#| |
| 23 | +### How do you connect to a Spark cluster from PySpark? |
| 24 | +# ANSW: Create an instance of the SparkContext class. |
| 25 | +#| |
| 26 | +#| |
| 27 | +### Examining The SparkContext |
| 28 | +# Verify SparkContext in environment |
| 29 | +print(sc) |
| 30 | + |
| 31 | +# Print Spark version |
| 32 | +print(sc.version) |
| 33 | +#- <SparkContext master=local[*] appName=pyspark-shell> |
| 34 | +#- 3.2.0 |
| 35 | +#| |
| 36 | +#| |
7 | 37 | """ |
| 38 | +\ Using Spark DataFrames / |
| 39 | +
|
| 40 | + . Spark structure : Resilient Distributed Dataset (RDD) |
| 41 | + . behaves : like a SQL table |
| 42 | + # DataFrames are more optimized for complicated operations than RDDs. |
| 43 | + |
| 44 | + - create Spark Dataframe: create a object from your |
| 45 | + <SparkSession> #interface <'spark'> |
| 46 | + <SparkContext> #connection <'sc'> |
| 47 | +""" |
| 48 | +#| |
| 49 | +#| |
| 50 | +### Which of the following is an advantage of Spark DataFrames over RDDs? |
| 51 | +# ANSW: Operations using DataFrames are automatically optimized. |
| 52 | +#| |
| 53 | +#| |
| 54 | +### Creating a SparkSession |
| 55 | + |
0 commit comments