Skip to content

Commit 18b1d24

Browse files
authored
Update I Getting to know PySpark.py
1 parent a899ec9 commit 18b1d24

File tree

1 file changed

+49
-1
lines changed

1 file changed

+49
-1
lines changed

Introduction to PySpark/I Getting to know PySpark.py

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,54 @@
22
how Spark manages data and how can you read and write tables from Python.
33
44
\ What is Spark, anyway? /
5+
6+
| parallel computation | ( split data in clusters )
57
6-
8+
platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes
9+
Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.
10+
parallel computation can make certain types of programming tasks much faster.
11+
12+
| Using Spark in Python |
13+
14+
- cluster: <remote machine connects all nodes>
15+
- master : ( main pc ) <manages splitting data and computations>
16+
- worker : ( rest of computers in cluster ) <receives calculations, return results>
17+
18+
. create connection : <SparkContext> class <sc> #creating an instance of the
19+
. attributes : <SparkConf() constructor>
20+
"""
21+
#|
22+
#|
23+
### How do you connect to a Spark cluster from PySpark?
24+
# ANSW: Create an instance of the SparkContext class.
25+
#|
26+
#|
27+
### Examining The SparkContext
28+
# Verify SparkContext in environment
29+
print(sc)
30+
31+
# Print Spark version
32+
print(sc.version)
33+
#- <SparkContext master=local[*] appName=pyspark-shell>
34+
#- 3.2.0
35+
#|
36+
#|
737
"""
38+
\ Using Spark DataFrames /
39+
40+
. Spark structure : Resilient Distributed Dataset (RDD)
41+
. behaves : like a SQL table
42+
# DataFrames are more optimized for complicated operations than RDDs.
43+
44+
- create Spark Dataframe: create a object from your
45+
<SparkSession> #interface <'spark'>
46+
<SparkContext> #connection <'sc'>
47+
"""
48+
#|
49+
#|
50+
### Which of the following is an advantage of Spark DataFrames over RDDs?
51+
# ANSW: Operations using DataFrames are automatically optimized.
52+
#|
53+
#|
54+
### Creating a SparkSession
55+

0 commit comments

Comments
 (0)