Skip to content

Commit d360b38

Browse files
authored
Update II Data engineering toolbox.py
1 parent 7a13f26 commit d360b38

File tree

1 file changed

+21
-0
lines changed

1 file changed

+21
-0
lines changed

Introduction to Data Engineering/II Data engineering toolbox.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,3 +116,24 @@ def parallel_apply(apply_func, groups, nb_cores):
116116
Built from need to use structures queries for pararell processing
117117
Initially used Hadoop MapReduce """
118118

119+
#---
120+
#A PySpark groupby
121+
"""The methods you're going to use in this exercise are:
122+
.printSchema(): helps print the schema of a Spark DataFrame.
123+
.groupBy(): grouping statement for an aggregation.
124+
.mean(): take the mean over each group.
125+
.show(): show the results."""
126+
# Print the type of athlete_events_spark
127+
print(type(athlete_events_spark))
128+
129+
# Print the schema of athlete_events_spark
130+
print(athlete_events_spark.printSchema())
131+
132+
# Group by the Year, and find the mean Age
133+
print(athlete_events_spark.groupBy('Year').mean('Age'))
134+
135+
# Group by the Year, and find the mean Age
136+
print(athlete_events_spark.groupBy('Year').mean('Age').show())
137+
138+
#---
139+
#

0 commit comments

Comments
 (0)