Import Spark libraries to Start a Spark Session

/**** Link to Spark Documentation
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumnRenamed.html ***?

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField,StructType,StringType,IntegerType
spark=SparkSession.builder.appName("Filtering,Slicing and View Creation").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/07 15:15:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/10/07 15:15:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/10/07 15:15:37 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


Read Data with Schema Changes

In [2]:
df=spark.read.json('/Users/solansah/Desktop/Data/people.json',schema= StructType([StructField('age',IntegerType(),True),
                                                                      StructField('name',StringType(),True)])
                   )
df.show()
df.describe().show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



In [3]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



Select Columns

In [4]:
type(df['age'])  ## when you do it this way you get a column as data type

pyspark.sql.column.Column

In [5]:
df.select('age') ## when you use the select method you will get a dataframe as data type

DataFrame[age: int]

In [6]:
df.select('age').show()  ##there is lots of flexibility with this  because it returns a data frame


+----+
| age|
+----+
|null|
|  30|
|  19|
+----+



In [7]:
df.head(2)

[Row(age=None, name='Michael'), Row(age=30, name='Andy')]

In [8]:
df.head(3)[1]

Row(age=30, name='Andy')

In [9]:
type(df.head(3)[1])  ## because we are slicng by row it is returning a row object

pyspark.sql.types.Row

In [10]:
'''Remeber the reason why spark has so many specialized objects ( like column or row object )is because /n
sparks ability to read from a distributed data source and then map that out to a more distributed computing'''


'Remeber the reason why spark has so many specialized objects ( like column or row object )is because /n\nsparks ability to read from a distributed data source and then map that out to a more distributed computing'

In [11]:
df.select(['age','name']).show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



Creating New Columns

In [12]:
df=df.withColumn('newage',df['age']*2 ) # Remember to persit this then you will need an inplace operation like variable re-assignment

In [13]:
df.show()

+----+-------+------+
| age|   name|newage|
+----+-------+------+
|null|Michael|  null|
|  30|   Andy|    60|
|  19| Justin|    38|
+----+-------+------+



In [14]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- newage: integer (nullable = true)



Column Renaming

In [15]:
df=df.withColumnRenamed('age','age_renamed')

In [16]:
df.show()

+-----------+-------+------+
|age_renamed|   name|newage|
+-----------+-------+------+
|       null|Michael|  null|
|         30|   Andy|    60|
|         19| Justin|    38|
+-----------+-------+------+



In [17]:
df.show()

+-----------+-------+------+
|age_renamed|   name|newage|
+-----------+-------+------+
|       null|Michael|  null|
|         30|   Andy|    60|
|         19| Justin|    38|
+-----------+-------+------+



In [18]:
df2=spark.createDataFrame([('solomon',28),('edem',30),('moses',35),('bright',10)],schema=['firstname','age'])
df2.show()

+---------+---+
|firstname|age|
+---------+---+
|  solomon| 28|
|     edem| 30|
|    moses| 35|
|   bright| 10|
+---------+---+



In [19]:
df2.withColumnsRenamed({'firstname':'firstnamenew','age':'agenew'}).show()

+------------+------+
|firstnamenew|agenew|
+------------+------+
|     solomon|    28|
|        edem|    30|
|       moses|    35|
|      bright|    10|
+------------+------+



In [20]:
df2.show()

+---------+---+
|firstname|age|
+---------+---+
|  solomon| 28|
|     edem| 30|
|    moses| 35|
|   bright| 10|
+---------+---+



Create Views

In [21]:
df.createOrReplaceTempView('myview')

In [22]:
spark.sql('select * from myview').show()

+-----------+-------+------+
|age_renamed|   name|newage|
+-----------+-------+------+
|       null|Michael|  null|
|         30|   Andy|    60|
|         19| Justin|    38|
+-----------+-------+------+



In [23]:
df2.createOrReplaceTempView('myview2')

In [24]:
spark.sql( 'select * from myview union all select * from myview ').show()

+-----------+-------+------+
|age_renamed|   name|newage|
+-----------+-------+------+
|       null|Michael|  null|
|         30|   Andy|    60|
|         19| Justin|    38|
|       null|Michael|  null|
|         30|   Andy|    60|
|         19| Justin|    38|
+-----------+-------+------+



In [25]:
df2.createOrReplaceTempView('myview3')

In [26]:
res=spark.sql('select * from myview3')
res.show()

+---------+---+
|firstname|age|
+---------+---+
|  solomon| 28|
|     edem| 30|
|    moses| 35|
|   bright| 10|
+---------+---+



In [27]:
spark