## Select Columns From DataFrame

select() function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame, PySpark select() is a transformation function hence it returns a new DataFrame with the selected columns

Create a dataframe

In [None]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('select').getOrCreate()

data = [("James","Smith","USA","CA"),
        ("Michell","Frost","CAN","TO"),
        ("Roberto","Gonzales","MEX","TL"),
        ("Maria","Jones","USA","FL")
  ]
columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)

+---------+--------+-------+-----+
|firstname|lastname|country|state|
+---------+--------+-------+-----+
|James    |Smith   |USA    |CA   |
|Michell  |Frost   |CAN    |TO   |
|Roberto  |Gonzales|MEX    |TL   |
|Maria    |Jones   |USA    |FL   |
+---------+--------+-------+-----+



### Select Single & Multiple Columns From PySpark
You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns. show() function is used to show the Dataframe contents.

Below are ways to select single, multiple or all columns.

In [22]:
df.select("firstname","lastname").show()
df.select(df.firstname,df.lastname).show()
df.select(df["firstname"],df["lastname"]).show()

+---------+--------+
|firstname|lastname|
+---------+--------+
|    James|   Smith|
|  Michell|   Frost|
|  Roberto|Gonzales|
|    Maria|   Jones|
+---------+--------+

+---------+--------+
|firstname|lastname|
+---------+--------+
|    James|   Smith|
|  Michell|   Frost|
|  Roberto|Gonzales|
|    Maria|   Jones|
+---------+--------+

+---------+--------+
|firstname|lastname|
+---------+--------+
|    James|   Smith|
|  Michell|   Frost|
|  Roberto|Gonzales|
|    Maria|   Jones|
+---------+--------+



In [23]:
# By using col() function

from pyspark.sql.functions import col
df.select(col("firstname"),col("lastname")).show()

+---------+--------+
|firstname|lastname|
+---------+--------+
|    James|   Smith|
|  Michell|   Frost|
|  Roberto|Gonzales|
|    Maria|   Jones|
+---------+--------+



In [24]:
#Select columns by regular expression

df.select(df.colRegex("`^.*name*`")).show()

+---------+--------+
|firstname|lastname|
+---------+--------+
|    James|   Smith|
|  Michell|   Frost|
|  Roberto|Gonzales|
|    Maria|   Jones|
+---------+--------+



### Select All Columns From List
Sometimes you may need to select all DataFrame columns from a Python list. In the below example, we have all columns in the columns list object.

In [25]:
from termcolor import cprint

# Select All columns from List
cprint("df.select(*columns)", 'red')
df.select(*columns).show()

# Select All columns
cprint("df.select([col for col in df.columns])", 'red')
df.select([col for col in df.columns]).show()
cprint("df.select(*)", 'red')
df.select("*").show()

[31mdf.select(*columns)[0m
+---------+--------+-------+-----+
|firstname|lastname|country|state|
+---------+--------+-------+-----+
|    James|   Smith|    USA|   CA|
|  Michell|   Frost|    CAN|   TO|
|  Roberto|Gonzales|    MEX|   TL|
|    Maria|   Jones|    USA|   FL|
+---------+--------+-------+-----+

[31mdf.select([col for col in df.columns])[0m
+---------+--------+-------+-----+
|firstname|lastname|country|state|
+---------+--------+-------+-----+
|    James|   Smith|    USA|   CA|
|  Michell|   Frost|    CAN|   TO|
|  Roberto|Gonzales|    MEX|   TL|
|    Maria|   Jones|    USA|   FL|
+---------+--------+-------+-----+

[31mdf.select(*)[0m
+---------+--------+-------+-----+
|firstname|lastname|country|state|
+---------+--------+-------+-----+
|    James|   Smith|    USA|   CA|
|  Michell|   Frost|    CAN|   TO|
|  Roberto|Gonzales|    MEX|   TL|
|    Maria|   Jones|    USA|   FL|
+---------+--------+-------+-----+



### Select Columns by Index
Using a python list features, you can select the columns by index.


In [26]:
#Selects first 3 columns and top 3 rows
cprint("df.select(df.columns[:3]).show(3)", 'red')

df.select(df.columns[:3]).show(3)

#Selects columns 2 to 4  and top 3 rows
cprint("df.select(df.columns[2:4]).show(3)", 'red')
df.select(df.columns[2:4]).show(3)

[31mdf.select(df.columns[:3]).show(3)[0m
+---------+--------+-------+
|firstname|lastname|country|
+---------+--------+-------+
|    James|   Smith|    USA|
|  Michell|   Frost|    CAN|
|  Roberto|Gonzales|    MEX|
+---------+--------+-------+
only showing top 3 rows
[31mdf.select(df.columns[2:4]).show(3)[0m
+-------+-----+
|country|state|
+-------+-----+
|    USA|   CA|
|    CAN|   TO|
|    MEX|   TL|
+-------+-----+
only showing top 3 rows


### Select Nested Struct Columns from PySpark
If you have a nested struct (StructType) column on PySpark DataFrame, you need to use an explicit column qualifier in order to select. 

 let’s create a new DataFrame with a struct type. Column 'name' is a struct type which consists of columns firstname, middlename, lastname.

In [27]:
data = [
        (("James",None,"Smith"),"IL","M"),
        (("Anna","Rose",""),"SC","F"),
        (("Julia","","Williams"),"IL","F"),
        (("Maria","Anne","Jones"),"NY","M"),
        (("Jen","Martin","Brown"),"TX","M"),
        (("Mike","Keith","Williams"),"IL","M")
        ]

from pyspark.sql.types import StructType,StructField, StringType        

schema = StructType([
    StructField('name', StructType([
         StructField('firstname', StringType(), True),
         StructField('middlename', StringType(), True),
         StructField('lastname', StringType(), True)
         ])),
     StructField('state', StringType(), True),
     StructField('gender', StringType(), True)
     ])
df2 = spark.createDataFrame(data = data, schema = schema)
df2.printSchema()
df2.show(truncate=False) # shows all columns

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- state: string (nullable = true)
 |-- gender: string (nullable = true)

+-----------------------+-----+------+
|name                   |state|gender|
+-----------------------+-----+------+
|{James, NULL, Smith}   |IL   |M     |
|{Anna, Rose, }         |SC   |F     |
|{Julia, , Williams}    |IL   |F     |
|{Maria, Anne, Jones}   |NY   |M     |
|{Jen, Martin, Brown}   |TX   |M     |
|{Mike, Keith, Williams}|IL   |M     |
+-----------------------+-----+------+



 let’s select struct column.

In [28]:
df2.select("name").show(truncate=False)

+-----------------------+
|name                   |
+-----------------------+
|{James, NULL, Smith}   |
|{Anna, Rose, }         |
|{Julia, , Williams}    |
|{Maria, Anne, Jones}   |
|{Jen, Martin, Brown}   |
|{Mike, Keith, Williams}|
+-----------------------+



In order to select the specific column from a nested struct, you need to explicitly qualify the nested struct column name.

In [29]:
df2.select("name.firstname","name.lastname").show(truncate=False)

+---------+--------+
|firstname|lastname|
+---------+--------+
|James    |Smith   |
|Anna     |        |
|Julia    |Williams|
|Maria    |Jones   |
|Jen      |Brown   |
|Mike     |Williams|
+---------+--------+



In order to get all columns from struct column.

In [30]:
df2.select("name.*").show(truncate=False)

+---------+----------+--------+
|firstname|middlename|lastname|
+---------+----------+--------+
|James    |NULL      |Smith   |
|Anna     |Rose      |        |
|Julia    |          |Williams|
|Maria    |Anne      |Jones   |
|Jen      |Martin    |Brown   |
|Mike     |Keith     |Williams|
+---------+----------+--------+

