# Overview
Often individual tasks and one-off analysis require assessing subsets of larger data sets. Selecting specific columns from larger datasets will increase performance, reduce costs, and makes working with the data easier due to lower dimensions.

We will start by reading a csv file stored in the dbfs.

In [0]:
csv_file_path = '/FileStore/tables/bankcard_data.csv'

sdf = spark.read.csv(csv_file_path, 
                     header = True,         # first row is headers  
                     inferSchema = 'true')  # infers schema of fields; otherwise is string as default?  

display(sdf)

checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
<0,6,critical/other existing credit,radio/tv,1169,no known savings,>=7,4,male single,none,4,real estate,67,none,own,2,skilled,1,yes,yes,good
0<=X<200,48,existing paid,radio/tv,5951,<100,1<=X<4,2,female div/dep/mar,none,2,real estate,22,none,own,1,skilled,1,none,yes,bad
no checking,12,critical/other existing credit,education,2096,<100,4<=X<7,2,male single,none,3,real estate,49,none,own,1,unskilled resident,2,none,yes,good
<0,42,existing paid,furniture/equipment,7882,<100,4<=X<7,2,male single,guarantor,4,life insurance,45,none,for free,1,skilled,2,none,yes,good
<0,24,delayed previously,new car,4870,<100,1<=X<4,3,male single,none,4,no known property,53,none,for free,2,skilled,2,none,yes,bad
no checking,36,existing paid,education,9055,no known savings,1<=X<4,2,male single,none,4,no known property,35,none,for free,1,unskilled resident,2,yes,yes,good
no checking,24,existing paid,furniture/equipment,2835,500<=X<1000,>=7,3,male single,none,4,life insurance,53,none,own,1,skilled,1,none,yes,good
0<=X<200,36,existing paid,used car,6948,<100,1<=X<4,2,male single,none,2,car,35,none,rent,1,high qualif/self emp/mgmt,1,yes,yes,good
no checking,12,existing paid,radio/tv,3059,>=1000,4<=X<7,2,male div/sep,none,4,real estate,61,none,own,1,unskilled resident,1,none,yes,good
0<=X<200,30,critical/other existing credit,new car,5234,<100,unemployed,4,male mar/wid,none,2,car,28,none,own,2,high qualif/self emp/mgmt,1,none,yes,bad


# Selecting Columns

The most common method uses PySpark's .select() function. Columns can be select by explicity stating them and implicitly referencing with a list.

In [0]:
# explicity referencing columns 
sdf_pyspark_select = sdf.select('purpose', 'credit_history', 'class')
print(sdf_pyspark_select.columns)

# implicitly referencing columns via list
columns = ['purpose', 'credit_history', 'class']
sdf_pyspark_select = sdf.select(*columns)
print(sdf_pyspark_select.columns)

['purpose', 'credit_history', 'class']
['purpose', 'credit_history', 'class']


Alternatively, you can explicitly reference the columns with PySpark's SQL function libararies .col() function. This function converts the column's string name into a Columnn type.

In [0]:
from pyspark.sql.functions import col
sdf_col_select = sdf.select(col('purpose'), col('credit_history'), col('class'))
print(sdf_col_select.columns)                                     

['purpose', 'credit_history', 'class']
['purpose', 'credit_history', 'class']


The .selectExpr() can be used to select columns similarly to the .select() function but offers additional functionality we will get into in later notebooks.

In [0]:
sdf_selectExpr_select = sdf.selectExpr('purpose', 'credit_history', 'class')
print(sdf_selectExpr_select.columns)   

['purpose', 'credit_history', 'class']
