# Overview
Often individual tasks and one-off analysis require assessing subsets of larger data sets. Most of the time, a .select() function will suffice for reducing the dimensions. However, sometimes a .drop() statement is more effective depending on the the situation. For example, if you want to drop less columns than you want to retain, the .drop() function should be used because less fields are required to be explicitly stated.

Start with creating a dataframe...

In [0]:
sdf = spark.createDataFrame(data=[(1001,'Chicago',535),          ##  
                                  (1002,'Boston',495),            # #Values
                                  (1003,'Seattle',318),          ##
                                  ], 
                            schema=['station_id','city','rainfall']
  ) 

# Print top 5 rows
print(sdf.show(n=5, truncate = False))

+----------+-------+--------+
|station_id|city   |rainfall|
+----------+-------+--------+
|1001      |Chicago|535     |
|1002      |Boston |495     |
|1003      |Seattle|318     |
+----------+-------+--------+

None


# Dropping Columns

## .drop() function:

Dropping a single column:

In [0]:
print(sdf.columns)          
sdf.drop('station_id')            # does not save object
print(sdf.columns)                # notice the station_id is still there? 
sdf_drop = sdf.drop('station_id') # saves objects
print(sdf_drop.columns)           # because the object was saved, the column is not longer available

['station_id', 'city', 'rainfall']
['station_id', 'city', 'rainfall']
['city', 'rainfall']


Dropping multiple columns by explicitly stating when within .drop() function:

In [0]:
print(sdf.columns)
sdf_drop = sdf.drop('station_id','city') # add cols after commas 
print(sdf_drop.columns)

['station_id', 'city', 'rainfall']
['rainfall']


Dropping multiple columns via list specification:

In [0]:
print(sdf.columns)
cols = ('station_id','city') # create a list object within ()
sdf_drop = sdf.drop(*cols)   # elicit list object with *list
print(sdf_drop.columns)

['station_id', 'city', 'rainfall']
['rainfall']
