# Removing and splitting pandas DataFrame columns

When you are preparing to train machine learning models, you often need to delete specific columns, or split certain columns from your DataFrame into a new DataFrame.

We need the pandas library and a DataFrame to explore

In [1]:
import pandas as pd

Let's load a bigger csv file with more columns, **flight_delays.csv** provides information about flights and flight delays

In [2]:
delays_df = pd.read_csv('Data/flight_delays.csv')
delays_df.head()

Unnamed: 0,FL_DATE,OP_UNIQUE_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,DISTANCE
0,2018-10-01,WN,N221WN,802,ABQ,BWI,905,903,-2,1450,1433,-17,225,210,197,1670
1,2018-10-01,WN,N8329B,3744,ABQ,BWI,1500,1458,-2,2045,2020,-25,225,202,191,1670
2,2018-10-01,WN,N920WN,1019,ABQ,DAL,1800,1802,2,2045,2032,-13,105,90,80,580
3,2018-10-01,WN,N480WN,1499,ABQ,DAL,950,947,-3,1235,1223,-12,105,96,81,580
4,2018-10-01,WN,N227WN,3635,ABQ,DAL,1150,1151,1,1430,1423,-7,100,92,80,580


## Removing a column from a DataFrame.

When you are preparing your data for machine learning, you may need to delete specific columns from the DataFrame before training the model.

For example:
Imagine you are training a model to predict how many minutes late a flight will be (ARR_DELAY)

If the model knew the scheduled arrival time (CRS_ARR_TIME) and the actual arrival time (ARR_TIME), the model would quickly figure out ARR_DELAY = ARR_TIME - CRS_ARR_TIME

When we predict arrival times for future flights, we won't have a value for  arrival time (ARR_TIME). So we should remove this column from the DataFrame so it is not used as a feature when training the model to predict ARR_DELAY.  

In [3]:
# Remove the column ARR_TIME from the DataFrane delays_df

#delays_df = delays_df.drop(['ARR_TIME'],axis=1)
new_df = delays_df.drop(columns=['ARR_TIME'])
new_df.head()

Unnamed: 0,FL_DATE,OP_UNIQUE_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,CRS_ARR_TIME,ARR_DELAY,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,DISTANCE
0,2018-10-01,WN,N221WN,802,ABQ,BWI,905,903,-2,1450,-17,225,210,197,1670
1,2018-10-01,WN,N8329B,3744,ABQ,BWI,1500,1458,-2,2045,-25,225,202,191,1670
2,2018-10-01,WN,N920WN,1019,ABQ,DAL,1800,1802,2,2045,-13,105,90,80,580
3,2018-10-01,WN,N480WN,1499,ABQ,DAL,950,947,-3,1235,-12,105,96,81,580
4,2018-10-01,WN,N227WN,3635,ABQ,DAL,1150,1151,1,1430,-7,100,92,80,580


Use the **inplace** parameter to specify you want to drop the column from the original DataFrame

In [4]:
# Remove the column ARR_TIME from the DataFrame delays_df

#delays_df = delays_df.drop(['ARR_TIME'],axis=1)
delays_df.drop(
    columns=['ARR_TIME'], 
    inplace=True
)
delays_df.head()

Unnamed: 0,FL_DATE,OP_UNIQUE_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,CRS_ARR_TIME,ARR_DELAY,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,DISTANCE
0,2018-10-01,WN,N221WN,802,ABQ,BWI,905,903,-2,1450,-17,225,210,197,1670
1,2018-10-01,WN,N8329B,3744,ABQ,BWI,1500,1458,-2,2045,-25,225,202,191,1670
2,2018-10-01,WN,N920WN,1019,ABQ,DAL,1800,1802,2,2045,-13,105,90,80,580
3,2018-10-01,WN,N480WN,1499,ABQ,DAL,950,947,-3,1235,-12,105,96,81,580
4,2018-10-01,WN,N227WN,3635,ABQ,DAL,1150,1151,1,1430,-7,100,92,80,580


We use different techniques to predict based on quantititative values which are usually numeric values (e.g. distance, number of minutes, weight) and qualitative (descriptive) values which may not be numeric (e.g. what airport a flight left from, what airline operated the flight)

Quantitative data may be moved into a separate DataFrame before training a model.

You also need to put the value you want to predict, called the label (ARR_DELAY) in a separate DataFrame from the values you think can help you make the prediction, called the features

We need to be able to create a new dataframe from the columns in an existing dataframe

In [None]:
# Create a new DataFrame called desc_df
# include all rows
# include the columns ORIGIN, DEST, OP_CARRIER_FL_NUM, OP_UNIQUE_CARRIER, TAIL_NUM

desc_df = delays_df.loc[:,['ORIGIN', 'DEST', 'OP_CARRIER_FL_NUM', 'OP_UNIQUE_CARRIER', 'TAIL_NUM']]
desc_df.head()