# <center> Python for Data Analysis</center>
### <center> Session 1c </center>

In [1]:
%%HTML
<style>
td,th {
  font-size: 20px
}
</style>


## <font color=green>Table of Contents</font>
- Quick Introduction to Pandas
- DataFrame and Series  
    - Series
    - DataFrame
- Loading Files
    - The NYC flights Dataset
- Getting a quick look at your data
- Filtering a DataFrame
- Sorting
- Selecting Multiple columns
- Renaming columns
- Rearranging columns
- Creating new columns
- [Grouping in Pandas](#grouping)
- [Merging in Pandas](#merge)
- [The Axis parameter](#axis)
- The Apply Function
- Working with Null Values
- Reshaping/Pivoting

In [2]:
import pandas as pd
import numpy as np
import re
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn.objects as so
alt.data_transformers.disable_max_rows()
import matplotlib.pyplot as plt
print(f"Pandas:{pd.__version__}")
print(f"Altair:{alt.__version__}")
print(f"numpy:{np.__version__}")
print(f"seaborn:{sns.__version__}")

Pandas:2.1.4
Altair:5.2.0
numpy:1.23.5
seaborn:0.12.2


In [3]:
flights=pd.read_csv("https://github.com/niradsp/Python-for-Data-Analysis/raw/main/flights.csv.gz",compression="gzip",index_col=0)

# <a id="grouping"> Grouping in Pandas</a>

Pandas follows what is called the split-apply-combine process.  Briefly, here is what this process entails:  
- <b> Splitting </b> the data into groups
- <b>  Applying </b> an aggregation procedure (such as mean or some other custom function) or a Transformation procedure (e.g. cumulative sum).
- <b>  Combining </b> the data into a DataFrame. 

Let's try a simple grouping.  Let's group by month and then count the number of values in that month.

The way to do this is by using the groupby method.  We will have the column we are interested in within the parenthesis.
Next, we select the month only column.  Finally, we count each group.

In [4]:
flights.groupby("month").month.count()

month
1     27004
2     24951
3     28834
4     28330
5     28796
6     28243
7     29425
8     29327
9     27574
10    28889
11    27268
12    28135
Name: month, dtype: int64

Let's verify that the grouping worked. 
This operation is the same as using value_counts(), which we discussed already.

The <b> sort_index() </b> parameter is used for sorting the index.  By default, value_counts() has the month with the largest count on top.

In [5]:
flights.month.value_counts().sort_index()

month
1     27004
2     24951
3     28834
4     28330
5     28796
6     28243
7     29425
8     29327
9     27574
10    28889
11    27268
12    28135
Name: count, dtype: int64

We can save the data into a variable after using groupby. However, you cannot visualize the group. 

In [6]:
group=flights.groupby("month")
group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001270CEE0550>

you can use the <b> .groups </b> attribute to find the number of groups.
This actually returns a key-value dictionary, but we can count the number of groups as follows.

In [7]:
len(group.groups)

12

If you want to extract the 12th month group, you can use the <b> get_group() </b>method.

In [8]:
group.get_group(12).head(3)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
83161,2013,12,1,13.0,2359,14.0,446.0,445,1.0,B6,745,N715JB,JFK,PSE,195.0,1617,23,59,12/1/2013 23:00
83162,2013,12,1,17.0,2359,18.0,443.0,437,6.0,B6,839,N593JB,JFK,BQN,186.0,1576,23,59,12/1/2013 23:00
83163,2013,12,1,453.0,500,-7.0,636.0,651,-15.0,US,1895,N197UW,EWR,CLT,86.0,529,5,0,12/1/2013 5:00


Now let's try a more complicated example.  
In the example below, we group by month, day and year, and then for this group, we take the mean of departure delay.

I must say that the <b> year </b> column is redundant because we have only one year in the dataset.

In [9]:
flights.groupby(['year','month','day'])['dep_delay'].mean().head()

year  month  day
2013  1      1      11.548926
             2      13.858824
             3      10.987832
             4       8.951595
             5       5.732218
Name: dep_delay, dtype: float64

The <b> reset_index() </b> method will reset the index so that we do not have multi-indexed  columns anymore.
Unlike R, where we have one rownames, Pandas can have multiple index labels.

In [10]:
flights.groupby(['year','month','day'])['dep_delay'].mean("dep_delay").reset_index().rename(columns={"dep_delay":"mean_dep_delay"})

Unnamed: 0,year,month,day,mean_dep_delay
0,2013,1,1,11.548926
1,2013,1,2,13.858824
2,2013,1,3,10.987832
3,2013,1,4,8.951595
4,2013,1,5,5.732218
...,...,...,...,...
360,2013,12,27,10.937630
361,2013,12,28,7.981550
362,2013,12,29,22.309551
363,2013,12,30,10.698113


Did the grouping work?  Let's test to see if we get the same mean, if we subset for Jan 1, 2013.

In [11]:
condition=(flights.month==1) & (flights.year==2013) & (flights.day==1)
flights[condition].dep_delay.mean()

11.54892601431981

The mean is the same.  Therefore, the grouping worked.

As an aside, in order to extract groups where we have multiple grouping columns, we need to use tuples.  In the example below, I extract the data for January 10, 2013.

In [12]:
flights.groupby(['year','month','day']).get_group((2013,1,10)).head(3)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
7900,2013,1,10,3.0,2359,4.0,426.0,437,-11.0,B6,727,N571JB,JFK,BQN,183.0,1576,23,59,1/10/2013 23:00
7901,2013,1,10,16.0,2359,17.0,447.0,444,3.0,B6,739,N564JB,JFK,PSE,191.0,1617,23,59,1/10/2013 23:00
7902,2013,1,10,450.0,500,-10.0,634.0,648,-14.0,US,1117,N171US,EWR,CLT,78.0,529,5,0,1/10/2013 5:00


In the example below, I first group by the "dest" column.
Then, I select distance and arr_delay columns.  Note the double square brackets.  
Finally, for the distance column, I count the number of "dest" in each group, and also take the average.  I also use the arr_delay column to calculate  the mean.
The <b> agg </b> method is handy for this.  The key is the column of interest, and the values are the operations we want (mean, count, etc).

In [13]:
avg_distance_arr_delay=flights.groupby(
    "dest")[['distance','arr_delay']].agg(
    {"distance":['mean','count'],"arr_delay":"mean"}).reset_index()

In [14]:
avg_distance_arr_delay.head()

Unnamed: 0_level_0,dest,distance,distance,arr_delay
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count,mean
0,ABQ,1826.0,254,4.38189
1,ACK,199.0,265,4.852273
2,ALB,143.0,439,14.397129
3,ANC,3370.0,8,-2.5
4,ATL,757.10822,17215,11.300113


Notice that the columns are <b>multi-indexed</b>

To convert from multiple column labels to single column labels, we can use the <b> to_flat_index()</b> method.

In [15]:
avg_distance_arr_delay.columns.to_flat_index()

Index([         ('dest', ''),  ('distance', 'mean'), ('distance', 'count'),
       ('arr_delay', 'mean')],
      dtype='object')

Notice above that the multi-indexed columns are tuples.  We can thus join these duples and create a single-index column from this.

In [16]:
avg_distance_arr_delay.columns=avg_distance_arr_delay.columns.to_flat_index().str.join("_")

Note that when you join the multi-index columns, the "dest" column is changed to "dest_". We thus need to rename this column.

In [17]:
avg_distance_arr_delay.rename(columns={"dest_":"dest"},inplace=True)

In [18]:
avg_distance_arr_delay.head()

Unnamed: 0,dest,distance_mean,distance_count,arr_delay_mean
0,ABQ,1826.0,254,4.38189
1,ACK,199.0,265,4.852273
2,ALB,143.0,439,14.397129
3,ANC,3370.0,8,-2.5
4,ATL,757.10822,17215,11.300113


Below, I group by month and day, and count the number of unique destinations. Here, I use <b> nunique() </b>.

In [49]:
flights.groupby(["month","day"]).dest.nunique().head(3)

month  day
1      1      87
       2      88
       3      87
Name: dest, dtype: int64

The <b> nth() </b> method can be used to extract the nth row for each group.

In [51]:
flights.groupby(["month","day"]).nth(1).sort_values(["year","month","day"]).head(5)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,1/1/2013 5:00
843,2013,1,2,126.0,2250,156.0,233.0,2359,154.0,B6,22,N636JB,JFK,SYR,49.0,209,22,50,1/2/2013 22:00
1786,2013,1,3,50.0,2145,185.0,203.0,2311,172.0,B6,104,N329JB,JFK,BUF,58.0,301,21,45,1/3/2013 21:00
2700,2013,1,4,106.0,2245,141.0,201.0,2356,125.0,B6,608,N192JB,JFK,PWM,44.0,273,22,45,1/4/2013 22:00
3615,2013,1,5,37.0,2230,127.0,341.0,131,130.0,B6,11,N527JB,JFK,FLL,163.0,1069,22,30,1/5/2013 22:00


There is also the <b> size() </b> method.  Note that the size() method also counts NULL values.

In [55]:
flights.groupby(["year","month"]).size().head()

year  month
2013  1        27004
      2        24951
      3        28834
      4        28330
      5        28796
dtype: int64

Similarly, we can also group and find the standard deviation.

In [56]:
flights.groupby(["year","month"]).dep_delay.std().head(3)

year  month
2013  1        36.390313
      2        36.266553
      3        40.130967
Name: dep_delay, dtype: float64

You can use the <b> .all() </b> method to test if all of the values are not 0.

In [57]:
flights.groupby(["year","month"]).dep_delay.all()

year  month
2013  1        False
      2        False
      3        False
      4        False
      5        False
      6        False
      7        False
      8        False
      9        False
      10       False
      11       False
      12       False
Name: dep_delay, dtype: bool

Here, I am using the <b> assign()</b> function to first create a column called dep_gt_500 (note that with assign the variable name does not have a quote). It is a boolean column (True or False if a row has dep_delay greater than 1100) Next, we check to see if any of the months have departure delay greater than 1100.  We can see that January and June are True below.

In [58]:
flights.assign(dep_gt_1100=flights['dep_delay']>1100).groupby(['month']).dep_gt_1100.any()

month
1      True
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11    False
12    False
Name: dep_gt_1100, dtype: bool

Use the <b> describe() </b> method to print descriptive stastics for each group.
Below, I used describe() on the dep_time column only, but if you exclude column name, it will compute descriptive statisics for all columns.

In [59]:
flights.groupby(['year','month','day']).dep_time.describe().head(n=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count,mean,std,min,25%,50%,75%,max
year,month,day,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2013,1,1,838.0,1384.991647,470.954331,517.0,940.25,1439.5,1756.75,2356.0
2013,1,2,935.0,1353.960428,484.82681,42.0,912.5,1412.0,1749.5,2354.0
2013,1,3,904.0,1356.665929,486.37812,32.0,909.0,1423.0,1755.5,2349.0


Next, we will talk about <b> Transformation </b> Functions.  Transformation Functions are functions that return the same number of rows as the original data.  We will look at the cumsum() function first.  
Let's look at what the dep_delay column looks like first.

In [60]:
flights.dep_delay.head()

0    2.0
1    4.0
2    2.0
3   -1.0
4   -6.0
Name: dep_delay, dtype: float64

In [61]:
flights.groupby(['year','month','day']).dep_delay.cumsum().head(n=10)

0     2.0
1     6.0
2     8.0
3     7.0
4     1.0
5    -3.0
6    -8.0
7   -11.0
8   -14.0
9   -16.0
Name: dep_delay, dtype: float64

You can see below that the shape of this transformed data is the same as the original data (336776).

In [63]:
flights.groupby(['year','month','day']).dep_delay.cumsum().shape

(336776,)

Next is the <b> diff() </b> method to calculate differences between current value and previous value.
The first value does not have a previous value, and therefore returns NULL.

In [64]:
flights.dep_delay.head()

0    2.0
1    4.0
2    2.0
3   -1.0
4   -6.0
Name: dep_delay, dtype: float64

In [65]:
flights.groupby(['year','month','day']).dep_delay.diff().head()

0    NaN
1    2.0
2   -2.0
3   -3.0
4   -5.0
Name: dep_delay, dtype: float64

You can control which values to subtract by using the <b> periods </b> parameter.  For example, using periods=2, it will subtract value  in position <b> n-2 </b> from position <b> n</b>.

In [66]:
flights.groupby(['year','month','day']).dep_delay.diff(periods=2).head()

0    NaN
1    NaN
2    0.0
3   -5.0
4   -8.0
Name: dep_delay, dtype: float64

These functions (diff, cumsum) are <b> Transform Functions </b>.  There is also the <b> transform method </b>.  
The  "transform" method also returns data of the <b> same length </b> as the original data.  

In the example below, I use transform to return mean of each group. 
Notice though that the mean gets repeated (a single group will have just one mean).

Using <b> iloc </b> I selected only columns 0,1,2, 18 and 19.

In [69]:
flights_small=flights.iloc[1:1000,:].copy()
flights_small['mean_dep_delay']=flights_small.groupby(['year','month','day']).dep_delay.transform("mean")
flights_small.iloc[:,[0,1,2,18,19]].head()

Unnamed: 0,year,month,day,time_hour,mean_dep_delay
1,2013,1,1,1/1/2013 5:00,11.560335
2,2013,1,1,1/1/2013 5:00,11.560335
3,2013,1,1,1/1/2013 5:00,11.560335
4,2013,1,1,1/1/2013 6:00,11.560335
5,2013,1,1,1/1/2013 5:00,11.560335


Finally, there is a <b> filter() </b> method that can be used with groupby.

Let's first look at the number of rows per month.

In [70]:
flights.groupby("month").dep_delay.size()

month
1     27004
2     24951
3     28834
4     28330
5     28796
6     28243
7     29425
8     29327
9     27574
10    28889
11    27268
12    28135
Name: dep_delay, dtype: int64

You can see in the example below that only July (month=7) has count >29400.  

We can use the filter method as follows.  We create an anonymous function which tests to see if a group has length >28880.  Notice that the results are all from October.

In [71]:
flights.groupby(['month']).filter(lambda x:len(x['dep_delay'])>29400)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
250450,2013,7,1,1.0,2029,212.0,236.0,2359,157.0,B6,915,N653JB,JFK,SFO,315.0,2586,20,29,7/1/2013 20:00
250451,2013,7,1,2.0,2359,3.0,344.0,344,0.0,B6,1503,N805JB,JFK,SJU,200.0,1598,23,59,7/1/2013 23:00
250452,2013,7,1,29.0,2245,104.0,151.0,1,110.0,B6,234,N348JB,JFK,BTV,66.0,266,22,45,7/1/2013 22:00
250453,2013,7,1,43.0,2130,193.0,322.0,14,188.0,B6,1371,N794JB,LGA,FLL,143.0,1076,21,30,7/1/2013 21:00
250454,2013,7,1,44.0,2150,174.0,300.0,100,120.0,AA,185,N324AA,JFK,LAX,297.0,2475,21,50,7/1/2013 21:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279870,2013,7,31,2352.0,2245,67.0,49.0,2359,50.0,B6,1816,N296JB,JFK,SYR,40.0,209,22,45,7/31/2013 22:00
279871,2013,7,31,,655,,,930,,AA,711,N3BAAA,LGA,DFW,,1389,6,55,7/31/2013 6:00
279872,2013,7,31,,1400,,,1508,,US,2130,,LGA,BOS,,184,14,0,7/31/2013 14:00
279873,2013,7,31,,959,,,1125,,UA,700,,EWR,ORD,,719,9,59,7/31/2013 9:00


# <a id="merge"> Merging</a>

To merge multiple columns, use the <b> pd.merge() </b> function.  Alternatively, there is also the <b> join() </b> function.  
Let's create 2 toy DataFrames first.  
Each dataframe has 2 columns: category and value.

In [72]:
df1=pd.DataFrame({"category1":["a","b","c","d"],"value":[1,2,3,4]})

In [73]:
df2=pd.DataFrame({"category2":["a","c","d","e"],"value":[1,2,3,4]})

Let's look at what the DataFrames look like.

In [74]:
df1

Unnamed: 0,category1,value
0,a,1
1,b,2
2,c,3
3,d,4


In [75]:
df2

Unnamed: 0,category2,value
0,a,1
1,c,2
2,d,3
3,e,4


Given that the keys we want to merge have different names (category1 and category2), we need to specify these columns, using left_on and right_on.  By default, merge will add the suffix _x and _y on the columns "value".

In [76]:
pd.merge(df1,df2,left_on="category1",right_on="category2")

Unnamed: 0,category1,value_x,category2,value_y
0,a,1,a,1
1,c,3,c,2
2,d,4,d,3


We see that there are 2 category columns, and they are identical.  We can thus drop 1 and rename the other one.

In [77]:
df1.merge(df2,left_on="category1",right_on="category2",suffixes=("_left","_right")).drop("category1",axis=1).rename(columns={"category2":"category"})

Unnamed: 0,value_left,category,value_right
0,1,a,1
1,3,c,2
2,4,d,3


By default, the merge() function performs an inner join.  Let's do a "left" merge.

A left merge is where all of the values in the left dataframe are kept.

In [78]:
df1.merge(df2,left_on="category1",right_on="category2",how="left")

Unnamed: 0,category1,value_x,category2,value_y
0,a,1,a,1.0
1,b,2,,
2,c,3,c,2.0
3,d,4,d,3.0


Similarly here is right join.  Here, all values in the right dataframe are preserved.

In [19]:
df1.merge(df2,left_on="category1",right_on="category2",how="right")

NameError: name 'df1' is not defined

Finally, the outer join will have all keys listed.

In [80]:
df1.merge(df2,left_on="category1",right_on="category2",how="outer")

Unnamed: 0,category1,value_x,category2,value_y
0,a,1.0,a,1.0
1,b,2.0,,
2,c,3.0,c,2.0
3,d,4.0,d,3.0
4,,,e,4.0


There is also the "join" function that has the same capability as merge. 


In [81]:
df1.join(df2.set_index("category2"),lsuffix="_left",rsuffix="_right",on="category1",how="outer")

Unnamed: 0,category1,value_left,value_right
0.0,a,1.0,1.0
1.0,b,2.0,
2.0,c,3.0,2.0
3.0,d,4.0,3.0
,e,,4.0


You can read more about Joins in the documentation below:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html

Now let's rename the category1 and category2 columns.  We will set inplace=True so that the dataframe itself is modified.

In [82]:
df1.rename(columns={"category1":"category"},inplace=True)

In [83]:
df2.rename(columns={"category2":"category"},inplace=True)

In [84]:
df1.head()

Unnamed: 0,category,value
0,a,1
1,b,2
2,c,3
3,d,4


In [85]:
df2.head()

Unnamed: 0,category,value
0,a,1
1,c,2
2,d,3
3,e,4


Both df1 and df2 now have the same column names.

If you want to concatenate 2 dataframes, you can use <b> pd.concat()</b>.  By default, it will perform a rowbind.

In [86]:
pd.concat([df1,df2])

Unnamed: 0,category,value
0,a,1
1,b,2
2,c,3
3,d,4
0,a,1
1,c,2
2,d,3
3,e,4


# <a id="axis">  About the "axis" parameter in various pandas methods.</a>

You will notice that many different functions give you the "axis" parameter.
If you want to compute by <b> row</b>, you use <b> axis=0</b>, while if you want to compute by <b>column</b> you use <b> axis=1 </b>.
As you will notice below, axis=0 concatenates by row (as in rbind in R).

In [87]:
pd.concat([df1,df2],axis=0)

Unnamed: 0,category,value
0,a,1
1,b,2
2,c,3
3,d,4
0,a,1
1,c,2
2,d,3
3,e,4


If you want to concatenate by columns, use axis=1. This is like R's <b> cbind </b>
Here I added 3 dataframe to the list (df2 repeated deliberately) to show that you can concatenate multiple dataframe.

In [91]:
pd.concat([df1,df2,df2],axis=1)

Unnamed: 0,category,value,category.1,value.1,category.2,value.2
0,a,1,a,1,a,1
1,b,2,c,2,c,2
2,c,3,d,3,d,3
3,d,4,e,4,e,4


Now let's select numeric data types using the select_dtypes command. This is in order to perform some calculations.

This can be done by using the <b> select_dtypes </b> method.

In [92]:
flights_numeric=flights.select_dtypes(['int64','float64']).copy()

In [93]:
flights_numeric.shape

(336776, 14)

Let's take the mean of the flights_numeric dataset by modifying the row axis.

In [94]:
flights_numeric.mean(axis=0)

year              2013.000000
month                6.548510
day                 15.710787
dep_time          1349.109947
sched_dep_time    1344.254840
dep_delay           12.639070
arr_time          1502.054999
sched_arr_time    1536.380220
arr_delay            6.895377
flight            1971.923620
air_time           150.686460
distance          1039.912604
hour                13.180247
minute              26.230100
dtype: float64

The axis=0 parameter returned the average of each column.
The way to think of this is as follows.  
We have a dataframe with dimension (336776,19).
The axis=0 parameter will work on the row and basically return a Series dataset, with a 1x19 Dimension.  In other words, the row has been changed, but the columns is the same.

To calculate mean of each row (or the rowmean), you use axis=1.

In [95]:
flights_numeric.mean(axis=1).shape

(336776,)

Let's quickly look at the shape of the flights_numeric.

In [96]:
flights_numeric.shape

(336776, 14)

Note that the rows match.

Similarly, drop() with axis set to 1 will reduce the column by one.

In [97]:
flights_numeric.drop(['month'],axis=1,inplace=True)

In [98]:
flights_numeric.shape

(336776, 13)

Notice that the "month" column is not there anymore.  We removed it using drop().

In [99]:
flights_numeric.head()

Unnamed: 0,year,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,flight,air_time,distance,hour,minute
0,2013,1,517.0,515,2.0,830.0,819,11.0,1545,227.0,1400,5,15
1,2013,1,533.0,529,4.0,850.0,830,20.0,1714,227.0,1416,5,29
2,2013,1,542.0,540,2.0,923.0,850,33.0,1141,160.0,1089,5,40
3,2013,1,544.0,545,-1.0,1004.0,1022,-18.0,725,183.0,1576,5,45
4,2013,1,554.0,600,-6.0,812.0,837,-25.0,461,116.0,762,6,0


How about dropping a row?  It is simple as specifying the index.

In [100]:
flights_numeric.drop(0,axis=0)

Unnamed: 0,year,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,flight,air_time,distance,hour,minute
1,2013,1,533.0,529,4.0,850.0,830,20.0,1714,227.0,1416,5,29
2,2013,1,542.0,540,2.0,923.0,850,33.0,1141,160.0,1089,5,40
3,2013,1,544.0,545,-1.0,1004.0,1022,-18.0,725,183.0,1576,5,45
4,2013,1,554.0,600,-6.0,812.0,837,-25.0,461,116.0,762,6,0
5,2013,1,554.0,558,-4.0,740.0,728,12.0,1696,150.0,719,5,58
...,...,...,...,...,...,...,...,...,...,...,...,...,...
336771,2013,30,,1455,,,1634,,3393,,213,14,55
336772,2013,30,,2200,,,2312,,3525,,198,22,0
336773,2013,30,,1210,,,1330,,3461,,764,12,10
336774,2013,30,,1159,,,1344,,3572,,419,11,59


You can specify the axis with rename() as well. Axis=1 will rename a column.

In [101]:
flights_numeric.rename({"dep_time":"departure_time"},axis=1).head(2)

Unnamed: 0,year,day,departure_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,flight,air_time,distance,hour,minute
0,2013,1,517.0,515,2.0,830.0,819,11.0,1545,227.0,1400,5,15
1,2013,1,533.0,529,4.0,850.0,830,20.0,1714,227.0,1416,5,29


Similarly, using axis=0 allows you to rename the index label.

In [102]:
flights_numeric.rename({0:"Zero"},axis=0).head(2)

Unnamed: 0,year,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,flight,air_time,distance,hour,minute
Zero,2013,1,517.0,515,2.0,830.0,819,11.0,1545,227.0,1400,5,15
1,2013,1,533.0,529,4.0,850.0,830,20.0,1714,227.0,1416,5,29
