# Loading the data

- pandas has a built-in function to help us load data from csv files
- the data is taken from the file and loaded into a pandas DataFrame

In [1]:
import pandas as pd
df = pd.read_csv('my_strava_activities.csv', index_col=0)

We specified the name of our file which is found in the same directory as this Jupyter notebook. We also used the parameter `index_col` which specifies a column to use as the row labels of the DataFrame. 

###  What is a DataFrame?

DataFrames are the main object of the pandas library. They help us store data in a table like format with rows (for the observations) and columns (for the variables or features). 

Pandas has many built-in machinery for working with DataFrames that are both efficient and user friendly. A great place to discover the methods and attributes of the DataFrame object is the official documentation at:

[https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

Most often we will define DataFrames by reading in data from a file. However we can also define them using the constructor `pd.DataFrame()` and passing one of the accepted data formats.

In [2]:
pd.DataFrame([[1,2,3],[4,5,6]], index=[0,1], columns = ['A','B', 'C'])

Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6


###  Looking at some entries

We can take a look at the top entries of a DataFrame by using the `DataFrame.head()` method.

In [3]:
df.head()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
0,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False,2018-05-21T09:59:24
1,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44
2,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35
3,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False,2018-05-17T17:50:55
4,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False,2018-05-17T07:15:06


Or the last entries with the `DataFrame.tail()` method

In [4]:
df.tail()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
971,3.32,,,10018.3,0:54:34,56.3,Run,False,2013-07-05T18:55:07
972,2.521,,,15818.8,1:50:18,120.6,Run,False,2013-06-23T12:43:01
973,3.559,,,10118.4,0:47:23,60.8,Run,False,2013-06-16T08:59:56
974,2.357,,,7798.5,0:56:43,110.3,Run,False,2013-06-14T13:40:09
975,3.034,,,20998.9,1:55:21,99.0,Run,False,2013-06-01T09:02:40


### The index and column labels

Essentially a DataFrame is composed of the following three elements:

* the **index labels** (these are the bold numbers on the left hand side of the table)
* the **column names** (these are the bold names on the top of the table)
* the **data** itself (this is everything else inside the actual cells of the table)


We can see the index of a DataFrame using the attribute `DataFrame.index`

In [5]:
df.index

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            966, 967, 968, 969, 970, 971, 972, 973, 974, 975],
           dtype='int64', length=976)

The default indexing is always a numbering starting at `0`, but this can be changed to anything we would like.

We can see the column labels using the attribute `DataFrame.columns`

In [6]:
df.columns

Index(['average_speed', 'average_heartrate', 'average_watts', 'distance',
       'elapsed_time', 'total_elevation_gain', 'type', 'commute',
       'start_date_local'],
      dtype='object')

We can select any of column of a DataFrame by passing its label inside the index operator `[]`

In [7]:
df['elapsed_time'].head()

0    3:23:50
1    6:17:54
2    4:39:40
3    2:38:43
4    0:55:01
Name: elapsed_time, dtype: object

We can see the total number of rows and columns using the attribute `DataFrame.shape`.

In [8]:
df.shape

(976, 9)

###  Summary statistics

In a DataFrame the data in each column must have the same type. We can check the data types of each column, together with number of missing values using the `DataFrame.info()` method

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 976 entries, 0 to 975
Data columns (total 9 columns):
average_speed           976 non-null float64
average_heartrate       424 non-null float64
average_watts           352 non-null float64
distance                976 non-null float64
elapsed_time            976 non-null object
total_elevation_gain    976 non-null float64
type                    976 non-null object
commute                 976 non-null bool
start_date_local        976 non-null object
dtypes: bool(1), float64(5), object(3)
memory usage: 69.6+ KB


For the numerical columns we easily get some summary statistics using the `DataFrame.describe()` method

In [10]:
df.describe()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,total_elevation_gain
count,976.0,424.0,352.0,976.0,976.0
mean,4.220466,135.874057,133.175852,31351.092008,494.94375
std,2.106024,12.621878,28.293171,37743.943179,772.615557
min,0.0,96.9,53.4,0.0,0.0
25%,2.76775,128.0,112.575,7227.425,42.025
50%,3.076,135.65,129.55,15435.8,128.95
75%,6.39575,143.125,149.7,38473.125,636.675
max,10.238,172.7,269.3,210087.0,5799.2


If we ask nicely we can also get summary stastistics for other columns

In [11]:
df.describe(include=[object, bool])

Unnamed: 0,elapsed_time,type,commute,start_date_local
count,976,976,976,976
unique,919,4,2,976
top,1:00:00,Run,False,2015-06-01T16:05:41
freq,17,585,946,1


### Types of variables

Variables can be classified according to two main types
* **categorical**: discrete, finite amount of values, can be numerical or not
* **continuous**: infinite number of values, usually corresponds to a measurement, always numerical

With categorial variables we usually want to get a list of all the possible values together with the number of observations for each value. We can do this with the method `DataFrame.value_counts()`

In [12]:
df['type'].value_counts()

Run            585
Ride           346
Swim            30
VirtualRide     15
Name: type, dtype: int64

In [13]:
df['commute'].value_counts()

False    946
True      30
Name: commute, dtype: int64

# Selecting data from a DataFrame

A fundamental task when working with a DataFrame is selecting data from it. We have already seen a way of selecting a single column of a DataFrame by passing its label to the index operator `[]`. 

The DataFrame object has two specific operators to help us perform data selection: 

* the `iloc` operator to select rows and columns by integer positions
* the `loc` operator to select rows and columns by labels

The syntax is the following:

```
DataFrame.iloc[rows, columns]
```

and

```
DataFrame.loc[rows, columns]
```

where **rows** specifies the rows that we want and **columns** specifies the columns that we want. Let's look at each one of them in detail.

### The iloc operator


In [14]:
df.iloc[0,0]

5.4649999999999999

In [15]:
df.iloc[0, 1:4]

average_heartrate    129.9
average_watts        131.7
distance             51690
Name: 0, dtype: object

In [16]:
df.iloc[0:10, [2,3]]

Unnamed: 0,average_watts,distance
0,131.7,51690.0
1,127.8,133126.0
2,136.4,108418.0
3,135.3,59019.1
4,117.7,23830.9
5,129.0,23773.2
6,,10847.6
7,,4995.2
8,121.9,129491.0
9,,3337.9


**Question**: what do you notice about the returned values of these selections? 

### The loc operator


In [17]:
df.loc[2,'average_speed']

6.8529999999999998

In [18]:
df.loc[0:2, :]

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
0,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False,2018-05-21T09:59:24
1,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44
2,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35


In [19]:
df.iloc[0:2, :]

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
0,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False,2018-05-21T09:59:24
1,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44


In [20]:
df.loc[0:2, ['average_speed','commute']]

Unnamed: 0,average_speed,commute
0,5.465,False
1,6.948,False
2,6.853,False


**Question**: which of the following commands correctly selects the first column of the DataFrame:

```
1. df['average_speed']
2. df.iloc[:,0]
3. df.loc[:,'average_speed']
4. df.average_speed
```


### Boolean selection

The methods we have considered so far select rows and columns based on either their position in the DataFrame or their index label and column names respectively. We can also select rows and columns based on a boolean condition. The most common way is to select some rows based on a boolean condition on one or more of the columns. 

For example suppose we want to select only the cycling activities. We cannot do this using position or label because we do not know where they are located in the DataFrame.

Instead we can write the following condition

In [21]:
df['type']=='Ride'

0       True
1       True
2       True
3       True
4       True
5       True
6      False
7      False
8       True
9      False
10     False
11     False
12      True
13      True
14     False
15      True
16     False
17      True
18     False
19      True
20      True
21     False
22      True
23     False
24     False
25      True
26     False
27      True
28      True
29      True
       ...  
946    False
947    False
948    False
949    False
950    False
951    False
952    False
953    False
954    False
955    False
956    False
957    False
958    False
959    False
960    False
961    False
962    False
963    False
964    False
965    False
966    False
967    False
968    False
969    False
970    False
971    False
972    False
973    False
974    False
975    False
Name: type, Length: 976, dtype: bool

When executed by itself it will return a Series of `True` `False` values corresponding to each row of the DataFrame. 

To actually get back the rows corresponding to `True` values we need to pass this condition to the `loc` operator.

In [22]:
df.loc[df['type']=='Ride',:].head()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
0,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False,2018-05-21T09:59:24
1,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44
2,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35
3,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False,2018-05-17T17:50:55
4,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False,2018-05-17T07:15:06


Equivalently we can also pass the boolean condition inside the index operator `[]` directly

In [23]:
df[df['type']=='Ride'].head()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
0,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False,2018-05-21T09:59:24
1,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44
2,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35
3,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False,2018-05-17T17:50:55
4,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False,2018-05-17T07:15:06


We can add multiple boolean conditions as long as we separate each one inside parentheses and join them with the logical operators:
* `&` for AND
* `'` for OR
* `~` for NOT

Let's get all activities which are a Ride and a commute

In [24]:
df.loc[(df['type']=='Ride') & (df['commute']==True),:].head()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
20,8.367,114.1,112.5,23845.2,0:49:27,105.0,Ride,True,2018-04-20T08:05:37
29,8.282,129.8,131.2,24009.0,0:52:23,113.0,Ride,True,2018-04-11T07:49:03
138,6.725,107.5,85.8,23585.4,0:58:27,120.0,Ride,True,2017-10-25T08:26:41
173,7.76,,127.6,23466.7,0:50:24,119.0,Ride,True,2017-09-07T08:36:07
182,8.78,128.0,161.0,24398.7,0:48:10,117.0,Ride,True,2017-08-30T08:12:35


### Selecting columns with DataFrame methods

One final type of selection that we will look at is using some specific DataFrame methods.

Often it is useful to select columns based on their data type. For example we might want to select only the numerical columns to apply some operation on them. To do this we can use the `DataFrame.select_dtypes()` method and pass in the desired data types to the `include` parameter

In [25]:
df.select_dtypes(include=['number']).head()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,total_elevation_gain
0,5.465,129.9,131.7,51690.0,1414.0
1,6.948,131.3,127.8,133126.0,2280.0
2,6.853,135.4,136.4,108418.0,1950.0
3,6.756,,135.3,59019.1,937.0
4,7.454,,117.7,23830.9,122.0


Another useful method is the `DataFrame.filter()` method. This methods provides several parameters for searching column names or index labels. For example we can get all the columns whose name contains the string `average` as follows

In [26]:
df.filter(like='average').head()

Unnamed: 0,average_speed,average_heartrate,average_watts
0,5.465,129.9,131.7
1,6.948,131.3,127.8
2,6.853,135.4,136.4
3,6.756,,135.3
4,7.454,,117.7


# Modifying a DataFrame

### Renaming rows and column

One of the most basic operations on a DataFrame is to rename the row or column labels. Good column names should be descriptive yet brief and follow a common convention when it comes to capitalization, spaces, underscores, or other features. 

To rename rows of columns we can use the `DataFrame.rename()` method and pass in a dictionary that maps the current names to the new names to the the parameter `index` or the parameter `columns` respectively.

In [27]:
df.rename(index = {0: 'A', 1:'B'}, columns = {'average_speed': 'speed'})

Unnamed: 0,speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
A,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False,2018-05-21T09:59:24
B,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44
2,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35
3,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False,2018-05-17T17:50:55
4,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False,2018-05-17T07:15:06
5,6.259,129.8,129.0,23773.2,1:06:00,504.0,Ride,False,2018-05-16T18:09:43
6,2.864,,,10847.6,1:08:12,347.3,Run,False,2018-05-15T19:49:01
7,2.928,,,4995.2,0:31:15,50.0,Run,False,2018-05-14T19:38:03
8,7.289,132.8,121.9,129491.0,5:55:04,1833.0,Ride,False,2018-05-12T09:13:34
9,3.004,134.0,,3337.9,0:18:54,50.0,Run,False,2018-05-05T09:47:42


### Changing the index

The default index is a list of integers from `0` to `n-1` where `n` is the total number of rows. However it is often more useful to change the index to a column that is more meaningful as an identified. In our example we can change the index to the column giving the starting time of the activity. To do this we use the `DataFrame.set_index()` method

In [28]:
df.set_index('start_date_local')

Unnamed: 0_level_0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute
start_date_local,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2018-05-21T09:59:24,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False
2018-05-20T08:46:44,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False
2018-05-19T08:12:35,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False
2018-05-17T17:50:55,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False
2018-05-17T07:15:06,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False
2018-05-16T18:09:43,6.259,129.8,129.0,23773.2,1:06:00,504.0,Ride,False
2018-05-15T19:49:01,2.864,,,10847.6,1:08:12,347.3,Run,False
2018-05-14T19:38:03,2.928,,,4995.2,0:31:15,50.0,Run,False
2018-05-12T09:13:34,7.289,132.8,121.9,129491.0,5:55:04,1833.0,Ride,False
2018-05-05T09:47:42,3.004,134.0,,3337.9,0:18:54,50.0,Run,False


By default this drops the column used as the index from the DataFrame, so if we try to select it now by it's name we will get an error. We can either select it using the `index` attribute or we can request that the column is kept in the DataFrame by using the `drop=False` parameter

In [29]:
df.set_index('start_date_local', drop=False)

Unnamed: 0_level_0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
start_date_local,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-05-21T09:59:24,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False,2018-05-21T09:59:24
2018-05-20T08:46:44,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44
2018-05-19T08:12:35,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35
2018-05-17T17:50:55,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False,2018-05-17T17:50:55
2018-05-17T07:15:06,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False,2018-05-17T07:15:06
2018-05-16T18:09:43,6.259,129.8,129.0,23773.2,1:06:00,504.0,Ride,False,2018-05-16T18:09:43
2018-05-15T19:49:01,2.864,,,10847.6,1:08:12,347.3,Run,False,2018-05-15T19:49:01
2018-05-14T19:38:03,2.928,,,4995.2,0:31:15,50.0,Run,False,2018-05-14T19:38:03
2018-05-12T09:13:34,7.289,132.8,121.9,129491.0,5:55:04,1833.0,Ride,False,2018-05-12T09:13:34
2018-05-05T09:47:42,3.004,134.0,,3337.9,0:18:54,50.0,Run,False,2018-05-05T09:47:42


We can always revert to the default indexing with the `DataFrame.reset_index()` method. This will take the current index and turn it into a column of the DataFrame

In [30]:
#takes us back to where we started
df.set_index('start_date_local').reset_index()

Unnamed: 0,start_date_local,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute
0,2018-05-21T09:59:24,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False
1,2018-05-20T08:46:44,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False
2,2018-05-19T08:12:35,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False
3,2018-05-17T17:50:55,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False
4,2018-05-17T07:15:06,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False
5,2018-05-16T18:09:43,6.259,129.8,129.0,23773.2,1:06:00,504.0,Ride,False
6,2018-05-15T19:49:01,2.864,,,10847.6,1:08:12,347.3,Run,False
7,2018-05-14T19:38:03,2.928,,,4995.2,0:31:15,50.0,Run,False
8,2018-05-12T09:13:34,7.289,132.8,121.9,129491.0,5:55:04,1833.0,Ride,False
9,2018-05-05T09:47:42,3.004,134.0,,3337.9,0:18:54,50.0,Run,False


### Changing the order of the columns

It is often useful to order to columns in some logical way. We can do this by simply passing a list of the column labels written in the exact order we want to the index operator `[]` of our DataFrame

In [31]:
new_order = ['start_date_local','distance','average_speed','elapsed_time',
             'total_elevation_gain','average_heartrate','average_watts',
             'type','commute']
df[new_order]

Unnamed: 0,start_date_local,distance,average_speed,elapsed_time,total_elevation_gain,average_heartrate,average_watts,type,commute
0,2018-05-21T09:59:24,51690.0,5.465,3:23:50,1414.0,129.9,131.7,Ride,False
1,2018-05-20T08:46:44,133126.0,6.948,6:17:54,2280.0,131.3,127.8,Ride,False
2,2018-05-19T08:12:35,108418.0,6.853,4:39:40,1950.0,135.4,136.4,Ride,False
3,2018-05-17T17:50:55,59019.1,6.756,2:38:43,937.0,,135.3,Ride,False
4,2018-05-17T07:15:06,23830.9,7.454,0:55:01,122.0,,117.7,Ride,False
5,2018-05-16T18:09:43,23773.2,6.259,1:06:00,504.0,129.8,129.0,Ride,False
6,2018-05-15T19:49:01,10847.6,2.864,1:08:12,347.3,,,Run,False
7,2018-05-14T19:38:03,4995.2,2.928,0:31:15,50.0,,,Run,False
8,2018-05-12T09:13:34,129491.0,7.289,5:55:04,1833.0,132.8,121.9,Ride,False
9,2018-05-05T09:47:42,3337.9,3.004,0:18:54,50.0,134.0,,Run,False


### Droping data 

We can drop rows and column of a DataFrame by passing their label to the `DataFrame.drop()` method. We must also specify whether they are rows or columns by using the `axis` parameter.

In [32]:
df.drop(0, axis='rows')

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
1,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44
2,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35
3,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False,2018-05-17T17:50:55
4,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False,2018-05-17T07:15:06
5,6.259,129.8,129.0,23773.2,1:06:00,504.0,Ride,False,2018-05-16T18:09:43
6,2.864,,,10847.6,1:08:12,347.3,Run,False,2018-05-15T19:49:01
7,2.928,,,4995.2,0:31:15,50.0,Run,False,2018-05-14T19:38:03
8,7.289,132.8,121.9,129491.0,5:55:04,1833.0,Ride,False,2018-05-12T09:13:34
9,3.004,134.0,,3337.9,0:18:54,50.0,Run,False,2018-05-05T09:47:42
10,2.993,132.5,,5887.3,0:36:16,35.9,Run,False,2018-05-03T19:14:36


In [33]:
df.drop([0,1], axis='rows')

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
2,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35
3,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False,2018-05-17T17:50:55
4,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False,2018-05-17T07:15:06
5,6.259,129.8,129.0,23773.2,1:06:00,504.0,Ride,False,2018-05-16T18:09:43
6,2.864,,,10847.6,1:08:12,347.3,Run,False,2018-05-15T19:49:01
7,2.928,,,4995.2,0:31:15,50.0,Run,False,2018-05-14T19:38:03
8,7.289,132.8,121.9,129491.0,5:55:04,1833.0,Ride,False,2018-05-12T09:13:34
9,3.004,134.0,,3337.9,0:18:54,50.0,Run,False,2018-05-05T09:47:42
10,2.993,132.5,,5887.3,0:36:16,35.9,Run,False,2018-05-03T19:14:36
11,3.114,134.1,,8158.0,0:46:41,46.9,Run,False,2018-05-01T19:23:56


In [34]:
df.drop('average_speed', axis='columns').head()

Unnamed: 0,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
0,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False,2018-05-21T09:59:24
1,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44
2,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35
3,,135.3,59019.1,2:38:43,937.0,Ride,False,2018-05-17T17:50:55
4,,117.7,23830.9,0:55:01,122.0,Ride,False,2018-05-17T07:15:06


### Question: are these modifications being applied to the original DataFrame?

In [35]:
df.head()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local
0,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False,2018-05-21T09:59:24
1,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44
2,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35
3,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False,2018-05-17T17:50:55
4,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False,2018-05-17T07:15:06


By default, most pandas DataFrame methods do not modify the original object, but instead return a new object that is a copy of the original and contains the specified modifications. Whenever we want the modifications to be applied to the original object we must use the parameter:

```
inplace = True
```



### Adding data

We can add a new column to a DataFrame by simply defining it directly

In [36]:
df['new column'] = 0

In [37]:
df.head()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local,new column
0,5.465,129.9,131.7,51690.0,3:23:50,1414.0,Ride,False,2018-05-21T09:59:24,0
1,6.948,131.3,127.8,133126.0,6:17:54,2280.0,Ride,False,2018-05-20T08:46:44,0
2,6.853,135.4,136.4,108418.0,4:39:40,1950.0,Ride,False,2018-05-19T08:12:35,0
3,6.756,,135.3,59019.1,2:38:43,937.0,Ride,False,2018-05-17T17:50:55,0
4,7.454,,117.7,23830.9,0:55:01,122.0,Ride,False,2018-05-17T07:15:06,0


We can define a new row similarly, but this time we must distinguish that it is a row and not a column. We can do this with the `loc` operator

In [38]:
df.loc[len(df.index)] = 0

In [39]:
df.tail()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local,new column
972,2.521,,,15818.8,1:50:18,120.6,Run,0,2013-06-23T12:43:01,0
973,3.559,,,10118.4,0:47:23,60.8,Run,0,2013-06-16T08:59:56,0
974,2.357,,,7798.5,0:56:43,110.3,Run,0,2013-06-14T13:40:09,0
975,3.034,,,20998.9,1:55:21,99.0,Run,0,2013-06-01T09:02:40,0
976,0.0,0.0,0.0,0.0,0,0.0,0,0,0,0


### Adding DataFrames together

When we want to add both new rows and new columns at once, we have several options available to use. First lets consider the `pandas.concat()` function. Note that this is a pandas function and not a DataFrame method. 

The first argument to the `pandas.concat()` function must be a list of DataFrames. Then we can specify the `axis` parameter which tells pandas how we want the DataFrames to be concatenated. There are essentially two options:
- stacking them vertically one on top of the other
- stacking them horizontally side by side


In [40]:
import numpy as np
df1 = pd.DataFrame(np.full((2,3),'x', dtype=object), index=['key1', 'key2'], columns=['A', 'B', 'C'])
df1

Unnamed: 0,A,B,C
key1,x,x,x
key2,x,x,x


In [41]:
df2 = pd.DataFrame(np.full((3,3),'o', dtype=object), index=['key2', 'key3', 'key4'], columns=['A', 'B', 'C'])
df2 

Unnamed: 0,A,B,C
key2,o,o,o
key3,o,o,o
key4,o,o,o


The default parameter for concatenation is `axis='rows'` that tells pandas to stack the second DataFrame under the first one. Pandas will check for us whether the column names are the same or not. If they are, then it will just stack them directly underneath each other so that the columns line up. Let's try this out with df1 and df2 which have the same column names



In [42]:
pd.concat([df1,df2])

Unnamed: 0,A,B,C
key1,x,x,x
key2,x,x,x
key2,o,o,o
key3,o,o,o
key4,o,o,o


A simpler version of the `pandas.concat()` function is the `DataFrame.append()` method which basically just takes a set of new rows and appends them at the bottom of the DataFrame. Internally it just calls the `pandas.concat()` function. 

In [43]:
df1.append(df2)

Unnamed: 0,A,B,C
key1,x,x,x
key2,x,x,x
key2,o,o,o
key3,o,o,o
key4,o,o,o


It is also possible to concatenate horizontally by changing the `axis` parameter to `columns`

In [44]:
df3 = pd.DataFrame(np.full((2,2),'v', dtype=object), index=['key1', 'key2'], columns=['D', 'E'])
df3

Unnamed: 0,D,E
key1,v,v
key2,v,v


In [45]:
pd.concat([df1,df3], axis='columns')

Unnamed: 0,A,B,C,D,E
key1,x,x,x,v,v
key2,x,x,x,v,v


### Changing data

What if we actually want to assign to new values to entries of a DataFrame? A big advantage of having data stored in DataFrames is that we can perform vectorized operations on entire columns at once without having to write any for loops.


In [46]:
df['distance'] = df['distance']/1000
df.head()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,elapsed_time,total_elevation_gain,type,commute,start_date_local,new column
0,5.465,129.9,131.7,51.69,3:23:50,1414.0,Ride,0,2018-05-21T09:59:24,0
1,6.948,131.3,127.8,133.126,6:17:54,2280.0,Ride,0,2018-05-20T08:46:44,0
2,6.853,135.4,136.4,108.418,4:39:40,1950.0,Ride,0,2018-05-19T08:12:35,0
3,6.756,,135.3,59.0191,2:38:43,937.0,Ride,0,2018-05-17T17:50:55,0
4,7.454,,117.7,23.8309,0:55:01,122.0,Ride,0,2018-05-17T07:15:06,0


This is much more efficient then updating entries one by one. When performing operations on a DataFrame's columns it is always a good idea to first try and express them in terms of vectorized operations since this is usually the most efficient.

Pandas also provides its own methods equivalent to the main operators:

```
Series.add()
Series.mul()
Series.div()
Series.floordiv()
Series.pow()
Series.mod()
```

What is the purpose of this? Methods can have parameters allowing for more flexibility.

In [47]:
df['average_heartrate'].add(1).head()

0    130.9
1    132.3
2    136.4
3      NaN
4      NaN
Name: average_heartrate, dtype: float64

In [48]:
df['average_heartrate'].add(1, fill_value=0).head()

0    130.9
1    132.3
2    136.4
3      1.0
4      1.0
Name: average_heartrate, dtype: float64

# Dealing with missing data

We saw that the `DataFrame.info()` method returns the number of missing values in each column

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 977 entries, 0 to 976
Data columns (total 10 columns):
average_speed           977 non-null float64
average_heartrate       425 non-null float64
average_watts           353 non-null float64
distance                977 non-null float64
elapsed_time            977 non-null object
total_elevation_gain    977 non-null float64
type                    977 non-null object
commute                 977 non-null int64
start_date_local        977 non-null object
new column              977 non-null int64
dtypes: float64(5), int64(2), object(3)
memory usage: 84.0+ KB


# Exercise: maximum values

Play around with the DataFrame `df` to perform the following tasks:

1) Select only numerical columns

2) Drop all rows that contain a missing value

3) Convert all data types to integer

4) Find the maximum of each column  method

5) Select only the rows that contain a maximum value in at least one column

# Solution

In [50]:
df = pd.read_csv('my_strava_activities.csv', index_col=0)
df = df.select_dtypes(include=['number'])

In [51]:
df.head()

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,total_elevation_gain
0,5.465,129.9,131.7,51690.0,1414.0
1,6.948,131.3,127.8,133126.0,2280.0
2,6.853,135.4,136.4,108418.0,1950.0
3,6.756,,135.3,59019.1,937.0
4,7.454,,117.7,23830.9,122.0


In [52]:
df = df.dropna()
df = df.astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 260 entries, 0 to 600
Data columns (total 5 columns):
average_speed           260 non-null int64
average_heartrate       260 non-null int64
average_watts           260 non-null int64
distance                260 non-null int64
total_elevation_gain    260 non-null int64
dtypes: int64(5)
memory usage: 12.2 KB


In [53]:
df.max().values

array([     9,    160,    203, 210087,   5799])

In [54]:
df[df.eq(df.max().values).any(axis='columns')]

Unnamed: 0,average_speed,average_heartrate,average_watts,distance,total_elevation_gain
62,9,138,145,72708,428
219,7,138,203,210087,3801
335,7,160,127,23872,71
471,5,151,186,161359,5799
