# Tutorial - How long can you run?
*by Debora Azevedo, Eliseu, Igor A. Brandão, Paiva*


**Goals**
The purpose of this notebook is helping you to explore the following:

- Access the dataset inside a csv file;
- Explore the data (get the info);
- Extract some insights;
- Apply the content seen in class.

<hr>

#Analysis of sportive data

<table style="width:100%">
  <tr>
    <th><img width="700" src="https://drive.google.com/uc?export=view&id=1H2JxLB0EoTgePMAv6apH68t8biPgquaR">
    </th>
  </tr>
  </table>
---

Ivano Vitch is an amazing athlete who's been preparing himself for a marathon. To do so, he's been training constantly and keeping record of many aspects of his training practices with the help of an app. For this purpose, he needs to do a survey with all the information his app's been collecting, and what does he need to improve i order to get a gold medal. Unfortunately, we are yet to discover a way to work out in one day and to wake up in the next morning just like Hulk. It takes a very very looooong time to achieve the wanted results, but with time and dedication it will eventually happen. So Ivano counts on you to analyse the data survey of his last month training practices and also to know what is still necessary to improve so he can achieve his goals. So, hands on! Help our friend Vitch to prepare himself the best possible way and, doing so, get all the gold medals he can!

#Contextualizing

Ivano Vitch is a person who does physical activity, and likes it very much. He's getting ready for a marathon that happens in his town. For this, he has been training constantly, and has kept a record with an app about various aspects of his workouts. However, he needs to know if he is having good results on his performance, yet he doesn't know how to do the analysis of his training data. To do so, he asked you, the person who is using this programming notebook, to help him understand whether he's making progress or not by analyzing his data. But do not worry: let us help you in this endeavor to help Ivano.

The activities practiced are divided in: cycling, running and walking. Each of these exercises are categorized according to the kilometrage, the categories being from 0 up to 10km, from 11km up to 20km, and from 20km on.

From this information, and from the workout.csv file that Ivano has sent you, we will begin to assist our friend in this fitness journey.

Ivano will be competing for 20 medals. Every time you solve a question in this notebook you're helping Ivano to get these medals, so work hard so that Ivano can get as many medals as possible!

# 1.0 First things first

Welcome programmer! We're going to help you analyze Ivano's dataset to help him in his run, and you in your run to become a better data scientist.

To guide you in this journey, we divided this tutorial into the following sections:

- *Importing data with pandas*;
- *Data handling and cleaning with pandas*;
- *Data manipulation with pandas and numpy*;





The dataset is a CSV file called **workout.csv**. Here is a data dictionary for some of the columns in the CSV:

- **duration** - Duration of the activity, measured in minutes.
- **entry_mode** - Entry mode of the information in the dataset.
- **has_path** - ???.
- **source** - Application used to feed the dataset.
- **start_time** - Day and hours of thee beginning of each activity.
- **total_calories** -  Number of calories spent after the given activity
- **total_distance** -  Total distance traveled, in meters, in a given activity
- **tracking_mode** - ???
- **type** - Type of physical activity
- **uri** - ???
- **utc_offset** - UTC offset of the country the activities were done


Let's begin! Run the cell below to get access to Ivano's dataset, so you can start working with his data.



In [1]:
# Start uploading the "workout.csv" to have access to the dataSet

# The following code upload a file from your local file system
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving workout.csv to workout.csv
User uploaded file "workout.csv" with length 66653 bytes


## 1.1 Importing and analyzing the data

Now that you have access to Ivano's dataset in your machine, we're going to help you read his data by using **pandas**.

**Pandas** is an open-source, high-performance library for python that allows you to use different data structures and data analysis tools. We are going to use **pandas** to read CSV files and manipulate them as [pandas.core.frame.DataFrame object](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame) type objects. This library also allows you to:

- Use column names as labels to access data.
- Work with different data types in the same structure. i.e. a **dataframe** can have both strings and floats.
- Use **numpy** operations on columns and rows.

Just like any other python module, we use the import keyword to bring **pandas** to our workspace. The import convetion for **pandas** is:

```python
import pandas as pd
```

Before you start to manipulate the data, you must first bring it to your workflow. The most common way of doing so is by using [pandas.read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) method to read a CSV and assign it to a variable.

Run the cell bellow to read **workout.csv** file and assign it to the **runningData** variable. The **index_col** parameter tells **pandas** which column to use as the index of our **dataframe**. Later in our mission, we're going to use this information to access and manipulate data.

In [2]:
import pandas as pd

# Use pandas to read the workout.csv file to a variable called runningData.
runningData = pd.read_csv("workout.csv", index_col=0)
runningData.index.name = None

runningData.columns

Index([u'duration', u'entry_mode', u'has_path', u'source', u'start_time',
       u'total_calories', u'total_distance', u'tracking_mode', u'type', u'uri',
       u'utc_offset'],
      dtype='object')

All right. Now that you have access to the dataset, you can start working on it to help Ivano get his deserved medals! But before manipulating the data, it is important to first analyze it so that you know what you're dealing with. The workflow of a data scientist basically consists of:

- Defining the objective.
- Importing the data.
- Analyzing it for possible flaws.
- Exploring and cleaning the data.
- Manipulating the data. This step depends directly on the scientist's objective.

You already know your objective and you have already imported the data, so you're currently on step 3. As we already mentioned, when working with **pandas**, we're mainly dealing with **dataframe** type objects. You can check this by using python's built-in **type** function. When using this kind of object, **pandas** provides you with a lot of functions and attributes for analyzing your dataframes, namely:

- The **.shape** attribute returns a tuple representing the dimensions of each axis of the object.
- **DataFrame.info()** is a method that, as the name suggests, displays infos about the examined **dataframe's** columns, such as names, row count and data types.
- **DataFrame.head()** and **DataFrame.tail()** are used to check the first 5 elements and last 5 elements of a dataframe, respectively.

Now that you know about some functions provided by **pandas** to analyze data, you can start working on those medals! Solve the exercises bellow to gain more knowledge about the data and help Ivano get his first medal!

##Exercise <img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D">

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">

1. Use Python's **type()** function to assign the type of **runningData** to **runningData_type.**
2. Use the **DataFrame.shape** attribute to assign the shape of **runningData** to **runningData_shape.**
3. Use the **DataFrame.info()** function to assign the general info of **runningData** to **runningData_info.**
4. Print the first 5 rows in the data frame by using the **DataFrame.head()** method.
5. Print the last 5 rows in the data frame by using the **DataFrame.tail()** method.

In [3]:
# [Your turn] Assign the dataSet type to runningData_type
runningData_type = type(runningData)

# [Your turn] Assign the dataSet shape to runningData_shape
runningData_shape = runningData.shape

# [Your turn] Assign the dataSet info to runningData_info
runningData_info = runningData.info()

# [Your turn] Print the first 5 rows in runningData
print(runningData.head())

# [Your turn] Print the last 5 rows in runningData
print(runningData.tail())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 567 entries, 0 to 566
Data columns (total 11 columns):
duration          567 non-null float64
entry_mode        567 non-null object
has_path          567 non-null bool
source            567 non-null object
start_time        567 non-null object
total_calories    567 non-null int64
total_distance    567 non-null float64
tracking_mode     567 non-null object
type              567 non-null object
uri               567 non-null object
utc_offset        567 non-null int64
dtypes: bool(1), float64(2), int64(2), object(6)
memory usage: 49.3+ KB
   duration entry_mode  has_path     source                 start_time  \
0     25.97        API      True  RunKeeper   Thu, 20 Sep 2018 7:12:25   
1     33.07        API      True  RunKeeper  Tue, 18 Sep 2018 12:54:14   
2     25.38        API      True  RunKeeper   Tue, 18 Sep 2018 7:22:49   
3     66.53        API      True  RunKeeper  Thu, 13 Sep 2018 14:03:45   
4     24.87        API      True  Run

## 1.2 Selecting data from pandas DataFrames


Now that you know how to bring datasets to your workflow as **dataframes** using **pandas**, you can start working on the data. But to do so, you must first learn how to select specific rows and columns, so that you can perform operations on them. Fortunately, **pandas** provides many ways to do this kind of selection, and we're going to help you understand a few of them.

As we've already mentioned, you can check the different columns of your dataframe by accessing them with the **DataFrame.columns** attribute. It turns out that these columns are also attributes of your dataframe, and they can be accessed by using this very simple syntax.

```python
totalCalories = runningData.total_calories
```

Another way to do this would be to used the column's name as an index, just like this:

```python
totalCalories = runningData["total_calories"]
```

When you select a single column from a dataframe and assign it to a variable, this column will be assigned to a **Series** type of object. You can verify this by using the **type** function. The main difference between **Series** and **Dataframes** is that **Series** are one-dimensional labeled arrays, whereas **Dataframes** are bidmensional collections of one or more **Series**, and are usually labeled by column names.

Even though selecting single columns is useful, most of the time you will need more than this to effectively manipulate your data, so **pandas** provides a method called  [DataFrame.loc[]](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc). You should have noticed that unlike the other methods we've seen so far, **loc** uses brackets (**[]**) instead of parenthesis (**()**). This is probably because it is used for selection, so **pandas** provides us with a more intuitive syntax, considering brackets are used for selection all over the other programming languages.

There are many ways to use **loc** for selection, we're going to show you a few of them in this section and we might come back to this subject if needed. For starters, the usual workflow for using **loc** is by using the following syntax:

```python
runningData.loc[row, column]
```
In this syntax, row and columns stand for rows and columns labels. These can be specified in many ways, **pandas** provides us with a lot of options to select our rows and columns, and these options influence directly on the kind of input you are going to get, specially for the way you specify the wanted columns.

If you pass a single column to the **column** parameters, the **loc** method will return a single **Series** type of object containing all the values for the specified column.

```python
runningData.loc[row, "column_name"]
```

On the other hand, if you use the following syntax:

```python
runningData.loc[row, ["column_a", "column_b"]]
```

**Pandas** will return a **Dataframe** containing all specified rows and specified columns in the order of input. As you can see, the columns are passed as a list, so you can pass as many columns as you want in whatever order you wish.

Yet another way to select data is by using the split syntax:

```python
runningData.loc[row, "column_a":"column_b"]
```

As expected, using **loc** with the split syntax will return a **Dataframe** with all the specified rows and columns between *column_a* and *column_b* (inclusive).

Now that you know how to select specific data from your dataset, you can use your newly acquired skills to help Ivano separate different values from the dataset so that they can be cleaned or processed. This will be a huge help, since some data is unneeded and he currently finds the dataset too complicated to understand. Do the following exercise to separate some columns from the dataset so that Ivano can get a clearer view of his data!

##Exercise <img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D"><img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D">

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">
1. Print all the column indices in the dataset by using the **DataFrame.columns** attribute.
2. Select the **duration** column and assign it to a variable called **durationValues**.
3. Use python's built-in **type** function to check **durationValues**' type.
4. Select all columns from **duration** to **total_distance** by using the **DataFrame.loc** method with the split syntax. You should assign the result to a variable called **workoutSplit**
5. Select **duration**, **start_time**, **total_calories**, **total_distance** and **type** columns in order and assign them to a variable named **workoutInfoList**. Use the **DataFrame.loc** method with the column list syntax;




In [5]:
# Print the columns names
print(runningData.columns)

# Select the duration column from runningData and assign it to durationValues
durationValues = runningData.duration

# Print durationValues' type
print(type(durationValues))

# Assign all columns from duration to total_distance to the workoutSplit variable
workoutSplit = runningData.loc[:, "duration":"total_distance"]

# Assign duration, start_time, total_calories, total_distance and type columns
# to workoutInfoList
columns =  ["duration", "start_time", "total_calories", "total_distance"]
workoutInfoList = runningData.loc[:, columns]


Index([u'duration', u'entry_mode', u'has_path', u'source', u'start_time',
       u'total_calories', u'total_distance', u'tracking_mode', u'type', u'uri',
       u'utc_offset'],
      dtype='object')
<class 'pandas.core.series.Series'>



###1.2.1 Boolean indexing

Even though selecting specific columns is useful, sometimes you'll need even more specific ways of selecting data, and that's where **boolean indexing** comes in.

**Boolean indexing** works by selecting rows or columns from the dataset according to some rule defined by the user. So let's say we wanted to select all rows of the dataset whose **type** column has the value *Cycling*. One way to do this would be to loop through all the rows in the dataset and select only the ones we're interested in, but this method is hacky and too expensive for large datasets. The best way to do this would be to use **boolean indexing**, which is a lot cleaner and benefits from **pandas** vectorization, a mechanism that **pandas** uses to speed up operations.

So for instance, if we wanted to select only rows from the dataset whose **column_name** column has the *"val"* value,  we could use **boolean indexing** by following this syntax:

```python
typeBool = runningData["column_name"] == "val"
runningData.loc[typeBool, :]
```

Alternatively, everytime you use **DataFrame.loc**, if you are specifying only rows and selecting all columns or the other way around, you could use this simplified syntax:

```python
typeBool = runningData["column_name"] == "val"
runningData.loc[typeBool]
runningData.loc[["column_a", "column_b"]]
```

Now that you know how to select specific values from the dataset, it would be a good idea to select only the relevant values for Ivano's cycling marathon. This way, Ivano can get an even better glimpse of his data and you can start working on cleaning it and manipulating only the relevant rows. By clearing the next exercise, you're helping Ivano in this way, so you're helping him get 3 more medals!

##Exercise <img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D"><img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D"><img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D">

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">
1. Use boolean indexing to select all rows whose **type** column equals *"Cycling"* and assign them to a variable called **cyclingValues**
2.  Select all rows from **cyclingValues** with **duration** less than 30 and assign them to a variable called **durationLess30**
3.  Select all rows from **durationLess30** with **total_distance** greater than 8000 and assign them to a variable called **distanceGreater8k**


In [0]:
# put your code here :)
boolType = runningData["type"] == "Cycling"
cyclingValues = runningData.loc[boolType]

boolDuration = cyclingValues["duration"] < 30
durationLess30 = cyclingValues.loc[boolDuration]

boolDistance = durationLess30["total_distance"] > 8000
distanceGreater8k = durationLess30.loc[boolDistance]


## 1.3 Using numpy functions on dataframes

Now that you know how to select different elements from your dataset, it's time to learn how to perform **NumPy** operations on your data. This is an useful feature **pandas** provides us, and it will serve to retrieve important informations and do statistical analysis about the data.

As we've already mention, when using **pandas** you're usually dealing with either **Dataframes** or **Series** types. What's cool about **NumPy** operations with **pandas** is that they have the same syntax for both **Dataframes** and **Series**, so you don't have to worry about memorizing different syntax. The most commonly used operations are:

- [Series.max()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html) and [DataFrame.max()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html), which returns the greatest element(s) in the object.
- [Series.min()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html) and [DataFrame.min()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html), returns the smallest element(s) in the object.
- [Series.mean()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) and [DataFrame.mean()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html), returns the mean for the elements.
- [Series.median()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html) and [DataFrame.median()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html), returns the median for the elements.
- [Series.sum()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html) and [DataFrame.sum()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html), returns the sum for the elements.

You probably noticed that the methods are very intuitive, they do exactly what their name suggests. The main difference between using **NumPy** methos for **Dataframes** and **Series** is that in the former, you can specify an **axis** parameters. As the name also suggests, this parameter is responsible for telling **pandas** in which **axis** it should operate, so:

```python
runningData[["col_1", "col_2"]].mean(axis = 0)
```

Returns the mean value for all elements along the row axis for **col_1** and **col_2**, and similarly:

```python
runningData[["col_1", "col_2"]].mean(axis = 1)
```

Returns the mean value for all elements along the column axis for **col_1** and **col_2**.

All these methods will be very handy in helping Ivano on his quest to become the marathon champion, because now you can actually work on his data and get useful informations about it. Solve the next question to award Ivano with 1 more medal!

##Exercise <img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D">

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">
1. Get the minimum value for **total_distance** on the **cyclingValues** dataframe, assign it to a variable called **minDist** and **print** it.
2. Assign the mean value for **total_calories** in **cyclingValues** to a variable called **meanCal**
3. Assign the maximum value for **duration** in **cyclingValues** to a variable called **maxDur**


In [7]:
minDist = cyclingValues["total_distance"].min()
print(minDist)

meanCal = cyclingValues["total_calories"].mean()

maxDur = cyclingValues["duration"].max()

0.0


##keep making progress
To start the preparations for the championship, it is necessary to keep the energy balance balanced between food intake and energy expenditure, so help Ivano to know more about his caloric expenditure during training.

<img width="300" src="https://drive.google.com/uc?export=view&id=1E12-Qp-ZQRq4fBCd1z4DhCXZ22r2yRQn"></th>
 

##Now answer the following:
####What is the total amount of calories burnt?

In [0]:
# put your code here :)

####How many workouts in cycling, running and walking type?

In [0]:
# put your code here :)

####What is the (%) of each workout type?

In [0]:
# put your code here :)

####What is the total distance?

In [0]:
# put your code here :)

####Which month has the highest distance?

In [0]:
# put your code here :)

#2. maximum analyzes and value selections in pandas

##2.1Select rows from a DataFrame based on values in a column in pandas
The pandas allows you to select multiple values in a simple and practical way.
- To select rows whose column value equals a scalar or string, `some_value`, use `==`:
`
  - df[df['column_name'] == some_value]`

example for a Dataframe (df):

<img width="700" src="https://drive.google.com/uc?export=view&id=1aGL4s2D71Qj-HpFEp_e7zaNdXGmX5CNe">
##2.2 Find maximum value of a column and return the corresponding row values using Pandas
To return the maximum value of a given column, simply call the maximum method, which returns the highest value object of the corresponding column in response.This method returns the maximum of the values in the object.
- `file['column_name'].max()`

example of maximum value in a column of the DataFrame (df):

<img width="500" src="https://drive.google.com/uc?export=view&id=1cStqoCLWkRwsaK8mWFHJTmv-HlcuI8Oo">


Another interesting information, is that we can select multiple columns in a document just put:
`pd [['column_name_1', 'column_name_2']]`
and all the values corresponding to the columns present in the term will be returned.

**Exercise**<img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D"><img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D">

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">


1. Use the [pandas.DataFrame.max()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html) method to calculate the maximum value of **total_calories** and assign the result to **calories_max**.
2. Use the [pandas.DataFrame['coliumn_name']==value](https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas) method to calculate the maximum value with the most calories burned **running_data['total_calories']** when it is equal to **calories_max** and assign the result to **maximum_calories**.
3. Now select from the running_data only the results of the columns "start_time" and "total_calories", for the date that has the highest total amount of calories spent


In [0]:
# put your code here :)


##Nice!
If you got here, then you're doing very well. Vitch is already adapting his workouts based on the new information you have given him. Continue analyzing your data and the medals will come

<img width="350" src="https://drive.google.com/uc?export=view&id=1Z5xMZ-aCeruVZMHKFG_YNpMX1FCg4zE-">


##3.1 Group series
using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns. [pandas.DataFrame.groupby()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)
##3.2 sum of values in the dataframe column
Return the sum of the values for the requested axis [pandas.DataFrame['column_name'].sum()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html)
###3.2.1 What is the total distance per workout type? <img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D">

In [0]:
# put your code here :)

###3.2.2 What is the total calories burnt per workout type?<img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D">

In [0]:
# put your code here :)


###3.2.3 What is average calories burnt per workout type?<img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D">

In [0]:
# put your code here :)

###3.3
to answer the next question, we can learn a bit more about [pandas.DataFrame.assign()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)
see too: [pandas.DataFrame.str.slice()](https://github.com/pandas-dev/pandas/issues/8748)
####What is the workout frequency per weekday?<img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D"><img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D"><img width="30" src="https://drive.google.com/uc?export=view&id=1RiQjMxnKQKqj_3YRQTeR83o0Pv_Ha80D">

In [0]:
# put your code here :)

#Congratulations
Incredible! you finished the data analysis, and our friend Ivano finally managed to achieve your dreams. Now he can celebrate all the medals he has won. We hope you also celebrate all the medals brought by your new skills in data analysis!

<img width="500" src="https://drive.google.com/uc?export=view&id=1qOfFw3jViXJcW3_RJKpYz09NEvydwvTt">
