# Worksheet 1: Introduction to Data Science

Welcome to DSCI 100: Introduction to Data Science!  

Each week you will complete a lecture assignment like this one. For this worksheet, there are two parts:

1. [Introduction to using Jupyter Notebooks](#1.-Introduction-to-Jupyter-Notebooks)
2. [Introduction to analyzing data in R](#2.-Analyze-some-data)

## 1. Introduction to Jupyter Notebooks
This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results.

### 1.1. Text Cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, click the "run cell" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions of the lab.)

**Question 1.1.1.** This paragraph is in its own text cell.  Try editing it so that all of the sentences following this one are deleted, then click the "run cell" ▶| button.  This sentence, for example, should be deleted.  So should this one.

### 1.2. Code Cells
Other cells contain code in the Python language. Running a code cell will execute all of the code it contains.

To run the code in a cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, either press Run ▶| or hold down the `shift` key and press `return` or `enter`.

Try running the next cell:

In [None]:
print("Hello, World!")

The above code cell contains a single line of code, but cells can also contain multiple lines of code. When you run a cell, the lines of code are executed in the order in which they appear. Every `print` expression prints a line. Run the next cell and notice the order of the output.

In [None]:
print("First this line is printed,")
print("and then this one.")

### 1.3. Adding cells to Jupyter Notebooks
You can use Jupyter notebooks for your own projects or documents.  When you make your own notebook, you'll need to create your own cells for text and code.

To add a cell, click the + button in the menu bar.  It'll start out as a code cell.  You can change it to a text cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing "Markdown".

**Question 1.3.1.** Add a code cell below this one.  Write code in it that prints out:
   
    A whole new code cell!

Run your cell to verify that it works.

**Question 1.3.2** Add a text/Markdown cell below this one. Write the text "A whole new Markdown cell" in it.

# 2. Analyze some data

Now that you know how to use a Jupyter notebook, we will start to analyze some data. As you do this, we provide feedback via tests so you can check if your work is correct. To do this, you will need to run the cell below to set things up. 

In [None]:
# run this cell to setup the automated feedback 
import tests_worksheet_01 as t

### Question 2.1 - importing packages
{points: 1} 

Using the `import` function, import the `pandas` package using the alias `pd`, and the `altair` package using the `alt` alias in the cell below:

In [None]:
# write your code here


To check whether the code you wrote above was correct, run the cell below. If your code is correct, it will print "Success" to tell you that. If your code is not correct, it will give you a hint towards the correct answer.

In [None]:
# run this cell to test your answer to the question above
t.test_2_1(dir())

> Note: if you run the above cell and see an error like this:
>
>```
>---------------------------------------------------------------------------
>NameError                                 Traceback (most recent call last)
><ipython-input-1-1be3d32133e4> in <module>
>      1 # run this cell to test your answer to the question above
>----> 2 t.test_2_1()
>
>NameError: name 't' is not defined
>```
>
>That means you probably forgot to run the cell above to setup the automated feedback. Try running `import tests_worksheet_01 as t` and testing your answer again.

### Is there a relationship between 5 km race time and body mass index in women runners?

Now let's us Python to answer a research question for which we have some data (described below) - is there a relationship between two quantitative varbiables: 5 km race time and body mass index (BMI) for women runners in this data set. To answer this exploratory question, we will need to do the following things in R:

1. load the data set into Python
2. subset & transform the data we are interested in visualizing from the loaded dataset
3. create a new column to get the unit of time in minutes instead of seconds
4. create a plot to visualize this modified data

*Note - subsetting the data and converting from seconds to minutes is not absolutely required to answer our question, but it will give us practice manipulating data in R, and make our data tables and figures more readable.*


> #### About the data set
> Researchers, Vickers and Vertosick performed [a study in 2016](https://bmcsportsscimedrehabil.biomedcentral.com/articles/10.1186/s13102-016-0052-y) that aimed to identify what factors had a relationship with race performance of recreational runners so that they could better predict future 5 km, 10 km and marathon race times for individual runners. Such predictions (and knowing what drives these predictions) can help runners by suggesting changes they could make to modifiable factors, such as training, to help them improve race time. Unmodifiable factors that contribute to the prediction, such as age or sex, allow for fair comparisons to be made between different runners.
>
>Vickers and Vertosick reasoned that their study is important because all previous research done to predict races times has focused on data from elite athletes. This biased data set means that the predictions generated from them do not necessarily do a good job predicting race times for recreational runners (whose data was not in the dataset that was used to create the model that generates the predictions). Additionally, previous research focused on reporting/measuring factors that require special expertise or equipment that are not freely available to recreational runners. This means that recreational runners may not be able to put their characteristics/measurements for these factors in the race time prediction models and so they will not be able to obtain an accurate prediction, or a prediction at all (in the case of some models).
>
>To make a better model, Vickers and Vertosick performed a large survey. They put their survey on the news website [Slate.com](https://slate.com/) attached to a news story about race time prediction. They were able to obtain 2,497 responses. The survey included questions that allowed them to collect a data set that included: 
>- age,
>- sex,
>- body mass index (BMI),
>- whether they are an edurance runner or speed demon,
>- what type of shoes they wear,
>- what type of training they do,
>- race time for 2-3 races they completed in the last 6 months,
>- self-rated fitness for each race,
>- and race difficulty for each race.
>



### Question 2.2 - Multiple Choice: 
{points: 1}

What kind of graph will we be creating? Choose the correct answer from the options below. 

A. Bar Graph 

B. Pie Chart

C. Scatter Plot

D. Box Plot 

*Assign the letter that corresponds to your answer to an object called `answer2_2`. Be sure to surround your answer with quotation marks.* 

In [None]:
# Replace NULL with the letter that corresponds to your answer.
# Be sure to surround your answer with quotation marks.
answer2_2 = None

answer2_2

In [None]:
# run this cell to test your answer to the question above
t.test_2_2(answer2_2)

### Question 2.3 - load the dataset into R
{points: 1}

The data set we are loading is called `race_times.csv` and it contains a subset of the data from the study described above. The file is in the same directory/folder as the file for this notebook. It is a comma separated file (meaning the columns are separated by the `,` character).

Fill in the `...` in the cell below to load this data into Python. To do this use the `pd.read_csv()` function. Doing this will save the data from `race_times.csv` to an object called `race_times`. 

If you need additional help try `?pd.read_csv` and/or ask your neighbours or the Instructional team for help.

In [None]:
# race_times = ...

race_times.head()

In [None]:
# run this cell to test your answer to the question above
t.test_2_3(race_times)

### Question 2.4 - subset and transform the data we are interested in 
{points: 1}

Rearrange the lines of code given below to subset the data we are interested in visualizing (BMI & 5 km race time for women runners), drop missing values (for 5 km race times only), and transform the race time units from seconds to minutes.

In [None]:
# Rearrange the commented out lines of code given below to subset the data we are interested in visualizing,
# drop missing values, and transform the race time units from seconds to minutes.
# To run the code, uncomment it (remove "#").

# race_times_women = race_times_women.dropna()
# race_times_women['km5_time_minutes'] = race_times_women['km5_time_seconds'] / 60
# race_times_women  = race_times_women[['bmi', 'km5_time_seconds']]
# race_times_women = race_times[race_times['sex'] == 'female']


race_times_women.head()

In [None]:
# run this cell to test your answer to the question above
t.test_2_4(race_times_women)

### Question 2.5 - create a plot to visualize this modified data

{points: 1}

Fill in the `... ` in the lines of code given below to create a scatterplot with the `bmi` on the x axis and `km5_time_minutes` on the y axis to create a visualization that we can use to start exploring whether there is a relationship between 5 km race time and body mass index in women runners. 

In [None]:
# Fill in the correct functions in place of "..." in the code skeleton given below.
# To run the code, uncomment it (remove "#").

#race_times_plot = alt.Chart(...).mark_circle(size=60, opacity=0.3).encode(
#    x=alt.X(..., title=...),
#    y=alt.Y(..., title=...)
#)
#race_times_plot = race_times_plot.configure_axis(
#    labelFontSize=12,
#    titleFontSize=14
#)


race_times_plot

In [None]:
# run this cell to test your answer to the question above
t.test_2_5(race_times_plot)

### Question 2.6 - Multiple Choice
{points: 1}

Looking at the graph above, choose a statement above that most reflects what we see?

A. There appears to be no trelationship between 5 km run time and body mass index for women; as the value for for body mass index increases we see neither an increase or decrease in the time it takes to run 10 km.

B. There may be a postitive relationship between 5 km run time and body mass index for women; as the value for for body mass index increases, so does the time it takes to run 5 km.

C. There may be a negative relationship between 5 km run time and body mass index for women; as the value for for body mass index increases, the time it takes to run 5 km decreases.

*Assign the letter that corresponds to your answer to an object called `answer2_6`. Be sure to surround your answer with quotation marks.* 

In [None]:
# Replace NULL with the letter that corresponds to your answer.
# Be sure to surround your answer with quotation marks.
answer2_6 = None

answer2_6

In [None]:
# run this cell to test your answer to the question above
t.test_2_6(answer2_6)

## Attributions
- UBC [DSCI 100 Public Material](https://github.com/UBC-DSCI/dsci-100-assets)
- UC Berkley [Data 8 Public Materials](https://github.com/data-8/data8assets)