In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("lab1.ipynb")

<img src="img/dsci511_header.png" width="600">

# Lab 1: Reading in and wrangling data

## Instructions
rubric={mechanics:5}

Check off that you have read and followed each of these instructions:

- [ ] All files necessary to run your work must be pushed to your GitHub.ubc.ca repository for this lab.
- [ ] You need to have a minimum of 3 commit messages associated with your GitHub.ubc.ca repository for this lab.
- [ ] You must also submit `.ipynb` file and the rendered PDF in this worksheet/lab to Gradescope. Entire notebook must be executed so the TA's can see the results of your work. 
- [ ] **There is autograding in this lab, so please do not move or rename this file. Also, do not copy and paste cells, if you need to add new cells, create new cells via the "Insert a cell below" button instead.**
- [ ] To ensure you do not break the autograder remove all code for installing packages (i.e., DO NOT have `! conda install ...` or `! pip install ...` in your homework!
- [ ] Follow the [MDS general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).
- [ ] <mark>This lab has hidden tests. In this lab, the visible tests are just there to ensure you create an object with the correct name. The remaining tests are hidden intentionally. This is so you get practice deciding when you have written the correct code and created the correct data object. This is a necessary skill for data scientists, and if we were to provide robust visible tests for all questions you would not develop this skill, or at least not to its full potential.</mark>


## Code Quality
rubric={quality:5}

The code that you write for this assignment will be given one overall grade for code quality, see our code quality rubric as a guide to what we are looking for. Also, for this course (and other MDS courses that use R), we are trying to follow the PEP 8 code style. There is a guide you can refer too: https://peps.python.org/pep-0008/

Each code question will also be assessed for code accuracy (i.e., does it do what it is supposed to do?).

## Writing 
rubric={writing:5}

To get the marks for this writing component, you should:

- Use proper English, spelling, and grammar throughout your submission (the non-coding parts).
- Be succinct. This means being specific about what you want to communicate, without being superfluous.


## Let's get started!

Run the cell below to load the packages needed for this lab.

In [2]:
import pandas as pd
import numpy as np
import altair as alt
pd.set_option('display.max_rows', 6)

## Exercise 1: Reading in Data

Read the data files listed in the table below, and store them as pandas data frames with the names provided in the table. We will use hidden tests to grade this, so you will get to practice deciding that your job is done, and done correctly.

**Note - if the column names are missing from any data sets you need to add them yourself programmatically via python**

| File  | Name for Data Frame | File location |
|---|---|----|
| `abbotsford_lang.xlsx`  | `abbotsford` | `data` directory of this repo |
| `calgary_lang.csv`  | `calgary`  | `data` directory of this repo |
| `edmonton_lang.xlsx`  | `edmonton`  | https://github.com/ttimbers/canlang/blob/master/inst/extdata/edmonton_lang.xlsx?raw=true |
|  `kelowna_lang.csv` | `kelowna`  | `data` directory of this repo |
| `vancouver_lang.csv`  | `vancouver`  | `data` directory of this repo |
| `victoria_lang.csv`  | `victoria`  | https://github.com/ttimbers/canlang/raw/master/inst/extdata/victoria_lang.tsv |


### The Data

The data you will be working with in this first exercise is language data from the 2016 Canadian Census for cities in Western Canada. If you are unfamiliar with Western Canadian geography, here’s a map to help you start to get more familiar:

<img src="https://www.canadatours.com/images/maps/Canada_W.gif" width=500>

Image source: https://www.canadatours.com/canada_maps.cfm?#W 

### Exercise 1.1: Read in the Abbotsford language Data
rubric={autograde:5}

In [None]:
abbotsford = ...
abbotsford

In [None]:
grader.check("ex1_1")

### Exercise 1.2: Read in the Calgary language Data
rubric={autograde:5}

In [None]:
calgary = ...
calgary

In [None]:
grader.check("ex1_2")

### Exercise 1.3: Read in the Edmonton language Data
rubric={autograde:5}

In [None]:
url = ...
edmonton = ...
edmonton

In [None]:
grader.check("ex1_3")

### Exercise 1.4: Read in the Kelowna language Data
rubric={autograde:5}

In [None]:
...
kelowna

In [None]:
grader.check("ex1_4")

### Exercise 1.5: Read in the Vancouver language Data
rubric={autograde:5}

In [None]:
vancouver = ...
vancouver

In [None]:
grader.check("ex1_5")

### Exercise 1.6: Read in the Victoria language Data
rubric={autograde:5}

In [None]:
url = ...
victoria = pd.read_csv(url, sep = '\t')
victoria

In [None]:
grader.check("ex1_6")

## Exercise 2: Basic Data Wrangling

rubric={autograde:10}

Read the file `region_lang.csv` (located in the `data` directory of this repo) into a pandas data frame. We will use this data frame to uncover the name of the Canadian census metropolitan area which has the second greatest number of people who claim that the language they speak most often at home is **Spanish**. Return the region name as a string and assign this string to a variable named `spanish2`.

In [None]:
...

In [None]:
spanish2

In [None]:
grader.check("ex2")

## Exercise 3: More Data Wrangling

rubric={accuracy:20}

For this exercise, we want you to choose a Canadian census metropolitan area from the `region_lang` data set you encountered in the previous question and find the top 5 languages spoken most often at home from that area. Your final result should be a data frame with two columns: 1. `language` 2. `perc_pop`.

The column perc_pop should be the percentage of the area’s population who reported that they speak that language most often at home. You can find the population size for each Canadian census metropolitan area in the file `region_data.csv` located in the `data` directory of this repo.

In [None]:
...

## Exercise 4: Tidying Data

rubric={autograde:10}

Let’s load a data set that is not tidy, because it is too wide for the statistical question being asked, and then use pandas to tidy it.

This next data set that we will be looking at contains environmental data from 1914 to 2018. The data was collected by the DFO (Canada’s Department of Fisheries and Oceans) at the Pacific Biological Station (Departure Bay). Daily sea surface temperatures were recorded. Original data source: http://www.pac.dfo-mpo.gc.ca/science/oceans/data-donnees/lightstations-phares/index-eng.html

A statistical question we might be interested in answering with this data set is, has sea surface temperature been changing over time, and is there an association between time of year (i.e., month) and this change over time? Read the `departure_bay_temperature.csv` data set in from the `data` directory and decide what tidying you will have to do, and then get to work and tidy it!

Assign the the tidy data frame you create to the variable `tidy_temps`. Set the second column name to be `month` & third column name to be `temp`.

In [None]:
...

In [None]:
tidy_temps

In [None]:
grader.check("ex4")

### Reward: Visualizing the data

Let’s take a look and see whether sea surface temperature been changing over time at Departure Bay, BC. Given that time of year is a factor that influences temperature, we’ll plot this for each month separately:

In [None]:
alt.Chart(tidy_temps).mark_point().encode(
    alt.X('Year:N', axis = alt.Axis(labels=False, ticks = False), title = 'Year', ),
    alt.Y('temp:Q', title= 'Temperature')
).properties(
    width=200,
    height=200
).facet(alt.Facet('month:N', sort = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']),
    columns = 4
).interactive()

## Exercise 5: More Tidying

rubric = {autograde:10}

Use one of the `pandas` functions to tidy the data that you will load in from the `language_diversity.csv` file located in the data directory. This data was collected to answer research questions, such as what factors are associated with language diversity (as measured by the number of languages spoken in a country). Read in the `language_diversity.csv` data set and decide what tidying you will have to do, and then get to work and tidy it! Assign the tidy data frame you create to the variable `tidy_lang`.

In [None]:
...

In [None]:
tidy_lang

In [None]:
grader.check("ex5")

### Let's plot!

Now that we have this data in a tidy format, let’s explore it and plot the number of languages spoken in each country in the data set against the country’s population:

In [None]:
alt.Chart(tidy_lang).mark_point().encode(
    x=alt.X('Population').scale(type="log"),
    y=alt.Y('Langs').scale(type="log"),
    color='Continent:N',
    shape='Continent:N',
).interactive()

## Exercise 6 (Challenging)

rubric = {accuracy:5}

(This exercise may be more time consuming than the previous ones. Attempt it only if you finish the previous questions early and want a bit more of a challenge.)

The file `data/beach_data.xlsx` contains data from the Narrabeen beach survey program in Sydney, Australia. The survey program started in the 1970's and has continued to the present day. The survey program is aimed to measure the width of the beach every few weeks. There are five locations along the beach for which measurements are made, from location 1 at the northern end of the beach, to location 5 at the southern end. All the data is available [here](http://narrabeen.wrl.unsw.edu.au/explore_data/time_series/).

Your tasks:

* Determine the largest absolute deviation in width for each beach location in 2010, relative to the mean beach width at that location across all time.
* Determine the standard deviation in width for each beach location in 2010.
* Present the results in a single Data Frame and sort it in descending order of maximum absolute deviation. For example, the corresponding data frame for the year 2011 would look like this:


| Location  | Abs Max | Std |
| --- | --- | --- |
|  3 | 35.805778 | 13.258404 |
|  4 | 31.611717 | 9.892066 |
|  5 | 26.424559 | 7.463615 |
|  2 | 25.652488 | 8.867712 |
|  1 | 23.943018 | 8.244922 |

In [None]:
...

**Congratulations!** You are done the lab!!! Pat yourself on the back, convert the notebook to PDF and submit your lab to **GitHub** and Gradescope! Make sure you have 3 Git commits!