> In all assignments you will be solving exercises posted in a Jupyter notebook that looks like this one. 

> **Note (that only applies if you did not download this assignment from Absalon but got it from Github):** Because you will generally be cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do is to **make a copy of this notebook and save it somewhere else** on your computer, not inside the `isds2023` folder that you cloned, and write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the assignment) may be overwritten and **lost** the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear. If you downloaded this file from Absalon, you should have no problems!

# Assignment 0 

## Practical information
For having this assignment approved you must 
* Handin no later than Thursday, **July 27th at 12.00 (noon)**.
* Work alone for this assignment (this will benefit YOU in the long run!).

If you are having difficulties solving the majority of the exercises, please let us know. You can ask your questions in a github issue in the [isds repo](https://github.com/isdsucph/isds2023/issues).

**Note**: 
- It is important that you submit your edited version of this [.ipynb file](https://fileinfo.com/extension/ipynb#:~:text=An%20IPYNB%20file%20is%20a,Python%20language%20and%20their%20data.) as a .ipynb file and nothing else. 
    - This is because of the grading software used to grade the assignments. 
- If many of you are struggling, we might host a virtual workshop before the due date, where we cover the skills required to answer this problem set.

## Where to start and aim

This assignment is intended to ensure that you all know some basic python when you begin the summer school. The material will be explained in the notebook. Along the way we will point to additional resources if you want the material presented in a different manner.

A very good (but perhaps slightly confusing) approach to learning anything programming related is to simply search google whenever you have a question, or cannot get your code to run. In 99% of the cases the first or second result will contain an answer to your question, and this saves you the time required to look for the right pages in a book, or the right tutorial on some website.

#### Firts things first
Here is a video that Andreas made a few years ago. It is a great overview of what Python is and is not. Run the cell to show the video. If the video does not run, open the link [here](https://youtu.be/kaSz6n9ePik).


In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('kaSz6n9ePik', width=640, height=360)

In [None]:
print("hello world")

#### Jupyter
In this course we use Jupyter for working with Python as it provides a clean interface and makes it easy to iterate on working with data. As mentioned on our [course page](https://isdsucph.github.io/isds2023/post/install/), the notebooks used in this course can either be run using the classic notebook interface or JupyterLab. JupyterLab is an extension of the classic notebook interface and provides, among others, a file browser and a ready to use dark mode theme. The videos in this assignment are recorded using the classic notebook interface but you are welcome to follow them using JupyterLab. 

Again, run the cell below to show the video. If the video does not run open the link [here](https://youtu.be/0U0MYrPLuI4).

In [None]:
YouTubeVideo('0U0MYrPLuI4', width=640, height=360)

### **<font color="red">STOP!</font>**

#### Checklist on Jupyter


Before you move on from here part make sure that you :
- Can execute a cell - check that the video loads and runs above
- Can edit a cell
- Add and remove cells

I recommend checking out my short guide on the most relevant hotkeys for being fast with Jupyter.


## Overview of notebook
This notebook consists of two independent parts. The first about basic Python where you get familiar with the most important concepts and tools. The second part is a short introduction to pandas which is a tool for structuring data in Python

#### Overview of Python content
In this integrated assignment and teaching module, we will learn the following things about basic Python:
- Fundamental data types: numeric, string and boolean
- Operators: numerical and logical
- Conditional logic
- Containers with indices
- Loops: for and while
- Reuseable code: functions, classes and modules

*Additional sources*: You may find that the explainations here are insufficient. You can check out the reading list for the course [here](https://isdsucph.github.io/isds2023/page/readings/)). 

Alternatively, if you don't like that material, we suggest you try one of the three different resources for learning Python that you can use to learn the necessary skills for answering the questions. 

- Videos: [pythonprogramming.net fundamental](https://pythonprogramming.net/python-fundamental-tutorials/) (basics and intermediate)

- A tutorial website: [The official python 3 tutorial](https://docs.python.org/3/tutorial/introduction.html) (sections 3, 4 and 5)

- A book: [learn python the hard way](https://learnpythonthehardway.org/) 

You are free to pick from either of the above, or find your own ressources online.

## A quick note on using this notebook
Below each problem you will find two code cells. The first one contains the line 
```python
raise NotImplementedError()
```
Erase this line and implement your answer here. 

The cell below is blank and cannot be edited. Under the hood, this contains a number of `assert` statements. These check if your answer is correct when you hand in your answers. Do not edit, delete, or move the blank cells. Otherwise, we cannot grade your notebook!

# 0.1 Fundamentals of Python
## Elementary Data Types

*Can you give me an example of what data python is working with?*

Watch the video below to get an explanation.

In [None]:
YouTubeVideo('r7IvHf0Rz4c', width=640, height=360)

#### Examples with data types
Execute the cell below to create a variable `A` as a float equal to 1.5:

In [None]:
A = 1.5

Execute the cell below to convert the variable `A` to an integer by typing: 

In [None]:
int(A) # rounds down, i.e. floor 

We can do the same for converting to `float`, `str`, `bool`. Note some may at first come across as slightly odd:

In [None]:
bool(A)

While some are simply not allowed:

In [None]:
float('A') # Attempt at converting the string (text) 'A' to a number

## Printing Stuff

An essential procedure in Python is `print`. Try executing some code - in this case, printing a string of text:

In [None]:
my_str = 'I can do in Python, whatever I want' # define a string of text
print(my_str) # print it

We can also print several objects at the same time. Try it!

In [None]:
my_var1 = 33
my_var2 = 2
print(my_var1, my_var2)

*Why do we print?*
- It allows us to inspect values
    - For instance when trying to understand why a piece of code does not give us what we expect (i.e. debugging)
- In particular helpful when inspecting intermediate output (within a function, loops etc.).


## Debugging 

*What happens if my code has errors?*
- Either you fix it or you're in trouble. See below on how we handle them.


Try executing the following code block that we have already tried to execute once:

In [None]:
float('A')

*What does the error message mean?*

See the video below for an interpretation of the problem and how to fix it.

In [None]:
YouTubeVideo('wEjO-0rx0c0', width=640, height=360)    

## Numeric Operators

*What numeric computations can python do?*

An operator in Python manipulates various data types.

We have seen addition `+`. Other basic numeric operators:
- multiplication `*`;
- subtraction `-`; 
- division `/`;
- power, `**` 

Try executing the code below. Explain in words what the expression does.

In [None]:
2**4 

Having seen all the information, you are now ready for the first exercise. The exercise is found below in the indented text. You should provide your answer in the cell below using Python code. Remember not to edit, delete, or move the empty cell.
> **Ex. 0.1.1:**  Add the two integers `3` and `5` and store the result in the variable named `answer_001`

In [None]:
#answer_001 = 
# YOUR CODE HERE
raise NotImplementedError()

### Problem 0.1.2
Python also has a built in data type called a _string_. This is simply a sequence of letters/characters and can thus contain names, sentences etc. To define a string in python, you need to wrap your sentence in either double or single quotation marks. For example you could write `"Hello world!"`. 

> **Ex. 0.1.2:** In python the `+` is not only used for numbers. Use the `+` to add together the three strings `"Social"`, `"Data"`, `"Science"` and `"Intro"` without any whitespace. What is the result? Store it in a new variable called `answer_002`

In [None]:
#answer_002 =
# YOUR CODE HERE
raise NotImplementedError()

## Boolean Operators

Helpful advice: If you are not certain what a boolean value is, try and go back to [Fundamental Data Types](#Fundmental-Data-Types)

*What else can operators do?*

We can check the validity of a statement - using the equal operator, `==`, or not equal operator `!=`. Try the examples below:

In [None]:
3 == (2 + 1)

In [None]:
3 != (2 + 1)

In [None]:
11 != 2 * 5

In all these cases, the outcomes were boolean.

We can also do numeric comparisons, such as greater than `>`, greater than or equal `>=`, etc.:

In [None]:
11 <= 2 * 5

*How can we manipulate boolean values?*

Combining boolean values can be done using:

- the `and` operator - equivalent to `&`
- the `or` operator - equivalent to `|`

Let's try this!

In [None]:
print(True | False)
print(True & False)

*What other things can we do?*

We can negate/reverse the statement with the `not` operator:

In [None]:
not (True and False)

### Problem 0.1.3
Above you added two integers together, and got a result of `8`. Python separates numbers in two classes, the _integers_ $...,-1,0,1,2,...$ and the _floats_, which are an approximation of the real numbers $\mathbb{R}$ (exactly how floats differ from reals is taught in introductory computer science courses).


> **Ex. 0.1.3:**  
* Add `1.7` to `4`, store the result in the variable `answer_0031`.
* What is `0.667 * 100` in python? Store the result in the variable `answer_0032`

> *Note:* the notebook will only print out the result of the last line in each cell.

In [None]:
#answer_0031 =
#answer_0032 =
# YOUR CODE HERE
raise NotImplementedError()

## Containers 

*What is a composite data type?*

A data type that can contain more than entry of data, e.g. multiple numbers.

*What are the most fundamental composite data types?*

Three of the most fundamental composite data types   are the _tuple_, the _list_ and the _dictionary_. 

* The _tuple_ is declared with round parentheses, e.g. `(1, 2, 3)` each element in the tuple is separated by a comma. One you have declared a tuple you cannot change it's content without making a copy of the tuple first (you will read that the tuple is an _immutable_ data type).

* The _list_ is almost identical to the tuple. It is declared using square parentheses, e.g. `[1, 2, 3]`. Unlike the tuple, a list can be changed after definition, by adding, changing or removing elements. This is called a _mutable_ data type.

* The _dictionary_ or simply _dict_ is also a mutable data type. Unlike the other data types the dict works like a lookup table, where each element of data stored in the dictionary is associated with a name. To look up an item in the dictionary you don't need to know its position in the dictionary, only its name. The dict is defined with curly braces and a colon to separate the name from the value, e.g. `{'name_of_first_entry': 1, 'name_of_second_entry: 2}`.

Watch the video below to get an explanation why these data structures are really important.

In [None]:
YouTubeVideo('8yIupGinliw', width=640, height=360)    

### Problem 0.1.4

> **Ex. 0.1.4:** Define the variable `y` as a list containing the elements `'k', 2, 'b', 9`. Also define a variable `z` which is a tuple containing the same elements. Try to access the 0th element of `y` (python is 0-indexed) and the 1st element of `z`. Store these two values in the two variables `answer_004_y0` and `answer_004_z1`.

> *Hint:* To access the _n_'th element of a list/tuple write `myData[n]`, for example `y[0]` gets the 0th element of `y`. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# 0.2 Control Flow

## If-then syntax

_Control flow_ means writing code that controls the way data or information flows through the program. The concepts of control flow should be recognizable outside of coding as well. For example when you go shopping you might want to buy koldskål, but only **if** the kammerjunker are on sale. **else** you will buy icecream. These kinds of logic come up everywhere in coding; self driving cars should go forward only **if** the light is green, items should be listed for sale in a web shop only **if** they are in stock, stars should be put on the estimates **if** they are significant etc. 

Another kind of control flow deals with doing things repeatedly. For example dishes should be done **while** there are still dirty dishes to wash, **for** each student **in** social data science a grade should be given, etc.

In the following problems you will work with both kinds of control flow. 

*How can we activate code based on data in Python?*

In Python, the syntax is easy with the `if` syntax. 

```python
if statement:  
    code
```

In the example above, the block called `code` is run if the condition called `statement` is true (either a variable or an expression).

#### Examples using if
Try to run the examples:

In [None]:
my_statement = (4 == 4)
if my_statement:  
    print ("I'm being executed, yay!")

#### Introducing an alternative


If the statement in our condition is false, then we can execute other code
with the `else` statement. Try the example below - and change the boolean value of `my_statement`.


In [None]:
my_statement = False
if my_statement:  
    print ("I'm being executed, yay!")
else:
    print ("Shoot! I'm still being executed!")

#### More material

We have not covered the statements `break` and `continue`, or `try` and `except` which are also control flow statements. These are slightly more advanced, but it can be a good idea to look them up yourself. 


### Problem 0.2.1
In python the if/else logic consists of three keywords: *if*, *elif* (else if) and *else*. The if and elif keywords should be followed by a logical statement that is either `True` or `False`. The code that should be executed `if` the logic is `True` is written on the lines below the `if`, and should be indented one TAB (or 4 spaces). Finally all control flow statements must end with a colon. 

> **Ex. 0.2.1:**  Read the code in the cell below. Assign a value to the variable `x` that makes the code print "Good job!"

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

if x > 5:
    answer_021_sentinel = 0
    print("x is too large")

elif x >= 3 and x <= 5:
    answer_021_sentinel = 1
    print("Good job!")

else: 
    answer_021_sentinel = 0    
    print("x is too small")
    

### Problem 0.2.2
Above we used two different types of comparison: `>=` and `<`. To compare two values and check whether they are equal, python uses double equal signs `==` (remember a single = was used to assign values to a variable). 

> **Ex. 0.2.2:**  The code below draws a random number between 0 and 1 and stores in the variable `randnum`. Write an if/else block that defines a new variable `answer_022` which is equal to 1 if `randnum <= 0.1` and is 0 if `randnum > 0.1`. That is, write code that defines 

<br>
$$
\text{answer_022} = \begin{cases}
1 \text{  if } \text{randnum} \leq 0.1 \\
0 \text{  if } \text{randnum} > 0.1
\end{cases}
$$

In [None]:
import random
randnum = random.uniform(0,1)

# YOUR CODE HERE
raise NotImplementedError()

## Loops 

#### For loops

Control flow that does the same thing repeatedly is called a _loop_. In python you can loop through anything that is _iterable_, e.g. anything where it makes sense to say "for each element in this item, do whatever." 

Lists, tuples and dictionaries are all iterable, and can thus be looped over. This kind of loop is called a _for loop_. The basic syntax is 

```python
for element in some_iterable:
    do_something(element)
```
where `element` is a temporary name given to the current element reached in the loop, and `do_something` can be any valid python function applied to `element`. 

Example - try the following code:

In [None]:
A = []

for i in [1, 3, 5]:
    i_squared = i ** 2
    A.append(i_squared)
    
print(A)

For loops are smart when: iterating over files in a directory; iterating over specific set of columns.

*Quiz*: How does Python know where the code associated with inside of the loop begins?

*Answer*: By indenting the line with four whitespaces, see example above. This is the same as the if statements.

> **Ex. 0.2.3:**  Begin by initializing an emply list in the variable `answer_023` (simply write `answer_023 = []`). Then loop trough the list `y` that you defined in [problem 0.1.4](#Problem-0.1.4). For each element in `y`, multiply that element by 7 and *append* it to `answer_023`. (You can finish off by showing the content of `answer_023` after the loop has run.)

> *Hint:* To append data to a list you can write `answer_023.append(new_element)` where `new_element` is the new data that you want to append.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### While loops

The other kind of loop in Python is the _while loop_. Instead of looping over an iterable, the `while` loop continues going as long as a supplied logical condition is True. 

Most commonly, the while loop is combined with a counting variable that keeps track of how many times the loop has been run. 

One specific application where a while loop can be useful is data collection on the internet (scraping) which is often open ended. Another application is when we are computing something that we do not know how long will take to compute, e.g. when a model is being estimated.

The basic syntax is seen in the example below. This code will run 100 times before stopping. At each iteration, it checks that `i` is smaller than 100. If it is, it does something and adds 1 to the variable `i` before repeating.

```python 

i = 0
while i < 100:
    do_something()    
    i = i + 1
```

In the example below, we provide an example of what `do_something()` can be. Try the code below and explain why it outputs what it does.

In [None]:
i = 0
L = []
while (i < 5):
    L.append(i * 3)
    i += 1

print(L)    

### Problem 0.2.4
> **Ex. 0.2.4:**  Begin by defining an empty list named `answer_024`. Write a while loop that runs from $i=0$ up to but not including $i=1500$. In each loop, it should determine whether the current value of `i` is a multiple of 19. If it is, append the number to the answer list `answer_024`. (recall that $i$ is divisible by $a$ if $i \text{ mod } a = 0$. The modulo operator in python is `%`)

> *Hint:* The `if` statement does not need to be followed by an `else`. You can simply code the `if` part and python will automatically skip it and continue if the logical condition is False.
>
> *Hint:* Remember to increment `i` in each iteration. Otherwise the program will run forever. If this happens, press _kernel > interrupt_ in the menu bar. 

In [None]:
i = 0
answer_024 = []
# YOUR CODE HERE
raise NotImplementedError()

# 0.3. Reusable Code

## Functions

If you have never programmed in anything but statistical software such as Stata or SAS, the concept of functions might be new to you. In python, a function is simply a "recipe" that is first written, and then later used to compute something. 

Conceptually, functions in programming are similar to functions in math. They have between $0$ and "$\infty$" inputs, do some calculation using their inputs and then return between 1 and "$\infty$" outputs. 

By making these recipes, we can save time by making a concise function that undertakes exactly the task that we want to complete. 

Python contains a large number of [built-in functions](https://docs.python.org/3/library/functions.html). Below, you are given examples of how to use the most commonly used built-ins. You should make yourself comfortable using each of the functions shown below.

In [None]:
# Setup for the examples. We define two lists to show you the built-in functions.
l1 = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
l2 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [None]:
# The len(x) function gives you the length of the input
len(l2)

In [None]:
# The abs(x) function returns the absolute value of x
abs(-5)

In [None]:
# The min(x) and max(x) functions return the minimum and maximum of the input.
min(l1), max(l1)

In [None]:
# The map(function, Iterable) function applies the supplied function to each element in Iterable:
# Note that the list() call just converts the result to a list
list(map(len, l2))

In [None]:
# The range([start], stop, [step]) function returns a range of numbers from `start` to `stop`, in increments of `step`.
# The values in [] are optional.
# If no start value is set, it defaults to 0.
# If no step value is set it defaults to 1. 
# A stop value must always be set.

print("Range from 0 to 100, step=1:", range(100))
print("Range from 0 to 100, step=2:", range(0, 100, 2))
print("Range from 10 to 65, step=3:", range(10, 65, 3))

In [None]:
# The reversed(x) function reverses the input.
# We can then loop trough it backwards
l1_reverse = reversed(l1)

for e in l1_reverse:
    print(e)

In [None]:
# The enumerate(x) function returns the index of the item as well as the item itself in sequence.
# With it, you can loop through things while keeping track of their position:
l2_enumerate = enumerate(l2)

for index, element in l2_enumerate:
    print(index, element)

In [None]:
# The zip(x,y,...) function "zips" together two or more iterables allowing you to loop through them pairwise:
l1l2_zip = zip(l1, l2)

for e1, e2 in l1l2_zip:
    print(e1, e2)


#### The how
You can also write your own python functions. A python function is defined with the `def` keyword, followed by a user-defined name of the function, the inputs to the function and a colon. On the following lines, the _function body_ is written, indented by one TAB. 

Functions use the keyword `return` to signal what values the function should return after doing its calculations on the inputs. For example, we can define a function named `my_first_function` seen in the cell below. Run the cell below and explain the printed output.



In [None]:
def my_first_function(x): # takes input x
    x_squared = x ** 2 # x squared
    return x_squared + 1

print('Output for input of 0: ', my_first_function(0))
print('Output for input of 1: ', my_first_function(1))
print('Output for input of 2: ', my_first_function(2))
print('Output for input of 3: ', my_first_function(3))

We can also make more complex functions. The function below, named `my_second_function`, takes two inputs `a` and `b` that is used to compute the values $a^b$ (written in python as `a ** b`) and $b^a$ and returns the larger of the two. 

Provide the function below with different inputs of `a` and `b`. Explain the output to yourself. 

In [None]:
def my_second_function(a, b):
    v1 = a ** b
    v2 = b ** a
    
    if v1 > v2:
        return v1
    else:
        return v2

### Problem 0.3.1
> **Ex. 0.3.1:**  Write a function called `minimum` that takes as input a list of numbers, and returns the index and value of the minimum number as a `tuple`. Use your function to calculate the index and value of the minimum number in the list `[-342, 195, 573, -234, 762, -175, 847, -882, 153, -22]`. 
Store the result of this computation in a variable named `answer_031`.

> *Hint:* A ["pythonic"](https://stackoverflow.com/questions/25011078/what-does-pythonic-mean#:~:text=Pythonic%20means%20code%20that%20doesn,is%20intended%20to%20be%20used) way to keep count of the index of the minimum value would be to loop over the list of numbers by using the [enumerate](#Functions) function on the list of numbers.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Problem 0.3.2
> **Ex. 0.3.2:**  Write a function called `average` that takes as input a list of numbers, and returns the average of the values in the list. Use your function to calculate the average of the values `[-1, 2, -3, 4, 0, -4, 3, -2, 1]`. Store the result of this computation in a variable named `answer_032`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Problem 0.3.3
Recall that [Eulers constant](https://en.wikipedia.org/wiki/E_(mathematical_constant)) $e$ can be calculated as 
$$
e=\lim_{n\rightarrow \infty}\left(1+\frac{x}{n}\right)^{n}
$$
Of course we cannot compute the limit on a finite memory computer. Instead we can calculate approximations by taking $n$ large enough.

> **Ex. 0.3.3:** Write a function named `eulers_e` that takes two inputs `x` and `n`, calculates 
$$
\left(1+\frac{x}{n}\right)^{n}
$$
and returns this value. Use your function to calculate `eulers_e(1, 5)` and store this value in the variable `answer_033`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Problem 0.3.4
The inverse of the exponential is the [logarithm](https://en.wikipedia.org/wiki/Logarithm). Like the exponential function, there are limit definitions of the logarithm. One of these is 
$$
\log(x) = 2 \cdot \sum_{k=0}^{\infty} \frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1}
$$

where $\sum_{k=0}^{\infty}$ signifies the sum of infinitely many elements, starting from $k=0$. Each element in the sum takes the value $\frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1}$ for some $k$. As before, we must approximate this with a finite sum. 

> **Ex. 0.3.4:**  Define another function called `natural_logarithm` which takes two inputs `x` and `k_max`. In the function body calculate 
$$
2 \cdot \sum_{k=0}^{k\_max} \frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1}
$$
and return this value. Use your function to calculate `natural_logarithm(2.71, 1)`. Store this value in a variable named `answer_034`.

> *Hint:* to calculate the sum, first initialize a value total = 0, loop through $k\in \{0, 1, \ldots, k\_max\}$ and compute $\frac{1}{2k+1} \left( \frac{x-1}{x+1} \right)^{2k+1}$. Add the computed value to your total in each step of the loop. After finalizing the loop you can then multiply the total by 2 and return the result.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Problem 0.3.5
Just like numbers, strings and data types, python treats functions as an object. This means you can write functions, that take a function as input and functions that return functions after being executed. This is sometimes a useful tool to have when you need to add extra functionality to an already existing function, or if you need to write _function factories_.

> **Ex. 0.3.5:**  Write a function called `exponentiate` that takes one input named `func`. In the body of `exponentiate` define a nested function (i.e. a function within a function) called `with_exp` that takes two inputs `x` and `k`. The nested function should return `func(e, k)` where `e = eulers_e(x, k)`. The outer function should return the nested function `with_exp`, i.e. write something like

>```python
def exponentiate(func):
    def with_exp(x, k):
        e = eulers_e(x, k)
        value = #[FILL IN]
        return value
    return with_exp
```

> Call the `exponentiate` function on `natural_logarithm` and store the result in a new variable called `logexp`. Then call `logexp(1, 100)` and store this value in the variable `answer_035`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Getting More General

#### Modules

Whatever we attempt in programming, it is likely nowadays that someone has done it before us. Therefore, we can reuse code which allows to 
1. save time by using others' code, and
2. learn from  others' code. 

Moreover, often the code implemented by someone with more experience is likely to work better and faster than what we can come up with! That's why we introduce modules. These are packages of Python code that we can load - and by doing that, we get access to powerful tools.

Let's see how modules work. Run the cell below to load a module called `numpy` which allows us to work with linear algebra and other numeric tools. 

In [None]:
import numpy as np

Let's  create an `array` with numpy.

In [None]:
row1 = [1, 2]
row2 = [3, 4]
table = [row1, row2]

my_array = np.array(table)
my_array

*What is a numpy array?*

An n-dimensional container that can store specific data types, e.g. bool and float. The arrays come with certain available methods and tools. E.g. 2-d array can act like a matrix, in 3-d it can act like a tensor. 

Objects can have useful attributes and methods that are built-in. These are accessed using `"."` Example, an array can be transposed as follows:

In [None]:
my_array.T

#### Classes

In Python, we can also define our types of objects, which is known as `class`. Each class contains rules and properties that governs how objects of the class will behave. If you are curious and want to learn, which is totally optional, then read more [here](https://docs.python.org/3/tutorial/classes.html) (note: quite technical). Otherwise move on to part 0.4.

# 0.4 Pandas for data structuring

You may ask yourself: Why do we need to learn data structuring? 

Data never comes in the form of our model. We need to 'wrangle' our data. As of right now, even the most advanced techniques needs data in a structured format to work with it.

<font color="red">STOP!</font> Before proceeding, make sure that you have completed all the material above (expect the optional part on Python classes).


## An Overview

In the motivation in the beginning of the class, we discussed how a number of modules were driving the recent popularity of Python. The most important one is pandas - which is a neat tool for working with tabular data. 

Tabular data is like the table below. Each row is an observation which consist of two entries, one for each of the columns/fields, i.e. animal and day.  

| index | Animal     | Date           |
|---------------:|:-----------|:---------------|
|  Observation 1 | Elk        | July 1, 2019   |
|  Observation 2 | Pig        | July 3, 2019   |

What pandas provides is a smart way of structuring data. It has two fundamental data types, see below. These are essentially just container but come with a lot of extra functionality for structuring data and performing analysis. 

- `Series`: tabular data with a single column (field)
  - akin to a vector in mathematics
  - has labelled columns (e.g. Animal and Date above) and named rows, called indices.
- `DataFrame`: tabular data that allows for more than one column (multiple fields)
  - akin to a matrix in mathematics  
  
Run the code below to make your first pandas dataframe. Try to print it and explain the content it shows.

In [None]:
import pandas as pd

df1 = pd.DataFrame(data=[[1, 2],[3, 4],[5, 6],[7, 8]],
                   index=['i', 'ii','iii','iv'],
                   columns=['A', 'B'])

The code below makes a series from a list. We can see that it contains all the four fundamental data types!

In [None]:
L = [1, 1.2, 'abc', True]
ser1 = pd.Series(L)

Now you may ask yourself: *why don't we just use numpy?* 

There are many reasons. Pandas is easier for loading, structuring and and making simple analysis of tabular data. However, in many cases, if you are working with custom data or need to performing fast and complex array computations, then numpy is a better option. If you are interested see discussion [here](https://stackoverflow.com/questions/30067051/python-what-are-the-major-improvement-of-pandas-over-numpy-scipy).

## Switching Among Python, Numpy and Pandas

Pandas dataframes can be thought of as numpy arrays with some additional stuff. Note that columns can have different datatypes!

Most functions from `numpy` can be applied directly to Pandas. We can convert a DataFrame to a `numpy` array with `values` attribute:

In [None]:
df1.values

In Python, we can describe it as a *list of lists*.

In [None]:
df1.values.tolist()

Both dataframes and series have indices which are both a blessing and a curse. These indices means that we can often convert a Series into a dictionary:

In [None]:
ser1.to_dict()

**<font color="red">WARNING!#@</font>**: Series indices are NOT unique thus we may lose data if we convert to a dict which requires unique keys.

## Inspection
Often we want to see what our dataframe contains. This can be done by putting the dataframe at the end of our cell, then it will automatically be printed.

The example below consist of 100 rows, with 5 columns of random data. We see that putting the dataframe in the end prints the dataframe.

In [None]:
df2 = pd.DataFrame(data=np.random.rand(100, 5), 
                   columns=['A','B','C','D','E'])
df2

We can also use `head` and the `tail` method that select respectively the first and last observations in a DataFrame. The code below prints the first four rows.

In [None]:
df3 = df2.head(n=4)
df3

## Input-output

We can load and save dataframes from our computer or the internet. Try the code below to save our dataframe as a CSV file called `my_data.csv`. If you are unsure what a CSV file is then check [the Wikipedia  description](https://en.wikipedia.org/wiki/Comma-separated_values).

In [None]:
df3.to_csv('my_data.csv')


Loading data is just as easy. Some data sources are open and easy to collect data from. They do not require formatting as they come in a table format. The code below load a CSV file on school test data from NYC. 

In [None]:
my_url = 'https://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv'
my_df = pd.read_csv(my_url)

my_df.head(10)

#### Working with weather data
We move on to the next problem in the assignment. These are part of other exercises that you will see later in the course and also use weather data. Our source will be National Oceanic and Atmospheric Administration (NOAA) which have a global data collection going back a couple of centuries. This collection is called Global Historical Climatology Network (GHCN). The data contains daily weather recorded at the weather stations. A description of GHCN can be found [here](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/readme-by_year.txt).

> **Ex. 0.4.1:** Use Pandas' CSV reader to fetch  daily data weather from 1863 for various stations - available here: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/. Save the dataframe in the variable `df_weather`. 

> *Hints*: you will need to give `read_csv` some keywords. Here are some suggestions
  - Specify the path, as the URL linking directly to the 1863 file as in the example above. To do this, open ftp link and scroll through the .csv files and copy the link address of the right CSV file. 
  - for [compressed files](https://www.winzip.com/win/en/gz-file.html) you may need to specify the keyword `compression` when calling the `.read_csv` method.
  - `header` can be specified as the CSV has no column names.  

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Selecting Rows and Columns

In pandas there are two canonical ways of accessing subsets of a dataframe.
- The `iloc` attribute: access rows and columns using integer indices (like a list).
- The `loc` attribute: access rows and columns using immutable keys, e.g. numbers, strings (like a dictionary).

In what follows we will describe some different way of selection using `.iloc` and `.loc` as well as a simpler way of simply accesing the dataframe using `[]`. The different ways are meant to give you an overview. 
 
#### Using list of keys/indices

Below is an example of using the `iloc` attribute to select specific rows:

In [None]:
df1 # show df1 before indexing it with .iloc[]

In [None]:
my_irows = [0, 3]
df1.iloc[my_irows]

We can select columns and rows simultaneously. Below is an example of using the `loc` attribute, which does that:

In [None]:
my_rows = ['i', 'iii']
my_cols = ['A']
df1.loc[my_rows, my_cols]

#### Using thresholds
We can also use `iloc` and `loc` for selecting rows and/or columns below or above some treshold, see below. Note that whether or not the `:` is on front determines whether it is above or below.

In [None]:
df2.iloc[:3, :4]

#### Using boolean data
If we provide the dataframe with a boolean, it will select rows (also works with `iloc` and `loc`). We will see soon that this is an extremely useful way of selecting certain rows.

In [None]:
df3[[True, False, False, True]]

#### Selecting columns

Often we need to select specific columns. If we provide the dataframe with a list of column names it will make a dataframe keep only these columns:

In [None]:
df3[['B', 'D']]


> **Ex 0.4.2:** Select the *four* left-most columns which contain: station identifier, data, observation type, observation value. Rename them as *'station', 'datetime', 'obs_type', 'obs_value'*. 

> *Hints:* 
- rename can be done with `df.columns = cols` where `cols` is a list of column names.
- we can require that column values come from a list of values by using the [.isin](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html) method for a pandas Series


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Basic Operations 

How do perform elementary operations like we learned for basic Python? E.g. numeric operations such as summation (`+`) or logical operations such as greater than (`>`). Actually we are in luck - they are exactly the same. 

Let's see how it works for numeric data using a numpy array (works the same way as Pandas).

In [None]:
my_arr1 = np.array([2, 3, 2, 1, 1])
my_arr2 = my_arr1 ** 2
my_arr2

*Can we do the same with two vectors?* Yes, we can also do elementwise addition, multiplication, subtractions etc. of series. Example: 

In [None]:
my_arr1 + my_arr2

## Changing and Copying Data

Everything in the dataframe can be changed. For instance, we can also update our dataframe with new values, e.g. by making new variables or overwriting existing ones. In the example below we add a new column to add a DataFrame. 

In [None]:
df2['F'] = df2['A'] > df2['D']
df2.head(10)

**<font color="red">WARNING! X#@</font>**: If you work on a subset of data from another dataframe, then this dataframe is what is known as a *view*! Therefore, all changes made in the view will also be made in the original version. 

In the example below, we try to change the dataframe `df2` which is a view of `df3`, and we get a *warning*. Thus, changes to `df3` also happen in `df2`. Notice that we can also use `loc` for changing the data.

In [None]:
df3.loc[:,'D'] = df3['A'] - df3['E']
print(df2['D'].head(3), '\n')
print(df3['D'].head(3))

To avoid the problem of having a *view*, we can instead *copy* the data as in the example below. Try to verify that if you change things in `df4` things do not change in `df2`.

In [None]:
df4 = df2.copy()

In [None]:
# Verify that the code from above doesn't throw the same "SettingWithCopyWarning" 
# when using the copied dataframe, df4, instead of df3.
df4.loc[:, 'D'] = df4['A'] - df4['E']


> **Ex. 0.4.3:**  Further, select the subset of data for the station `EZE00100082` and only observations for maximal temperature. Make a copy of the DataFrame and store this in the variable `df_select`. Explain in a one or two sentences how copying works. Write your answer in a multi line comment like `""" Your answer here """`.

> *Hint*: The `&` operator works elementwise on boolean series (like `and` in core python). This allows to combine conditions for selections.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

> **Ex 0.4.4:** Make sure that max temperature is correctly formated (how many decimals should we add? one? Look through this .txt file for an answer https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt). Make a new column called `TMAX_F` where you have converted the temperature variables to Fahrenheit.

> *Hint*: Conversion is $F = 32 + 1.8*C$ where $F$ is Fahrenheit and $C$ is Celsius.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Changing and Rearranging Indices

In addition to replacing values of our data, we can also rearrange the order of variables and rows as well as make new ones. We have already seen how to change column names but we can also reset the index, as seen below. Alternatively, we can set our own custom index using [`set_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html), with temporal data etc. which provides the DataFrame with new functionality.

In [None]:
df1_new_index = df1.reset_index(drop=True)
df1_new_index

A powerful tool for re-organizing the data is to `sort` the data. That is, we can re-organize rows (or columns) such that they are ascending or descending according to one or more columns.

In [None]:
df3_sorted = df3.sort_values(by=['A','B'], ascending=True)
df3_sorted

> **Ex 0.4.5:**  Inspect the indices in `df_select`. Are they following the sequence of natural numbers, 0,1,2,...? If not, reset the index and make sure to drop the old.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

> **Ex 0.4.6:** Make a new DataFrame where you have sorted by the maximum temperature. Save this DataFrame as the variable `df_sorted`. What is the date for the first and last observations?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

### Combining what we have learned
In the final exercise, we will combine some of the things that we have learned throughout this assignment. In particular, we'll:
1. Code a function that can preprocess some data.
2. Iterate over an iterable and apply our function on the specific object at each iteration.
3. Concatenate everything into a single pandas DataFrame.

#### f-strings
One last concept that could be helpful in the following exercise is the concept of [f-strings](https://www.freecodecamp.org/news/python-f-strings-tutorial-how-to-use-f-strings-for-string-formatting/). 
f-strings are a nice way to easily format strings in Python. 

Say that we want a function to return an url address for a specific website with one or more characters in the url replaced. 

In the following example, we want to get the url of the website of Wikipedia in different languages. This can be done by changing the part of the url string: https://en.wikipedia.org/. To do this, we can utilize f-strings. 

As an example, consider the function `change_url` below. It takes as input the variable `chars` and returns the string stored in the variable `url` with `chars` inserted inside the curly brackets. Setting `chars = "da"` and applying the function returns https://da.wikipedia.org/, the Danish wikipedia site. For more examples of f-strings, see [here](https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python).

In [None]:
def change_url(chars):
    url = f"https://{chars}.wikipedia.org/"
    return url

chars = "da"
change_url(chars)

> **Ex. 0.5.0:** Use the `change_url` function above together with a list comprehension to get a list with the url of Wikipedia in the languages contained in the list `["en", "da", "de"]`. Store the result in the variable `answer_050`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

> **Ex. 0.5.1:** Finally, returning to the weather data, get the weather data for the years 1860-1863 and turn this into a single DataFrame. In order to do this follow the steps below: 
1. Create a function named `load_weather` that takes as input an integer and returns a preprocessed DataFrame by loading the url of  `'https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/1863.csv.gz'` with the last integer of the given year replaced
    - Process the loaded DataFrame by using .iloc[] and changing the column names for each year as we previously did for the year 1863
3. Store the DataFrames in a list called `list_of_dfs`
4. Convert the list into a single DataFrame by [concatenating vertically](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) 
5. Store the final DataFrame as the variable `df_weather_period` and reset the index 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()