# Introduction to numerical algorithms
## Practice class 1 - Set up a workspace, read and structure data

We are going to use jupyter notebooks, suggested programming enviroments:
* VS code with miniconda 
* Jupyther with miniconda 
* Google Colab

**Set up miniconda:**
1. Install: https://docs.anaconda.com/miniconda/miniconda-install/
2. Creat python virtual environment: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#
3. `environment.yml` is provided

*Pros:*
* virtual environments
* safety
* any version of any package can be installed in your env
* you can have many different envs

*Cons:*
* need some experience to set up

### Pip on Linux/Window (without miniconda)

- Use the following terminal commands:
```bash
python3 -m venv venv_name
```
- Windows:
```bash
venv_name\Scripts\activate
```
- Linux/Mac:
```bash
source venv_name/bin/activate
```

Note that `.yml` is nothing else than a simple text file. If you are using mini conda use the `.yml`, if you are using native python, you have to convert it to simple `.txt` as it uses `pip` by default. This is done by renaming the extension and rewriting the file as in the example below:
- YAML:
```yaml
dependencies:
  - numpy=1.19.2
  - pandas=1.1.3
  - scikit-learn=0.23.2

```
- TXT:
```bash
numpy==1.19.2
pandas==1.1.3
scikit-learn==0.23.2
```
Use command 
```bash
pip install -r /path/to/environment.txt
```
to install the requirements.

**Remarks**

I personally prefer pip as it is more native python, it is more easier to manage packages, it is built in natively in Linux systems and doesn't require miniconda setup.

Miniconda, on the other hand, is a compact way to handle project files and is more widespread on windows systems. Some types of projects and compatibility issues may require its usage.

Note that, 
- there were changes to `pip` in Ubuntu (Debian based Linux) systems after 2023 release cycle.
- noteworthy mention is `pyenv` on Linux to handle different python environments easily.
- virtual environments are preferred to keep track of the versions of your packages and assure backward compatibility
- you can precise the version number in the requirements file. If you don't do this, the package manager will try to look for the newest stable version. 
- in VS code you can use the command palette instead of terminal to create virtual environments both with venv or conda

**Set up Jupyter:**

Install: https://docs.jupyter.org/en/latest/install/notebook-classic.html

**Set up VS Code**

Install: https://code.visualstudio.com/download

**Set up colab:**

Visit: https://colab.research.google.com/

You need a google account, everything stored in your google drive. 

*Pros:*
* easy to access
* no initial setup
* stored in the cloud, can be accessed anywhere

*Cons:*
* complicated file management
* runs in the cloud, requires internet access
* slow

### Python and numpy basics:
**Reference materials:**
* python: https://docs.python.org/3/
* numpy: https://numpy.org/doc/stable/user/index.html

But Google is your friend, with any question first Google it!

### Numpy Tutorial
`numpy` is the most general vector and linear algebra tool in python. It includes the most common operations of linear algebra, complex numbers, mathematical functions and random numbers. I reiterate that do not mix with the `math` and `random` package. They have different number styles, and different precisions. Python is a mess by itself, there is no need to make your life harder!

In [10]:
import numpy as np #always use it this way!
import sys

```
import numpy as foo
````
would also work, but noone will understand

In [None]:
a = np.arange(10)
b = range(10)
print(a,b,list(b))
print(type(a),type(b))
#please note the subtle difference of missing commas. Do not rely on it! 

[0 1 2 3 4 5 6 7 8 9] range(0, 10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
<class 'numpy.ndarray'> <class 'range'>


In [None]:
#numpy array and list can be made from each other
b = [1,2,3]
print(type(b),b,b[0])
a = np.array(b)
print(type(a),a,a[0])
c = list(a)
print(type(c),c,c[0])

<class 'list'> [1, 2, 3] 1
<class 'numpy.ndarray'> [1 2 3] 1
<class 'list'> [1, 2, 3] 1


In [8]:
#numpy cannot mix types, if the list does it will take the most incluseive one, here the string
print(np.array([1,"a",1.5]))
#here float
print(np.array([1,1.5,7]))

['1' 'a' '1.5']
[1.  1.5 7. ]


In [None]:
#very efficient in storage. I have 8 byte integers by default:
a = np.arange(10000)
b = np.arange(100000)
print(sys.getsizeof(a),sys.getsizeof(b))

80096 800096


In [None]:
#list first stores a list of pointers to the elements which are in turn stored individually in the memory.
la = [i for i in range(10000)]
print(sys.getsizeof(la))
print(sys.getsizeof(la[0]))

87616
24


In [12]:
la = ["A LONG STRING THAT DEFINITELY"+str(i)+" DOES NOT FIT" for i in range(10000)]
print(sys.getsizeof(la))
print(sys.getsizeof(la[0]))

85176
92


In [None]:
#the extra space stores the shape of the numpy array. reshaping is cheap:
a = np.array([1,2,3,4,5,6])
b = np.array([[1,2,3],[4,5,6]])
print(np.ndim(a))
print(np.ndim(b),np.ndim(b[0]))
print(a.reshape(2,3),a.reshape(2,3).ndim)

1
2 1
[[1 2 3]
 [4 5 6]] 2


In [None]:
#obviously:
a = np.array([1,2,3,4,5,6])
c = a.reshape(2,4)

ValueError: cannot reshape array of size 6 into shape (2,4)

In [None]:
#zeros is the default way to create a numpy array. Always define the type!!!!!!!! Though default is float.
a = np.zeros((3,3))
b = np.zeros((3,3),dtype=int)
print(a)
print(b)
c = np.ones(5,dtype=np.uint8) #8 bit unsigned integer
print(c)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
[[0 0 0]
 [0 0 0]
 [0 0 0]]
[1 1 1 1 1]


In [None]:
#indexing, and slicing
a = np.zeros((3,3),dtype=int)
print("a=",a)
a[0][1] = 1
a[1, 2] = 3.5 # just to show that it will be converted back to integer
print("\na=",a)
a[:,0:3:2] = 2  # slicing. The single : means all rows/columns/whatever comes after it in the third dimension
print("\na=",a)

a= [[0 0 0]
 [0 0 0]
 [0 0 0]]

a= [[0 1 0]
 [0 0 3]
 [0 0 0]]

a= [[2 1 2]
 [2 0 2]
 [2 0 2]]


In [None]:
c[0] += 255
print(c)
#it is zero as a 8 bit unsigned goes from 0 to 255

[0 1 1 1 1]


In [None]:
# operations on the whole array or on a part of it
a = np.zeros((3,3),dtype=float)
a[1] += 1
print(a)
a[:,2] += 2
print(a)

[[0. 0. 0.]
 [1. 1. 1.]
 [0. 0. 0.]]
[[0. 0. 2.]
 [1. 1. 3.]
 [0. 0. 2.]]


In [None]:
a = a*(-1)
print(a)

[[-0. -0. -2.]
 [-1. -1. -3.]
 [-0. -0. -2.]]


In [None]:
#but
a[:2,::2] = a[:2,::2] + 10
print(a)

[[10. -0.  8.]
 [ 9. -1.  7.]
 [-0. -0. -2.]]


In [None]:
a[:2,::2] += np.ones((2,2))*10
print(a)

[[20. -0. 18.]
 [19. -1. 17.]
 [-0. -0. -2.]]


In [None]:
# columns and rows
a = np.arange(9).reshape(3,3)
print(a)
print("")
print(a[1])
print(a[:,1])

[[0 1 2]
 [3 4 5]
 [6 7 8]]

[3 4 5]
[1 4 7]


### Random numbers
All random numbers created by python are pseudo random numbers which means they are created by an algorithm, which was proven to fullfill all requirements of randomness. The new numbers are always created from the last generated one. So there must be a first one which is called seed. If you omit it the computer will use the current time. Advantages of using your own seed:
* You can debug your code
* Results are reporducable
* If by chance you run a parallel code. All instances generate the same random sequence. 

In [None]:
np.random.seed(12345)
a = np.random.random(10)
print(a)

[0.92961609 0.31637555 0.18391881 0.20456028 0.56772503 0.5955447
 0.96451452 0.6531771  0.74890664 0.65356987]


In [None]:
#events with given probability. Here 30%
N = 10
p = 0.3
for i in range(N):
    if np.random.random() < p:
        print("It happened in step %d." % (i))

It happened in step 2.
It happened in step 3.
It happened in step 4.


In [None]:
# do it in one step
p = 0.3
r = np.random.random(10) < p
print(r)

[False False False False False False False False False  True]


In [None]:
#the True/False array can be used as an index to restrict the other array to the True part
np.arange(10)[r]

array([9])

In [None]:
np.where(r)

(array([9]),)

In [None]:
# Other useful random functions:
print(np.random.choice(5, 10))
print(np.random.choice(5,3,replace=False))
print(np.random.choice(5, 10, p=[0.1, 0, 0.3, 0.6, 0]))
c = ["Budapest", "Pécs", "Debrecen", "Miskolc"]
print(np.random.choice(c,1))
print(np.random.choice(c,2))
sr = np.arange(5)
np.random.shuffle(sr)
print(sr)

[3 1 3 1 3 4 0 0 3 2]
[0 1 4]
[3 2 2 3 3 2 0 2 3 2]
['Budapest']
['Miskolc' 'Budapest']
[0 1 3 2 4]


In [None]:
# shape of the array
np.random.seed(12345)
a = np.random.random(size=(2,5))
print(a)

[[0.92961609 0.31637555 0.18391881 0.20456028 0.56772503]
 [0.5955447  0.96451452 0.6531771  0.74890664 0.65356987]]


<b>copying</b>

In [None]:
a = np.arange(5)
b = a
b[3] = 9
print(a,b) #they are the same

[0 1 2 9 4] [0 1 2 9 4]


In [None]:
a = np.arange(5)
b = a.copy()
b[3] = 9
print(a,b) #they are not the same

[0 1 2 3 4] [0 1 2 9 4]


In [None]:
a = np.arange(6).reshape(3,2)
b = a[0:2]
b[1,1] = 9
print(a,"\n",b) #they are not the same

[[0 1]
 [2 9]
 [4 5]] 
 [[0 1]
 [2 9]]


### Random number example
Typical problem: We have an array with many values and we have to change them randomly with a given probability `p`

First for loop

In [None]:
N = 1000000
p = 0.00001
a = np.zeros(N,dtype=float)

In [None]:
for i in range(N):
    if np.random.random() < p:
        a[i] = np.random.normal(0,0.1)

In [None]:
%%timeit
a = np.zeros(N,dtype=float)
for i in range(N):
    if np.random.random() < p:
        a[i] = np.random.normal(0,0.1)

271 ms ± 3.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
a = np.zeros(N,dtype=float)
r = np.random.random(N) < p
a[r] = np.random.normal(0,0.1,size=r.sum())

In [None]:
%%timeit
a = np.zeros(N,dtype=float)
r = np.random.random(N) < p
a[r] = np.random.normal(0,0.1,size=r.sum())

5.92 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
n = np.random.binomial(N,p)
print(n)

6


Idea: We "know" how many number will be changed, or at least the distribution of it: binomial. So we generate a number with binomial distribution and only that many random numbers and positions. The problem with python is that the positions is tricky to get, as we have a terrible implementation of `np.random.choice` with the `replace=False` option.

In [None]:
%%timeit
a = np.zeros(N,dtype=float)
n = np.random.binomial(N,p)
a[np.random.choice(N,n,replace=False)] = np.random.normal(0,0.1,size=n)

13 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
%%timeit
a = np.zeros(N,dtype=float)
n = np.random.binomial(N,p)
a[np.random.choice(N,n)] = np.random.normal(0,0.1,size=n)

272 µs ± 6.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Task 1: Create your first virtual env
1. Install miniconda.
2. Create a virtual env with the provided `environment.yml` file.
3. Activate/deactivate your environment.

Note: you are going to use it in the whole semester!

### Task 2: Write Hello World!
1. Install VS Code/Jupyterlab (Colab is not recommended, only if the others are not working).
2. Open a local work folder in VS Code/Jupyterlab
3. Set your virtual environment as python interpretter
4. Create your fist notebook called `helloworld.ipynb`.
5. Write a short code, what prints 'Hello World!' and run it.

### Task 3: Basics of file read
1. Read the layer charges from `example1.txt` and plot it
* First read the file until you find the header
* Stop if the line not looks like expected
* Create a simple line plot with 
```python
from matplotlib import pyplot as plt
plt.plot(yourdata)
```
* Modify your code to write out the last occurance of the layer charges.

2. Read the $\chi$ superconducting order parameter from `example2.txt` and plot it, similarly as before.

Use the following code to read the file line by line:
```python
infile='data.txt'
with open(infile,'r') as f:
    line=f.readline()
    print(line)
```


### Task 4: Built in routines
There are also built in routines in python what are very effective but not for all cases
```python
import numpy as np
import pandas as pd
np.genfromtxt()
np.loadtxt()
pd.read_csv()
```

1. Download the following dataset: https://www.kaggle.com/datasets/muhammadehsan000/diabetes-healthcare-dataset
2. Read the data with the previous method, 'by hand'.
  * read each data coloumn into a list
  * cast the lists to numpy arrays with the appropriate data type
  * create a dictionary from the data where the keys are the coloumn names
3. Do it with the built in methods

### Task 5: Save and load numpy arrays
1. Load the provided numpy array `a.npy` with `np.load()`
2. Determine the:
 * Data type
 * Dimensions
3. Create an array with dimension (3,6,9) and save it as `b.npy`

### Homework
From nature.com find an article with provided data sets.

Eg: https://www.nature.com/articles/s41586-024-07818-x#data-availability 

1. Write a script that reads the data 
2. Determine the data type and the dimension of the data
3. Extra: Reproduce the plot