# Uvod v NumPy


## What is NumPy?


- NumPy is the fundamental package for scientific computing in Python.
- It is a Python library that provides a multidimensional array object.
- At the core of the NumPy package, is the ndarray object.

There are several important differences between NumPy arrays and the standard Python sequences:

- NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original.
- The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory.
- NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.


## Understanding Data Types in Python


Effective data-driven science and computation **requires understanding how data is stored and manipulated**. This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy improves on this.

Users of Python are often drawn-in by its ease of use, one piece of which is dynamic typing. While a **statically-typed language like C or Java** requires each variable to be explicitly declared, a dynamically-typed language like Python skips this specification. For example, in C you might specify a particular operation as follows:

```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```

While in Python the equivalent operation could be written this way:

```python
# Python code
result = 0
for i in range(100):
    result += i
```


Notice the main difference: in C, the data types of each variable are explicitly declared, while in Python the types are dynamically inferred. This means, for example, that **we can assign any kind of data to any variable**:


In [1]:
# Python code
x = 4
x = "four"


Here we've switched the contents of x from an integer to a string. The same thing in C would lead (depending on compiler settings) to a compilation error or other unintented consequences:

```C
/* C code */
int x = 4;
x = "four";  // FAILS
```

This sort of flexibility is one piece that makes Python and other dynamically-typed languages convenient and easy to use. Understanding how this works is an important piece of learning to analyze data efficiently and effectively with Python. But what this type-flexibility also points to is the fact that **Python variables are more than just their value; they also contain extra information about the type of the value**. We'll explore this more in the sections that follow.


### A Python Integer Is More Than Just an Integer

The standard Python implementation is written in C. This means that every Python object is simply a cleverly-disguised C structure, which contains not only its value, but other information as well. For example, when we define an integer in Python, such as x = 10000, x is not just a "raw" integer. It's actually a pointer to a compound C structure, which contains several values. Looking through the Python 3.4 source code, we find that the integer (long) type definition effectively looks like this (once the C macros are expanded):

```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```

A single integer in Python 3.4 actually contains four pieces:

- ob_refcnt, a reference count that helps Python silently handle memory allocation and deallocation
- ob_type, which encodes the type of the variable
- ob_size, which specifies the size of the following data members
- ob_digit, which contains the actual integer value that we expect the Python variable to represent.

This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C, as illustrated in the following figure:

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/cint_vs_pyint.png" alt="Integer Memory Layout">


Here PyObject_HEAD is the part of the structure containing the reference count, type code, and other pieces mentioned before.

Notice the difference here: a C integer is essentially a label for a position in memory whose bytes encode an integer value. A Python integer is a pointer to a position in memory containing all the Python object information, including the bytes that contain the integer value. This extra information in the Python integer structure is what allows Python to be coded so freely and dynamically. All this additional information in Python types comes at a cost, however, which becomes especially apparent in structures that combine many of these objects.


### A Python List Is More Than Just a List

Let's consider now what happens when we use a Python data structure that holds many Python objects. The standard mutable multi-element container in Python is the list. We can create a list of integers as follows:


In [2]:
L = list(range(10))
L


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [3]:
type(L[0])

int

Or, similarly, a list of strings:


In [4]:
L2 = [str(c) for c in L]

In [5]:
type(L2[0])

str

Because of Python's dynamic typing, we can even create heterogeneous lists:


In [6]:
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]


[bool, str, float, int]

But this flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other information–that is, each item is a complete Python object. In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array. The difference between a dynamic-type list and a fixed-type (NumPy-style) array is illustrated in the following figure:

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png" alt="Array Memory Layout">


At the implementation level, the array essentially contains a single pointer to one contiguous block of data. The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier. Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type. Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.


### Fixed-Type Arrays in Python

Python offers several different options for storing data in efficient, fixed-type data buffers. The built-in array module (available since Python 3.3) can be used to create dense arrays of a uniform type:


In [7]:
import array

L = list(range(10))
A = array.array("i", L)
A


array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Here 'i' is a type code indicating the contents are integers.

Much more useful, however, is the ndarray object of the NumPy package. While Python's array object provides efficient storage of array-based data, NumPy adds to this efficient operations on that data. We will explore these operations in later sections; here we'll demonstrate several ways of creating a NumPy array.


## NumPy Speed


The points about sequence size and speed are particularly important in scientific computing. As a simple example, consider the case of multiplying each element in a 1-D sequence with the corresponding element in another sequence of the same length. If the data are stored in two Python lists, a and b, we could iterate over each element:


In [8]:
a = list(range(100))
b = list(range(200, 300))
c = []

for i in range(len(a)):
    c.append(a[i] * b[i])

print(c[:10])


[0, 201, 404, 609, 816, 1025, 1236, 1449, 1664, 1881]


This produces the correct answer, but if a and b each contain millions of numbers, we will pay the price for the inefficiencies of looping in Python. We could accomplish the same task much more quickly in C by writing (for clarity we neglect variable declarations and initializations, memory allocation, etc.)

```c
for (i = 0; i < rows; i++) {
  c[i] = a[i]*b[i];
}
```


This saves all the overhead involved in interpreting the Python code and manipulating Python objects, but at the expense of the benefits gained from coding in Python. Furthermore, the coding work required increases with the dimensionality of our data. In the case of a 2-D array, for example, the C code (abridged as before) expands to

```c
for (i = 0; i < rows; i++) {
  for (j = 0; j < columns; j++) {
    c[i][j] = a[i][j]*b[i][j];
  }
}
```


**NumPy gives us the best of both worlds**: element-by-element operations are the “default mode” when an ndarray is involved, but the element-by-element operation is speedily executed by pre-compiled C code. In NumPy does what the earlier examples do, at near-C speeds, but with the code simplicity we expect from something based on Python.


In [9]:
import numpy as np

a = np.arange(100)
b = np.arange(200, 300)

c = a * b

print(c[:10])


[   0  201  404  609  816 1025 1236 1449 1664 1881]


**Why is NumPy Fast?**

- **Vectorization** describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:
  - vectorized code is more concise and easier to read
  - fewer lines of code generally means fewer bugs
  - the code more closely resembles standard mathematical notation (making it easier, typically, to correctly code mathematical constructs)
  - vectorization results in more “Pythonic” code. Without vectorization, our code would be littered with inefficient and difficult-to-read for loops.
- **Broadcasting** is the term used to describe the implicit element-by-element behavior of operations; generally speaking, in NumPy all operations, not just arithmetic operations, but logical, bit-wise, functional, etc., behave in this implicit element-by-element fashion, i.e., they broadcast. Moreover, in the example above, a and b could be multidimensional arrays of the same shape, or a scalar and an array, or even two arrays of with different shapes, provided that the smaller array is “expandable” to the shape of the larger in such a way that the resulting broadcast is unambiguous.


## Example: Data analysis in pure Python


In [10]:
import csv

dataset_path = "data/f500_small.csv"

with open(dataset_path, "r") as f:
    f500_small = list(csv.reader(f))

print(sum([int(row[2]) for row in f500_small[1:]]))

4305395


## How Vectorization Makes Code Faster


One of the reasons that the Python language is extremely popular is that it makes writing programs easy. When we execute Python code, the Python interpreter converts your code into bytecode that your computer can understand, and then runs that bytecode. When you write code in Python, you don't have to worry about things like allocating memory on your computer or choosing how certain operations are done by your computer's processor. Python takes care of that for you.

<p><img alt="Translating Python code to bytecode" src="https://s3.amazonaws.com/dq-content/289/bytecode.svg"></p>

Python is what we call a high-level language. High level languages allow you to write programs faster as the interpreter makes the decisions on how to execute your instructions. In contrast, when you use low-level languages like C, you define exactly how memory will be managed and how the processor will execute your instructions. This means that coding in a low-level language takes longer, however you have more ability to optimize your code to run faster.

<table>
<thead>
<tr>
<th>Language Type</th>
<th>Example</th>
<th>Time taken to write program</th>
<th>Control over program performance</th>
</tr>
</thead>
<tbody>
<tr>
<td>High-Level</td>
<td>Python</td>
<td>Low</td>
<td>Low</td>
</tr>
<tr>
<td>Low-Level</td>
<td>C</td>
<td>High</td>
<td>High</td>
</tr>
</tbody>
</table>

When choosing between a high and low-level language, you have to make a trade-off between being able to work and quickly, and having programs that run quickly and efficiently. Luckily, there are two Python libraries that were created to give us the best of both-worlds: NumPy and pandas. Together, pandas and NumPy provide a powerful toolset for working with data in Python. They allow us to write code quickly without sacrificing performance. But how do they do this? What is it that makes these libraries faster than raw Python? The answer is vectorization.

Let's look at an example where we have two columns of data. Each row contains two numbers we wish to add together. Using just Python, we would use a list of lists structure to store our data, and use for loops to iterate over that data. Let's see what this would look like as Python code:


<p><img alt="For loop to sum rows" src="https://s3.amazonaws.com/dq-content/289/for_loop.svg"></p>


In [11]:
my_numbers = [[6, 5], [1, 3], [5, 6]]

sums = []

for row in my_numbers:
    row_sum = row[0] + row[1]
    sums.append(row_sum)

print(sums)


[11, 4, 11]


When this code is run, the Python interpreter will turn our code into bytecode, following the logic of our for loop. In each iteration of our loop, the bytecode asks our computer's processor to add the two numbers together and stores the result. The diagram shows the first calculation our computer's processor would make:


<p><img src="./images/numpy_for.gif"></p>


Our computer would take eight processor cycles to process the 8 rows of of our data.

Vectorization takes advantage of a processor feature called **Single Instruction Multiple Data (SIMD)** to process data faster. Most modern computer processors support SIMD. SIMD allows a processor to perform the same operation, on **multiple data points, in a single processor cycle**. Let's look at how a vectorized version of our code above might be processed using a SIMD instruction that allows four data points to be processed at once:


<p><img src="./images/numpy_vectorized.gif"></p>


The vectorized version of our code will only take two processor cycles to process our eight rows of data - a four times speed-up. Vectorized operations might process as little as two and as many as as hundreds of operations per processor cycle, depending on the capabilities of the processor and the size of each data point.

The good news is that you don't have to worry about SIMD and processor cycles, because NumPy and pandas take care of this for you.


In [12]:
%%timeit -n 3 -r 1
# Native Python
size = 5_000_000
list1 = [i for i in range(size)]
list2 = [i for i in range(size)]

sums = []

for el1, el2 in zip(list1, list2):
    row_sum = el1 + el2
    sums.append(row_sum)

print(sums[:5])

[0, 2, 4, 6, 8]
[0, 2, 4, 6, 8]


[0, 2, 4, 6, 8]
3.26 s ± 0 ns per loop (mean ± std. dev. of 1 run, 3 loops each)


In [13]:
%%timeit -n 10 -r 1
# NumPy - vectorized operations
import numpy as np

size = 5_000_000  # 10 times bigger than the native Python example

# Numpy - declaring arrays
array1 = np.arange(size)
array2 = np.arange(size)

sums = array1 + array2

print(sums[:5])

[0 2 4 6 8]
[0 2 4 6 8]
[0 2 4 6 8]
[0 2 4 6 8]


[0 2 4 6 8]
[0 2 4 6 8]
[0 2 4 6 8]
[0 2 4 6 8]
[0 2 4 6 8]
[0 2 4 6 8]
38.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)


## NumPy library


NumPy (Numerical Python) is an **open source Python library that’s used in almost every field of science and engineering**. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems. NumPy users include everyone from beginning coders to experienced researchers doing state-of-the-art scientific and industrial research and development. The NumPy API is used extensively in Pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.

The NumPy library contains **multidimensional array and matrix data structures**. It provides **ndarray**, a homogeneous n-dimensional array object, with methods to efficiently operate on it. NumPy can be used to perform a wide variety of mathematical operations on arrays. It adds powerful data structures to Python that guarantee efficient calculations with arrays and matrices and it supplies an enormous library of high-level mathematical functions that operate on these arrays and matrices.


One of the reasons NumPy is so important for numerical computations in Python is
because it is designed for efficiency on large arrays of data.

NumPy-based algorithms are generally 10 to 100 times faster (or more) than their
pure Python counterparts and use significantly less memory.

If you use pip, you can install NumPy with: `pip install numpy`


To access NumPy and its functions import it in your Python code like this:


In [14]:
import numpy as np

In [15]:
np.__version__

'1.24.1'

## Introduction to Ndarrays


A multidimensional array is a **central data structure** of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be **indexed in various ways**. The elements are all of the **same type, referred to as the array dtype**.

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

It is a table of elements (usually numbers), **all of the same type**, indexed by a tuple of non-negative integers. In NumPy **dimensions are called axes**.


For example, the array for the coordinates of a point in 3D space, `[1, 2, 1]`, has one axis. That axis has 3 elements in it, so we say it has a length of 3. In the example pictured below, the array has 2 axes. The first axis has a length of 2, the second axis has a length of 3.

    [[1., 0., 0.],
    [0., 1., 2.]]

NumPy’s array class is called ndarray. It is also known by the alias array. Note that numpy.array is not the same as the Standard Python Library class array.array, which only handles one-dimensional arrays and offers less functionality.


In [16]:
a = np.array([1, 3, 4, 5, 6, 7, 8])
a

array([1, 3, 4, 5, 6, 7, 8])

In [17]:
print(type(a))

<class 'numpy.ndarray'>


In [18]:
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(b)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [19]:
a.ndim, b.ndim  # ndarray.ndim: the number of axes (dimensions) of the array.

(1, 2)

In [20]:
(
    a.shape,
    b.shape,
)  # ndarray.shape: the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension.


((7,), (3, 3))

In [21]:
(
    a.size,
    b.size,
)  # ndarray.size: the total number of elements of the array. This is equal to the product of the elements of shape.

(7, 9)

In [22]:
a.dtype, b.dtype  # ndarray.dtype: an object describing the type of the elements in the array.

(dtype('int32'), dtype('int32'))

In [23]:
a.itemsize, b.itemsize  # ndarray.itemsize: the size in bytes of each element of the array.

(4, 4)

In [24]:
a.data

<memory at 0x000001C8C8A19FC0>

In [25]:
import numpy as np

<img alt="Dimensional Arrays" src="./images/one_dim.svg">
<img alt="Dimensional Arrays" src="./images/Two_Dim.svg">


In [26]:
a = np.array([1,2,3,4])
a

array([1, 2, 3, 4])

In [27]:
a.dtype

dtype('int32')

In [28]:
b = np.array([23.3, 4.5, 6])
b.dtype

dtype('float64')

In [29]:
np.array(1,2,3,4) # Napačno

TypeError: array() takes from 1 to 2 positional arguments but 4 were given

In [30]:
c = np.array([[1,2], [3,4]], dtype=np.int64)
print(c.dtype)
c

int64


array([[1, 2],
       [3, 4]], dtype=int64)

Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy offers several functions to create arrays with initial placeholder content. These minimize the necessity of growing arrays, an expensive operation.m

In [31]:
np.zeros((3,4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [32]:
np.ones((3,4,4), dtype=np.int16)

array([[[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]]], dtype=int16)

In [33]:
np.empty((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [34]:
np.arange(10, 40, 5)

array([10, 15, 20, 25, 30, 35])

In [35]:
np.arange(10.5, 40, 3.5) # it accepts float arguments

array([10.5, 14. , 17.5, 21. , 24.5, 28. , 31.5, 35. , 38.5])

In [36]:
np.linspace(0,10,15)

array([ 0.        ,  0.71428571,  1.42857143,  2.14285714,  2.85714286,
        3.57142857,  4.28571429,  5.        ,  5.71428571,  6.42857143,
        7.14285714,  7.85714286,  8.57142857,  9.28571429, 10.        ])

In [37]:
# vaja
x = np.linspace(0, 2*np.pi, 100)
f = np.sin(x)
f[:10]

array([0.        , 0.06342392, 0.12659245, 0.18925124, 0.25114799,
       0.31203345, 0.37166246, 0.42979491, 0.48619674, 0.54064082])

In [38]:
a = np.arange(6)
print(a)
b = a.reshape(2,3)
print(b)

[0 1 2 3 4 5]
[[0 1 2]
 [3 4 5]]


In [39]:
print(np.arange(100000))

[    0     1     2 ... 99997 99998 99999]


## Datatypes

https://numpy.org/doc/stable/reference/arrays.dtypes.html

https://numpy.org/doc/stable/reference/arrays.scalars.html#arrays-scalars-built-in

<div class="text_cell_render border-box-sizing rendered_html">
<table>
<thead><tr>
<th>Data type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>bool_</code></td>
<td>Boolean (True or False) stored as a byte</td>
</tr>
<tr>
<td><code>int_</code></td>
<td>Default integer type (same as C <code>long</code>; normally either <code>int64</code> or <code>int32</code>)</td>
</tr>
<tr>
<td><code>intc</code></td>
<td>Identical to C <code>int</code> (normally <code>int32</code> or <code>int64</code>)</td>
</tr>
<tr>
<td><code>intp</code></td>
<td>Integer used for indexing (same as C <code>ssize_t</code>; normally either <code>int32</code> or <code>int64</code>)</td>
</tr>
<tr>
<td><code>int8</code></td>
<td>Byte (-128 to 127)</td>
</tr>
<tr>
<td><code>int16</code></td>
<td>Integer (-32768 to 32767)</td>
</tr>
<tr>
<td><code>int32</code></td>
<td>Integer (-2147483648 to 2147483647)</td>
</tr>
<tr>
<td><code>int64</code></td>
<td>Integer (-9223372036854775808 to 9223372036854775807)</td>
</tr>
<tr>
<td><code>uint8</code></td>
<td>Unsigned integer (0 to 255)</td>
</tr>
<tr>
<td><code>uint16</code></td>
<td>Unsigned integer (0 to 65535)</td>
</tr>
<tr>
<td><code>uint32</code></td>
<td>Unsigned integer (0 to 4294967295)</td>
</tr>
<tr>
<td><code>uint64</code></td>
<td>Unsigned integer (0 to 18446744073709551615)</td>
</tr>
<tr>
<td><code>float_</code></td>
<td>Shorthand for <code>float64</code>.</td>
</tr>
<tr>
<td><code>float16</code></td>
<td>Half precision float: sign bit, 5 bits exponent, 10 bits mantissa</td>
</tr>
<tr>
<td><code>float32</code></td>
<td>Single precision float: sign bit, 8 bits exponent, 23 bits mantissa</td>
</tr>
<tr>
<td><code>float64</code></td>
<td>Double precision float: sign bit, 11 bits exponent, 52 bits mantissa</td>
</tr>
<tr>
<td><code>complex_</code></td>
<td>Shorthand for <code>complex128</code>.</td>
</tr>
<tr>
<td><code>complex64</code></td>
<td>Complex number, represented by two 32-bit floats</td>
</tr>
<tr>
<td><code>complex128</code></td>
<td>Complex number, represented by two 64-bit floats</td>
</tr>
</tbody>
</table>

</div>


In [40]:
np.array([66, 57, 2345], dtype=np.int8)

For the old behavior, usually:
    np.array(value).astype(dtype)`
will give the desired result (the cast overflows).
  np.array([66, 57, 2345], dtype=np.int8)


array([66, 57, 41], dtype=int8)

In [41]:
np.array(["a", "b", "c"])

array(['a', 'b', 'c'], dtype='<U1')

In [42]:
np.array([12,34,5, True, "dsdsdds", 45])

array(['12', '34', '5', 'True', 'dsdsdds', '45'], dtype='<U11')

## Basic Operations and Universal Functions

In [43]:
a = np.array([23,4,56,66,77])

In [44]:
a - 10

array([13, -6, 46, 56, 67])

In [45]:
a - np.arange(5)

array([23,  3, 54, 63, 73])

In [46]:
a

array([23,  4, 56, 66, 77])

In [47]:
a < 50

array([ True,  True, False, False, False])

In [48]:
np.sqrt(a)

array([4.79583152, 2.        , 7.48331477, 8.1240384 , 8.77496439])

## Indexing, Slicing and Iterating

One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other Python sequences.

In [49]:
a= np.arange(10)**3
b=5
a

array([  0,   1,   8,  27,  64, 125, 216, 343, 512, 729], dtype=int32)

In [50]:
print(a[2:5])
print(a[:6:2])
print(a[::-1])

[ 8 27 64]
[ 0  8 64]
[729 512 343 216 125  64  27   8   1   0]


## Selecting and Slicing Rows and Items from ndarrays

An array can be indexed by a tuple of nonnegative integers, by booleans, by another array, or by integers.


<img alt="Dimensional Arrays" src="./images/selection_rows.svg">

<img alt="Dimensional Arrays" src="./images/selection_item.svg">


In [51]:
for i in a:
    print(i**(1/3.0))

0.0
1.0
2.0
3.0
3.9999999999999996
5.0
5.999999999999999
6.999999999999999
7.999999999999999
8.999999999999998


In [52]:
test=np.random.randint(0,10,(5,5))
test



array([[6, 7, 1, 9, 3],
       [6, 3, 6, 1, 2],
       [9, 7, 6, 8, 6],
       [1, 0, 6, 9, 5],
       [1, 1, 7, 0, 1]])

In [53]:

first_row=test[0] 
last_row=test[-1] 
first_row
last_row

array([1, 1, 7, 0, 1])

In [61]:
# selecting the 2nd and 3rd row
row_2_and_3=test[1:3]
row_2_and_3

array([[6, 3, 6, 1, 2],
       [9, 7, 6, 8, 6]])

In [60]:
# select the item at row 2 and column 3
test[2,3]

8

In [56]:
#Difference between (simple) Python syntax
a=[[1,2,3,4,5],
   [2,4,5,6,7],
   [3,5,2,1,3],
   [2,3,4,5,6]]
a

[[1, 2, 3, 4, 5], [2, 4, 5, 6, 7], [3, 5, 2, 1, 3], [2, 3, 4, 5, 6]]

In [57]:
a[1:3]

[[2, 4, 5, 6, 7], [3, 5, 2, 1, 3]]

In [58]:
a[2,3]

TypeError: list indices must be integers or slices, not tuple

In [None]:
a[2][3]

1

### Columns indexing

In [None]:
columns_test=np.random.random((5,5))
columns_test

array([[0.69225051, 0.12883187, 0.5986135 , 0.06918783, 0.64603718],
       [0.83629415, 0.45564276, 0.04109224, 0.79771475, 0.36463934],
       [0.57518776, 0.64658139, 0.81659486, 0.37410568, 0.65464183],
       [0.31336384, 0.49318262, 0.3340242 , 0.22635323, 0.11385745],
       [0.06078545, 0.47001831, 0.49027777, 0.23054076, 0.80207408]])

In [None]:
columns_test[:,3]

array([0.61503923, 0.14906601, 0.00111309, 0.79336884, 0.06448006])

In [None]:
columns_test[:,1:3]


array([[0.00378054, 0.84218699],
       [0.95105205, 0.85827689],
       [0.1927151 , 0.48439107],
       [0.27973802, 0.66032526],
       [0.63334219, 0.21608377]])

In [None]:
cols=[1,3,4]
columns_test[:,cols]

array([[0.00378054, 0.61503923, 0.42186494],
       [0.95105205, 0.14906601, 0.63156363],
       [0.1927151 , 0.00111309, 0.39879811],
       [0.27973802, 0.79336884, 0.95235174],
       [0.63334219, 0.06448006, 0.80530285]])

In [None]:

columns_test[:,[1,3,4]]

array([[0.00378054, 0.61503923, 0.42186494],
       [0.95105205, 0.14906601, 0.63156363],
       [0.1927151 , 0.00111309, 0.39879811],
       [0.27973802, 0.79336884, 0.95235174],
       [0.63334219, 0.06448006, 0.80530285]])

In [None]:
columns_test[2,1:4]

array([0.64658139, 0.81659486, 0.37410568])

In [None]:
columns_test[1:4,:3]

array([[0.83629415, 0.45564276, 0.04109224],
       [0.57518776, 0.64658139, 0.81659486],
       [0.31336384, 0.49318262, 0.3340242 ]])

## Vector Math

As we saw in the last two screens, NumPy ndarrays allow us to select data much more easily. Beyond this, the selection we make is a lot faster when working with vectorized operations because the operations are applied to multiple data points at once.

When we first talked about vectorized operations, we used the example of adding two columns of data. With data in a list of lists, we'd have to construct a for-loop and add each pair of values from each row individually:


In [None]:
my_numbers=[[6,3],[9,1],[2,4],[7,14],[0,4]]
type(my_numbers)

list

In [None]:
sums=[]
for row in my_numbers:
    row_sums=row[0]+row[1]
    sums.append(row_sums)

print(sums)

[9, 10, 6, 21, 4]


At the time, we only talked about how vectorized operations make this faster; however, vectorized operations also make our code easier to execute. Here's how we would perform the same task above with vectorized operations:

In [None]:
my_numbers=np.array(my_numbers)
print(type(my_numbers)) 
print(my_numbers.dtype)

<class 'numpy.ndarray'>
int32


In [None]:
col1=my_numbers[:,0]
col2=my_numbers[:,1]

In [None]:
sums=col1+col2
sums

array([ 9, 10,  6, 21,  4])

In [None]:
sums2=my_numbers[:,0]+my_numbers[:,1]
sums2

array([ 9, 10,  6, 21,  4])

<div>
<p>Here are some key observations about this code:</p>
<ul>
<li>When we selected each column, we used the syntax <code>ndarray[:,c]</code> where <code>c</code> is the column index we wanted to select.  Like we saw in the previous screen, the colon selects all rows.</li>
<li>To add the two 1D ndarrays, <code>col1</code> and <code>col2</code>, we simply use the addition operator (<code>+</code>) between them.</li>
</ul>

<p>The result of adding two 1D ndarrays is a 1D ndarray of the same shape (or dimensions) as the original. In this context, ndarrays can also be called <strong>vectors</strong>, a term taken from a branch of mathematics called linear algebra. What we just did, adding two vectors together, is called <strong>vector addition</strong>.</p></div>


As you start to feel more comfortable with these libraries, you should start exploring the documentation. This is useful because it builds out your knowledge of available functions and methods, but also because it gets you used to reading the documentation. It's not possible to remember the syntax for every variation of every data science library, but if you remember what is possible, and can read the documentation, you'll always be able to quickly refamiliarize yourself with some syntax whenever you need it.

You may have noticed that when we mention a function or method for the first time, we'll link to the documentation for it. Take a moment now to click the link for the numpy.divide() function from the first paragraph of this screen and look at the documentation. It may seem a little overwhelming at first, but it is well worth your time.

<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The following table lists the arithmetic operators implemented in NumPy:</p>
<table>
<thead><tr>
<th>Operator</th>
<th>Equivalent ufunc</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>+</code></td>
<td><code>np.add</code></td>
<td>Addition (e.g., <code>1 + 1 = 2</code>)</td>
</tr>
<tr>
<td><code>-</code></td>
<td><code>np.subtract</code></td>
<td>Subtraction (e.g., <code>3 - 2 = 1</code>)</td>
</tr>
<tr>
<td><code>-</code></td>
<td><code>np.negative</code></td>
<td>Unary negation (e.g., <code>-2</code>)</td>
</tr>
<tr>
<td><code>*</code></td>
<td><code>np.multiply</code></td>
<td>Multiplication (e.g., <code>2 * 3 = 6</code>)</td>
</tr>
<tr>
<td><code>/</code></td>
<td><code>np.divide</code></td>
<td>Division (e.g., <code>3 / 2 = 1.5</code>)</td>
</tr>
<tr>
<td><code>//</code></td>
<td><code>np.floor_divide</code></td>
<td>Floor division (e.g., <code>3 // 2 = 1</code>)</td>
</tr>
<tr>
<td><code>**</code></td>
<td><code>np.power</code></td>
<td>Exponentiation (e.g., <code>2 ** 3 = 8</code>)</td>
</tr>
<tr>
<td><code>%</code></td>
<td><code>np.mod</code></td>
<td>Modulus/remainder (e.g., <code>9 % 4 = 1</code>)</td>
</tr>
</tbody>
</table>

</div>
</div>


To make the calculations in the previous screen, we used operators like the / symbol to perform vectorized operations over our data. NumPy provides a second way to make these calculations - arithmetic functions. Let's look at how we would write the exercise from the previous screen with with the equivalent, the `numpy.divide` function:

In [None]:
d_dols=np.divide(col1,col2)
d_dols

array([2. , 9. , 0.5, 0.5, 0. ])

In [None]:
d_sums=np.add(col1,col2)
d_sums

array([ 9, 10,  6, 21,  4])

## Calculating statistics

In [None]:
columns_test=np.arange(100)
print(columns_test)
columns_test.min()

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
 96 97 98 99]


0

<div>

<p>Numpy ndarrays have methods for many different calculations. A few key methods are:</p>
<ul>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min"><code>ndarray.min()</code> to calculate the minimum value</a></li>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.ndarray.max.html"><code>ndarray.max()</code> to calculate the maximum value</a></li>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.mean.html#numpy.ndarray.mean"><code>ndarray.mean()</code> to calculate the mean or average value</a></li>
<li><a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.sum.html#numpy.ndarray.sum"><code>ndarray.sum()</code> to calculate the sum of the values</a></li>
</ul>
<p>You can see the full list of ndarray methods in the <a target="_blank" href="https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.ndarray.html#calculation">NumPy ndarray documentation</a>.</p>
<p>It's important to get comfortable with the documentation because it's not possible to remember the syntax for every variation of every data science library. However, if you remember what is possible and can read the documentation, you'll always be able to refamiliarize yourself with it.</p>

</div>

<div>
<p>In NumPy, sometimes there are operations that are implemented as both methods and functions, which can be confusing. Let's look at some examples:</p>
<table>
<thead>
<tr>
<th>Calculation</th>
<th>Function Representation</th>
<th>Method Representation</th>
</tr>
</thead>
<tbody>
<tr>
<td>Calculate the minimum value of <code>trip_mph</code></td>
<td><code>np.min(trip_mph)</code></td>
<td><code>trip_mph.min()</code></td>
</tr>
<tr>
<td>Calculate the maximum value of <code>trip_mph</code></td>
<td><code>np.max(trip_mph)</code></td>
<td><code>trip_mph.max()</code></td>
</tr>
<tr>
<td>Calculate the <a target="_blank" href="https://en.wikipedia.org/wiki/Mean">mean average</a> value of <code>trip_mph</code></td>
<td><code>np.mean(trip_mph)</code></td>
<td><code>trip_mph.mean()</code></td>
</tr>
<tr>
<td>Calculate the <a target="_blank" href="https://en.wikipedia.org/wiki/Median">median average</a> value of <code>trip_mph</code></td>
<td><code>np.median(trip_mph)</code></td>
<td>There is no ndarray median method</td>
</tr>
</tbody>
</table>
<p>To remember the right terminology, anything that starts with <code>np</code> (e.g. <code>np.mean()</code>) is a function and anything expressed with an object (or variable) name first (e.g. <code>trip_mph.mean()</code>) is a method. When both exist, it's up to you to decide which to use, but it's much more common to use the method approach.</p></div>


In [None]:
columns_test.max() #99
columns_test.mean() #49,5
columns_test.sum() #4950

4950

## 2D Statistics

<img alt="Dimensional Arrays" src="./images/array_method_axis_none.svg">

<img alt="Dimensional Arrays" src="./images/array_method_axis_0.svg">


To help you remember which is which, you can think of the first axis as rows, and the second axis as columns, just in the same way as when we're indexing a 2D NumPy array we use ndarray[row,column]. Then you think about which axis you want to apply the method along. The tricky part is to remember that when you apply the method along one axis, you get results in the other axis. Here is an illustration of that:

<p><img alt="The axis parameter" src="https://s3.amazonaws.com/dq-content/289/axis_param.svg"></p>

In [None]:
data=np.arange(15).reshape(5,3)
data

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [None]:
data2=np.arange(15).reshape(2,8) #ValueError: cannot reshape array of size 15 into shape (2,8)

ValueError: cannot reshape array of size 15 into shape (2,8)

In [None]:
data.max()

14

In [None]:
data.max(axis=0)

array([12, 13, 14])

In [None]:
data.max(axis=1)

array([ 2,  5,  8, 11, 14])

In [None]:
data.sum(axis=0)

array([30, 35, 40])

## Boolean Indexing

<div><p>In the last mission, we learned how to index — or select — data from ndarrays. In this mission, we're going to focus on arguably the most powerful method, the boolean array.  A <strong>boolean array</strong>, as the name suggests, is an array of boolean values. Boolean arrays are sometimes called <strong>boolean vectors</strong> or <strong>boolean masks</strong>.</p>
<p>You may recall that the boolean (or <code>bool</code>) type is a built-in Python type that can be one of two unique values:</p>
<ul>
<li><code>True</code></li>
<li><code>False</code></li>
</ul>
<p>You may also remember that we've used boolean values when working with Python <a target="_blank" href="https://docs.python.org/3.4/library/stdtypes.html#comparisons">comparison operators</a> like <code>==</code> (equal) <code>&gt;</code> (greater than), <code>&lt;</code> (less than), <code>!=</code> (not equal). Below are a couple examples of simple boolean operations:</p>
</div>


In [None]:
c=np.array([80.0,103.4,6.9,200.3])
c

array([ 80. , 103.4,   6.9, 200.3])

In [None]:
c_bool=c>100
c_bool

array([False,  True, False,  True])

In [None]:
result=c[c_bool]
result

array([103.4, 200.3])

### Boolean indexing with 2D ndarrays

When working with 2D ndarrays, you can use boolean indexing in combination with any of the indexing methods we learned in the previous mission. The only limitation is that the boolean array must have the same length as the dimension you're indexing.


<img alt="Dimensional Arrays" src="./images/bool_dims_updated.svg">

In [None]:
#2D 

data=np.arange(15).reshape(5,3)
data

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [None]:
data[data>5]

array([ 6,  7,  8,  9, 10, 11, 12, 13, 14])

In [None]:
data[:,[True,False,True]]

array([[ 0,  2],
       [ 3,  5],
       [ 6,  8],
       [ 9, 11],
       [12, 14]])

In [None]:
data[:,[True,False,True,False,True]] #IndexError: boolean index did not match indexed array along dimension 1; dimension is 3 but corresponding boolean dimension is 5

IndexError: boolean index did not match indexed array along dimension 1; dimension is 3 but corresponding boolean dimension is 5

### Shape formating

In [None]:
data[[True,False,True,False,True],:]

array([[ 0,  1,  2],
       [ 6,  7,  8],
       [12, 13, 14]])

In [None]:
a=np.array([1,2,3,4,5,6])
print(a.shape)

b=np.expand_dims(a,axis=1)
print(b.shape)

c=np.expand_dims(a,axis=0)
print(c.shape)

(6,)
(6, 1)
(1, 6)


## Assigning values

So far, we've learned how to retrieve data from ndarrays. Next, we'll use the same indexing techniques we've already learned to modify values within an ndarray. The syntax we'll use (in pseudocode) is:

    ndarray[location_of_values] = new_value


In [None]:
a=np.array(["red","blue","black","blue","purple"])
a

array(['red', 'blue', 'black', 'blue', 'purple'], dtype='<U6')

In [None]:
a[0]="orange"

In [None]:
print(a)

['orange' 'blue' 'black' 'blue' 'purple']


In [None]:
a[3:]="pink"
a


array(['orange', 'blue', 'black', 'pink', 'pink'], dtype='<U6')

In [None]:
ones=np.ones((3,5))
ones[1,2]=99
print(ones)

[[ 1.  1.  1.  1.  1.]
 [ 1.  1. 99.  1.  1.]
 [ 1.  1.  1.  1.  1.]]


In [None]:
ones[0]=42
ones

array([[42., 42., 42., 42., 42.],
       [ 1.,  1., 99.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [None]:
ones[:,2]=0
ones

array([[42., 42.,  0., 42., 42.],
       [ 1.,  1.,  0.,  1.,  1.],
       [ 1.,  1.,  0.,  1.,  1.]])

In [None]:
a=np.array([1,2,3,4,5])
a[a>2]=99
print(a)

[ 1  2 99 99 99]


In [None]:
b=np.linspace(1,9,num=9,dtype=np.int32)
b=np.reshape(b,(3,3))
c=b.copy()
b

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [None]:
b[b>4]=99
b

array([[ 1,  2,  3],
       [ 4, 99, 99],
       [99, 99, 99]])

In [None]:
c[c[:,1]>2,1]=99

In [None]:
c

array([[ 1,  2,  3],
       [ 4, 99,  6],
       [ 7, 99,  9]])

## Adding rows and columns to ndarrays

In [None]:
#Primer 2-1d
ones=np.ones(shape=3)
print(ones)
print(ones.shape)

[1. 1. 1.]
(3,)


In [None]:
#Primer 3-2d
ones=np.ones(shape=(3,1))
print(ones)
print("--------")
print(ones[0],"prvi primer")
print("--------")
print(ones[0,0],"drugi primer")
print(ones.shape)

[[1.]
 [1.]
 [1.]]
--------
[1.] prvi primer
--------
1.0 drugi primer
(3, 1)


In [None]:
#Primer 5-3d
ones=np.ones(shape=(2,3,2))
print(ones)
print("--------")
print(ones[0],"prvi primer")
print("--------")
print(ones[0,0],"drugi primer")
print("--------")
print(ones[0,0,0],"tretji primer")
print(ones.shape)

[[[1. 1.]
  [1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]
  [1. 1.]]]
--------
[[1. 1.]
 [1. 1.]
 [1. 1.]] prvi primer
--------
[1. 1.] drugi primer
--------
1.0 tretji primer
(2, 3, 2)


In [None]:
zeros=np.zeros(3)
print(zeros)

[0. 0. 0.]


In [None]:
ones=np.ones((2,3))
print(ones)

[[1. 1. 1.]
 [1. 1. 1.]]


In [None]:
combined=np.concatenate([zeros,ones],axis=0) #ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)3

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 2 dimension(s)

In [None]:
print(zeros.shape)
print(ones.shape)

(3,)
(2, 3)


In [None]:
zeros=np.expand_dims(zeros,axis=0)


In [None]:
combined_2=np.concatenate([ones,zeros],axis=0)
combined_2

array([[1., 1., 1.],
       [1., 1., 1.],
       [0., 0., 0.]])

## Copies and Views

In [None]:
#No copy at all
a=np.array([[0,1,2,3],[4,5,6,7],[8,9,10,11]])
b=a #No new object is created
b is a 

True

In [None]:
def f(x):
    print(id(x))

print(id(a))
f(a)

2744802431920
2744802431920


In [None]:
#View or shallow Copy
c=a.view()
c

array([[     0, 100099,      2,      3],
       [     4,      5,      6,      7],
       [     8,      9,     10,     11]])

In [None]:
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [None]:
c is a

False

In [None]:
print(id(a)) #2744802431920
print(id(c)) #2744802439504


2744802431920
2744802439504


In [None]:
c.base

array([[     0, 100099,      2,      3],
       [     4,      5,      6,      7],
       [     8,      9,     10,     11]])

In [None]:
c.base is a

True

In [None]:
c[0,1]=100099

In [None]:
c

array([[     0, 100099,      2,      3],
       [     4,      5,      6,      7],
       [     8,      9,     10,     11]])

In [None]:
c is a

False

In [None]:
c.base is a

True

In [None]:
print(a.shape)
c=c.reshape((2,6))


(3, 4)


In [None]:
a.shape

(3, 4)

In [None]:
print(a,"prvi izpis")
c[0,4]=1234
print(a, "drugi izpis")

[[     0 100099      2      3]
 [  1234      5      6      7]
 [     8      9     10     11]] prvi izpis
[[     0 100099      2      3]
 [  1234      5      6      7]
 [     8      9     10     11]] drugi izpis


Deep copy


In [None]:
d=a.copy()
d

array([[     0, 100099,      2,      3],
       [  1234,      5,      6,      7],
       [     8,      9,     10,     11]])

In [None]:
print(d is a)
print(d.base is a)

False
False


In [None]:
d[0,0]=9999
print(a)

[[     0 100099      2      3]
 [  1234      5      6      7]
 [     8      9     10     11]]


In [None]:
print(d)

[[  9999 100099      2      3]
 [  1234      5      6      7]
 [     8      9     10     11]]


## Reading CSV files with Numpy

In [None]:
f500=np.genfromtxt(r"C:\Users\ICTA\Downloads\python-data-analysis-public-master\Del_02_Uvod_v_Numpy\data\f500_small.csv",delimiter=",",usecols=(1,2,3,4,5,6),skip_header=1)
f500[:5]


array([[ 1.00000e+00,  4.85873e+05,  8.00000e-01,  1.36430e+04,
         1.98825e+05, -7.20000e+00],
       [ 2.00000e+00,  3.15199e+05, -4.40000e+00,  9.57130e+03,
         4.89838e+05, -6.20000e+00],
       [ 3.00000e+00,  2.67518e+05, -9.10000e+00,  1.25790e+03,
         3.10726e+05, -6.50000e+01],
       [ 4.00000e+00,  2.62573e+05, -1.23000e+01,  1.86750e+03,
         5.85619e+05, -7.37000e+01],
       [ 5.00000e+00,  2.54694e+05,  7.70000e+00,  1.68993e+04,
         4.37575e+05, -1.23000e+01]])

In [None]:
f500.shape

(20, 6)

In [None]:
f500.dtype

dtype('float64')

In [None]:
#VAja1
#Izberite naključno vrednost in vrstico f500 arraya
f500[1,3]

9571.3

In [None]:
#Vaja2
#Uporabi vektorsko seštevanje in seštej stolpca, ki vsebujeta revenue (1) ter profit (3)
#Izpiši rezultat 
revenue=f500[:,1]
profit=f500[:,3]
result=revenue+profit
result

array([499516. , 324770.3, 268775.9, 264440.5, 271593.3, 246201.3,
       244608. , 247678. , 261326. , 212844. , 203603. , 186721. ,
       191857. , 182843. , 193273.5, 175262. , 178911.4, 175807. ,
       176762. ])

In [None]:
#Vaja3
#Ugotovi koliko znaša minimalni prihodek - revenue (1)
f500[:,1].min()

163786.0