# Introduction to NumPy

In [Tutorial 1](https://github.com/krittikaiitb/tutorials/tree/master/Tutorial_1), you have seen list, tuple and dictionaries (and we hinted at sets). However, these are not particularly useful when you want to do numerical calculations, or matrix operations. You could try to write a for loop that would iterate through each element, however Python is very slow with operations like that. 

A better (more Pythonic) way is to use the package known as `numpy`. Numpy arrays are similar to lists and tuples etc. in that they hold multiple values. However, all elements in a numpy array must be of the same data type. 

Disregarding that small setback, however, we now have a lot more functionality in numpy arrays. It is convenient to think of the behaviour of numpy arrays as similar to a row vector.

In [1]:
import numpy as np # This is how you can import packages in python. 

The above line imports the package numpy (which includes all the functions we will be using). The part `as np` renames it to `np` (just for this program, don't worry), which is slightly more convenient. 

Let us assign a numpy array to a variable x. 

In [2]:
x=np.array([1,2,3,4]) # np.array converts any array-like object to a numpy array
#x=[1,2,3,4]
y=2*x
print("{}\n{}".format(x,y))

[1 2 3 4]
[2 4 6 8]


Try the above code with a normal list. Now, hopefully it is clear why `numpy` is powerful. But that is just the beginning!

In [3]:
x = np.arange(0,100, 1) # This is similar to the range function from earlier, except that it returns a numpy array
print(x[[4,5,6,7,8,9,10]])

[ 4  5  6  7  8  9 10]


Trying a similar trick with a list instead: (this should give an error!)

In [4]:
x = [1,2,3,4,5,6,7,8,9,10]
print(x[[0,1,2]])

TypeError: list indices must be integers or slices, not list

So `numpy` also supports indexing with lists, instead of only integers or slices, whereas lists don't. 

Similar to the `len` function, numpy also has several functions to describe an array. These are also available as methods, which can be used as follows:

In [5]:
x=np.array([[1,2,3],[4,5,6],[7,8,9]])
print(x)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [6]:
print(x.ndim) # number of dimensions. This is a 2D array
print(x.shape) # Shape of array
print(x.size) # Total number of elements in array (product of elements in shape)

2
(3, 3)
9


Multiplication, like addition is element wise. If matrix multiplication is required, we can either convert it into a numpy matrix or use `np.dot()`

In [7]:
x*x

array([[ 1,  4,  9],
       [16, 25, 36],
       [49, 64, 81]])

In [8]:
np.matrix(x)*np.matrix(x)

matrix([[ 30,  36,  42],
        [ 66,  81,  96],
        [102, 126, 150]])

In [9]:
np.dot(x,x)

array([[ 30,  36,  42],
       [ 66,  81,  96],
       [102, 126, 150]])

Comparisons on numpy arrays are easier now

In [10]:
x>5

array([[False, False, False],
       [False, False,  True],
       [ True,  True,  True]])

We can also use conditional expressions like indices

In [11]:
y=np.copy(x) # Simply typing y=x doesnot make a new array, it just adds a reference to the old one.
             # Changes in y will then be reflected in x
y[y>5]=0
y

array([[1, 2, 3],
       [4, 5, 0],
       [0, 0, 0]])

If we just want the indices where a certain condition is satisified, we can use `np.where()`. This is one of the more **important** functions, as we can use it to filter out data according to conditions. 

In [12]:
x=np.array([0,1,2,3,4,5,6,7,0.5,0.6])
indices=np.where(x<5)
x[indices]

array([0. , 1. , 2. , 3. , 4. , 0.5, 0.6])

We can also delete elements using `numpy.delete`. 

Note the array slicing. x is 2 dimensional, and requires 2 indices. `x[:, 0]` returns `np.array[1,4,7]`. Play around with different combinations to become comfortable with using indices, as well as `np.where`. 

Also note the argument of the function called `axis`. For a 2D array, `axis=0` is along the rows, and `axis=1` is along the columns. `np.where` assigns the index to be `(np.array([2]))`. Removing index 2 along the columns results in y. If you wanted to remove the row instead, then use `axis=0`. 

In [13]:
x=np.array([[1,2,3],[4,5,6],[7,8,9]])
index=np.where(x[:,0]>5)
y=np.delete(x,index,axis=1)
print(y)

[[1 2]
 [4 5]
 [7 8]]


**Make sure you understand the part about indexing, slicing, and using `np.where` and `np.delete`, and are comfortable using it. Try out different arrays that you define yourself, of different dimensions and shapes, and experiment!!**

Other useful functions in numpy like `numpy.arange`, `numpy.linspace` often come in handy. We have already seen `arange`

In [14]:
np.arange(0,1,0.1) # start(inclusive):stop(exclusive):step

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [15]:
np.linspace(0,1,5) #start:stop(inclusive):number of elements

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

That is not even it. Several mathematical functions are also available:

In [16]:
x = np.arange(-np.pi, np.pi, 0.1)
y = np.sin(x)
x_abs = np.abs(x)

In [17]:
y2 = np.exp(x)

If you thought `numpy` has revealed all its cards, it hasn't yet! <br>
You can also find the mean, standard deviation, maximum, minimum of an array, sort an array, do matrix computations and so much more. 

If you feel right now that NumPy is vast, remember we have barely scratched the surface. In this module and the next is where your Googling skills will be put to the test. 

If you want to find the official docstrings for any function, an easy way to get it from Jupyter is to use `?`. For example

In [18]:
# Note that this is a very useful function, since we can use it to filter out data. 
np.where?

As a final note, we will end the tutorial part of this notebook by introducing `np.loadtxt`, a function that can be used to load a txt file into a numpy array. (If you remember how difficult it was to import a text file using `open` and `.split()`, then you will appreciate this one!)

In [19]:
beehive_data = np.loadtxt('Beehive_data.csv', delimiter=',')

Yes, that's it. 

### Your assignment ...
... should you choose to accept it, will be the following:
1. Re-do Tutorial 1 assignment with numpy. (This should be fairly easy, but use only numpy functions to reproduce your results)
2. Use the data from `Beehive_data.csv` to find the distance to the Beehive Cluster. Read on ahead to find out more about the concepts in astronomy you will need to solve this

Data downloaded from [VizieR](https://cdsarc.unistra.fr/viz-bin/cat/V/19), from data used in [Piskunov 1980](https://ui.adsabs.harvard.edu/?#abs/1980BICDS..19...67P)

# Magnitudes in Astronomy

Magnitudes are a way to describe how bright an object (in our case, a star) is. It is similar to the decibel system for sound in that magnitudes are logarithmic. They can be calculated according to the formula
$$\rm m = -2.5 \log\left(\frac{F}{F_0}\right)$$
where F is the flux from the star (measured in W m$^2$), and F$_0$ is a reference flux.

To read up more about magnitudes, hit up the [Wikipedia article on it](https://en.wikipedia.org/wiki/Magnitude_(astronomy)#History)

Now, we can calculate the flux of a star at some distance d away as 
$$ F = \frac{L}{4\pi d^2} $$
where L is the Luminosity of the star (measured in W). 

There are a lot of details we have skimmed over (for example, the use of filters) which we will visit again in more detail in `Tutorial 8: Image Reduction`.

As a final note on the data you have been given in `Beehive_data.csv`, the columns are respectively the apparent magnitude of the stars as seen from Earth, the logarithm of the Luminosity(in units of Luminosity of the Sun) of the stars (i.e. $\log(L/L_\odot)$, where $L_\odot$ is the luminosity of the sun), and the probability that the star belongs to the Beehive Cluster.

You must find the distance to the cluster (you are given that the absolute magnitude of the Sun is +4.83). You can do this in two ways:
1. Exclude all the stars with low probability of belonging to the cluster, and then calculate distance for each star, and find the mean.
2. Find the distance for all stars (including the low probability ones) and find the weighted average of the distances, where the weights are the probability.

Good Luck!

### Solutions to Day 2, NumPy assignments

In [20]:
beehive_data = np.loadtxt('Beehive_data.csv', delimiter=',')

#### Part 1: Finding distance considering only high probability stars

To do this, as a first step one must filter out the low probabilty stars. We saw in the tutorial that `np.where()` can be used to return certain indices based on some comparison.

There is no fixed rule to decide what threshold should be chosen to select high probabitlity stars (the dataset is too small to come up with a precise value). For this particular example let us choose a probabilty above 90.0 as a criterion for selecting stars.

Note that your answer may differ depending on the threshold you choose; the method is what counts.

In [21]:
indices = np.where(beehive_data[:, 2]>90) #The third column of the 2D array contains the probabilities
high_prob = beehive_data[indices] #This now selects only the high probability stars

Let us now find distance for these stars. 

We have been given that the absolute magnitude of the sun is 4.83 as well as the formula : 
$$m = -2.5\log\left(\frac{F}{F_0}\right)$$
where the variables and the flux $F$ are as defined in the last part of the tutorial.

If you search for information about Absolute Magnitude, you would find that it is the magnitude of the star at a distance of 10 pc, and is related to the apparent magnitude by (and try to see if you can derive this):
$$m-M=5(\log(d)-1)$$
where $M$ is the absolute magnitude and $d$ the distance of the star from us in parsecs(pc).

Now let $m_\odot$ denote the apparent magnetude of the sun and $d_\odot$ the distance of the sun from us. We can then write the following formula assuming $m$ to be the apparent magnitude of the star considered and $d$ its distance:
$$m = -2.5log\left(\frac{L/4\pi d^2}{F_0}\right)$$
$$m_\odot = -2.5\log\left(\frac{L_\odot/4\pi d_\odot^2}{F_0}\right)$$
Subtracting the two we get, 
$$m_\odot-m = -2.5\log\left(\frac{L}{L_\odot}\right) + 5\log\left(\frac{d}{d_\odot}\right)$$ 

Now using
$$m_\odot - 4.83 = 5(\log(d) - 1)$$
and substituting in the above equation we get after some rearrangement:
$$5 \log(d) = m + 2.5\log\left(\frac{L}{L_\odot}\right) + 0.17$$
$$\implies \log(d) = \left[m + 2.5\log\left(\frac{L}{L_\odot}\right) + 0.17\right]/5$$

Let us code this

In [22]:
logd = (high_prob[:,0] + 2.5*high_prob[:,1] + 0.17)/5 #First column contains apparent maginitudes
                                                            #and the second the required log values
d = 10**logd #We now have the distances of high probability stars in a 1D array

All we need to do is find the sum of the elements of the distance array and divide by its number of elements

In [23]:
sum_of_dist = 0
for i in d: #Here we take the sum of all elements in the array
    sum_of_dist = sum_of_dist + i
dist_of_cluster= sum_of_dist/d.size #Answer is the average
print(dist_of_cluster)

159.9814397624894


And we have the answer to part 1!

## Part 2: Finding distance by taking average weighted by probabilities

For this part, we need not select any particular elements, we can work with the entire array `beehive_data`. We need to however, take the weighted average, where the weights are the probabilites. Weighted average of a list of quantities is given by:
$$\frac{\sum quantity\times weight\_of\_this\_quantity}{\sum weight\_of\_quantity} $$

The first step is agian to find the distance similar to the previous part, but this time with whole of `beehive_data`

In [24]:
logd_full = (beehive_data[:,0] + 2.5*beehive_data[:,1] + 0.17)/5 
d_full = 10**logd_full

Now we must find the weighted average as per the above formula. Weights are the probabilites of each star and the quantities are the elements of the distance array

In [25]:
weighted_dist = d_full * beehive_data[:,2] #Using the elementwise multiplication property
weighted_sum = 0
sum_of_weights = 0
for i in range(0, weighted_dist.size):
    weighted_sum = weighted_sum + weighted_dist[i] #Numerator in the weighted average formula
    sum_of_weights = sum_of_weights + beehive_data[i,2] #Denominator in the weighted average formula
weighted_dist_of_cluster = weighted_sum/sum_of_weights #Answer is Numerator/Denominator
print(weighted_dist_of_cluster)

159.5464083490186


Of course, `numpy` provides a function to calculate this weighted average, which would reduce the above block to one line!