<hr style="height: 1px;">
<i>This notebook was authored by the 8.S50x Course Team, Copyright 2022 MIT All Rights Reserved.</i>
<hr style="height: 1px;">
<br>

<h1>Lesson 25: Anomaly Detection</h1>

<a name='section_25_0'></a>
<hr style="height: 1px;">


## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L25.0 Overview</h2>


<h3>Navigation</h3>

<table style="width:100%">
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_25_1">25.1 The Gaia Experiment</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_25_1">L25.1 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_25_2">25.2 Building Projections with Gaia</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_25_2">L25.2 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_25_3">25.3 Anomaly Detection with Gaia</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_25_3">L25.3 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_25_4">25.4 Anomaly Detection with Lots of Gaia Data</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_25_4">L25.4 Exercises</a></td>
    </tr>
</table>

<h3>Slides</h3>

You can access the slides related to this lecture at the following link: <a href="https://github.com/mitx-8s50/slides/raw/main/module3_slides/L25_slides.pdf" target="_blank">L25 Slides</a>

<h3>Learning Objectives</h3>

This lecture is going to review AI based anomaly detection using an astophysics dataset. For this dataset we are going to use data from the Gaia satellite. The Gaia dataset is a rich dataset that has previously been used for anomaly detection, as detailed in the papers below:

>source: https://arxiv.org/abs/2303.01529<br>
>attribution: Shih et al., arXiv:2303.01529 [astro-ph.GA] (2023)

>source: https://arxiv.org/abs/2104.12789 <br>
>attribution: Shih et al., arXiv:2104.12789 [astro-ph.GA] (2021)  



Now lets load a toolkit to process Gaia data

<h3>Data</h3>

Download the directory where we will save data.

<h3>Importing Libraries</h3>

Before beginning, run the cell below to import the relevant libraries for this notebook. 

In [None]:
#>>>RUN: L25.0-runcell02

# astropy imports
import astropy.coordinates as coord
from astropy.table import QTable
import astropy.units as u
from astroquery.gaia import Gaia

# Third-party imports
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# gala imports
import gala.coordinates as gc
import gala.dynamics as gd
import gala.potential as gp
from gala.units import galactic

#ML imports
import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torch.autograd import Variable


<h3>Setting Default Figure Parameters</h3>

The following code cell sets default values for figure parameters.


In [None]:
#>>>RUN: L25.0-runcell03

#set plot resolution
%config InlineBackend.figure_format = 'retina'

#set default figure parameters
plt.rcParams['figure.figsize'] = (9,6)

medium_size = 12
large_size = 15

plt.rc('font', size=medium_size)          # default text sizes
plt.rc('xtick', labelsize=medium_size)    # xtick labels
plt.rc('ytick', labelsize=medium_size)    # ytick labels
plt.rc('legend', fontsize=medium_size)    # legend
plt.rc('axes', titlesize=large_size)      # axes title
plt.rc('axes', labelsize=large_size)      # x and y labels
plt.rc('figure', titlesize=large_size)    # figure title

<a name='section_25_1'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C"> L25.1 The Gaia Experiment </h2>  

| [Top](#section_25_0) | [Previous Section](#section_25_0) | [Exercises](#exercises_25_1) | [Next Section](#section_25_2) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.3x+3T2023/block-v1:MITxT+8.S50.3x+3T2023+type@sequential+block@seq_LS25/block-v1:MITxT+8.S50.3x+3T2023+type@vertical+block@vert_LS25_vid1" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

The Gaia experiment is a satellite experiment launched by the European Space Agency (more information can be found <a href="https://www.esa.int/Science_Exploration/Space_Science/Gaia" target="_blank">here</a> and <a href="https://www.esa.int/Science_Exploration/Space_Science/Gaia_overview" target="_blank">here</a>). This experiment has been running for the last 10 years and has been focused on agglomerating nearby star data. There have been 3 data releases since the onset of the experiment, with the first data coming from 2016. This experiment has had a large impact on our knowledge of the local galaxy by enabling a detailed catalog of nearby stars in the Milky Way.


<h3>Loading the Data</h3>

Before we do anything, we need to load Gaia data. To do this, we make a query to the Gaia database selecting the first 4096 stars that have a parallax significance > 10, a parallax > 10, and for which the velocity has been calculated. Fortunately, this is relatively easy to do since its an SQL style query that we can perform in python. Requiring a high parallax significance means that the parallax is very well determined by the measurements. As explained below, requiring a larger parallax selects stars that are relatively close to the Sun.

We are going to extract all the info that the Gaia database provide by default, namely:

  * Coordinates (ra,dec)
  * Parallax (which yields distance)
  * Radial velocity
  * Magnitude of the light through 3 filters (general, blue and red)
  
With this information, we can build a good understanding of stellar data.  One other small note is that we will use Gaia Data Release 3 (GaiaDR3), giving us the latest greatest data to study. Later on, we can download more data, but we'll start small for now.

>source:\
>This work has made use of data from the European Space Agency (ESA) mission *Gaia* https://www.cosmos.esa.int/gaia, processed by the *Gaia* Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the *Gaia* Multilateral Agreement.
>
>related papers:\
>Gaia Collaboration, T. Prusti, J.H.J. de Bruijne, et al. (2016b) The Gaia mission. A&A 595, pp. A1.\
>Gaia Collaboration, A. Vallenari, A. G. A. Brown, et al. (2023j) Gaia Data Release 3. Summary of the content and survey properties. A&A 674, pp. A1.\
>C. Babusiaux, C. Fabricius, S. Khanna, et al. (2023) Gaia Data Release 3. Catalogue validation. A&A 674, pp. A32.


In [None]:
#>>>RUN: L25.1-runcell01

query_text = '''SELECT TOP 4096 ra, dec, parallax, pmra, pmdec, radial_velocity,
phot_g_mean_mag, phot_bp_mean_mag, phot_rp_mean_mag
FROM gaiadr3.gaia_source
WHERE parallax_over_error > 10 AND
    parallax > 10 AND
    radial_velocity IS NOT null
ORDER BY random_index
'''

job = Gaia.launch_job(query_text)
gaia_data = job.get_results()
gaia_data.write('data/L25/gaia_data1.fits',overwrite=True)

#Note, if you return to this section after closing the kernel,
#you can load the data again using the following code
gaia_data = QTable.read('data/L25/gaia_data1.fits')


print("Total Events:",len(gaia_data))
print()
print(gaia_data[:4])

#Note, the columns in our dataset are defined by our query above.
#To print one column, use the command gaia_data['column_name'], e.g.: gaia_data['parallax']

Let's consider the units here. Firstly, astronomical distances have units of parsecs (1 parsec = 3.25 light years). By definition, a parsec is the distance to an object with a paralax of one arcsecond (for reference, the moon subtends an angle of roughly 0.5 degrees in the sky, or about 1800 arcseconds).

In the dataset that we are using, the units of parallax are reported in milliarcseconds (mas), for which we use the variable $\theta_{\rm mas}$. To convert from $\theta_{\rm mas}$ to a distance in parsecs is, $d_{\rm parsec}$, one should divide, as follows:

$$
d_{\rm parsec} = \frac{1}{\theta_{\rm mas}\times10^{-3}}
$$


Furthermore, you will also notice in the above dataset, 2 variables use radial velocity, which is the proper motion in milliarcseconds per year. This is given in angular coordinates ra and dec, but we can use parallax to obtain the radial velocity in milliarcseconds per year.


This dataset contains 2 variables for tangential velocity, which is the proper motion in angular coordinates ra and dec, using units of milliarcseconds per year. We can use the distance (found from the parallax) to obtain the tangential velocity in parsecs per year.

The library `astropy` has lots of nice tools to do this conversion for us. One very simple conversion is to calculate the distance.

In [None]:
#>>>RUN: L25.1-runcell02

dist = coord.Distance(parallax=u.Quantity(gaia_data['parallax']))
print("Min:",dist.min(), "Max:",dist.max())
print(dist[0],1e3/gaia_data['parallax'][0])

plt.hist(dist)
plt.xlabel("distance (pcs)")
plt.ylabel("N")
plt.show()

You may be surprised that the sun appears to be in a "hole" with few stars nearby and lots of stars farther away. However, each bin in this histogram represents a spherical shell with a radius $r$ and width $\Delta r$, whose volume is $4\pi r^2 \Delta r$. So, a uniform density would manifest itself as a quadratic rise in the number of stars as the radius increases.

Given the position of the Solar System in the Milky way, the angular coordinates ra and dec of the star, and its distance, we can transform the star's location into Galactocentric coordinates. The function used for this conversion also transforms the velocities.


In [None]:
#>>>RUN: L25.1-runcell03

c = coord.SkyCoord(ra=gaia_data['ra'], dec=gaia_data['dec'],distance=dist,
                   pm_ra_cosdec=gaia_data['pmra'], pm_dec=gaia_data['pmdec'],
                   radial_velocity=gaia_data['radial_velocity'])

print(c[:4],'\n')
print()

#Now we can translate it to galactic coordinates
print(c.galactic[:4])
coord.Galactocentric()

Notice that the distances to the stars don't change when changing coordinate systems. The function used above only changes the angular coordinates into the galactic frame. In order to move the center of our coordinate system, we need to do one more transform.

To define the Galactocentric coordinate system, the position of the Sun is assumed to be on the $x$ axis. That is, the $x$ axis points from the position of the Sun projected along the Galactic midplane to the Galactic center. The $y$ axis points roughly towards Galactic longitude 90$^{\circ}$ and the $z$ axis points roughly towards the North Galactic Pole.

In [None]:
#>>>RUN: L25.1-runcell04

galcen = c.transform_to(coord.Galactocentric(z_sun=0*u.pc, galcen_distance=8.1*u.kpc))
print(galcen[:4],'\n')
plt.hist(galcen.z.value, bins=np.linspace(-110, 110, 32),alpha=0.5,label='z')
#plt.hist(galcen.x.value, bins=np.linspace(-110, 110, 32),alpha=0.5)
plt.hist(galcen.y.value, bins=np.linspace(-110, 110, 32),alpha=0.5,label='y')
plt.xlabel('Corrdinate position (about Galactic Plane) [{0:latex_inline}]'.format(galcen.z.unit));
plt.legend()
plt.show()

plt.hist(galcen.x.value, alpha=0.5)
plt.xlabel('X-position (nearby stars) [{0:latex_inline}]'.format(galcen.x.unit))
plt.legend()
plt.show()


Defining the $x$ axis as pointing to the position of the Sun means that it is located at $(y, z)=(0,0)$, so nearby stars are centered around those coordinates. However, the Sun is very far from the Galactic center, so nearby stars have $x$ coordinates that have a peak roughly similar in shape and width to those for the 2 other coordinates, but with a center which is offset quite a bit from $0$.

We can also look at the transformed velocities of our stars (below).


In [None]:
#>>>RUN: L25.1-runcell05

fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.plot(galcen.v_x.value, galcen.v_y.value,marker='.', linestyle='none', alpha=0.5)
ax.set_xlim(-125, 125)
ax.set_ylim(200-125, 200+125)
ax.set_xlabel('vx [{0:latex_inline}]'.format(u.km/u.s))
ax.set_ylabel('vy [{0:latex_inline}]'.format(u.km/u.s))
plt.show()

fig, ax = plt.subplots(1, 1, figsize=(6, 6))
plt.plot(galcen.v_x.value, galcen.v_z.value,marker='.', linestyle='none', alpha=0.5)
plt.xlabel('vx [{0:latex_inline}]'.format(u.km/u.s))
plt.ylabel('vz [{0:latex_inline}]'.format(u.km/u.s))
plt.xlim(-125, 125)
plt.ylim(-125, 125)
plt.show()


Notice that the $x$ and $z$ velocities are centered around zero, but the $y$ velocities are not, indicating an average motion in a specific direction. This is the result of the rotation of the Galaxy. The Sun, and all of the stars around it, are moving together in a tangential direction around the Galactic center which, given the definition of the coordinate system, means the $y$ direction.

Now that we have transformed things, let's do some basic analysis of the different stars to try to understand their properties. The Gaia satellite is capable of taking spectra in 3 different frequency ranges called general, blue, and red. General denotes the magnitude using an inclusive filter, while the blue and red tell us about the magnitude with filters of those two colors. This information is enough for us to start categorizing the stars into  different populations.

Around 100 years ago, astronomers started classifying the different populations of stars and stumbled upon a classification using a diagram known as the <a href="https://en.wikipedia.org/wiki/Hertzsprung%E2%80%93Russell_diagram" target="_blank">Hertzsprung-Russell diagram</a>. The strategy is to look at the intensity of the star as a function of color (or more accurately as a function of the frequency of the light). It was found that most stars are concentrated in a few specific regions on a plot of intensity versus color. Stars that are bluer tend to be larger and younger. Stars that are redder tend to be older and smaller. Most stars are found along a specific trajectory, called the "Main Sequence".

There are two regions of anomalous stars that are not along this main trajectory. Those that are not very bright but have a blue tint are known as white dwarfs. Likewise, bright red stars are known as red giants. Stars along the main sequence will evolve into red giants, and then into white dwarfs, which is the final stage in stellar evolution.

Stars are placed on the H-R diagram by taking their magnitudes and correcting for the distance modulus (i.e., $1/r^{2}$ in intensity drop) to find the vertical coordinate on the plot. For the horizontal coordinate, each star's color is found by computing the difference in luminosity between the blue and red filtered light.


In [None]:
#>>>RUN: L25.1-runcell06

M_G = gaia_data['phot_g_mean_mag'] - dist.distmod #corrected Star Magnitude (general-distance mod)
BP_RP = gaia_data['phot_bp_mean_mag'] - gaia_data['phot_rp_mean_mag'] #Blue filter - read filter

plt.plot(BP_RP.value, M_G.value, marker='.', linestyle='none', alpha=0.3)

plt.xlim(0, 3)
plt.ylim(11, 1)

#expand range
#plt.xlim(0, 3.75)
#plt.ylim(13.5, 0)

plt.xlabel('$G_{BP}-G_{RP}$')
plt.ylabel('$M_{G}$')
plt.show()

These local stars are predominantly along the main sequence, but you do see a few red giants at about $(1-1.5,0-4)$. There may also be several white dwarfs around $(1.0-2.0,10)$

Let's have some fun and see what we can do with this data. One simple thing is to look at where the big bright stars are compared to the small red stars. Specifically, we will characterize a group of large stars using a `BP_RP` color in the range `[0.5,0.7]` and a magnitude in the range `[2,3.75]`. The group of low mass stars will be characterized with a `BP_RP` color in the range `[2,2.4]` and a magnitude in the range `[8.2,9.7]`. First, we'll show exactly where these stars are on our H-R diagram.


In [None]:
#>>>RUN: L25.1-runcell07

np.seterr(invalid="ignore")
#Red+0.7 > Blue > Red+0.5 and 2 < Star magnitude < 3.75
hi_mass_mask = ((BP_RP > 0.5*u.mag) & (BP_RP < 0.7*u.mag) & (M_G > 2*u.mag) & (M_G < 3.75*u.mag))
#                &                 (np.abs(galcen.v_y - 220*u.km/u.s) < 50*u.km/u.s))

#Red+2.4 > Blue > Red+2 and 8.2 < Star magnitude < 9.7
lo_mass_mask = ((BP_RP > 2*u.mag) & (BP_RP < 2.4*u.mag) & (M_G > 8.2*u.mag) & (M_G < 9.7*u.mag))
#                &                (np.abs(galcen.v_y - 220*u.km/u.s) < 50*u.km/u.s))

hi_mass_color = 'tab:purple'
lo_mass_color = 'tab:red'
hi_mass_label = 'high mass'
lo_mass_label = 'low mass'
milky_way = gp.MilkyWayPotential()
milky_way

plt.plot(BP_RP.value, M_G.value, marker='.', linestyle='none', alpha=0.1)

for mask, color, label in zip([lo_mass_mask, hi_mass_mask],[lo_mass_color, hi_mass_color], [lo_mass_label, hi_mass_label]):
    plt.plot(BP_RP[mask].value, M_G[mask].value, marker='.', linestyle='none', 
            alpha=0.5, color=color, label=label)

plt.xlim(0, 3)
plt.ylim(11, 1)

plt.xlabel('$G_{BP}-G_{RP}$')
plt.ylabel('$M_{G}$')
plt.legend()
plt.show()
     
     

Next, look at the where these stars are found in the galaxy. This plot shows the location of all stars as fainter dots.

In [None]:
#>>>RUN: L25.1-runcell08

fig, ax = plt.subplots(1, 1, figsize=(10, 10))

ax.plot(galcen[1==1].x.value, galcen[1==1].y.value,marker='.', linestyle='none', alpha=0.02,color='blue')
ax.plot(galcen[lo_mass_mask].x.value, galcen[lo_mass_mask].y.value,marker='.', linestyle='none', alpha=0.5,color=lo_mass_color, label=lo_mass_label)
ax.plot(galcen[hi_mass_mask].x.value, galcen[hi_mass_mask].y.value,marker='.', linestyle='none', alpha=0.5,color=hi_mass_color, label=hi_mass_label)
ax.set_xlabel('x [{0:latex_inline}]'.format(galcen.x.unit));
ax.set_ylabel('y [{0:latex_inline}]'.format(galcen.y.unit));
plt.legend()
plt.show()

At first glance, there is no obvious deviation between the different populations visible here. However, I think it's been made clear in this class that simply plotting points doesn't reveal much about broad variations. To look further, we'll plot the distribution of radial and $z$ velocities.

In [None]:
#>>>RUN: L25.1-runcell09

r_ce = np.sqrt(galcen[1==1]        .y.value**2 + galcen[1==1]        .z.value**2)
r_lo = np.sqrt(galcen[lo_mass_mask].y.value**2 + galcen[lo_mass_mask].z.value**2)
r_hi = np.sqrt(galcen[hi_mass_mask].y.value**2 + galcen[hi_mass_mask].z.value**2)

hist_ce, bin_edges = np.histogram(r_ce,bins=10, density=True)
hist_lo, bin_edges = np.histogram(r_lo, bins=bin_edges, density=True)
hist_hi, bin_edges = np.histogram(r_hi, bins=bin_edges, density=True)
bin_center = 0.5*(bin_edges[:-1] + bin_edges[1:])
sc=1./(len(galcen))
sl=1./(len(galcen[lo_mass_mask]))
sh=1./(len(galcen[hi_mass_mask]))
plt.errorbar(bin_center,hist_ce,yerr=sc*hist_ce**0.5,alpha=0.5,drawstyle="steps-mid",marker='.',color='blue',label='all')
plt.errorbar(bin_center,hist_lo,yerr=sl*hist_lo**0.5,alpha=0.5,drawstyle="steps-mid",marker='.',color=lo_mass_color,label=lo_mass_label)
plt.errorbar(bin_center,hist_hi,yerr=sh*hist_hi**0.5,alpha=0.5,drawstyle="steps-mid",marker='.',color=hi_mass_color,label=hi_mass_label)
plt.xlabel('y[pc]')
plt.legend()
plt.show()


vr_ce = np.sqrt(galcen[1==1]        .v_y.value**2 + galcen[1==1]        .v_x.value**2)
vr_lo = np.sqrt(galcen[lo_mass_mask].v_y.value**2 + galcen[lo_mass_mask].v_x.value**2)
vr_hi = np.sqrt(galcen[hi_mass_mask].v_y.value**2 + galcen[hi_mass_mask].v_x.value**2)

hist_ce, bin_edges = np.histogram(vr_ce,bins=20, density=True)
hist_lo, bin_edges = np.histogram(vr_lo, bins=bin_edges, density=True)
hist_hi, bin_edges = np.histogram(vr_hi, bins=bin_edges, density=True)
bin_center = 0.5*(bin_edges[:-1] + bin_edges[1:])
sc=1./(len(galcen))
sl=1./(len(galcen[lo_mass_mask]))
sh=1./(len(galcen[hi_mass_mask]))
plt.errorbar(bin_center,hist_ce,yerr=sc*hist_ce**0.5,alpha=0.5,drawstyle="steps-mid",marker='.',color='blue',label='all')
plt.errorbar(bin_center,hist_lo,yerr=sl*hist_lo**0.5,alpha=0.5,drawstyle="steps-mid",marker='.',color=lo_mass_color,label=lo_mass_label)
plt.errorbar(bin_center,hist_hi,yerr=sh*hist_hi**0.5,alpha=0.5,drawstyle="steps-mid",marker='.',color=hi_mass_color,label=hi_mass_label)
plt.xlabel('v$_{r}$[km/s]')
plt.legend()
plt.show()


vz_ce = galcen[1==1]        .v_z.value
vz_lo = galcen[lo_mass_mask].v_z.value
vz_hi = galcen[hi_mass_mask].v_z.value

hist_ce, bin_edges = np.histogram(vz_ce,bins=20, density=True)
hist_lo, bin_edges = np.histogram(vz_lo, bins=bin_edges, density=True)
hist_hi, bin_edges = np.histogram(vz_hi, bins=bin_edges, density=True)
bin_center = 0.5*(bin_edges[:-1] + bin_edges[1:])
sc=1./(len(galcen))
sl=1./(len(galcen[lo_mass_mask]))
sh=1./(len(galcen[hi_mass_mask]))
plt.errorbar(bin_center,hist_ce,yerr=sc*hist_ce**0.5,alpha=0.5,drawstyle="steps-mid",marker='.',color='blue',label='all')
plt.errorbar(bin_center,hist_lo,yerr=sl*hist_lo**0.5,alpha=0.5,drawstyle="steps-mid",marker='.',color=lo_mass_color,label=lo_mass_label)
plt.errorbar(bin_center,hist_hi,yerr=sh*hist_hi**0.5,alpha=0.5,drawstyle="steps-mid",marker='.',color=hi_mass_color,label=hi_mass_label)
plt.xlabel('v$_{z}$[km/s]')
plt.legend()
plt.show()


As before, these distributions do not appear by-eye to be different from each other.

<a name='exercises_25_1'></a>     

| [Top](#section_25_0) | [Restart Section](#section_25_1) | [Next Section](#section_25_2) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 25.1.1</span>

We've compared the distributions of high mass stars and low mass stars. Now, partition the intermediate mass stars and compare their distributions.

Specifically, define two group of stars, `intermed1`, with a `BP_RP` color in the range `[1,1.5]` and a magnitude in the range `[5,7.5]`, and `intermed2`, with a `BP_RP` color in the range `[1.5,2]` and a magnitude in the range `[6.5,8.5]`.

Hint: It might be convenient to wrap some of the cells above into a function, which takes the masks, colors, and labels as inputs.


What do you see?

A) There appears to be no difference in the distributions.

B) These distributions are now clearly different.


<br>

In [None]:
#>>>EXERCISE: L25.1.1

np.seterr(invalid="ignore")

#DEFINE MASKS
intermed1_mask = #YOUR CODE HERE
intermed2_mask = #YOUR CODE HERE

intermed1_color = 'tab:green'
intermed2_color = 'tab:orange'
intermed1_label = 'intermed1'
intermed2_label = 'intermed2'


#YOUR CODE HERE (APPLY MASKS, MAKE PLOTS)

     

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 25.1.2</span>

What if we consider some of the other variables? Select objects with extreme radial velocity, and see how they look in the HR diagram.  Specifically, select events with radial velocity < 100 km/s and compare with radial velocity > 290 km/s. Where do these objects appear on the HR diagram? Select ALL that apply.

A) The low velocity objects appear roughly evenly distributed in color and magnitude.
    
B) The low velocity objects appear to be bluer (younger) stars.

C) The low velocity objects appear to be redder (older) stars.

D) The high velocity objects appear roughly evenly distributed in color and magnitude.

E) The high velocity objects appear to be bluer (younger) stars.

F) The high velocity objects appear to be redder (older) stars.


<br>

In [None]:
#>>>EXERCISE: L25.1.2

#DEFINE MASKS
np.seterr(invalid="ignore")

lo_vel_mask = #YOUR CODE HERE
hi_vel_mask = #YOUR CODE HERE

hi_vel_color = 'tab:purple'
lo_vel_color = 'tab:orange'
hi_vel_label = 'high velocity'
lo_vel_label = 'low velocity'

#MAKE PLOT OF HR DIAGRAM
#suggestion: use larger marker style or size to see clearly

     

<a name='section_25_2'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C"> L25.2 Building Projections with Gaia </h2>  

| [Top](#section_25_0) | [Previous Section](#section_25_1) | [Exercises](#exercises_25_2) | [Next Section](#section_25_3) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.3x+3T2023/block-v1:MITxT+8.S50.3x+3T2023+type@sequential+block@seq_LS25/block-v1:MITxT+8.S50.3x+3T2023+type@vertical+block@vert_LS25_vid2" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

Now, we would like to see if we can understand some differences using the Gaia data toolkit, mostly because the toolkit is so well developed, and we would like to show you the extent of what you can do easily. Let's take our same data that doesn't have may obvious differences and see if we can start to conjure a differnece in the data.

What we are going to do is take our previous data set and evolve it for a long period of time using a stepping integrator. This integrator will not step all stars in the galaxy, but will instead use a constant radial mass profile (i.e., mass distribution as a function of radius), consistent with that observed for the Milky Way.

Let's go ahead and define it. What we will end up plotting is a histogram of z-positions for all stars, before and after evolving in time 500 million years. Again, we are comparing high-mass stars (before and after evolution) with low-mass stars (before and after evolution).

This will start to show us if some stars are going in or out of the plane of the galaxy!

In [None]:
#>>>RUN: L25.2-runcell01

H = gp.Hamiltonian(milky_way)
w0_hi = gd.PhaseSpacePosition(galcen[hi_mass_mask].cartesian)
w0_lo = gd.PhaseSpacePosition(galcen[lo_mass_mask].cartesian)

orbits_hi = H.integrate_orbit(w0_hi, dt=1*u.Myr, t1=0*u.Myr, t2=500*u.Myr)
orbits_lo = H.integrate_orbit(w0_lo, dt=1*u.Myr, t1=0*u.Myr, t2=500*u.Myr)

w0_hlo,bin_edges = np.histogram(w0_lo.z.value,bins=10,density=True)
w0_hhi,bin_edges = np.histogram(w0_hi.z.value,bins=bin_edges,density=True)
bin_center = 0.5*(bin_edges[:-1] + bin_edges[1:])

print(orbits_lo[-1,0],w0_lo[0])
o0_hlo,bin_edges = np.histogram(1e3*orbits_lo[-1,:].z.value,bins=bin_edges,density=True)
o0_hhi,bin_edges = np.histogram(1e3*orbits_hi[-1,:].z.value,bins=bin_edges,density=True)

plt.plot(bin_center,w0_hlo,alpha=0.5,marker='.',drawstyle="steps-mid",label='z-start low mass')
plt.plot(bin_center,w0_hhi,alpha=0.5,marker='.',drawstyle="steps-mid",label='z-start high mass')

plt.plot(bin_center,o0_hlo,alpha=0.5,marker='.',drawstyle="steps-mid",label='z-end low mass')
plt.plot(bin_center,o0_hhi,alpha=0.5,marker='.',drawstyle="steps-mid",label='z-end high mass')
plt.legend()
plt.xlabel('z[pc]')
plt.ylabel('N')

plt.show()

Now let's plot the orbit of some stars as function of time over the 500 Myr, to see how things look. We will just choose the one high-mass star and low-mass star to compare the two orbits.


In [None]:
#>>>RUN: L25.2-runcell02

fig = orbits_hi[0:500, 0].plot(color=hi_mass_color)
_   = orbits_lo[0:500, 0].plot(axes=fig.axes, color=lo_mass_color)

So, the stars tend to stay in the plane of the galaxy at $z\approx 0$, while orbiting around the galactic center at $(x, y)=(0,0)$.

Let's also plot the initial and final positions of the evolved stars.


In [None]:
#>>>RUN: L25.2-runcell03

plt.plot(1e-3*galcen[1==1].x.value, 1e-3*galcen[1==1].y.value,marker='.', linestyle='none', alpha=0.2,color='blue')
plt.plot(1e-3*w0_lo.x.value,1e-3*w0_lo.y.value,marker='.', linestyle='none', alpha=0.5,color='green',label='start')
plt.plot(orbits_lo[-1].x.value,orbits_lo[-1].y.value,marker='.', linestyle='none', alpha=0.5,color=lo_mass_color,label='low mass')
plt.plot(orbits_hi[-1].x.value,orbits_hi[-1].y.value,marker='.', linestyle='none', alpha=0.5,color=hi_mass_color,label='high mass')
plt.xlabel('x[kpc]')
plt.ylabel('y[kpc]')
plt.legend()
plt.show()

plt.plot(1e-3*galcen[1==1].z.value, 1e-3*galcen[1==1].y.value,marker='.', linestyle='none', alpha=0.2,color='blue')#,label='origin')
plt.plot(1e-3*w0_lo.z.value,1e-3*w0_lo.y.value,marker='.', linestyle='none', alpha=0.5,color='green',label='start')
plt.plot(orbits_lo[-1].z.value,orbits_lo[-1].y.value,marker='.', linestyle='none', alpha=0.5,color=lo_mass_color,label='low mass')
plt.plot(orbits_hi[-1].z.value,orbits_hi[-1].y.value,marker='.', linestyle='none', alpha=0.5,color=hi_mass_color,label='high mass')
plt.xlabel('z[kpc]')
plt.ylabel('y[kpc]')
plt.legend()
plt.show()


We now start to see that there are some interesting features in these results. Look in particular at the evolved $x$ and $y$ positions. This is a reflection of the velocities of the nearby stars based on their type. Also, note, the stars that we selected are not comoving, so we do not expect them to all follow the same orbits. Remember, we are just analyzing stars within a nearby distance to the sun.


Another way to visualize the orbits is in cylindrical coordinates. The we can see the variation in z and radius $\rho$ (instead of cartesian position). In the first plot, we again take the first star in each group (high-mass vs. low-mass) and plot its trajectory over the 500 Myr evolution. In the second plot, we show the final position of every star in the selected groups.


In [None]:
#>>>RUN: L25.2-runcell04

fig = orbits_hi[:, 0].cylindrical.plot(['rho', 'z'], 
                                       color=hi_mass_color,
                                       label='high mass')
_ = orbits_lo[:, 0].cylindrical.plot(['rho', 'z'], color=lo_mass_color,
                                     axes=fig.axes,
                                     label='low mass')

fig.axes[0].legend(loc='upper left')
fig.axes[0].set_ylim(-0.3, 0.3)
plt.show()

plt.plot(1e-3*galcen[1==1].cylindrical.rho, 1e-3*galcen[1==1].z.value,marker='.', linestyle='none', alpha=0.2,color='blue')
plt.plot(1e-3*w0_lo.cylindrical.rho,        1e-3*w0_lo.z.value,marker='.', linestyle='none', alpha=0.5,color='green')
plt.plot(orbits_lo[-1].cylindrical.rho,orbits_lo[-1].z.value,marker='.', linestyle='none', alpha=0.5,color=lo_mass_color)
plt.plot(orbits_hi[-1].cylindrical.rho,orbits_hi[-1].z.value,marker='.', linestyle='none', alpha=0.5,color=hi_mass_color)
plt.xlabel(r'$\rho$ [kpc]')
plt.ylabel('z [kpc]')
plt.show()

We can try to maximize the trends that we see above by computing more complex observables. You can see that the purple stars tend to have large z values and spread out in $\rho$ more. We can further separate these out by looking at the projected $z_{\rm max}$ and the orbital eccentricity (a measure of how far an orbit deviates from a circle, with $ecc=0$ indicating a perfect circle and $ecc=1$ indicating an ellipse that is so elongated that it forms a line).

In [None]:
#>>>RUN: L25.2-runcell05

zmax_hi = orbits_hi.zmax(approximate=True)
zmax_lo = orbits_lo.zmax(approximate=True)
bins = np.linspace(0, 2, 50)

plt.hist(zmax_hi.value, bins=bins, alpha=0.4, density=True, label='high-mass', color=hi_mass_color)
plt.hist(zmax_lo.value, bins=bins, alpha=0.4, density=True, label='low-mass',color=lo_mass_color);
plt.legend(loc='best', fontsize=14)
print("Mean high: ", zmax_hi.value.mean(),"Mean Low:",zmax_lo.mean())

plt.yscale('log')
plt.xlabel(r" zmax" + " [{0:latex}]".format(zmax_hi.unit))
plt.show()

zmax_hi = orbits_hi.eccentricity()
zmax_lo = orbits_lo.eccentricity()
print("Ecc high: ", zmax_hi.value.mean(),"Ecc Low:",zmax_lo.mean())

bins = np.linspace(0, 2, 50) #bins = np.linspace(0, 0.75, 50)
plt.hist(zmax_hi.value, bins=bins, alpha=0.4, density=True, label='high-mass', color=hi_mass_color)
plt.hist(zmax_lo.value, bins=bins, alpha=0.4, density=True, label='low-mass',color=lo_mass_color);
plt.legend(loc='best', fontsize=14)
plt.yscale('log')
plt.xlabel('Eccentricity')
plt.show()

<h3>Observations</h3>

The critical question from this all is what have we observed? We've seen that the low mass stars are moving much more out of the galactic plane than the high mass stars, and similarly the low mass stars have a greater eccentricity.

Another piece to this puzzle is the fact that the low mass stars are old (they've been burning for billions of years), and the high mass stars are young.


In this case, we have observed two types of stars: Population I and Population II. You can read more about this <a href="https://en.wikipedia.org/wiki/Stellar_population" target="_blank">here</a>. 

<a name='exercises_25_2'></a>     

| [Top](#section_25_0) | [Restart Section](#section_25_2) | [Next Section](#section_25_3) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 25.2.1</span>

Let's make a histogram of another characteristic, namely the orbital period. Plot the distribution of orbital periods for these populations of stars. Note you will need the `estimate_period(radial=True)` function from <a href="https://gala.adrian.pw/en/latest/api/gala.dynamics.Orbit.html#gala.dynamics.Orbit.estimate_period" target="_blank">here</a>. 

As a rough measurement of how regular the periods are for each group, calculate the standard deviation of the orbital periods for the high-mass vs. low-mass stars. Report your answer as a list of numbers with `[per_hi_stdev,per_lo_stdev]`, with precision 1 Myr.


<br>

In [None]:
#>>>EXERCISE: L25.2.1

per_hi = #YOUR CODE HERE
per_lo = #YOUR CODE HERE

#CALCULATE THE STDEV OF EACH
#YOUR CODE HERE

#PLOT THE HISTOGRAM
bins = np.linspace(100, 250, 50)
plt.hist(per_hi.value, bins=bins, alpha=0.4, density=True, label='high-mass', color=hi_mass_color)
plt.hist(per_lo.value, bins=bins, alpha=0.4, density=True, label='low-mass',color=lo_mass_color);
plt.legend(loc='best', fontsize=14)
plt.yscale('log')
plt.xlabel('period')
plt.show()

<a name='section_25_3'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C"> L25.3 Anomaly Detection with Gaia </h2>  

| [Top](#section_25_0) | [Previous Section](#section_25_2) | [Exercises](#exercises_25_3) | [Next Section](#section_25_4) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.3x+3T2023/block-v1:MITxT+8.S50.3x+3T2023+type@sequential+block@seq_LS25/block-v1:MITxT+8.S50.3x+3T2023+type@vertical+block@vert_LS25_vid3" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

Now let's do something more interesting with machine learning to see if we can detect more anomalies in the Gaia data! First, we will train an autoencoder on part of the dataset. Then, we'll apply this to another part of the dataset.  Our strategy will be to see if we can find some of the most anomalous stars in the dataset. We can then use the above observable exploration to see if the AI is telling us something meaningful.

To do this, we will want to identify stars that are not exactly like the others. The idea is going to be to construct an autoencoder that can look for anomalous stars amongst the dataset. To do that we will make a torch dataset and we are going to use variables that don't make our location in the galaxy too special. What we will do is use:

 * galactic coordinates $\vec{r} = r_{1}, r_{2}, r_{3}$
 * galactic velocity coordinates $\vec{v_{r}} = \dot{r}_{1}, \dot{r}_{2}, \dot{r}_{3}$
 * Distance corrected magnitudes

Naturally, to make the data clean, we will remove the `nans` (where there are no values in the data) and split it up into testing and training data sets.

In [None]:
#>>>RUN: L25.3-runcell01

def prepData(igaia_data,idist,igalcen,iSplit=0.5):
    gaia_vars=['phot_g_mean_mag','phot_bp_mean_mag','phot_rp_mean_mag']
    var0=(igaia_data[gaia_vars[0]]-idist.distmod).value
    var1=(igaia_data[gaia_vars[1]]-idist.distmod).value
    var2=(igaia_data[gaia_vars[2]]-idist.distmod).value
    var3=igalcen.x.value
    var4=igalcen.y.value
    var5=igalcen.z.value
    var6=igalcen.v_x.value
    var7=igalcen.v_y.value
    var8=igalcen.v_z.value

    #processed_data = np.vstack((var0,var1,var2,var3,var4,var5,var6,var7,var8))
    processed_data = np.vstack((var0,var1,var2,var3,var4,var5,var6,var7,var8))
    processed_data = processed_data.T
    processed_data = processed_data[~np.isnan(processed_data).any(axis=1)]
    processed_data_raw = processed_data.copy()

    #normalize the data
    processed_data /= np.std(processed_data,axis=0)
    processed_data -= np.mean(processed_data,axis=0)
    processed_data = processed_data[~np.isnan(processed_data).any(axis=1)]
    
    #pytorch the layer
    tprocessed_data = torch.tensor(processed_data).float()
    processed_data_raw = processed_data_raw[~torch.any(tprocessed_data.isnan(),dim=1)]
    galcen_clean       = igalcen[~torch.any(tprocessed_data.isnan(),dim=1)]
    #split
    maxindex       = int(len(processed_data)*iSplit)
    trainset       = torch.tensor(processed_data[0:maxindex]).float()
    trainset       = trainset[~torch.any(trainset.isnan(),dim=1)]
    testset        = torch.tensor(processed_data[maxindex:len(processed_data)]).float()
    testset        = testset[~torch.any(testset.isnan(),dim=1)]
    print(processed_data_raw.shape,testset.shape,trainset.shape)
    return testset,trainset,processed_data_raw,tprocessed_data,galcen_clean

btestset,btrainset,bprocessed_data_raw,btprocessed_data,bgalcen_clean=prepData(gaia_data,dist,galcen)

Now, we'll make a multilayer perceptron autoencoder with this dataset. This autoencoder will go to dimension 20 and then shrink down to a latent space of 2. The idea is that, in performing this compression, the autoencoder will learn the most relevant features of the data. Again, the features that we will give it are the three positions, three velocities, and three distance-corrected spectral magnitudes.


In [None]:
#>>>RUN: L25.3-runcell02

class MLP(nn.Module):
    def __init__(self,n_inputs,n_outputs):
        super(MLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(n_inputs, 20),
            nn.ReLU(),
            nn.Linear(20, 6),
            nn.ReLU(),
            nn.Linear(6, 2),
            nn.ReLU(),
            nn.Linear(2, 6),
            nn.ReLU(),
            nn.Linear(6, 20),
            nn.ReLU(),
            nn.Linear(20, n_outputs),
        )
        
    def forward(self, x):        
        x = self.layers(x)
        return x

def train(x,y,net,loss_func,opt,sched,nepochs):
    net.train(True)
    for epoch in range(nepochs):
        prediction = net(x)
        opt.zero_grad()
        loss = loss_func(prediction,y) 
        loss.backward() 
        opt.step()
        if epoch % 500 == 0: 
            print('[%d] loss: %.4f ' % (epoch + 1, loss.item()  ))
    #sched.step()
    return    

basicmodel     = MLP(btrainset.shape[1],btrainset.shape[1])
optimizer = torch.optim.Adam(basicmodel.parameters(), lr=0.01)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.1, last_epoch=-1, verbose=False)
loss_fn   =  nn.MSELoss(reduction='sum')

Let's use our dataset to train this.

In [None]:
#>>>RUN: L25.3-runcell03

train(btrainset,btrainset,basicmodel,loss_fn,optimizer,scheduler,10001)

Now that we have trained the model, let's look at its overall performance to understand what is going on. First, take a look at the loss. In the context of anomaly detection, the loss has a particular meaning. Any inputs with high loss represent anomalous events, since they don't match what the network has learned. These inputs (stars in our case) may be quite different from the majority of those in the dataset!

In [None]:
#>>>RUN: L25.3-runcell04

basicmodel.train(False)
boutput=basicmodel(btestset)
btestloss=torch.sum((btestset-boutput)**2,axis=1)
plt.hist(btestloss[btestloss < 100].detach().numpy(),density=True)
plt.yscale('log')
plt.xlabel('loss')
plt.ylabel('pdf')
plt.show()

varlabels=['Mag','B-Mag','R-Mag','x','y','z','vx','vy','vz']

fig, ax = plt.subplots(3, 3, figsize=(20, 20))
for var in range(btestset.shape[1]):
    _,bins,_=ax[var//3,var % 3].hist(btestset[:,var].detach().numpy(),density=True,alpha=0.5,label='Input')
    ax[var//3,var % 3].hist(boutput [:,var].detach().numpy(),density=True,alpha=0.5,bins=bins,label='Output')
    ax[var//3,var % 3].set_xlabel(varlabels[var])
    ax[var//3,var % 3].legend()

We can put a cut on the loss and look to see where the most anomalous regions are in the data. Specifically, we'll select all objects with a loss above 40 (but you can try changing the cut if you want!).

The colorscale in the plots below represent the normalized loss, with higher loss values (i.e., more anomalous objects) represented by the orange end of the spectrum.

In [None]:
#>>>RUN: L25.3-runcell05

def plotAnomaly(iCut,iRaw,iLoss,igalcen,maxlosscolor=20):
    loss = np.minimum(iLoss,maxlosscolor)
    anomalies=(iLoss > iCut)
    baseidex=len(iRaw)-len(iLoss)
    btestdata_raw = iRaw[baseidex:]
    anomaly_raw  = btestdata_raw[anomalies]
    btestgalcen   = igalcen[baseidex:]
    anomaly_galcen = btestgalcen[anomalies]
    loss = loss/np.max(loss)
    if maxlosscolor != 20:
        loss=np.ones(loss.shape)
    print(loss.shape,btestdata_raw[:,2].shape)
    scat=plt.scatter(btestdata_raw[:,2]-btestdata_raw[:,1],btestdata_raw[:,0], marker='.',c=loss,cmap="viridis")
    plt.plot(anomaly_raw[:,2]-anomaly_raw[:,1],anomaly_raw[:,0], marker='.', linestyle='none',c='orange')
    plt.xlabel('$G_{BP}-G_{RP}$')
    plt.ylabel('$M_{G}$')
    plt.ylim(15,0)
    plt.xlim(1,-5)
    plt.colorbar(scat)
    plt.show()
    
    scat=plt.scatter(btestdata_raw[:,3],btestdata_raw[:,4],c=loss, marker='.',cmap="viridis")
    plt.plot(anomaly_raw[:,3],anomaly_raw[:,4], marker='.', linestyle='none',c='orange')
    plt.xlabel("x[pc]")
    plt.ylabel("y[pc]")
    plt.colorbar(scat)
    plt.show()

    scat=plt.scatter(btestdata_raw[:,3],btestdata_raw[:,5],c=loss, marker='.', cmap="viridis")
    plt.plot(anomaly_raw[:,3],anomaly_raw[:,5], marker='.', linestyle='none',c='orange')
    plt.xlabel("x[pc]")
    plt.ylabel("z[pc]")
    plt.colorbar(scat)
    plt.show()

    scat=plt.scatter(btestdata_raw[:,6],btestdata_raw[:,7],c=loss, marker='.', cmap="viridis")
    plt.plot(anomaly_raw[:,6],anomaly_raw[:,7], marker='.', linestyle='none',c='orange')
    plt.xlabel("vx[pc]")
    plt.ylabel("vy[pc]")
    plt.colorbar(scat)
    plt.show()

    scat=plt.scatter(btestdata_raw[:,6],btestdata_raw[:,8],c=loss, marker='.',cmap="viridis")
    plt.plot(anomaly_raw[:,6],anomaly_raw[:,8], marker='.', linestyle='none')
    plt.xlabel("vx[pc]")
    plt.ylabel("vz[pc]")
    plt.show()

    print(len(btestdata_raw),len(iRaw),len(igalcen),len(btestgalcen))
    H = gp.Hamiltonian(milky_way)
    w0_anom = gd.PhaseSpacePosition(anomaly_galcen.cartesian)
    orbits_anom = H.integrate_orbit(w0_anom, dt=1*u.Myr, t1=0*u.Myr, t2=100*u.Myr)
    w0_all  = gd.PhaseSpacePosition(bgalcen_clean[1==1].cartesian)
    orbits_all  = H.integrate_orbit(w0_all,  dt=1*u.Myr, t1=0*u.Myr, t2=100*u.Myr)

    zmax_all = orbits_all.zmax(approximate=True)
    zmax_anom = orbits_anom.zmax(approximate=True)
    print("ZMax mean:",zmax_anom[~np.isnan(zmax_anom.value)].mean(),len(zmax_anom),"default:",zmax_all.mean())
    bins = np.linspace(0, 10, 50)
    plt.hist(zmax_all.value, bins=bins, alpha=0.4, density=True, label='all')
    plt.hist(zmax_anom.value, bins=bins, alpha=0.4, density=True, label='anom')
    plt.legend(loc='best', fontsize=14)
    plt.yscale('log')
    plt.xlabel(r" zmax" + " [{0:latex}]".format(zmax_all.unit))
    plt.show()

    zmax_all  = orbits_all.eccentricity()
    zmax_anom = orbits_anom.eccentricity()
    print("Ecc mean:",zmax_anom[~np.isnan(zmax_anom.value)].mean(),len(zmax_anom),"default:",zmax_all.mean())
    bins = np.linspace(0, 3, 50)
    plt.hist(zmax_all.value,  bins=bins, alpha=0.4, density=True, label='all')
    plt.hist(zmax_anom.value, bins=bins, alpha=0.4, density=True, label='anom')
    plt.legend(loc='best', fontsize=14)
    plt.yscale('log')
    plt.xlabel('Eccentricity')
    plt.show()


plotAnomaly(10,bprocessed_data_raw,btestloss.detach().numpy(),bgalcen_clean)

I think its clear that we are really finding some of the weirdest stars, which are very far from the galactic plane!

<a name='exercises_25_3'></a>     

| [Top](#section_25_0) | [Restart Section](#section_25_3) | [Next Section](#section_25_4) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 25.3.1</span>

We saw that anomalous stars tended to have large $|v_{x}|$ values. Let's try to cut on the events with $|v_{x}| > 100$ and thereby effectively reverse engineer the anomaly detection. 

In the function `plotAnomaly(iCut,iRaw,iLoss,igalcen,maxlosscolor=20)`, the anomalous objects have `iLoss` values that are above the threshold defined by `iCut`. So to accomplish our goal, let's define an array of `iLoss` values to be the velocities  $|v_{x}|$, then use the appropriate `iCut`.

Consider the options below for defining this cut on the data, and choose the best one. Hint, look at the definitions within the `prepData` function at the beginning of this section.

A) `cutvals=np.abs(bprocessed_data_raw[:,0][2032:]).copy()`\
B) `cutvals=np.abs(bprocessed_data_raw[:,1][2032:]).copy()`\
C) `cutvals=np.abs(bprocessed_data_raw[:,2][2032:]).copy()`\
D) `cutvals=np.abs(bprocessed_data_raw[:,3][2032:]).copy()`\
E) `cutvals=np.abs(bprocessed_data_raw[:,4][2032:]).copy()`\
F) `cutvals=np.abs(bprocessed_data_raw[:,5][2032:]).copy()`\
G) `cutvals=np.abs(bprocessed_data_raw[:,6][2032:]).copy()`\
H) `cutvals=np.abs(bprocessed_data_raw[:,7][2032:]).copy()`\
I) `cutvals=np.abs(bprocessed_data_raw[:,8][2032:]).copy()`


Upon completing your selection above, try to run the code `plotAnomaly(100,bprocessed_data_raw,cutvals,bgalcen_clean,maxlosscolor=150)` and compare the HR diagram to what we saw before. Do you think this does a good job?

<br>

In [None]:
#>>>EXERCISE: L25.3.1

cutvals=#YOUR CODE HERE
plotAnomaly(100,bprocessed_data_raw,cutvals,bgalcen_clean,maxlosscolor=150)

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 25.3.2</span>

Search for anomalies without using the red, blue, and general filters. Hint, to do this you could either redfine the `prepData` function, or perform the relavant selections on the existing data (e.g., `btestset[:,3:9]`). How do things compare? Select ALL that apply.

A) The selected anomalies are identical.

B) The selected anomalies are totally different.

C) Some anomalies are the same, but the selection is different from before.

D) The loss function is nearly identical.

E) The loss function has a narrower tail, so fewer anomalies are selected.

F) The loss function has a wider tail, so more anomalies are selected.

<br>

<a name='section_25_4'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C"> L25.4 Anomaly Detection with Lots of Gaia Data</h2>  

| [Top](#section_25_0) | [Previous Section](#section_25_3) | [Exercises](#exercises_25_4) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.3x+3T2023/block-v1:MITxT+8.S50.3x+3T2023+type@sequential+block@seq_LS25/block-v1:MITxT+8.S50.3x+3T2023+type@vertical+block@vert_LS25_vid4" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

Recall that what we've done so far only used a sampling of 2k nearest stars. What if we do this on a much larger scale, namely 170k stars which generate a dataset file of size approximately 9MB.


In [None]:
#>>>RUN: L25.4-runcell01

#note, our query looks for 1e6 stars, but only finds about 170k
query_text = '''SELECT TOP 1000000 ra, dec, parallax, pmra, pmdec, radial_velocity,
phot_g_mean_mag, phot_bp_mean_mag, phot_rp_mean_mag
FROM gaiadr3.gaia_source
WHERE parallax_over_error > 10 AND
    parallax > 10 AND
    radial_velocity IS NOT null
ORDER BY random_index
'''

job = Gaia.launch_job(query_text)
gaia_data = job.get_results()
gaia_data.write('data/L25/gaia_data2.fits',overwrite=True)

#Note, if you return to this section after closing the kernel,
#you can load the data again using the following code
gaia_data = QTable.read('data/L25/gaia_data2.fits')
print("Total Events:",len(gaia_data))


How do these stars look on an H-R diagram?

In [None]:
#>>>RUN: L25.4-runcell02

dist = coord.Distance(parallax=u.Quantity(gaia_data['parallax']))
M_G = gaia_data['phot_g_mean_mag'] - dist.distmod
BP_RP = gaia_data['phot_bp_mean_mag'] - gaia_data['phot_rp_mean_mag']

plt.plot(BP_RP.value, M_G.value, marker='.', linestyle='none', alpha=0.3)

plt.xlim(-1, 5)
plt.ylim(17, -2)

plt.xlabel('$G_{BP}-G_{RP}$')
plt.ylabel('$M_{G}$')
plt.show()

With this larger sample, the red giants at the top are much more prominent and you start to more clearly see some points down at the bottom that correspond to the white dwarfs. As before, we'll split this dataset in two and train an autoencoder, **this time using 4 latent dimensions.**

As before, we are performing a compression that forces the autoencoder to learn the most relevant features of the data. Again, the features that we will give it are the three positions, three velocities, and three distance-corrected spectral magnitudes.

Let's define everything.

In [None]:
#>>>RUN: L25.4-runcell03

#redefine as above
def prepData(igaia_data,idist,igalcen,iSplit=0.5):
    gaia_vars=['phot_g_mean_mag','phot_bp_mean_mag','phot_rp_mean_mag']
    var0=(igaia_data[gaia_vars[0]]-idist.distmod).value
    var1=(igaia_data[gaia_vars[1]]-idist.distmod).value
    var2=(igaia_data[gaia_vars[2]]-idist.distmod).value
    var3=igalcen.x.value
    var4=igalcen.y.value
    var5=igalcen.z.value
    var6=igalcen.v_x.value
    var7=igalcen.v_y.value
    var8=igalcen.v_z.value

    #processed_data = np.vstack((var0,var1,var2,var3,var4,var5,var6,var7,var8))
    processed_data = np.vstack((var0,var1,var2,var3,var4,var5,var6,var7,var8))
    processed_data = processed_data.T
    processed_data = processed_data[~np.isnan(processed_data).any(axis=1)]
    processed_data_raw = processed_data.copy()

    #normalize the data
    processed_data /= np.std(processed_data,axis=0)
    processed_data -= np.mean(processed_data,axis=0)
    processed_data = processed_data[~np.isnan(processed_data).any(axis=1)]
    
    #pytorch the layer
    tprocessed_data = torch.tensor(processed_data).float()
    processed_data_raw = processed_data_raw[~torch.any(tprocessed_data.isnan(),dim=1)]
    galcen_clean       = igalcen[~torch.any(tprocessed_data.isnan(),dim=1)]
    #split
    maxindex       = int(len(processed_data)*iSplit)
    trainset       = torch.tensor(processed_data[0:maxindex]).float()
    trainset       = trainset[~torch.any(trainset.isnan(),dim=1)]
    testset        = torch.tensor(processed_data[maxindex:len(processed_data)]).float()
    testset        = testset[~torch.any(testset.isnan(),dim=1)]
    print(processed_data_raw.shape,testset.shape,trainset.shape)
    return testset,trainset,processed_data_raw,tprocessed_data,galcen_clean

c = coord.SkyCoord(ra=gaia_data['ra'], dec=gaia_data['dec'],distance=dist,pm_ra_cosdec=gaia_data['pmra'], pm_dec=gaia_data['pmdec'],
                   radial_velocity=gaia_data['radial_velocity'])
galcen = c.transform_to(coord.Galactocentric(z_sun=0*u.pc, galcen_distance=8.1*u.kpc))

testset,trainset,processed_data_raw,tprocessed_data,galcen_clean=prepData(gaia_data,dist,galcen)

In [None]:
#>>>RUN: L25.4-runcell04

class MLP(nn.Module):
    def __init__(self,n_inputs,n_outputs):
        super(MLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(n_inputs, 100),
            nn.ReLU(),
            nn.Linear(100, 20),
            nn.ReLU(),
            nn.Linear(20, 4),
            nn.ReLU(),
            nn.Linear(4, 20),
            nn.ReLU(),
            nn.Linear(20, 100),
            nn.ReLU(),
            nn.Linear(100, n_outputs),
        )
        
    def forward(self, x):        
        x = self.layers(x)
        return x

def train(x,y,net,loss_func,opt,sched,nepochs):
    net.train(True)
    for epoch in range(nepochs):
        prediction = net(x)
        opt.zero_grad()
        loss = loss_func(prediction,y) 
        loss.backward() 
        opt.step()
        if epoch % 500 == 0: 
            print('[%d] loss: %.4f ' % (epoch + 1, loss.item()  ))
    #sched.step()
    return    

model     = MLP(trainset.shape[1],trainset.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.1, last_epoch=-1, verbose=False)
loss_fn   =  nn.MSELoss(reduction='sum')

Then, we train on half of the data. This will take a significant amount of time (perhaps more than 30 min, but probably not more than an hour - timed in Colab).

In [None]:
#>>>RUN: L25.4-runcell05

train(trainset,trainset,model,loss_fn,optimizer,scheduler,10001)
#train(trainset,trainset,model,loss_fn,optimizer,scheduler,5001) #train for shorter amount of epochs

In [None]:
#>>>RUN: L25.4-runcell06

model.train(False)
output=model(testset)
testloss=torch.sum((testset-output)**2,axis=1)
plt.hist(testloss[testloss < 100].detach().numpy(),density=True)
plt.yscale('log')
plt.xlabel('loss')
plt.ylabel('pdf')
plt.show()

varlabels=['Mag','B-Mag','R-Mag','x','y','z','vx','vy','vz']

fig, ax = plt.subplots(3, 3, figsize=(20, 20))
for var in range(testset.shape[1]):
    _,bins,_=ax[var//3,var % 3].hist(testset[:,var].detach().numpy(),density=True,alpha=0.5,label='Input')
    ax[var//3,var % 3].hist(output [:,var].detach().numpy(),density=True,alpha=0.5,bins=bins,label='Output')
    ax[var//3,var % 3].set_xlabel(varlabels[var])
    ax[var//3,var % 3].legend()

As before, we'll look for anomalies in this scenario, and study the dynamics of these objects, first checking if anything appears strange by-eye.

In [None]:
#>>>RUN: L25.4-runcell07

milky_way = gp.MilkyWayPotential()
H = gp.Hamiltonian(milky_way)
w0_all  = gd.PhaseSpacePosition(galcen_clean[1==1].cartesian)
orbits_all  = H.integrate_orbit(w0_all,  dt=1*u.Myr, t1=0*u.Myr, t2=100*u.Myr)
#plotAnomaly(45,processed_data_raw,testloss,galcen_clean,orbits_all)

In [None]:
#>>>RUN: L25.4-runcell08

anomalies=(testloss > 45.)
baseidex=len(processed_data_raw)-len(testloss)
testgalcen   = galcen_clean[baseidex:]
anomaly_galcen = testgalcen[anomalies]

w0_anom = gd.PhaseSpacePosition(anomaly_galcen.cartesian)
orbits_anom = H.integrate_orbit(w0_anom, dt=1*u.Myr, t1=0*u.Myr, t2=100*u.Myr)

#redefine, as above
hi_mass_color = 'tab:purple'
lo_mass_color = 'tab:red'
hi_mass_label = 'high mass'
lo_mass_label = 'low mass'

fig = orbits_all[0:200, 0].plot(color=hi_mass_color,label='normal')
_   = orbits_all[0:200, 1].plot(axes=fig.axes,color=hi_mass_color)
_   = orbits_all[0:200, 2].plot(axes=fig.axes,color=hi_mass_color)
_   = orbits_anom[0:200, 0].plot(axes=fig.axes,label='anom 1')
_   = orbits_anom[0:200, 1].plot(axes=fig.axes,label='anom 2')
_   = orbits_anom[0:200, 2].plot(axes=fig.axes,label='anom 3')
_   = orbits_anom[0:200, 3].plot(axes=fig.axes,label='anom 4',color=lo_mass_color)
plt.legend()


zmax=orbits_anom.zmax(approximate=True)
print(zmax)

Wow! Some of these objects appear to have trajectories that make them shoot out of the galactic plane. That's certainly not how most stars behave!

Now, let's again cut on the loss and look at the characteristics of the anomalous population.

In [None]:
#>>>RUN: L25.4-runcell09

def plotAnomalyBasic(iCut,iRaw,iLoss,igalcen,iOrbitsAll):
    loss = np.minimum(iLoss.detach().numpy(),20)
    anomalies=(iLoss > iCut)
    baseidex=len(iRaw)-len(iLoss)
    testdata_raw = iRaw[baseidex:]
    anomaly_raw  = testdata_raw[anomalies]
    testgalcen   = igalcen[baseidex:]
    anomaly_galcen = testgalcen[anomalies]

    scat=plt.scatter(-1.*(testdata_raw[:,2]-testdata_raw[:,1]),-1.*testdata_raw[:,0],c=loss)
    plt.plot(-1.*(anomaly_raw[:,2]-anomaly_raw[:,1]),-1.*anomaly_raw[:,0], marker='.',c='orange', linestyle='none')
    plt.xlabel('$G_{BP}-G_{RP}$')
    plt.ylabel('$M_{G}$')
    plt.colorbar(scat)
    plt.show()

    scat=plt.scatter(testdata_raw[:,3],testdata_raw[:,4], marker='.', c=loss)
    plt.plot(anomaly_raw[:,3],anomaly_raw[:,4], marker='.',c='orange', linestyle='none')
    plt.xlabel("x[pc]")
    plt.ylabel("y[pc]")
    plt.colorbar(scat)
    plt.show()

    scat=plt.scatter(testdata_raw[:,3],testdata_raw[:,5], marker='.', c=loss)
    plt.plot(anomaly_raw[:,3],anomaly_raw[:,5], marker='.',c='orange', linestyle='none')
    plt.xlabel("x[pc]")
    plt.ylabel("z[pc]")
    plt.colorbar(scat)
    plt.show()

    scat=plt.scatter(testdata_raw[:,6],testdata_raw[:,7], marker='.', c=loss)
    plt.plot(anomaly_raw[:,6],anomaly_raw[:,7], marker='.',c='orange', linestyle='none')
    plt.xlabel("vx[pc]")
    plt.ylabel("vy[pc]")
    plt.colorbar(scat)
    plt.show()

    scat=plt.scatter(testdata_raw[:,6],testdata_raw[:,8], marker='.', c=loss)
    plt.plot(anomaly_raw[:,6],anomaly_raw[:,8], marker='.',c='orange', linestyle='none')
    plt.xlabel("vx[pc]")
    plt.ylabel("vz[pc]")
    plt.colorbar(scat)
    plt.show()

def plotAnomalyComplex(iCut,iRaw,iLoss,igalcen,iOrbitsAll,iEccAll,iZAll,iT2):
    anomalies=(iLoss > iCut)
    baseidex=len(iRaw)-len(iLoss)
    testdata_raw = iRaw[baseidex:]
    anomaly_raw  = testdata_raw[anomalies]
    testgalcen   = igalcen[baseidex:]
    anomaly_galcen = testgalcen[anomalies]

    print(len(testdata_raw),len(iRaw),len(igalcen),len(testgalcen))
    H = gp.Hamiltonian(milky_way)
    w0_anom = gd.PhaseSpacePosition(anomaly_galcen.cartesian)
    orbits_anom = H.integrate_orbit(w0_anom, dt=1*u.Myr, t1=0*u.Myr, t2=iT2)

    #zmax_all = iOrbitsAll.zmax(approximate=True)
    zmax_anom = orbits_anom.zmax(approximate=True)
    print("ZMax mean:",zmax_anom[~np.isnan(zmax_anom.value)].mean(),len(zmax_anom),"default:",zmax_all.mean())
    #print(zmax_anom,"!! Anom 1")
    bins = np.linspace(0, 10, 50)
    plt.hist(iZAll.value, bins=bins, alpha=0.4, density=True, label='all')
    plt.hist(zmax_anom.value, bins=bins, alpha=0.4, density=True, label='anom')
    plt.legend(loc='best', fontsize=14)
    plt.yscale('log')
    plt.xlabel(r" zmax" + " [{0:latex}]".format(zmax_all.unit))
    plt.show()

    #zmax_all  = orbits_all.eccentricity()
    zmax_anom = orbits_anom.eccentricity()
    print("Ecc mean:",zmax_anom[~np.isnan(zmax_anom.value)].mean(),len(zmax_anom),"default:",zmax_all.mean())
    bins = np.linspace(0, 10, 50)
    plt.hist(iEccAll.value,  bins=bins, alpha=0.4, density=True, label='all')
    plt.hist(zmax_anom.value, bins=bins, alpha=0.4, density=True, label='anom')
    plt.legend(loc='best', fontsize=14)
    plt.yscale('log')
    plt.xlabel('Eccentricity')
    plt.show()

plotAnomalyBasic(45,processed_data_raw,testloss,galcen_clean,orbits_all)

In [None]:
#>>>RUN: L25.4-runcell10

#Note, this will take some time to run
zmax_all = orbits_all.zmax(approximate=True)
ecc_all  = orbits_all.eccentricity()
plotAnomalyComplex(20,processed_data_raw,testloss,galcen_clean,orbits_all,ecc_all,zmax_all,iT2=100*u.Myr)

These results make it clear that we are selecting stars with very weird orbital paths. We could go into even more detail and investigate these some more.

However, let's train a different type of anomaly detection. We can do a VAE and see if anything else is different.

<a name='exercises_25_4'></a>     

| [Top](#section_25_0) | [Restart Section](#section_25_4) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 25.4.1</span>

We performed the analysis above using a multilayer perceptron (MLP). Consider what similarities or differences might arise if we instead used a variational autoencoder (VAE). Select ALL statements that accurately describe these machine learning methods.

A) A VAE would provide a probabilistic representation of the data, while an MLP would provide a deterministic mapping from input to output.

B) Both VAE and MLP can be used for anomaly detection, but a VAE can generate new samples similar to the training data, which an MLP cannot do.

C) A VAE would learn a continuous latent space, while an MLP would not.


**Afterwards, perform the same analysis as above with a VAE, using 4 latent dimensions.** What do you find? Note, you can check the solution to this problem to see the code that we used.

<br>