In [1]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sps
import pandas as pd

Hipparcos, operating from 1989-1993 was the first scientific satellite devoted to precision astrometry, to accurately measure the positions of stars. By measuring the parallax motion of stars on the sky as the Earth (and the satellite) moves in its orbit around the sun, Hipparcos could obtain accurate measures of distances to stars up to a few hundred parsecs (pc). We will use some data from the Hipparcos mission as our example data set, in order to plot a ‘colour-magnitude’ diagram of the general population of stars. We will see how to read the data into a Pandas dataframe, clean it of bad and low-precision data, and transform the data into useful values which we can plot.</p>
                

The file <code class="language-plaintext highlighter-rouge">hipparcos.txt</code>
                    (see the Lesson data 
                    <a href="https://github.com/philuttley/statistical-inference/tree/gh-pages/data/hipparcos.txt">
                        <strong>here</strong>
                    </a>
                    ) is a multivariate data-set containing a lot of information.  To start with you should look at the raw data file using your favourite text editor, Pythons native text input/output commands or the <code class="language-plaintext highlighter-rouge">more</code>
                    or <code class="language-plaintext highlighter-rouge">cat</code>
                    commands in the linux shell.  The file is formatted in a complex way, so that we need to skip the first 53 lines in order to get to the data.  We will also need to skip the final couple of lines.  Using the <code class="language-plaintext highlighter-rouge">pandas.read_csv</code>
                    command to read in the file, we specify <code class="language-plaintext highlighter-rouge">delim_whitespace=True</code>
                    since the values are separated by spaces not commas in this file, and we use the <code class="language-plaintext highlighter-rouge">skiprows</code>
                    and <code class="language-plaintext highlighter-rouge">skipfooter</code>
                    commands to skip the lines that do not correspond to data at the start and end of the file. We specify <code class="language-plaintext highlighter-rouge">engine='python'</code>
                    to avoid a warning message, and <code class="language-plaintext highlighter-rouge">index_col=False</code>
                    ensures that Pandas does not automatically assume that the integer ID values that are in the first column correspond to the indices in the array (this way we ensure direct correspondence of our index with our position in the array, so it is easier to diagnose problems with the data if we encounter any).

</p>
                <p>Note also that here we specify the names of our columns - we could also use names given in a specific header row in the file if one exists.  Here, the header row is not formatted such that the names are easy to use, so we give our own names for the columns.</p>
                <p>
                    Finally, we need to account for the fact that some of our values are not defined (in the parallax and its error, <code class="language-plaintext highlighter-rouge">Plx</code>
                    and <code class="language-plaintext highlighter-rouge">ePlx</code>
                    columns) and are denoted with <code class="language-plaintext highlighter-rouge">-</code>
                    .  This is done by setting <code class="language-plaintext highlighter-rouge">-</code>
                    to count as a <code class="language-plaintext highlighter-rouge">NaN</code>
                    value to Pandas, using <code class="language-plaintext highlighter-rouge">na_values='-'</code>
                    .  If we don’t include this instruction in the command, those columns will appear as strings (<code class="language-plaintext highlighter-rouge">object</code>
                    ) according to the <code class="language-plaintext highlighter-rouge">dtypes</code>
                    list.

In [2]:
hipparcos = pd.read_csv('hipparcos.txt', delim_whitespace=True, skiprows=53, skipfooter=2, engine='python',
names=['ID','Rah','Ram','Ras','DECd','DECm','DECs','Vmag','Plx','ePlx','BV','eBV'],
  index_col=False, na_values='-')

  hipparcos = pd.read_csv('hipparcos.txt', delim_whitespace=True, skiprows=53, skipfooter=2, engine='python',


 Note that Pandas automatically assigns a datatype (<code class="language-plaintext highlighter-rouge">dtype</code>
                    ) to each column based on the type of values it contains.  It is always good to check that this is working to assign the correct types (here using the <code class="language-plaintext highlighter-rouge">pandas.DataFrame.dtypes</code>
                    command), or errors may arise.  If needed, we can also assign a <code class="language-plaintext highlighter-rouge">dtype</code>
                    to each column using that variable in the <code class="language-plaintext highlighter-rouge">pandas.read_csv</code>
                    command.

In [3]:
print(hipparcos.dtypes,hipparcos.shape)

ID        int64
Rah       int64
Ram       int64
Ras     float64
DECd      int64
DECm      int64
DECs    float64
Vmag    float64
Plx     float64
ePlx    float64
BV      float64
eBV     float64
dtype: object (85509, 12)


Once we have read the data in, we should also clean it to remove <code class="language-plaintext highlighter-rouge">NaN</code>
                    values (use the Pandas <code class="language-plaintext highlighter-rouge">.dropna</code>
                    function). We add a print statement to see how many rows of data are left. We should then also remove parallax values ($p$) with large error bars $\Delta p$ use a conditional statement to select only items in the pandas array which satisfy $\Delta p/p \lt 0.05$. Then, let’s calculate the distance (distance in parsecs is $d=1/p$ where $p$ is the parallax in arcsec) and the absolute V-band magnitude ($V_{\rm abs} = V_{\rm mag} - 5\left[\log_{10}(d) -1\right]$), which is needed for the colour-magnitude diagram.
                </p>

In [4]:
hnew = hipparcos[:].dropna(how="any") # get rid of NaNs if present
print(len(hnew),"rows remaining")

# get rid of data with parallax error > 5 per cent
hclean = hnew[hnew.ePlx/np.abs(hnew.Plx) < 0.05]

hclean[['Rah','Ram','Ras','DECd','DECm','DECs','Vmag','Plx','ePlx','BV','eBV']] # Just use the values 
# we are going to need - avoids warning message

hclean['dist'] = 1.e3/hclean["Plx"] # Convert parallax to distance in pc
# Convert to absolute magnitude using distance modulus
hclean['Vabs'] = hclean.Vmag - 5.*(np.log10(hclean.dist) - 1.) # Note: larger magnitudes are fainter!

85446 rows remaining


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hclean['dist'] = 1.e3/hclean["Plx"] # Convert parallax to distance in pc
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hclean['Vabs'] = hclean.Vmag - 5.*(np.log10(hclean.dist) - 1.) # Note: larger magnitudes are fainter!


You will probably see a <code class="language-plaintext highlighter-rouge">SettingWithCopyWarning</code>
                    on running the cell containing this code. It arises from the fact that we are producing output to the same dataframe that we are using as input. We get a warning because in some situations this kind of operation is dangerous - we could modify our dataframe in a way that affects things in unexpected ways later on. However, here we are safe, as we are creating a new column rather than modifying any existing column, so we can proceed, and ignore the warning.
               

### **1. 2. Exploratory Data Analysis (6 pts):**
#### The parameter `Prob` gives a conservative estimate of the probability that the star is associated with the cluster, by doing a ‘clustering’3 analysis of the stars in the 5-dimensional astrometric parameter space, i.e. using `RAdeg`, `DEdeg`, `Plx`, `pmRA` and `pmDE`.