# Data analysis for Pipe 3D, Integral Field Spectroscopy

For further information related to the dataset please go to the project [README](https://github.com/nestornav/astroestadistica_2022)

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from astropy.table import Table
from astropy.io import fits

### Importing the data

The orginal files have at the first N rows the name of the columns per row. So I decide to remove them for all the csv files and create a custom column list per table. On the other hand I assigned suffix for the errors to avoid mix them even after the merging all the tables.


### Context information
* Table 8: contains data related to the integrated properties of the califa dr2 galaxies.
* Table 9: contains data related to the average proporties of the califa dr2 galacies.
* Table 10: contains data related to the califa dr2 galaxies at the effective radius.

After the data importing, I will check the integrity of the data and take actions if it is needed.

In [2]:
cols_tab_10 = ['califa_name', 'log(age/yr)_ssp_integrated', 'error_ay','[Z/H]_ssp_integrated', 'error_zh',
              'av_ssp_integrated', 'error_spp', '12+log(O/H)_O3N2_integrated', 'error_oh',
              'av_gas(mag)_integrated', 'error_mag']
cols_tab_9 = ['califa_name', 'log(age/yr)_ssp_avg', 'error_ay','[Z/H]_ssp_avg', 'error_zh',
              'av_ssp_avg', 'error_avg', '12+log(O/H)_O3N2_avg', 'error_oh', 'av_gas_mag_avg', 'error_mag_avg']
cols_tab_8 = ['califa_name', 'id', 'ra_J2000', 'dec_J200', 'redshift', 'log(Mass/Msun)_er',
               'error_mm_er', 'log(SFR/Msun/yr)_er', 'error_sfr']

In [14]:
table8_path = '../data/pipe3d/DR2_Pipe3D_obj.tab.csv'
table9_path = '../data/pipe3d/DR2_Pipe3D_mean.tab.csv'
table10_path = '../data/pipe3d/DR2_Pipe3D_Re.tab.csv'

df_tab_8 = pd.read_csv(table8_path, names=cols_tab_8, delimiter=',')
df_tab_9 = pd.read_csv(table9_path, names=cols_tab_9, delimiter=';')
df_tab_10 = pd.read_csv(table10_path, names=cols_tab_10, delimiter=';')

In [5]:
print(f'Shape of Table 8, integrateed properties of galaxies: {df_tab_8.shape}')
print(f'Shape of Table 9, average properties of galaxies: {df_tab_9.shape}')
print(f'Shape of Table 10, properties of galaxies at the effective radius: {df_tab_10.shape}')

Shape of Table 8, integrateed properties of galaxies: (200, 9)
Shape of Table 9, average properties of galaxies: (200, 11)
Shape of Table 10, properties of galaxies at the effective radius: (200, 11)


In [15]:
df_tab_8.head()

Unnamed: 0,califa_name,id,ra_J2000,dec_J200,redshift,log(Mass/Msun)_er,error_mm_er,log(SFR/Msun/yr)_er,error_sfr
0,IC;5376;,1,00;:01:19.77,+34:31:32.52,0.0166,10.65,0.1,0.006,0.105
1,UG;C000;05,2,;00:03:05.63,-01:54:49.67,0.024,11.16,0.08,0.867,0.066
2,NG;C781;9,3,0;0:04:24.50,+31:28:19.20,0.0165,10.61,0.08,0.412,0.067
3,IC;1528;,5,00;:05:05.37,-07:05:36.23,0.0125,10.54,0.11,0.197,0.064
4,UG;C000;36,7,;00:05:13.87,+06:46:19.20,0.0208,11.06,0.09,0.186,0.134


In [9]:
df_tab_8.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   califa_name          200 non-null    object 
 1   id                   200 non-null    object 
 2   ra_J2000             200 non-null    object 
 3   dec_J200             200 non-null    object 
 4   redshift             200 non-null    float64
 5   log(Mass/Msun)_er    200 non-null    float64
 6   error_mm_er          200 non-null    float64
 7   log(SFR/Msun/yr)_er  200 non-null    float64
 8   error_sfr            200 non-null    float64
dtypes: float64(5), object(4)
memory usage: 14.2+ KB


In [7]:
df_tab_9.head()

Unnamed: 0,califa_name,log(age/yr)_ssp_avg,error_ay,[Z/H]_ssp_avg,error_zh,av_ssp_avg,error_avg,12+log(O/H)_O3N2_avg,error_oh,av_gas_mag_avg,error_mag_avg
0,IC5376,8.99,0.51,-0.35,0.16,0.57,1.17,8.5,0.05,1.24,0.03
1,UGC00005,8.97,0.42,-0.31,0.13,0.26,0.83,8.54,0.06,1.3,0.04
2,NGC7819,8.68,0.45,-0.29,0.11,0.1,0.62,8.47,0.07,0.9,0.08
3,IC1528,8.75,0.54,-0.34,0.14,0.22,0.82,8.47,0.08,0.77,0.03
4,UGC00036,9.42,0.26,-0.17,0.12,0.1,0.69,8.52,0.06,1.33,0.09


In [10]:
df_tab_9.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   califa_name           200 non-null    object 
 1   log(age/yr)_ssp_avg   200 non-null    float64
 2   error_ay              200 non-null    float64
 3   [Z/H]_ssp_avg         200 non-null    float64
 4   error_zh              200 non-null    float64
 5   av_ssp_avg            200 non-null    float64
 6   error_avg             200 non-null    float64
 7   12+log(O/H)_O3N2_avg  168 non-null    float64
 8   error_oh              168 non-null    float64
 9   av_gas_mag_avg        186 non-null    float64
 10  error_mag_avg         186 non-null    float64
dtypes: float64(10), object(1)
memory usage: 17.3+ KB


In [8]:
df_tab_10.head()

Unnamed: 0,califa_name,log(age/yr)_ssp_integrated,error_ay,[Z/H]_ssp_integrated,error_zh,av_ssp_integrated,error_spp,12+log(O/H)_O3N2_integrated,error_oh,av_gas(mag)_integrated,error_mag
0,IC5376,9.37,0.5,-0.44,0.13,0.48,0.11,8.53,0.04,1.34,0.21
1,UGC00005,9.18,0.28,-0.37,0.11,0.29,0.09,8.57,0.04,1.28,0.18
2,NGC7819,8.81,0.44,-0.28,0.14,0.06,0.06,8.47,0.07,0.82,0.17
3,IC1528,9.03,0.31,-0.41,0.14,0.19,0.1,8.51,0.04,0.77,0.14
4,UGC00036,9.56,0.44,-0.24,0.15,0.1,0.06,8.53,0.05,1.22,0.25


In [11]:
df_tab_10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   califa_name                  200 non-null    object 
 1   log(age/yr)_ssp_integrated   200 non-null    float64
 2   error_ay                     200 non-null    float64
 3   [Z/H]_ssp_integrated         200 non-null    float64
 4   error_zh                     200 non-null    float64
 5   av_ssp_integrated            200 non-null    object 
 6   error_spp                    200 non-null    float64
 7   12+log(O/H)_O3N2_integrated  156 non-null    float64
 8   error_oh                     156 non-null    float64
 9   av_gas(mag)_integrated       162 non-null    float64
 10  error_mag                    162 non-null    float64
dtypes: float64(9), object(2)
memory usage: 17.3+ KB


For all the imports I needed to set up the delimiter parameter to a right import. After  check the data I realized that table 8 (integrated proporties) has several in the data. In that case, the solution was change the delimiter.

On the other hand table 8 has errors in columns such as califa_name and RA_J2000. So I need to perform some actions to get a good data quality. Finally the left table were imported right and didn't show errors in the values.

Tables 9 and 10 have several null values for some features. This situation will be tackle in the data preparation step.

# Data preparation

The main idea of this step is provide a dataset to start working on the assgiment.