# Galaxy Zoo 2: Images from Original Sample

Willett, Kyle W.; Lintott, Chris J.; Bamford, Steven P.; Masters, Karen L.; Simmons, Brooke D.; Casteels, Kevin R. V.; Edmonson, Edward M.; Fortson, Lucy F.; Kaviraj, Sugata; Keel, William C.; Melvin, Thomas; Nichol, Robert C.; Raddick, M. Jordan; Schawinski, Kevin; Simpson, Robert J.; Skibba, Ramin A.; Smith, Arfon M.; Thomas, Daniel

The Galaxy Zoo team regularly receives requests for subject images for various versions of Galaxy Zoo, in order to facilitate other investigations, e.g. machine learning projects. This repository is an updated attempt to provide those in a way that is useful to the wider community.

The images here are meant to be used with the data tables available at data.galaxyzoo.org. They are the "original" sample of subject images in Galaxy Zoo 2 (Willett et al. 2013, MNRAS, 435, 2835, DOI: 10.1093/mnras/stt1458) as identified in Table 1 of Willett et al. and also in Hart et al. (2016, MNRAS, 461, 3663, DOI: 10.1093/mnras/stw1588). The original GZ2 subjects also gave the option to view an inverted version of the subject image; these inverted images are not provided but are easily reproducible from the included subject images. 

If you use this dataset, please cite Willett et al. (2013) as the general data release and also cite the DOI for this dataset; if you use the updated debiased tables from Hart et al. (2016) please cite that as well.

There are 243,434 images in total. This is off by about 0.08% from the total count in the tables - it's not clear what the cause of the discrepancy is, but we don't think the missing images have any particular sampling bias, so this sample should be useful for research.

The images are available in a single zip file (images_gz2.zip).

The most recent and reliable source for morphology measurements is "GZ2 - Table 1 - Normal-depth sample with new debiasing method – CSV" (from Hart et al. 2016), which is available at data.galaxyzoo.org. <font color="red"> To cross-reference the images with Table 1, this sample includes another CSV table (gz2_filename_mapping.csv) which contains three columns and 355,990 rows </font>. The columns are:

    objid: the Data Release 7 (DR7) object ID for each galaxy. This should match the first column in Table 1.
    sample: string indicating the subsampling of the galaxy.  
    asset_id: an integer that corresponds to the filename of the image in the zipped file linked above.

As an example row:

587722981742084144,original,16

The galaxy is 587722981741363294, which is in Table 1 and was identified by GZ2 volunteers as a barred spiral galaxy with a mild bulge and two tightly-wound arms (morphology='Sc2t'). It is in the original GZ2 sample, and can be found in the zipped file as 16.jpg. 

The overlap between the set of images, the attached table, and Table 1 is not 100%; there are a few rows in the tables that don't have a corresponding image. Again, it's not clear what the exact reason is for this, but we suggest just dropping any missing rows/images from your analysis unless you have a need for analyzing specific subjects. If you do need a 100% complete sample, you can obtain the missing images directly from SDSS. 

Based on spot checks the mappings between asset ID and DR7 object ID appear correct, but we strongly suggest that you pick some random images and verify on your own that the image seems to match the label/classifications that are listed in Table 1. 

If you have any issues using this dataset, please contact the Galaxy Zoo team, in particular Brooke Simmons (b.simmons@lancaster.ac.uk). Should Dr Simmons be unavailable, try contacting Karen Masters or Chris Lintott.

- the GZ team, 5 Dec 2019

In [8]:
import pandas as pd

# Charger les deux tables en dataframes
table1 = pd.read_csv("data/GalaxyZoo1_DR_table2.csv")
table2 = pd.read_csv("data/gz2_hart16.csv")
filename_tables = pd.read_csv('data/gz2_filename_mapping.csv')
# Obtenir les noms de colonnes des deux tables
cols1 = set(table1.columns)
cols2 = set(table2.columns)

# Obtenir les colonnes communes
cols_communes = cols1.intersection(cols2)

# Afficher les colonnes communes
print(cols_communes)

set()


In [7]:
merged_df = pd.merge(table1, table2, left_on='OBJID', right_on='dr7objid')

# Sélectionner les colonnes "OBJID" et "dr7objid" pour afficher les identifiants communs
common_ids = merged_df[['OBJID', 'dr7objid']]
print(f"Trouvé {len(common_ids)}/{len(table1)} galaxies dans le jeu de données le plus récent.")
print(f"Le jeu de données initial contient {len(common_ids)}/{len(table2)} des galaxies du jeu de données le plus récent.")

Trouvé 239695/667944 galaxies dans le jeu de données le plus récent.
Le jeu de données initial contient 239695/239695 des galaxies du jeu de données le plus récent.


In [10]:
id_to_filename = pd.merge(table1, filename_tables, left_on='OBJID', right_on='objid')

# Sélectionner les colonnes "OBJID" et "dr7objid" pour afficher les identifiants communs
common_ids = id_to_filename[['OBJID', 'objid']]
print(f"Trouvé {len(common_ids)}/{len(table1)} des images disponibles.")
# print(f"Le jeu de données initial contient {len(common_ids)}/{len(table2)} des galaxies du jeu de données le plus récent.")

Trouvé 248895/667944 des images disponibles.


In [11]:
id_to_filename.head()

Unnamed: 0,OBJID,RA,DEC,NVOTE,P_EL,P_CW,P_ACW,P_EDGE,P_DK,P_MG,P_CS,P_EL_DEBIASED,P_CS_DEBIASED,SPIRAL,ELLIPTICAL,UNCERTAIN,objid,sample,asset_id
0,587731186203885750,00:00:01.55,-00:05:33.3,59,0.712,0.0,0.0,0.22,0.068,0.0,0.22,0.64,0.29,0,0,1,587731186203885750,stripe82,278415
1,587731187277627676,00:00:01.86,+00:43:09.3,38,0.5,0.0,0.053,0.289,0.105,0.053,0.342,0.351,0.473,0,0,1,587731187277627676,stripe82,280228
2,588015507658768464,00:00:03.24,-01:06:46.8,57,0.474,0.088,0.0,0.263,0.175,0.0,0.351,0.324,0.48,0,0,1,588015507658768464,stripe82,287951
3,587731187277693069,00:00:04.12,+00:45:07.9,30,0.933,0.0,0.033,0.0,0.033,0.0,0.033,0.913,0.054,0,1,0,587731187277693069,stripe82,280231
4,587731187277693072,00:00:04.74,+00:46:54.2,36,0.722,0.083,0.111,0.083,0.0,0.0,0.278,0.606,0.394,0,0,1,587731187277693072,stripe82,280232
