
# Data Preparation

## Data Info

* The sample (test_sample_posvel) is containing a 1000 random samples’ posvel info for testing 
*  Gal_phi can be omitted as they are always 0 in our case.
* For each sample our goal is to get orbital parameters and/or Action_Energy information, based on the test grids.

samples orb_param and (or) action
hyper parameters 

#### Samples

In [1]:
import pandas as pd
df_OrbP= pd.read_csv('../data/dyn_grid_test/grid/test_grid_orbparam.csv')
df_OrbP

Unnamed: 0,rmin,rmax,zmax,ecc,incl
0,4.048712,8.006926,0.0,0.529039,0.855806
1,3.633552,7.950900,0.0,0.545379,0.811017
2,3.262327,7.902003,0.0,0.560080,0.761661
3,2.953663,7.868850,0.0,0.571909,0.707319
4,2.744357,7.864369,0.0,0.579107,0.647628
...,...,...,...,...,...
4084096,3.133519,21.157232,0.0,0.449318,0.194738
4084097,3.743006,21.511620,0.0,0.455086,0.221849
4084098,4.392057,21.938562,0.0,0.462619,0.249772
4084099,5.074083,22.427464,0.0,0.470489,0.278077


In [2]:
len(df_OrbP['zmax'].unique())

1

#### Hyper parameters

In [4]:
import pandas as pd
df_posval= pd.read_csv('../data/dyn_grid_test/grid/test_grid_posvel.csv')
df_posval

Unnamed: 0,Gal_R,Gal_phi,Gal_z,v_R,v_phi,v_z
0,7.0,0.0,-1.0,-100.0,100.0,-100.0
1,7.0,0.0,-1.0,-100.0,100.0,-90.0
2,7.0,0.0,-1.0,-100.0,100.0,-80.0
3,7.0,0.0,-1.0,-100.0,100.0,-70.0
4,7.0,0.0,-1.0,-100.0,100.0,-60.0
...,...,...,...,...,...,...
4084096,9.0,0.0,1.0,100.0,300.0,60.0
4084097,9.0,0.0,1.0,100.0,300.0,70.0
4084098,9.0,0.0,1.0,100.0,300.0,80.0
4084099,9.0,0.0,1.0,100.0,300.0,90.0


Slimdown repeates, with catagorical data

In [5]:
for col in df_posval.columns:
    if df_posval[col].nunique() < 200:
        print(col, 'only has', df_posval[col].nunique(), 'unique value(s)' )
        df_posval[col] = df_posval[col].astype('category') 

df_posval.dtypes


Gal_R only has 21 unique value(s)
Gal_phi only has 1 unique value(s)
Gal_z only has 21 unique value(s)
v_R only has 21 unique value(s)
v_phi only has 21 unique value(s)
v_z only has 21 unique value(s)


Gal_R      category
Gal_phi    category
Gal_z      category
v_R        category
v_phi      category
v_z        category
dtype: object

### Merge and prep

Keeping rows together but reduce data size (for now)



In [6]:
df = pd.concat([df_OrbP, df_posval], axis =1)
df = df.drop(columns=['Gal_phi','zmax']) 
df


Unnamed: 0,rmin,rmax,ecc,incl,Gal_R,Gal_z,v_R,v_phi,v_z
0,4.048712,8.006926,0.529039,0.855806,7.0,-1.0,-100.0,100.0,-100.0
1,3.633552,7.950900,0.545379,0.811017,7.0,-1.0,-100.0,100.0,-90.0
2,3.262327,7.902003,0.560080,0.761661,7.0,-1.0,-100.0,100.0,-80.0
3,2.953663,7.868850,0.571909,0.707319,7.0,-1.0,-100.0,100.0,-70.0
4,2.744357,7.864369,0.579107,0.647628,7.0,-1.0,-100.0,100.0,-60.0
...,...,...,...,...,...,...,...,...,...
4084096,3.133519,21.157232,0.449318,0.194738,9.0,1.0,100.0,300.0,60.0
4084097,3.743006,21.511620,0.455086,0.221849,9.0,1.0,100.0,300.0,70.0
4084098,4.392057,21.938562,0.462619,0.249772,9.0,1.0,100.0,300.0,80.0
4084099,5.074083,22.427464,0.470489,0.278077,9.0,1.0,100.0,300.0,90.0


In [8]:
train_frac = 0.8
val_frac = 0.1
test_frac = 0.1

from sklearn.model_selection import train_test_split
df_train, df_val_test =train_test_split(df,train_size=train_frac,test_size=val_frac+test_frac)
df_val, df_test = train_test_split(df,train_size=0.5,test_size=0.5)

df_val
 



Unnamed: 0,rmin,rmax,ecc,incl,Gal_R,Gal_z,v_R,v_phi,v_z
349619,0.796403,9.903461,0.195646,0.087156,7.1,0.6,50.0,260.0,10.0
120556,1.402020,8.393775,0.370242,0.363647,7.0,0.3,-100.0,170.0,60.0
2939773,2.105200,8.539642,0.412610,0.446002,8.5,-0.8,-10.0,130.0,-60.0
1206934,3.077691,12.414352,0.249501,0.340186,7.6,-0.6,-40.0,270.0,-90.0
1905206,2.421332,8.300717,0.370370,0.542881,7.9,0.6,50.0,140.0,-80.0
...,...,...,...,...,...,...,...,...,...
3963547,0.855373,12.741110,0.305564,0.120058,9.0,-0.3,100.0,230.0,-30.0
1866132,0.247925,7.902914,0.053931,0.051982,7.9,0.2,0.0,220.0,-10.0
2701458,2.668953,11.699124,0.198603,0.309218,8.3,0.8,40.0,250.0,80.0
561282,1.485979,9.098982,0.125287,0.217565,7.2,0.8,20.0,250.0,50.0


In [9]:
df_train.to_parquet('../data/train_'+str(train_frac)+'.pq')
df_val.to_parquet('../data/val_'+str(val_frac)+'.pq')
df_val.to_parquet('../data/test_'+str(val_frac)+'.pq')


In [13]:
!python ../nf_train.py -device cpu -workers 8 -N 100 -sample_cols rmin:rmax:ecc:incl -pop_cols Gal_R:Gal_z:v_R:v_phi:v_z

5 4
Using cpu , Early stopping is on.DataLoader is on.
start
100%|██████████████████████████████████████| 100/100 [5:46:41<00:00, 208.01s/it]
Time used for training:20801.201719522476 s
run infomation saved to ../default.json


In [11]:
print('done')

done


In [91]:
df[['rmin','rmax', 'ecc', 'incl']].values.shape

(4084101, 4)

Will add Act E later

In [7]:
import pandas as pd
df_ActE= pd.read_csv('../data/dyn_grid_test/grid/test_grid_Action_Energy.csv')
df_ActE

Unnamed: 0,JR,Jz,Jphi,Energy
0,218.887129,205.709175,700.0,-174961.401486
1,233.331941,168.998918,700.0,-175911.401486
2,246.498037,137.403249,700.0,-176761.401486
3,258.360725,110.657939,700.0,-177511.401486
4,268.880658,88.436184,700.0,-178161.401486
...,...,...,...,...
4084096,458.623415,50.607875,2700.0,-125342.731228
4084097,478.602574,63.050676,2700.0,-124692.731228
4084098,501.925121,78.107250,2700.0,-123942.731228
4084099,528.876465,95.829913,2700.0,-123092.731228


#### Check MPS or GPU config

In [1]:
import torch
torch.backends.mps.is_available()

True