Featurizing seems rather slow #2

sgbaird · 2021-08-05T04:04:48Z

Does it have to do with chunksize=1?

Lines 435 to 444 in 7c82fdc

    
           def featurize(self, compositions, how="mean"): 
        
               elmd_obj = ElMD(metric=self.metric) 
        
               # if type(elmd_obj.periodic_tab[self.metric]["H"]) is int: 
        
               #     vectors = np.ndarray((len(compositions), len(elmd_obj.periodic_tab[self.metric]))) 
        
               # else: 
        
               #     vectors = np.ndarray((len(compositions), len(elmd_obj.periodic_tab[self.metric]["H"]))) 
        
               print(f"Constructing compositionally weighted {self.metric} feature vectors for each composition") 
        
               vectors = process_map(self._pool_featurize, compositions, chunksize=1)

The text was updated successfully, but these errors were encountered:

SurgeArrester · 2021-08-05T14:09:02Z

That's a good point thanks for bringing it up, I'm afraid I haven't got the bandwidth to test out the best chunksizes for different sized datasets at the moment. I've added chunksize as a parameter to the ElM2D class in 0.3.15 which will be used in each of these lines if that's of use?

ElM2D(chunksize=64)

I'm afraid I haven't been able to fix the other spyder issue, but will leave it open for now.

sgbaird · 2021-08-05T23:32:42Z

This is great, thanks! Also, no pressure on the Spyder issue.

sgbaird · 2021-09-08T02:01:28Z

Something like the following seems to be a lot faster (probably because it only involves a single call to ElMD():

E = ElMD()

def gen_ratio_vector(comp):
    """Create a numpy array from a composition dictionary."""
    if isinstance(comp, str):
        comp = E._parse_formula(comp)
        comp = E._normalise_composition(comp)

    sorted_keys = sorted(comp.keys())
    comp_labels = [E._get_position(k) for k in sorted_keys]
    comp_ratios = [comp[k] for k in sorted_keys]

    indices = np.array(comp_labels, dtype=np.int64)
    ratios = np.array(comp_ratios, dtype=np.float64)

    numeric = np.zeros(shape=len(E.periodic_tab[E.metric]), dtype=np.float64)
    numeric[indices] = ratios

    return numeric

def gen_ratio_vectors(comps):
    return np.array([gen_ratio_vector(comp) for comp in comps])

U = gen_ratio_vectors(formulas)
V = gen_ratio_vectors(formulas2)

lookup, periodic_tab, metric = attrgetter("lookup", "periodic_tab", "metric")(E)
ptab_metric = periodic_tab[metric]

def get_mod_petti(x):
    return [ptab_metric[lookup[a]] if b > 0 else 0 for a, b in enumerate(x)]

def get_mod_pettis(X):
    return np.array([get_mod_petti(x) for x in X])

U_weights = get_mod_pettis(U)
V_weights = get_mod_pettis(V)

SurgeArrester · 2021-10-11T13:33:30Z

It looks like this issue was introduced when the additional lookup tables were added. Because it was loading a big json from disk into ram for each composition it was slowing things down a lot. I've reduced the memory overhead of this function in ElMD and followed this suggestion a bit by caching the functions output to local memory which has significantly sped up parsing. That's now pushed to ElM2D==0.4.0 and ElMD==0.4.2

SurgeArrester closed this as completed Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Featurizing seems rather slow #2

Featurizing seems rather slow #2

sgbaird commented Aug 5, 2021

SurgeArrester commented Aug 5, 2021

sgbaird commented Aug 5, 2021

sgbaird commented Sep 8, 2021 •

edited

Loading

SurgeArrester commented Oct 11, 2021

Featurizing seems rather slow #2

Featurizing seems rather slow #2

Comments

sgbaird commented Aug 5, 2021

SurgeArrester commented Aug 5, 2021

sgbaird commented Aug 5, 2021

sgbaird commented Sep 8, 2021 • edited Loading

SurgeArrester commented Oct 11, 2021

sgbaird commented Sep 8, 2021 •

edited

Loading