Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Featurizing seems rather slow #2

Closed
sgbaird opened this issue Aug 5, 2021 · 4 comments
Closed

Featurizing seems rather slow #2

sgbaird opened this issue Aug 5, 2021 · 4 comments

Comments

@sgbaird
Copy link

sgbaird commented Aug 5, 2021

Does it have to do with chunksize=1?

ElM2D/ElM2D/ElM2D.py

Lines 435 to 444 in 7c82fdc

def featurize(self, compositions, how="mean"):
elmd_obj = ElMD(metric=self.metric)
# if type(elmd_obj.periodic_tab[self.metric]["H"]) is int:
# vectors = np.ndarray((len(compositions), len(elmd_obj.periodic_tab[self.metric])))
# else:
# vectors = np.ndarray((len(compositions), len(elmd_obj.periodic_tab[self.metric]["H"])))
print(f"Constructing compositionally weighted {self.metric} feature vectors for each composition")
vectors = process_map(self._pool_featurize, compositions, chunksize=1)

@SurgeArrester
Copy link
Collaborator

That's a good point thanks for bringing it up, I'm afraid I haven't got the bandwidth to test out the best chunksizes for different sized datasets at the moment. I've added chunksize as a parameter to the ElM2D class in 0.3.15 which will be used in each of these lines if that's of use?

ElM2D(chunksize=64)

I'm afraid I haven't been able to fix the other spyder issue, but will leave it open for now.

@sgbaird
Copy link
Author

sgbaird commented Aug 5, 2021

This is great, thanks! Also, no pressure on the Spyder issue.

@sgbaird
Copy link
Author

sgbaird commented Sep 8, 2021

Something like the following seems to be a lot faster (probably because it only involves a single call to ElMD():

E = ElMD()

def gen_ratio_vector(comp):
    """Create a numpy array from a composition dictionary."""
    if isinstance(comp, str):
        comp = E._parse_formula(comp)
        comp = E._normalise_composition(comp)

    sorted_keys = sorted(comp.keys())
    comp_labels = [E._get_position(k) for k in sorted_keys]
    comp_ratios = [comp[k] for k in sorted_keys]

    indices = np.array(comp_labels, dtype=np.int64)
    ratios = np.array(comp_ratios, dtype=np.float64)

    numeric = np.zeros(shape=len(E.periodic_tab[E.metric]), dtype=np.float64)
    numeric[indices] = ratios

    return numeric

def gen_ratio_vectors(comps):
    return np.array([gen_ratio_vector(comp) for comp in comps])

U = gen_ratio_vectors(formulas)
V = gen_ratio_vectors(formulas2)

lookup, periodic_tab, metric = attrgetter("lookup", "periodic_tab", "metric")(E)
ptab_metric = periodic_tab[metric]

def get_mod_petti(x):
    return [ptab_metric[lookup[a]] if b > 0 else 0 for a, b in enumerate(x)]

def get_mod_pettis(X):
    return np.array([get_mod_petti(x) for x in X])

U_weights = get_mod_pettis(U)
V_weights = get_mod_pettis(V)

@SurgeArrester
Copy link
Collaborator

It looks like this issue was introduced when the additional lookup tables were added. Because it was loading a big json from disk into ram for each composition it was slowing things down a lot. I've reduced the memory overhead of this function in ElMD and followed this suggestion a bit by caching the functions output to local memory which has significantly sped up parsing. That's now pushed to ElM2D==0.4.0 and ElMD==0.4.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants