# Synthesize data

<!--<badge>--><a href="https://colab.research.google.com/github/kuennethgroup/ml_in_ms_st25/blob/main/05_ex/synthesize_your_own_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><!--</badge>-->


# Tutorial 

1. Synthesize 500 data points for Nanoclay Content (wt%) Tensile Strength (MPa)	Flexural Strength (MPa)	Impact Strength (kJ/m²)
2. Generate Nanoclay Content (wt%) in the range of 0-10 with the numpy random generate, e.g., `np.random.uniform`. Don't forget to set a seed
3. Compute the tensile strength as `50 + 10 * np.log1p(nanoclay_content) + np.random.normal(0, 2, 500)`
4. Compute the flexural_strength as `40 + 8 * np.log1p(nanoclay_content) + np.random.normal(0, 1.5, 500)`
5. Compute the impact_strength as `15 + 5 * np.exp(-0.1 * (nanoclay_content - 5) ** 2) + np.random.normal(0, 0.5, 500)`
6. Create a dataframe 

# Tutorial
Solve the previous task with `perplexity.ai`, Google Gemini, Copilot, or ChatGPT

### Tutorial

- Create a pandas dataframe of 1000 polymers named P0 to P999
- Randomly generate tensile strength values for each polymer in the range of 50 to 200 MPa. Add to the data frame
- Randomly generate PDI and Mw values for each polymer. PDI range 1 to 2.5. Mw range 0 to 200000.  Add to the data frame

In [3]:
import pandas as pd
import numpy as np

df = pd.DataFrame(index=[f"P{n}" for n in range(1000)])
df["TS"] = np.random.uniform(50, 200, 1000)
# Continue from here
df

Unnamed: 0,TS
P0,63.957068
P1,92.870803
P2,149.942191
P3,106.055984
P4,122.805240
...,...
P995,160.005779
P996,86.416020
P997,196.930394
P998,162.971390


### SDV

Use SDV  (`!pip install sdv`) to synthesize new data for the polymers tendency to crystallize dataset.

In [5]:
import pandas as pd

df = pd.read_json(
    "https://raw.githubusercontent.com/kuennethgroup/materials_datasets/refs/heads/main/polymer_tendency_to_crystalize/polymers_tend_to_crystalize.json"
)
# ... and easy-peasy
df

Unnamed: 0,smiles,property,value,fingerprint
0,[*]C[*],Xc,47.80,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,[*]CC([*])C,Xc,44.47,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,[*]CC([*])CC,Xc,34.04,"[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,[*]CC([*])CCC,Xc,20.01,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,[*]CC([*])CC(C)C,Xc,21.64,"[0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...,...
427,[*]C([*])(F)F,Xc,31.84,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
428,[*]C/C=C\C[*],Xc,25.58,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
429,[*]O[Si](C)(C)CCCC(=O)Oc1ccc(C=Nc2ccc(N=Cc3ccc...,Xc,29.05,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ..."
430,[*]O[Si](C)(C)CCCC(=O)Oc1ccc(C=Nc2ccc(Cc3ccc(N...,Xc,21.74,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ..."


In [None]:
from sdv.metadata import MultiTableMetadata

metadata = MultiTableMetadata()
# Continue from here