# Load, synthesize, manipulate, and save you data


For training a ML model, you need to have the data in a Pandas dataframe. A example dataframe for Nanoclay Content (wt%) from Tensile Strength (MPa),	Flexural Strength (MPa), and	Impact Strength (kJ/m²) would look like:


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Nanoclay Content (wt%)</th>
      <th>Tensile Strength (MPa)</th>
      <th>Flexural Strength (MPa)</th>
      <th>Impact Strength (kJ/m²)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>3.745401</td>
      <td>66.255272</td>
      <td>53.333300</td>
      <td>19.922674</td>
    </tr>
    <tr>
      <th>1</th>
      <td>9.507143</td>
      <td>77.272895</td>
      <td>58.277504</td>
      <td>16.436492</td>
    </tr>
    <tr>
      <th>2</th>
      <td>7.319939</td>
      <td>73.087397</td>
      <td>57.835222</td>
      <td>17.934958</td>
    </tr>
    <tr>
      <th>3</th>
      <td>5.986585</td>
      <td>68.286111</td>
      <td>57.214990</td>
      <td>19.159551</td>
    </tr>
    <tr>
      <th>4</th>
      <td>1.560186</td>
      <td>57.603971</td>
      <td>48.751364</td>
      <td>16.761423</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>495</th>
      <td>3.533522</td>
      <td>64.053254</td>
      <td>51.360126</td>
      <td>19.840274</td>
    </tr>
    <tr>
      <th>496</th>
      <td>5.836561</td>
      <td>68.243970</td>
      <td>58.614241</td>
      <td>19.044872</td>
    </tr>
    <tr>
      <th>497</th>
      <td>0.777346</td>
      <td>57.839536</td>
      <td>43.692399</td>
      <td>15.544365</td>
    </tr>
    <tr>
      <th>498</th>
      <td>9.743948</td>
      <td>75.107209</td>
      <td>60.107884</td>
      <td>15.513528</td>
    </tr>
    <tr>
      <th>499</th>
      <td>9.862107</td>
      <td>77.546218</td>
      <td>59.531182</td>
      <td>15.610293</td>
    </tr>
  </tbody>
</table>
<p>500 rows × 4 columns</p>
</div>

## Actions:
1. Gather and Consolidate Data:

 - Upload your existing dataset. If you have multiple data sources, combine them into a single, cohesive dataset.
- Action: Consult with Chris to evaluate the sufficiency of the current data.
 
2. Load and Explore the Data:
- Load the data into a pandas DataFrame for manipulation and analysis.
- Conduct an initial exploratory data analysis (EDA) to understand the dataset's structure, identify data types, and check for missing values or anomalies. Use functions like .info(), .describe(), and .head() of the pandas dataframe
- Plot histograms (e.g., `df.plot.hist()`) of the data to visualize the data

3. Define Project Objectives and Data Requirements:
- Clearly define the goal of the project. What problem are you trying to solve?
- Based on the objective, identify the input features (X) and the target variable (y). The inputs are the data you will use to make predictions, and the output is the value you are trying to predict.

4. Feature Engineering
- For polymers, use the `psmiles` package to generate fingerprints. See [here](https://github.com/kuennethgroup/materials_datasets/blob/main/polymer_band_gap_computational/convert.ipynb) for an example
- For other materials, talk to Chris.
  
5. Save the pandas dataset as `data.json` file

