# Loading Data

Importing Libraries

In [None]:
import os
import numpy as np
import pandas as pd

# Helper Functions to keep notebook clean
import functions as func

Define file and folder paths

In [None]:
PATH = os.getcwd() + "/"

PCD_folder = PATH + "DATA/cifs/PCD/"
ICSD_folder = PATH + "DATA/cifs/ICSD/"

PCD_pickle_raw = PATH + 'DATA/pickle/PCD_raw.pkl'
ICSD_pickle_raw = PATH + 'DATA/pickle/ICSD_raw.pkl'

## Downloading data

### PCD

*The PCD dataset was provided by our Assistant, which is why we didn't include the download for that*

### ICSD

Since a lot of data is necessary for this part, part of the data is downloaded from the ICSD database courtesy of FIZ Karlsruhe. This can be done via their API, where the script by github user "simonverret" was used: https://github.com/simonverret/materials_data_api_scripts. 

But since this script downloads the data into a `.csv` file and not `.cif` files (which will be important later down the line), the script was modified to suit our needs!

In [None]:
import ICSD_download as icd

credentials = icd.get_credentials()
icd.download_all(credentials["loginid"], credentials["password"], min_N = 1, max_N = 5)

logged in ICSD  (token=0BF8D8664C079C29AB4E276E0D525E4E)
materials with 1 elements
Progress: [---->] 100%
received 3199/3199 cif strings
materials with 2 elements
Progress: [---------------------------------------------->] 100%
received 46216/46216 cif strings
materials with 3 elements
Progress: [------------------------------------------------------------------------------------->] 100%
received 85045/85045 cif strings
materials with 4 elements
Progress: [------------------------------------------------------------>] 100%
received 60334/60334 cif strings
materials with 5 elements
Progress: [---------------------------------------->] 100%
received 40133/40133 cif strings
logged out ICSD (token=523FC3D1C4A4406837FD656353D1E2C0)


## Convert Data

After downloading all the `.cif` files, we want to convert the files into a `pandas.DataFrame` for easier data handling. Our 2 main objectives that we want to achieve to are:

1. Calculate coefficient of thermal expansion
2. Generate feature vectors

Step 1 can be easily handled in a dataframe. Step 2 however, isn't as straightforward. For the feature vectors we want to use **pymatgen** which can load .cif files to create `pymatgen.core.Structure` objects, but when we do operations on the corresponding dataframe, those operations don't translate well and we might loose track of which exact `.cif` files should be loaded afterwards. To circumvent this problem, while creating the dataframe, we attach the `dict` version of the **pymatgen** object to the end, this way we can store all the raw data in one dataframe while also keeping data together when we perform row operations on the dataframe. This will however lead to additional complexity later down the line when we have to fetch the **pymatgen** object out of the dataframe, create the feature vectors and then merge them back in. 

### PCD

Here we actually convert the PCD `.cif` files into a dataframe.

In [None]:
df_PCD_raw = func.load_PCD_cif(PCD_folder)

df_PCD_raw.to_pickle(PCD_pickle_raw)

Loading: PCD cifs
Progress: [------------------->] 100%
Final Report: 7% Failed to load, 144 minutes taken for entire operation


And a quick look at how large the useable raw PCD dataset is:

In [None]:
print(len(df_PCD_raw))

281865


### ICSD

And here we convert the ICSD `.cif` files into a dataframe. 

**Note:** the `.cif` files downloaded from the ICSD database sometimes contain the ' (apostrophe) within the publication title. Because the `MMMCFI2Dict` module reads parts of the file according to the formatting, this messes with some of the files present. We went through the entire list and identified which files contained a loose apostrophe and removed it. If you try to replicate the results, be very mindful of this and maybe try to circumvent this problem from the beginning.

In [None]:
df_ICSD_raw = func.load_ICSD_cif(ICSD_folder)

df_ICSD_raw.to_pickle(ICSD_pickle_raw)

Loading: ICSD cifs
Progress: [------------------->] 100%
Final Report: 2% Failed to load, 155 minutes taken for entire operation


And a quick look at how large the useable raw ICSD dataset is:

In [None]:
print(len(df_ICSD_raw))

229320
