# Component Specifications

# <u>Component Breakdown</u>

## The package contains two main module, **data_compile.py** and **<span style="color:red">model_module_name_here.py</span>**.


### 1. **data_compile.py** - Data processing module

Module contains multiple functions to serve all purposes of processing and manipulate raw data from molecular databases and user's end. In general, There are three categories that functions fall into:

<u>Database setup:</u> **database_setup()**, **sample_subset()** and **get_id()** : These functions setup database folder for storing molecules retrieved from ChemSpiPy server, create smaller sample sets for testing, and as well as keeping id list for later handling.
    
<u>Function wrapper and input reader:</u> **get_df()**, **df_cleaner()**, **get_df_database()**, **get_df_user()**, **get_all_dataset()** are for reading user input and database files. The inputs are generally in text file format, from both user and database. It requires going through several steps to acquire proper dataframe format. This step is necessary for latter data manipulation before input to model.

<u>Data manipulation:</u> **trim_hydrogen()**, **atom_connect()**, **atom_periodic_number_convert()** are for trimming unnecessary data and compute information for model input from raw values. Within the data files that contains 3D information, location of Hydrogens are often not know. Therefore, it is necessary to remove them from the data sets. Other information like how each atoms connect to one another are crucial for identifying the relative locations between atoms. Last but not least, by converting the atom symbol from string type to integer, it would reduce possiblity of error when the model take those values into account during the calculation.

### 2. Machine learning model module

### 3. Data source
The database used for this package was a part of [ChemSpider](http://www.chemspider.com/).

User input can easily be made from [this online tool](http://www.cheminfo.org/Chemistry/Generate_molfiles/index.html)


### <u>User Interfaces</u>
#### Below are function decription in **data_compile** module
**get_user_df()**
```
Prepare user input to correct format to feed into the model

input: list of file directory from the user (list)

return: dataframe of compiled user input in correct format (pandas.DataFrame)
```

**get_all_dataset(set1=None, set2=0)**
```
Get all dataset from the database and combine them to one dataframe, and the samples are randomly selected. When two return sets are requested, the samples are randomly picked from the same list, matching values between two sets can happen

input_1: amount of samples wanted for the first set (training set) (int)
input_2 (optional): amount of samples wanted for the second set (test set) (int)

return: compiled dataframe that contains all of the datasets (pandas.DataFrame)
```

**get_df_database(id_num, raw=False, hydrogen=False)**
```
Access the database folder using the id number to get a list of dataframes contain 2D and 3D data

input id_num: id number of the molecule (int)
input raw (optional): return dataframes in raw form from web server without processing (bool)
input hydrogen (optional): return dataframes in without trimming Hydrogen (bool)

return: a list of datframes containing atom coordinates, bonding types and arrangement in 2D and 3D (list)
```

**trim_hydrogen(coord_input, bond_input)**
```
Return a copy of the same dataframe after removing Hydorgen atom

input coord_input: coordinate dataframe (pandas.DataFrame)
input bond_input: dataframe contain atom pairs and the connections (pandas.DataFrame)

return: the same coordinate and bond dataframe without any infomation regarding Hydorgen locations and bonding (pandas.DataFrame)
```

**atom_connect(coord_input, bond_input)**
```
Create array contains connection info to the atom and put it into a new coordinate dataframe column

input coord_input: dataframe to be updated with new column of connection (pandas.DataFrame)
input bond_input: dataframe contain atom pairs and the connections (pandas.DataFrame)

return: coord same dataframe as coord_input with added column of connection arrays (pandas.DataFrame)
```

**atom_periodic_number_convert(coord_input)**
```
Add a new column contain periodic number of the corresponding atom

param coord_input: coordinate dataframe of 2D or 3D data (pandas.DataFrame)

return: same dataframe with added column of periodic number (pandas.DataFrame)
```

**get_df(filename, dim=2)**
```
Extract the atom coordinates and bonding data from txt file according to provided dimension
Can be used for both database and user input file

param filename: text file name (str)
param dim: dimension of the molecule structure in the text file (2D, 3D) (int)

return: coordinate and bonding dataframes from the text file (pandas.DataFrame)
```

**df_cleaner(df, new_df)**
```
Reformat input dataframe of single column to be mulitple predetermined column

param df: input dataframe from reading id.txt file, only has 1 column of white space separated values (pandas.DataFrame)
param new_df: output dataframe with predetmined columns (pandas.DataFrame)

return: dataframe with predetmined columns with sorted data (pandas.DataFrame)
```

**get_id()**
```
Return a list of id of the whole database for latter calling (list)
```

**sample_subset(directory=DATABASE, size=50)**
```
Create a smaller database folder inside the main database folder for testing. Typical users do not need to use the database functions

param directory: directory of the database (str)
param size: the size of the folder (int)

return a list of sample id in the created folder for latter calling (list)
```

**database_setup()**
```
Download 2D and 3D molecule structure from ChemSpider sever to create a database. Typical users do not need to use the database functions
```

## <u> Implementation Guidance <u>



## <u> Reference <u>

