# Component Specifications

# <u>Component Breakdown</u>

## The package contains two main module, **data_compile.py** and  **model.py** 
 


### 1. **data_compile.py** - Data processing module

Module contains multiple functions to serve all purposes of processing and manipulate raw data from molecular databases and user's end. In general, There are three categories that functions fall into:

<u>Database setup:</u> **database_setup()**, **sample_subset()** and **get_id()** : These functions setup database folder for storing molecules retrieved from ChemSpiPy server, create smaller sample sets for testing, and as well as keeping id list for later handling.
    
<u>Function wrapper and input reader:</u> **get_df()**, **df_cleaner()**, **get_df_database()**, **get_df_user()**, **get_all_dataset()** are for reading user input and database files. The inputs are generally in text file format, from both user and database. It requires going through several steps to acquire proper dataframe format. This step is necessary for latter data manipulation before input to model.

<u>Data manipulation:</u> **trim_hydrogen()**, **atom_connect()**, **atom_periodic_number_convert()** are for trimming unnecessary data and compute information for model input from raw values. Within the data files that contains 3D information, location of Hydrogens are often not know. Therefore, it is necessary to remove them from the data sets. Other information like how each atoms connect to one another are crucial for identifying the relative locations between atoms. Last but not least, by converting the atom symbol from string type to integer, it would reduce possiblity of error when the model take those values into account during the calculation.

### 2. **model.py** - Machine learning model module
Module contains multiple functions before feeding into maching learning :XGboost. In general, There are three categories that functions fall into:

<u>Read Dataframe:</u> **get_user_df()**: This function helps to read demo datafrme into csv file.

<u>Data Preparation:</u> **get_centroid(coord_matrix)**, **translation_centroid(coord_matrix)**, **get_max_dist(coord_matrix)**, **unit_vector(vector)**, **angle_between(v1, v2)**, **rotation_matrix_2d(theta)**, **absmax_index(a,axis=None)**, **translate_rotate_2d(coord_2d)**, **rotation_matrix_from_vectors(vec1, vec2)**, **rotation_around_x(axis, theta)**, **translate_rotate_3d(x_index, y_index, coord_3d)** are our algorithm for predicting 3d strucutre. The inputs are generally about the compound's basic properties such us their bond type. It requires several steps to tell the computer how to connect atoms with particular bond type. According to the background knowledge, each element countains different bond type. Therefore, we need to follow the principles. 

<u>Train model:</u> **build_model()**, **get_model(data)**, **model_eval(model,data,n_fold=5)**, **predict_3d(user_input,model)** are for training the model using XGboost. With these function we can obtain training model and tune the parameter in XGBoost to gain the better accuracy.


### 3. Data source
The database used for this package was a part of [ChemSpider](http://www.chemspider.com/).

User input can easily be made from [this online tool](http://www.cheminfo.org/Chemistry/Generate_molfiles/index.html)


### <u>User Interfaces</u>
#### Below are function decription in **data_compile** module
**get_user_df()**
```
Prepare user input to correct format to feed into the model

input: list of file directory from the user (list)

return: dataframe of compiled user input in correct format (pandas.DataFrame)
```

**get_all_dataset(set1=None, set2=0)**
```
Get all dataset from the database and combine them to one dataframe, and the samples are randomly selected. When two return sets are requested, the samples are randomly picked from the same list, matching values between two sets can happen

input_1: amount of samples wanted for the first set (training set) (int)
input_2 (optional): amount of samples wanted for the second set (test set) (int)

return: compiled dataframe that contains all of the datasets (pandas.DataFrame)
```

**get_df_database(id_num, raw=False, hydrogen=False)**
```
Access the database folder using the id number to get a list of dataframes contain 2D and 3D data

input id_num: id number of the molecule (int)
input raw (optional): return dataframes in raw form from web server without processing (bool)
input hydrogen (optional): return dataframes in without trimming Hydrogen (bool)

return: a list of datframes containing atom coordinates, bonding types and arrangement in 2D and 3D (list)
```

**trim_hydrogen(coord_input, bond_input)**
```
Return a copy of the same dataframe after removing Hydorgen atom

input coord_input: coordinate dataframe (pandas.DataFrame)
input bond_input: dataframe contain atom pairs and the connections (pandas.DataFrame)

return: the same coordinate and bond dataframe without any infomation regarding Hydorgen locations and bonding (pandas.DataFrame)
```

**atom_connect(coord_input, bond_input)**
```
Create array contains connection info to the atom and put it into a new coordinate dataframe column

input coord_input: dataframe to be updated with new column of connection (pandas.DataFrame)
input bond_input: dataframe contain atom pairs and the connections (pandas.DataFrame)

return: coord same dataframe as coord_input with added column of connection arrays (pandas.DataFrame)
```

**atom_periodic_number_convert(coord_input)**
```
Add a new column contain periodic number of the corresponding atom

param coord_input: coordinate dataframe of 2D or 3D data (pandas.DataFrame)

return: same dataframe with added column of periodic number (pandas.DataFrame)
```

**get_df(filename, dim=2)**
```
Extract the atom coordinates and bonding data from txt file according to provided dimension
Can be used for both database and user input file

param filename: text file name (str)
param dim: dimension of the molecule structure in the text file (2D, 3D) (int)

return: coordinate and bonding dataframes from the text file (pandas.DataFrame)
```

**df_cleaner(df, new_df)**
```
Reformat input dataframe of single column to be mulitple predetermined column

param df: input dataframe from reading id.txt file, only has 1 column of white space separated values (pandas.DataFrame)
param new_df: output dataframe with predetmined columns (pandas.DataFrame)

return: dataframe with predetmined columns with sorted data (pandas.DataFrame)
```

**get_id()**
```
Return a list of id of the whole database for latter calling (list)
```

**sample_subset(directory=DATABASE, size=50)**
```
Create a smaller database folder inside the main database folder for testing. Typical users do not need to use the database functions

param directory: directory of the database (str)
param size: the size of the folder (int)

return a list of sample id in the created folder for latter calling (list)
```

**database_setup()**
```
Download 2D and 3D molecule structure from ChemSpider sever to create a database. Typical users do not need to use the database functions
```

#### Below are function decription in **model** maching learning module
**get_csv()**
```
return demo datafrme(read csv file)
```

**get_centroid(coord_matrix)**
```
Defined as center point of all atoms in either 2d or 3d structures

param coord_matrix: 2d or 3d matrix of coordinations

return coordination of centroid
```

**translation_centroid(coord_matrix)**
```
Move 2d or 3d matrix centroid to origin by matrix translation.

param coord_matrix: 2d or 3d matrix of coordinations

return coordinations of the matrix whose centroid is moved to the origin
```

**get_max_dist(coord_matrix)**
```
Get the index of the atom that have largest distance to centroid.

param coord_matrix: 2d or 3d matrix of coordinations

return the index of atom is the furtherest from centroid
```

**unit_vector(vector)**
```
Returns the unit vector of the vector. 

param vector: a 2d or 3d vector

return unit vector
```

**angle_between(v1, v2)**
```
Returns the angle in radians between vectors 'v1' and 'v2'.

param v1,v2: two vectors

return radian angle of two vectors
```

**rotation_matrix_2d(theta)**
```
Get the rotational matrix given the target angle theta(2d matrix)

param theta: radian angle

return rotational matrix to be applied for matrix rotation
```

**absmax_index(a, axis=None)**
```
Find the second atom index that is used to align y axis for 3d matrix.

param a: vector

return the index of the target atom
```

**translate_rotate_2d(coord_2d)**
```
Wrapping function for 2d translation and rotation
    Translate and rotate the 2d matrix
    
param coord_2d: input 2d structure of the atom

return max_x_index: the index of the furtherest atom
       coord_2d_new: competed 2d matrix has the translation and rotation
```

**rotation_matrix_from_vectors(vec1, vec2)**
```
Find the rotation matrix that aligns vec1 to vec2

param vec1: A 3d "source" vector
param vec2: A 3d "destination" vector

return a transform matrix (3x3) applied to vec1, aligns it with vec2.
```

**rotation_around_x(axis, theta)**
```
Return the rotation matrix associated with counterclockwise rotation about the given axis by theta radians.

param axis: the axis that a certain matrix to be rotated around 
      theta: the radian angle to rotate
      
return rotated matrix
```

**translate_rotate_3d(x_index, y_index, coord_3d)**
```
Translate and rotate the 2d matrix so that the centroid is located at origin and the furtherest atom is located at x axis

param coord_2d: input 2d structure of the atom

return 
    max_x_index: the index of the furtherest atom
    coord_2d_new: competed 2d matrix has translation and rotation
```

**build_model()**
```
    This is the function to build the predictve model
    (import all the data into a csv file)
```

**get_model(data)**
```
Extract the model and parameters from training data

param data: training data

return mulyioutputregressor(the 3 output numbers)
```

**model_eval(model,data,n_fold=5)**
```
Evaluate the performance of the model built from exsiting database

param model: predictive model-gradient boost tree
      data: the data given for cross validation
      n_fold: number of cross validation fold
      
return the accuracy score with 95% confidence interval
```

**predict_3d(user_input,model)**
```
Take user's input 2d information and predictive model built from database to make the prediction of 3d strucutre

param user_input: user input of the 2d information about molecule
      model: predictive model built from database
      
return user_output: 3d information about the molecule
```










