datascience_tfm

Currently work have been presented as the first approach to a future propperly DataScience model.

Manual pages

Documentation for the project utilities is abailable by cloning this repository.

Installation and requirements

The project requires:

Linux system (>= Ubuntu 18.04 LTS system recommended; no other distributions were tested)
Python (>=3.7 recommended). Python languaje can be easily installed throug the anaconda environment.

The remaining dependencies can be installed using the included setup.sh script. Basic instructions:

git clone https://github.com/nacasfer/datascience_tfm
cd datascience_tfm
bash setup.sh

The installer may also be used to check for updates to this and co-dependent packages.

Some considerations before start

Currently project starts with a set of genomic variants. These variants where collected by our group and no distribution is available. Random example using 1000genomes phase 3 database is provided for script testing. Aligment, varint calling and variant VCF annotation steps were perform with following modules using UCSC [HG37] fasta reference and a proper BED file:

Mapping step was perform using BWA software
Variant Call step was perform folllowing best practices of GATK through --HaploTypeCaller module.
Variant annotation step was perform by joint output from two annotation workflows: Annovar and IonReporter web service.
Final datasets were obtained by joint of manualy-anotated tsv patient files for the dissease variant candidate.

Filling NaN's: Analysis of the relation between variables

Filling NaN values is strongly necessary for corret predictions. In this case, NaN values of one feature are filled throug linear, logaritmic and logistic regresion with other features. These models are availables in the "MLmodel" folder, downloadble by cloning this repository. Models are automaticaly appied by using predict.py script.

Clustering

Several models are avaiable at predict.py script. For not using all just coment de non-selected ones in the code. Otherwise, all the predictions will be storage on a new folder with the name of the model and the name of the sample.

Final considerations

Currently model is ready to use throug the instruction below. Just change the path of the folder to make the predictions, or crate one as instructions.

python3 predict.py

If you wish to train models with your own data, just create a main dataset using the TrainDataFormat.py script and run the main.py script as follows. Remember set the correct path.

python3 main.py

For validation, set the correct path to not-trained data and run the Comp_w_Test.py as follows.

python3 Comp_w_Test.py

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
Entrenamiento		Entrenamiento
MLmodel		MLmodel
Train		Train
Training		Training
Comp_w_Test.py		Comp_w_Test.py
Documentation.pdf		Documentation.pdf
README.md		README.md
TrainDataFormat.py		TrainDataFormat.py
predict.py		predict.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datascience_tfm

Table of contents

Manual pages

Installation and requirements

Some considerations before start

Filling NaN's: Analysis of the relation between variables

Clustering

Final considerations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

datascience_tfm

Table of contents

Manual pages

Installation and requirements

Some considerations before start

Filling NaN's: Analysis of the relation between variables

Clustering

Final considerations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages