In [1]:
from gentype import EnsemblClient, DataManager, PiCollapsedNonparametricGibbsSampler

Set up Classes:

In [2]:
Database_Name = "Gentype_DB"
client = EnsemblClient()
data_manager = DataManager(client, Database_Name)

The following will fill the Database which can be expected to take a while. If the database was already filled it is not necessary to execute the cell below.
You can however copy and alter specific statements to fetch different data.
In order to inspect the resulting database I recommend https://sqlitebrowser.org/.
The database should be found in the same directory as this notebook.

In [3]:
data_manager.fetch_reference_set()
data_manager.fetch_reference_sequences("GRCh37.p13")
data_manager.fetch_populations(pop_filter = None)
data_manager.fetch_individuals()
data_manager.fetch_individuals("CHB", "1000GENOMES:phase_3")
data_manager.fetch_variants(17671934, 17675934, "22")

The following generates a inference matrix from the data provided by the local database. The inference matrix can be constructed with respect to a population and a section specified by start and end. Be sure to fetch the according population before construction the matrix.

In [6]:
inference_matrix = data_manager.generate_inference_matrix(start = 17671934, end = 17675934, population = "ALL")
inference_matrix.shape

(2504, 123)

The following cell runs the sampler with the inference matrix.

In [7]:
sampler = PiCollapsedNonparametricGibbsSampler()
sampler.fit(inference_matrix)

Iteration: 1; Current clusters: 3; Likelihood: -28968.778
Iteration: 2; Current clusters: 2; Likelihood: -28930.802
Iteration: 3; Current clusters: 2; Likelihood: -28912.060
Iteration: 4; Current clusters: 2; Likelihood: -28838.925
Iteration: 5; Current clusters: 2; Likelihood: -28846.841
Iteration: 6; Current clusters: 2; Likelihood: -28816.714
Iteration: 7; Current clusters: 2; Likelihood: -28818.587
Iteration: 8; Current clusters: 2; Likelihood: -28812.185
Iteration: 9; Current clusters: 2; Likelihood: -28793.387
Iteration: 10; Current clusters: 2; Likelihood: -28762.678
Iteration: 11; Current clusters: 2; Likelihood: -28771.244
Iteration: 12; Current clusters: 2; Likelihood: -28699.184
Iteration: 13; Current clusters: 2; Likelihood: -28618.639
Iteration: 14; Current clusters: 2; Likelihood: -28530.348
Iteration: 15; Current clusters: 2; Likelihood: -28455.006
Iteration: 16; Current clusters: 2; Likelihood: -28185.044
Iteration: 17; Current clusters: 2; Likelihood: -27960.246
Iterat

The following generates the distribution (as a dict) of number of variations per strand in the specified region. I.e. {n : #strands with n variations}

In [4]:
distribution = data_manager.get_variation_distribution(start = 17671934, end = 17675934, population = "CHB")
distribution

{2: 137, 1: 6, 3: 54, 4: 8, 5: 1}