Skip to content

Commit

Permalink
Merge pull request #133 from pachterlab/dev
Browse files Browse the repository at this point in the history
Dev -> main
  • Loading branch information
lauraluebbert committed May 27, 2024
2 parents cae23dc + 8fee3b2 commit 9984599
Show file tree
Hide file tree
Showing 24 changed files with 1,075 additions and 289 deletions.
1 change: 1 addition & 0 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ coverage>=5.1
pytest>=7.0.0
openai<=0.28.1
cellxgene-census
parameterized==0.9.0
6 changes: 4 additions & 2 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@
* [gget enrichr](en/enrichr.md)
* [gget gpt](en/gpt.md)
* [gget info](en/info.md)
* [gget muscle](en/muscle.md)
* [gget muscle](en/muscle.md)
* [gget mutate](en/mutate.md)
* [gget pdb](en/pdb.md)
* [gget ref](en/ref.md)
* [gget search](en/search.md)
Expand Down Expand Up @@ -54,7 +55,8 @@
* [gget enrichr](es/enrichr.md)
* [gget gpt](es/gpt.md)
* [gget info](es/info.md)
* [gget muscle](es/muscle.md)
* [gget muscle](es/muscle.md)
* [gget mutate](es/mutate.md)
* [gget pdb](es/pdb.md)
* [gget ref](es/ref.md)
* [gget search](es/search.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/src/en/blast.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ gget.blast("MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRI
| PREDICTED: gamma-aminobutyric acid receptor-as...| Colobus angolensis palliatus | NaN | 336983 | 180 | 180 | 100% | ... |
| . . . | . . . | . . . | . . . | . . . | . . . | . . . | ... |


<br/><br/>
**BLAST from .fa or .txt file:**
```bash
gget blast fasta.fa
Expand Down
86 changes: 69 additions & 17 deletions docs/src/en/cosmic.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,67 @@
> Python arguments are equivalent to long-option arguments (`--arg`), unless otherwise specified. Flags are True/False arguments in Python. The manual for any gget tool can be called from the command-line using the `-h` `--help` flag.
## gget cosmic 馃獝
Search for genes, mutations, and other factors associated with cancer using the [COSMIC](https://cancer.sanger.ac.uk/cosmic) (Catalogue Of Somatic Mutations In Cancer) database.
Return format: JSON (command-line) or data frame/CSV (Python).
This module was written by [@AubakirovArman](https://github.com/AubakirovArman).
Return format: JSON (command-line) or data frame/CSV (Python) when `download_cosmic=False`. When `download_cosmic=True`, downloads the requested database into the specified folder.

This module was written in part by [@AubakirovArman](https://github.com/AubakirovArman) (information querying) and [@josephrich98](https://github.com/josephrich98) (database download).

NOTE: License fees apply for the commercial use of COSMIC. You can read more about licensing COSMIC data [here](https://cancer.sanger.ac.uk/cosmic/license).

**Positional argument**
**Positional argument (for querying information)**
`searchterm`
Search term, which can be a mutation, gene name (or Ensembl ID), cancer type, tumor site, study ID, PubMed ID, or sample ID, as defined using the `entity` argument. Example: 'EGFR'
Search term, which can be a mutation, or gene name (or Ensembl ID), or sample, etc.
Examples for the searchterm and entitity arguments:

| searchterm | entitity | |
|--------------|-------------| ---|
| EGFR | mutations | -> Find mutations in the EGFR gene that are associated with cancer |
| v600e | mutations | -> Find genes for which a v600e mutation is associated with cancer |
| COSV57014428 | mutations | -> Find mutations associated with this COSMIC mutations ID |
| EGFR | genes | -> Get the number of samples, coding/simple mutations, and fusions observed in COSMIC for EGFR |
| prostate | cancer | -> Get number of tested samples and mutations for prostate cancer |
| prostate | tumour_site | -> Get number of tested samples, genes, mutations, fusions, etc. with 'prostate' as primary tissue site |
| ICGC | studies | -> Get project code and descriptions for all studies from the ICGC (International Cancer Genome Consortium) |
| EGFR | pubmed | -> Find PubMed publications on EGFR and cancer |
| ICGC | samples | -> Get metadata on all samples from the ICGC (International Cancer Genome Consortium) |
| COSS2907494 | samples | -> Get metadata on this COSMIC sample ID (cancer type, tissue, # analyzed genes, # mutations, etc.) |

**Optional arguments**
NOTE: (Python only) Set to `None` when downloading COSMIC databases with `download_cosmic=True`.

**Optional arguments (for querying information)**
`-e` `--entity`
'mutations' (default), 'genes', 'cancer', 'tumour site', 'studies', 'pubmed', or 'samples'.
Defines the type of the supplied search term.
Defines the type of the results to return.

`-l` `--limit`
Limits number of hits to return. Default: 100.

**Flags (for downloading COSMIC databases)**
`-d` `--download_cosmic`
Switches into database download mode.

`-gm` `--gget_mutate`
TURNS OFF creation of a modified version of the database for use with gget mutate.
Python: `gget_mutate` is True by default. Set `gget_mutate=False` to disable.

**Optional arguments (for downloading COSMIC databases)**
`-mc` `--mutation_class`
'cancer' (default), 'cell_line', 'census', 'resistance', 'screen', or 'cancer_example'
Type of COSMIC database to download.

`-cv` `--cosmic_version`
Version of the COSMIC database. Default: None -> Defaults to latest version.

`-gv` `--grch_version`
Version of the human GRCh reference genome the COSMIC database was based on (37 or 38). Default: 37

**Optional arguments (general)**
`-o` `--out`
Path to the file the results will be saved in, e.g. path/to/directory/results.csv (or .json). Default: Standard out.
Python: `save=True` will save the output in the current working directory.
Path to the file (or folder when downloading databases with the `download_cosmic` flag) the results will be saved in, e.g. 'path/to/results.json'.
Default: None
-> When download_cosmic=False: Results will be returned to standard out
-> When download_cosmic=True: Database will be downloaded into current working directory

**Flags**
**Flags (general)**
`-csv` `--csv`
Command-line only. Returns results in CSV format.
Python: Use `json=True` to return output in JSON format.
Expand All @@ -32,18 +71,31 @@ Command-line only. Prevents progress information from being displayed.
Python: Use `verbose=False` to prevent progress information from being displayed.


### Example
### Examples
#### Query information
```bash
gget cosmic -e genes EGFR
gget cosmic EGFR
```
```python
# Python
gget.cosmic("EGFR", entity="genes")
gget.cosmic("EGFR")
```
&rarr; Returns the COSMIC hits for gene 'EGFR' in the format:
&rarr; Returns mutations in the EGFR gene that are associated with cancer in the format:

| Gene | Alternate IDs | Tested samples | Simple Mutations | Fusions | Coding Mutations | ... |
| -------------- |-------------------------| ------------------------| -------------- | ----------|-----|---|
| EGFR| EGFR,ENST00000275493.6,... | 210280 | 31900 | 0 | 31900 | ... |
| . . . | . . . | . . . | . . . | . . . | . . . | . . . | ... |
| Gene | Syntax | Alternate IDs | Canonical |
| -------- |------------| -------------------------------| ---------- |
| EGFR | c.*2446A>G | EGFR c.*2446A>G, EGFR p.?, ... | y |
| EGFR | c.(2185_2283)ins(18) | EGFR c.(2185_2283)ins(18), EGFR p.?, ... | y |
| . . . | . . . | . . . | . . . |


### Downloading COSMIC databases
```bash
gget cosmic --download_cosmic
```
```python
# Python
gget.cosmic(searchterm=None, download_cosmic=True)
```
&rarr; Downloads the COSMIC cancer database of the latest COSMIC release into the current working directory.

104 changes: 104 additions & 0 deletions docs/src/en/mutate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
> Python arguments are equivalent to long-option arguments (`--arg`), unless otherwise specified. Flags are True/False arguments in Python. The manual for any gget tool can be called from the command-line using the `-h` `--help` flag.
## gget mutate 馃
Takes in nucleotide sequences and mutations (in [standard mutation annotation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1867422/) and returns mutated versions of the input sequences according to the provided mutations.
Return format: Saves mutated sequences in FASTA format (or returns a list containing the mutated sequences if `out=None`).

**Positional argument**
`sequences`
Path to the FASTA file containing the sequences to be mutated, e.g., 'path/to/seqs.fa'.
Sequence identifiers following the '>' character must correspond to the identifiers in the seq_ID column of `mutations`.
NOTE: Only the string following the '>' until the first space or dot will be used as a sequence identifier. - Version numbers of Ensembl IDs will be ignored.

Example format of the FASTA file:
```
>seq1 (or ENSG00000106443)
ACTGCGATAGACT
>seq2
AGATCGCTAG
```

Alternatively: Input sequence(s) as a string or list, e.g. 'AGCTAGCT'.

**Required arguments**
`-m` `--mutations`
Path to the csv or tsv file (e.g., 'path/to/mutations.csv') or data frame (DataFrame object) containing information about the mutations in the following format:

| mutation | mut_ID | seq_ID |
|------------------|--------|--------|
| c.2C>T | mut1 | seq1 |
| c.9_13inv | mut2 | seq2 |
| c.9_13inv | mut2 | seq4 |
| c.9_13delinsAAT | mut3 | seq4 |
| ... | ... | ... |

'mutation' = Column containing the mutations to be performed written in standard mutation annotation
'mut_ID' = Column containing the identifier for each mutation
'seq_ID' = Column containing the identifiers of the sequences to be mutated (must correspond to the string following the '>' character in the 'sequences' FASTA file; do NOT include spaces or dots)

Alternatively: Input mutation(s) as a string or list, e.g., 'c.2C>T'.
If a list is provided, the number of mutations must equal the number of input sequences.
For use from the terminal (bash): Enclose individual mutation annotations in quotation marks to prevent parsing errors.

**Optional arguments**
`-k` `--k`
Length of sequences flanking the mutation. Default: 31.
If k > total length of the sequence, the entire sequence will be kept.

`-mc` `--mut_column`
Name of the column containing the mutations to be performed in `mutations`. Default: 'mutation'.

`-mic` `--mut_id_column`
Name of the column containing the IDs of each mutation in `mutations`. Default: 'mut_ID'.

`-sic` `--seq_id_column`
Name of the column containing the IDs of the sequences to be mutated in `mutations`. Default: 'seq_ID'.

`-o` `--out`
Path to output FASTA file containing the mutated sequences, e.g., 'path/to/output_fasta.fa'.
Default: None -> returns a list of the mutated sequences to standard out.
The identifiers (following the '>') of the mutated sequences in the output FASTA will be '>[seq_ID]_[mut_ID]'.

**Flags**
`-q` `--quiet`
Command-line only. Prevents progress information from being displayed.
Python: Use `verbose=False` to prevent progress information from being displayed.

### Examples
```bash
gget mutate ATCGCTAAGCT -m 'c.4G>T'
```
```python
# Python
gget.mutate("ATCGCTAAGCT", "c.4G>T")
```
&rarr; Returns ATCTCTAAGCT.

<br/><br/>

**List of sequences with a mutation for each sequence provided in a list:**
```bash
gget mutate ATCGCTAAGCT TAGCTA -m 'c.4G>T' 'c.1_3inv' -o mut_fasta.fa
```
```python
# Python
gget.mutate(["ATCGCTAAGCT", "TAGCTA"], ["c.4G>T", "c.1_3inv"], out="mut_fasta.fa")
```
&rarr; Saves 'mut_fasta.fa' file containing:
```
>seq1_mut1
ATCTCTAAGCT
>seq2_mut2
GATCTA
```

<br/><br/>

**One mutation applied to several sequences with adjusted `k`:**
```bash
gget mutate ATCGCTAAGCT TAGCTA -m 'c.1_3inv' -k 3
```
```python
# Python
gget.mutate(["ATCGCTAAGCT", "TAGCTA"], "c.1_3inv", k=3)
```
&rarr; Returns ['CTAGCT', 'GATCTA'].
85 changes: 68 additions & 17 deletions docs/src/es/cosmic.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,67 @@
> Par谩metros de Python s贸n iguales a los par谩metros largos (`--par谩metro`) de Terminal, si no especificado de otra manera. Las banderas son par谩metros de verdadero o falso (True/False) en Python. El manu谩l para cualquier modulo de gget se puede llamar desde la Terminal con la bandera `-h` `--help`.
## gget cosmic 馃獝
Busque genes, mutaciones, etc. asociados con c谩nceres utilizando la base de datos [COSMIC](https://cancer.sanger.ac.uk/cosmic) (Cat谩logo de mutaciones som谩ticas en c谩ncer).
Produce: Resultados en formato JSON (Terminal) o Dataframe/CSV (Python).
`gget cosmic` fue escrito por [@AubakirovArman](https://github.com/AubakirovArman).
Produce: JSON (l铆nea de comandos) o marco de datos/CSV (Python) cuando `download_cosmic=False`. Cuando `download_cosmic=True`, descarga la base de datos solicitada en la carpeta especificada.

Se aplican tarifas de licencia para el uso comercial de COSMIC. Puede leer m谩s sobre la concesi贸n de licencias de datos COSMIC [aqu铆](https://cancer.sanger.ac.uk/cosmic/license).
Este m贸dulo fue escrito en parte por [@AubakirovArman](https://github.com/AubakirovArman) (consulta de informaci贸n) y [@josephrich98](https://github.com/josephrich98) (descarga de base de datos).

**Par谩metro posicional**
NOTA: Se aplican tarifas de licencia para el uso comercial de COSMIC. Puede leer m谩s sobre la concesi贸n de licencias de datos COSMIC [aqu铆](https://cancer.sanger.ac.uk/cosmic/license).

**Par谩metro posicional (para consultar informaci贸n)**
`searchterm`
T茅rmino de b煤squeda. Puede ser una mutaci贸n, un nombre de gen (o ID de Ensembl), tipo de c谩ncer, sitio del tumor, ID de estudio, ID de PubMed o ID de muestra, tal como se define con el argumento `entity`. Ejemplo: 'EGFR'
T茅rmino de b煤squeda, que puede ser una mutaci贸n, un nombre de gen (o ID de Ensembl), una muestra, etc.
Ejemplos para los argumentos de searchterm y entidad:

| searchterm | entidad | |
|--------------|------------|-|
| EGFR | mutaciones | -> Encuentra mutaciones en el gen EGFR asociadas con el c谩ncer |
| v600e | mutaciones | -> Encuentra genes para los cuales una mutaci贸n v600e est谩 asociada con el c谩ncer |
| COSV57014428 | mutaciones | -> Encuentra mutaciones asociadas con esta ID de mutaciones COSMIC |
| EGFR | genes | -> Obtiene el n煤mero de muestras, mutaciones simples/codificantes y fusiones observadas en COSMIC para EGFR |
| prostate | c谩ncer | -> Obtiene el n煤mero de muestras probadas y mutaciones para el c谩ncer de pr贸stata |
| prostate | sitio_tumoral | -> Obtiene el n煤mero de muestras probadas, genes, mutaciones, fusiones, etc. con 'pr贸stata' como sitio de tejido primario |
| ICGC | estudios | -> Obtiene el c贸digo de proyecto y descripciones de todos los estudios de ICGC (Consortio Internacional del Genoma del C谩ncer) |
| EGFR | pubmed | -> Encuentra publicaciones de PubMed sobre EGFR y c谩ncer |
| ICGC | muestras | -> Obtiene metadatos sobre todas las muestras de ICGC (Consortio Internacional del Genoma del C谩ncer) |
| COSS2907494 | muestras | -> Obtiene metadatos sobre esta ID de muestra COSMIC (tipo de c谩ncer, tejido, # genes analizados, # mutaciones, etc.) |

**Par谩metros optionales**
NOTA: (Solo Python) Establezca en `None` cuando se descarguen bases de datos COSMIC con `download_cosmic=True`.

**Par谩metros opcionales (para consultar informaci贸n)**
`-e` `--entity`
'mutations' (mutaci贸n), 'genes' (nombre de gen / ID de Ensembl), 'cancer' (tipo de c谩ncer), 'tumour site' (sitio del tumor), 'studies' (ID de estudio), 'pubmed' (ID de PubMed), o 'samples' (ID de muestra). Por defecto: 'mutations'.
Define el tipo de t茅rmino de b煤squeda (`searchterm`).
'mutations' (predeterminado), 'genes', 'c谩ncer', 'sitio_tumoral', 'estudios', 'pubmed' o 'muestras'.
Define el tipo de resultados a devolver.

`-l` `--limit`
Limita el n煤mero de resultados producidos. Por defecto: 100.
Limita el n煤mero de resultados a devolver. Predeterminado: 100.

**Banderas (para descargar bases de datos COSMIC)**
`-d` `--download_cosmic`
Conmuta al modo de descarga de base de datos.

`-gm` `--gget_mutate`
DESACTIVA la creaci贸n de una versi贸n modificada de la base de datos para usar con gget mutate.
Python: `gget_mutate` es Verdadero por defecto. Establezca `gget_mutate=False` para deshabilitar.

**Par谩metros opcionales (para descargar bases de datos COSMIC)**
`-mc` `--mutation_class`
'c谩ncer' (predeterminado), 'l铆nea_celular', 'censo', 'resistencia', 'pantalla' o 'ejemplo_c谩ncer'
Tipo de base de datos COSMIC para descargar.

`-cv` `--cosmic_version`
Versi贸n de la base de datos COSMIC. Predeterminado: Ninguno -> Se establece en la 煤ltima versi贸n por defecto.

`-gv` `--grch_version`
Versi贸n del genoma de referencia humano GRCh en el que se bas贸 la base de datos COSMIC (37 o 38). Predeterminado: 37

**Par谩metros opcionales (generales)**
`-o` `--out`
Ruta al archivo en el que se guardar谩n los resultados, p. ej. ruta/al/directorio/resultados.csv (o .json). Por defecto: salida est谩ndar (STDOUT).
Para Python, usa `save=True` para guardar los resultados en el directorio de trabajo actual.
Ruta al archivo (o carpeta cuando se descargan bases de datos con el flag `download_cosmic`) donde se guardar谩n los resultados, p. ej. 'ruta/a/resultados.json'.
Predeterminado: None
-> Cuando download_cosmic=False: Los resultados se devolver谩n a la salida est谩ndar
-> Cuando download_cosmic=True: La base de datos se descargar谩 en el directorio de trabajo actual

**Banderas**
**Banderas (generales)**
`-csv` `--csv`
Solo para Terminal. Produce los resultados en formato CSV.
Para Python, usa json=True para producir los resultados en formato JSON.
Expand All @@ -33,16 +72,28 @@ Para Python, usa `verbose=False` para imipidir la informacion de progreso de ser


### Por ejemplo
#### Consultar informaci贸n
```bash
gget cosmic -e genes EGFR
```
```python
# Python
gget.cosmic("EGFR", entity="genes")
```
&rarr; Produce los resultados COSMIC para el gen 'EGFR':
&rarr; Devuelve mutaciones en el gen EGFR asociadas con el c谩ncer en el formato:

| Gene | Syntax | Alternate IDs | Canonical |
| -------- |------------| -------------------------------| ---------- |
| EGFR | c.*2446A>G | EGFR c.*2446A>G, EGFR p.?, ... | y |
| EGFR | c.(2185_2283)ins(18) | EGFR c.(2185_2283)ins(18), EGFR p.?, ... | y |
| . . . | . . . | . . . | . . . |

| Gene | Alternate IDs | Tested samples | Simple Mutations | Fusions | Coding Mutations | ... |
| -------------- |-------------------------| ------------------------| -------------- | ----------|-----|---|
| EGFR| EGFR,ENST00000275493.6,... | 210280 | 31900 | 0 | 31900 | ... |
| . . . | . . . | . . . | . . . | . . . | . . . | . . . | ... |
### Descargar bases de datos COSMIC
```bash
gget cosmic --download_cosmic
```
```python
# Python
gget.cosmic(searchterm=None, download_cosmic=True)
```
&rarr; Descargue la base de datos sobre c谩ncer de COSMIC de la 煤ltima versi贸n de COSMIC.
Loading

0 comments on commit 9984599

Please sign in to comment.