# Submit zebrafish markers lists using the submit list notebook

This notebook allows you to create and upload a marker list. <br>
<br>
A marker list consists of two components: the metadata and the list(s) of markers. <br>
The metadata is entered manually, while the markers are taken from a tab delimited file with one or two columns. <br>
<br>
In the settings you only have to specify the path to this marker file and information about the column(s) of the file <b>(2)</b>. 
<br>
The submit lists notebook was modified to specifically process the format of the two zebrafish markers lists obtained from the scientific publication "Characterization of the Zebrafish Cell Landscape at Single-Cell Resolution".
The lists are stored in two different Excel files: DataSheet1.XLSX and DataSheet2.XLSX. DataSheet1.XLSX contains the markers of the entire zebrafish cell landscape and the second file contains the markers of each tissue separately. The first code block was utilized to process the markers list from DataSheet1.XLSX, while second block was used for the second Excel file, as the files have distinct formats <b>(2.1)</b>. <br>
<br>
The input of the metadata as well as the verification and extension of the markers themselves, is supported by whitelists. The whitelists are located in an external repository, which is downloaded or updated before the data is entered <b>(3)</b>. <br>
<br>
Now the first metadata can be entered: the name of the list, the organism, and the type of markers (genes or genomic regions) <b>(4)</b>. <br>
<br>
At this point the marker file is being read in. If the markers are genes, they will be filtered and expanded using whitelists (Gene name and Ensembl ID) <b>(5)</b>. <br>
<br>
Finally, the remaining metadata can be entered and the new marker list can be saved <b>(6)</b>. <br>
<br>
The accuracy and integrity of the lists are confirmed through validation checks performed by the validation fuction <b>(7)</b>.

## 1. Loading packages

In [1]:
import markerrepo.marker_repo as mr
import markerrepo.parsing as parse
import markerrepo.update_uids as u_uids
import markerrepo.generate_metafile as gm
import markerrepo.validate_yaml as validate
import markerrepo.utils as utils
import pandas as pd

%load_ext autoreload
%autoreload 2

## 2. Settings

Specify path of the cloned repository.

In [2]:
# The repo path must be specified as information entry 
# annotate_by_marker_and_features is the latest version of the repository

repo_path = "/mnt/workspace_stud/allstud/wp2/annotate_by_marker_and_features-sort"

The path of the list to be added to the marker repo. <br>
The file must consist of one marker per line or two tab separated columns (marker and info like cell type).

## 2.1 Submit zebrafish markers

In [4]:
# This code line reads the Excel file "DataSheet1.XLSX" into a pandas DataFrame named "df"
# It opens the sheet named "zebrafish cell landscape"
df = pd.read_excel(open("/mnt/workspace_stud/allstud/wp2/DataSheet1.XLSX", 'rb'), sheet_name="zebrafish cell landscape", index_col=[0])

# The code selects the columns that contain the string "gene name" from the DataFrame, it renames the columns using the first row 
# of the DataFrame and then drops the first row
cellGene = df.loc[:, df.columns.str.contains("gene name")].rename(columns=df.iloc[0]).drop(df.index[0])

# This empty list is used to store the marker information
# The first loop iterates over the column names with the cell types and the second loop iterates over the values in
# each column of the "cellGene"
# The variable "marker_info" creates a list containing the gene name and the cell type
# If the "marker_info" is not alreafy in the list, it will be added

marker_infos = []
for cellType in list(cellGene):
    for gene in cellGene[cellType]:
        marker_info = [gene, cellType]
        if marker_info not in marker_infos:
            marker_infos.append(marker_info)
            
# The variable "headers" defines the column headers for the DataFrame            
headers = ["Marker", "Info"]

# This line of code creates a new DataFrame named "markers"
markers = pd.DataFrame(marker_infos, columns=headers)

# The code returns the DataFrame "markers", which contains the marker information from the Excel file
markers

Unnamed: 0,Marker,Info
0,zgc:55461,Spermatid
1,si:ch73-111k22.3,Spermatid
2,tuba7l,Spermatid
3,si:dkey-40g16.5,Spermatid
4,syce2,Spermatid
...,...,...
29009,atp5meb,Embryonic macrophage
29010,ndufb7,Embryonic macrophage
29011,cox7a2a,Embryonic macrophage
29012,cnbpb,Embryonic macrophage


In [5]:
# The zebrafish markers are located in an Excel file named "DataSheet2.XLSX"
# The following code was designed specifically for the excel format containing the markers
# The code must be adjusted in case of using a new list with a different format

# This line reads the Excel file named "DataSheet2.XLSX" located at the specified path into a pandas DataFrame
# It reads the sheet named "Gill" from the Excel file and sets "index_col" to "None"
# The variable "sheet_name" must be changed manually to access sheets containing the markers for the other tissues

df = pd.read_excel(open("/mnt/workspace_stud/allstud/wp2/DataSheet2.XLSX", 'rb'), sheet_name="Testis", index_col=None)

# This line specifies the column names for gene name and cell type in the DataFrame to create the marker list
gene_column_name = "gene"
cell_type_column_name = "cell type"

# The missing values in the "cell type" column on the DataFrame are filled using the "fillna()" method
# The missing values i the "cell type" column of the DataFrame will be filled with the last valid value in the column
df[cell_type_column_name] = df[cell_type_column_name].fillna(method='ffill')

# The code creates a new DataFrame called "cellGene" containing only the columns "gene" and "celltype"
# It drops any rows with missing values
cellGene = df[[gene_column_name, cell_type_column_name]].dropna(subset=[cell_type_column_name])

# Finally, we rename the columns of the DataFrame to "Marker" (gene) and "Info" (cell type)
cellGene.columns = ["Marker", "Info"]

# This line creates a new DataFrame named "markers"
markers = pd.DataFrame(cellGene)

# The resulting DataFrame "markers" is displayed, which contains gene names and the corresponding cell type information
markers


Unnamed: 0,Marker,Info
0,si:ch211-121a2.2,Spermatid
1,zgc:92249,Spermatid
2,cremb,Spermatid
3,zgc:92789,Spermatid
4,BX004999.1,Spermatid
...,...,...
944,maco1b,Innate immune cell
945,rpl21,Innate immune cell
946,si:ch211-5k11.8,Innate immune cell
947,aldoaa,Innate immune cell


## 3. Get whitelists

Pull whitelist repository and update if necessary.

In [7]:
mr.get_whitelists()

Initiating whitelist fetching...
Directory ./metadata_whitelists already exists.
Update is set to False. Skipping update.
Whitelist fetching process completed.



## 4. Get essential metadata

Enter essential metadata: Liste name, organism and marker type

In [9]:
# LIST_NAME = input("Please enter the name of the marker list: ")
LIST_NAME = "zebrafish_gene_cellType"
ORGANISM = mr.select(key="organism")
MARKER_TYPE = mr.select(key="marker_type")

Select organism
1:	human 9606
2:	mouse 10090
3:	zebrafish 7955
4:	rat 10114
5:	pig 9823
6:	medaka 8090
7:	chicken 9031
8:	drosophila 7215
9:	yeast 4932
3
Selection: zebrafish 7955

Select marker_type
1:	Genes
2:	Genomic regions
1
Selection: Genes



Read genes whitelist in order to filter and extend marker genes

In [10]:
if MARKER_TYPE == "Genes":
    gene_dict = mr.get_gene_dict(ORGANISM)

## 5. Transform marker list

<b>Read</b>, <b>filter</b>, <b>extend</b> and <b>convert</b> marker list in order to append it to the yaml file.

In [11]:
markers['Marker'] = markers['Marker'].str.upper()
print("All markers of provided list:")
display(markers)

if MARKER_TYPE == "Genes":
    markers_removed = markers[~markers['Marker'].isin(gene_dict.keys())]
    print("Removed markers:")
    display(markers_removed)
    markers_filtered = markers[markers['Marker'].isin(gene_dict.keys())]
    markers_extended = mr.update_markers(markers_filtered, gene_dict)
    print("Filtered and extended markers:")
    display(markers_extended)
    marker_dict = parse.dataframe_to_dict(markers_extended)
else:
    marker_dict = parse.dataframe_to_dict(markers)
    
marker_list = []
for name in marker_dict.keys():
    marker_list.append({'name': name, 'markers': marker_dict[name]})

All markers of provided list:


Unnamed: 0,Marker,Info
0,SI:CH211-121A2.2,Spermatid
1,ZGC:92249,Spermatid
2,CREMB,Spermatid
3,ZGC:92789,Spermatid
4,BX004999.1,Spermatid
...,...,...
944,MACO1B,Innate immune cell
945,RPL21,Innate immune cell
946,SI:CH211-5K11.8,Innate immune cell
947,ALDOAA,Innate immune cell


Removed markers:


Unnamed: 0,Marker,Info
1,ZGC:92249,Spermatid
3,ZGC:92789,Spermatid
25,LRRC69,Spermatid
34,GB:EH456644,Spermatid
42,SI:CH211-57H10.1,Spermatid
58,H1F0,Spermatid
59,NUPL2,Spermatid
62,SI:DKEY-148F10.4,Spermatid
63,SI:DKEYP-46H3.1,Spermatid
68,ZGC:110130,Spermatid


Filtered and extended markers:


Unnamed: 0,Marker,Info
0,SI:CH211-121A2.2 ENSDARG00000039682,Spermatid
2,CREMB ENSDARG00000102899,Spermatid
4,BX004999.1 ENSDARG00000096792,Spermatid
5,FAM166B ENSDARG00000100292,Spermatid
6,TEKT1 ENSDARG00000101331,Spermatid
...,...,...
944,MACO1B ENSDARG00000012741,Innate immune cell
945,RPL21 ENSDARG00000010516,Innate immune cell
946,SI:CH211-5K11.8 ENSDARG00000079078,Innate immune cell
947,ALDOAA ENSDARG00000011665,Innate immune cell


## 6. Enter metadata

Enter general metadata, tags and add marker list(s) automatically.

In [12]:
UID = mr.get_uid()
file_name = f"{LIST_NAME}_{UID}.yaml"

list_path = gm.generate_file(UID, LIST_NAME, False, marker_list, ORGANISM, MARKER_TYPE, repo_path=repo_path)

____________________________________________________________________________________________________

                                              Metadata                                              
____________________________________________________________________________________________________

This part contains the metadata of the list.


----------------------------------------------------------------------------------------------------
                                             Submitter                                              
----------------------------------------------------------------------------------------------------

The one who submitted the list to the repository.


The name should be entered in the format 'Last name, First name'.
name: Quintanilla, Marta

Do you want to add any of the following optional keys? (1,...,1 or n)
1:  email  The email adress of the submitter.
n

Do you want to add any of the following optional keys? (1,...,4 or n)
1:  date   

## 7. Validation

Check whether the format of the yaml file is correct.

In [13]:
if validate.validate_file(utils.read_in_yaml(f"{list_path}", marker_list=False), repo_path=repo_path):
    print(f"No errors were found concerning the '{LIST_NAME}' marker list.")

No errors were found concerning the 'zebrafish_gene_cellType' marker list.
