# Analyzing Gene Functions in Arabidopsis Thaliana

[The Arabidopsis Information Resource (TAIR)](https://www.arabidopsis.org/) maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana. We can do all sorts of analyses with this data, but in this post we will focus on looking at gene functions. You can download the annotation data (ATH_GO_GOSLIM.txt.gz) from the TAIR site, but I also included it in my Github repo [mlg556/arabidopsis](https://github.com/mlg556/arabidopsis), so you can just clone that.

First let's import the libraries. We will use [pandas](https://pandas.pydata.org/) for reading/analyzing the data, and `zipfile` to extract the data file, which is archived. You can simply install pandas via `pip install pandas`, and `zipfile` is a built-in python library.

In [1]:
import pandas as pd
import zipfile

We extract the archived file `ATH_GO_GOSLIM.txt.gz` to `ATH_GO_GOSLIM.txt`. Note that although the extension `.gz` suggests Gzip/7zip file, we need a regular zip extractor.

In [2]:
# extract txt, file extension is erroneously gz, but its actually a zip file
fname = "ATH_GO_GOSLIM.txt"
fname_gz = f"{fname}.gz"

with zipfile.ZipFile(fname_gz, 'r') as z:
    z.extractall(".")

In [4]:
# column data from from ATH_GO.README.txt
columns = [
    "locus_name",
    "tair_acc",
    "obj_name",
    "rel_type",
    "go_term",
    "go_id",
    "tair_id",
    "aspect",
    "go_slim",
    "evidence_code",
    "evidence_desc",
    "evidence_with",
    "reference",
    "annotator",
    "date"
]


# read into dataframe
df0 = pd.read_csv(fname, sep="\t", names=columns, skiprows=[0,1,2,3], index_col=False, header=0)
df0

Unnamed: 0,locus_name,tair_acc,obj_name,rel_type,go_term,go_id,tair_id,aspect,go_slim,evidence_code,evidence_desc,evidence_with,reference,annotator,date
0,AT1G01010,locus:2200935,AT1G01010,acts upstream of or within,response to oxidative stress,GO:0006979,6625,P,response to stress,IEA,traceable computational prediction,AGI_LocusCode:AT5G19875,Publication:501796011|PMID:34562334,klaasvdp,2022-11-14
1,AT1G01010,locus:2200935,AT1G01010,acts upstream of or within,response to abscisic acid,GO:0009737,11395,P,response to chemical,IEA,traceable computational prediction,AGI_LocusCode:AT4G27410,Publication:501796011|PMID:34562334,klaasvdp,2022-11-14
2,AT1G01010,locus:2200935,AT1G01010,acts upstream of or within,response to lipid,GO:0033993,28865,P,response to chemical,IEA,traceable computational prediction,AGI_LocusCode:AT4G27410|AGI_LocusCode:AT2G0299...,Publication:501796011|PMID:34562334,klaasvdp,2022-11-14
3,AT1G01010,locus:2200935,AT1G01010,acts upstream of or within,oxoacid metabolic process,GO:0043436,21524,P,other cellular processes,IEA,traceable computational prediction,AGI_LocusCode:AT5G63790,Publication:501796011|PMID:34562334,klaasvdp,2022-11-14
4,AT1G01010,locus:2200935,AT1G01010,acts upstream of or within,defense response to other organism,GO:0098542,46569,P,response to external stimulus,IEA,traceable computational prediction,AGI_LocusCode:AT2G43510|AGI_LocusCode:AT4G1473...,Publication:501796011|PMID:34562334,klaasvdp,2022-11-14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456200,YAK,gene:1945468,YAK,is active in,cellular_component,GO:0005575,163,C,unknown cellular components,ND,'Unknown' cellular component,NONE,Communication:1345790,TAIR,2022-02-01
456201,YAK,gene:1945468,YAK,enables,molecular_function,GO:0003674,3226,F,unknown molecular functions,ND,'Unknown' molecular function,,Communication:1345790,TAIR,2006-10-20
456202,YI,gene:1945470,YI,involved in,biological_process,GO:0008150,5239,P,unknown biological processes,ND,'Unknown' biological process,NONE,Communication:1345790,TAIR,2022-02-01
456203,YI,gene:1945470,YI,is active in,cellular_component,GO:0005575,163,C,unknown cellular components,ND,'Unknown' cellular component,NONE,Communication:1345790,TAIR,2022-02-01


In [12]:
select_columns = ["locus_name", "go_term"]

excluded_go_terms = ["molecular_function", "biological_process", "cellular_component"]

df = df0[select_columns] # select gene name and function
df = df.drop_duplicates() # drop duplicates
df = df.query("go_term not in @excluded_go_terms") # exclude go_terms

df

Unnamed: 0,locus_name,go_term
0,AT1G01010,response to oxidative stress
1,AT1G01010,response to abscisic acid
2,AT1G01010,response to lipid
3,AT1G01010,oxoacid metabolic process
4,AT1G01010,defense response to other organism
...,...,...
456174,XRS9,DNA damage response
456175,XRS9,DNA repair
456181,XRS9,response to X-ray
456182,XTC1,embryo development ending in seed dormancy


In [13]:
dfg = df.groupby("go_term", as_index=False).apply(lambda x: x)
dfg

Unnamed: 0,Unnamed: 1,locus_name,go_term
0,59239,AT1G36280,'de novo' AMP biosynthetic process
0,264419,AT3G57610,'de novo' AMP biosynthetic process
0,301470,AT4G18440,'de novo' AMP biosynthetic process
1,52412,AT1G30820,'de novo' CTP biosynthetic process
1,161071,AT2G34890,'de novo' CTP biosynthetic process
...,...,...,...
7460,97087,AT1G69770,zygote asymmetric cytokinesis in embryo sac
7460,118067,AT2G01210,zygote asymmetric cytokinesis in embryo sac
7460,412149,AT5G49160,zygote asymmetric cytokinesis in embryo sac
7461,129873,AT2G17090,zygote elongation


In [17]:
dfg_sum = df.groupby("go_term").count()
dfg_sum = dfg_sum.sort_values(by=['locus_name'], ascending=False)
dfg_sum

Unnamed: 0_level_0,locus_name
go_term,Unnamed: 1_level_1
nucleus,10494
chloroplast,4966
cytoplasm,4771
protein binding,4519
mitochondrion,4449
...,...
organelle fission,1
"beta,beta digalactosyldiacylglycerol galactosyltransferase activity",1
regulation of MAPK cascade,1
regulation of GTPase activity,1


# Citations

* Berardini, TZ, Mundodi, S, Reiser, R, Huala, E, Garcia-Hernandez, M, Zhang, P, Mueller, LM, Yoon, J, Doyle, A, Lander, G, Moseyko, N, Yoo, D, Xu, I, Zoeckler, B, Montoya, M, Miller, N, Weems, D, and Rhee, SY (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol. 135(2):1-11.

* TAIR Terms of Use: http://www.arabidopsis.org/doc/about/tair_terms_of_use/417.