# Data exploration with Pandas

###  Tree of Life
<table><tr><td><img src="http://t3.gstatic.com/licensed-image?q=tbn:ANd9GcSq4PRaxgfpjNOSe81JgN8l71DWtDHpkSfH3xo8EOk7khAlqQozXnJm8ubupyHj" width=300></td><td>
<img src="https://i.guim.co.uk/img/static/sys-images/Guardian/Pix/pictures/2008/04/17/DarwinSketch.article.jpg?width=445&quality=85&auto=format&fit=max&s=c7f89552d12b8495b2b4eb4d7a5bc391" width=300></td><td><img src="https://i.pinimg.com/originals/78/3f/98/783f983d622b06b9a990ad67efabbbe8.png" width=300></td></tr></table>

### Objective of this notebook
We will continue our data exploration with the dataset we used in Colab_Lec04. In this notebook, we will tackle real world questions with pandas. We will explore an kingdom of life data file. Each row in this data set represents a particular organism. An organism is classified in a particular Kingdom and Class.

Notebook adapted from Wendy Lee

In [None]:
# Import libraries
import pandas as pd

In [None]:
euk_filepath = "https://raw.githubusercontent.com/csbfx/advpy122-data/master/euk.tsv"
euk_df = pd.read_csv(euk_filepath, sep='\t')
euk_df.head()

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status
0,Emiliania huxleyi CCMP1516,Protists,Other Protists,167.676,64.5,38549,38554,2013,Scaffold
1,Arabidopsis thaliana,Plants,Land Plants,119.669,36.0529,38311,48265,2001,Chromosome
2,Glycine max,Plants,Land Plants,979.046,35.1153,59847,71219,2010,Chromosome
3,Medicago truncatula,Plants,Land Plants,412.924,34.047,37603,41939,2011,Chromosome
4,Solanum lycopersicum,Plants,Land Plants,828.349,35.6991,31200,37660,2010,Chromosome


In [None]:
# ## You can sort by one or more columns
euk_df.sort_values(['Species', 'Publication year'], ascending=[True, False])

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
2968,Abeoforma whisleri,Protists,Other Protists,101.55400,31.4,,,2017,Scaffold,Abeoforma,
3764,Abrus precatorius,Plants,Land Plants,347.23000,31.8,28735.0,40048.0,2018,Scaffold,Abrus,1.393701
2467,Absidia glauca,Fungi,Other Fungi,48.74640,44.5,15117.0,14891.0,2016,Scaffold,Absidia,0.985050
2695,Absidia repens,Fungi,Other Fungi,47.42290,38.2,15151.0,14915.0,2017,Contig,Absidia,0.984423
1905,Acanthamoeba astronyxis,Protists,Other Protists,83.43250,42.9,,,2015,Scaffold,Acanthamoeba,
...,...,...,...,...,...,...,...,...,...,...,...
932,fungal sp. EF0021,Fungi,Other Fungi,44.37210,47.1,,,2012,Contig,fungal,
3069,fungal sp. Mo6-1,Fungi,Other Fungi,25.10290,60.2,,,2018,Contig,fungal,
1920,fungal sp. No.11243,Fungi,Other Fungi,21.71980,54.0,9730.0,9693.0,2015,Scaffold,fungal,0.996197
2639,fungal sp. No.14919,Fungi,Other Fungi,48.73490,46.9,14606.0,14342.0,2017,Scaffold,fungal,0.981925


*italicized text*### Q1. How many fungal species have genomes size bigger than 100Mb? What are their names?

**We need to filter a few things to address this question.**
1. Select all the Fungi under the "Kingdom" column.
2. Select all the Fungi with genome size greater than 100.
3. Select Species from the filtered data from step 2 above.

**Let's do it step by step. And then we will combine all of them in one line of code.**

In [None]:
## Narrow down to only Fungi
euk_df[euk_df.Kingdom == 'Fungi']

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
8,Saccharomyces cerevisiae S288C,Fungi,Ascomycetes,12.157100,38.1556,6445.0,6002.0,1999,Complete Genome,Saccharomyces,0.931265
10,Pneumocystis carinii B80,Fungi,Ascomycetes,7.661460,27.8000,3695.0,3646.0,2015,Contig,Pneumocystis,0.986739
11,Schizosaccharomyces pombe,Fungi,Ascomycetes,12.591300,36.0381,6974.0,5132.0,2002,Chromosome,Schizosaccharomyces,0.735876
12,Aspergillus nidulans FGSC A4,Fungi,Ascomycetes,30.276000,50.2721,9586.0,9556.0,2003,Scaffold,Aspergillus,0.996870
13,Aspergillus fumigatus Af293,Fungi,Ascomycetes,29.385000,49.8105,19832.0,19260.0,2005,Chromosome,Aspergillus,0.971158
...,...,...,...,...,...,...,...,...,...,...,...
8297,Saccharomyces cerevisiae,Fungi,Ascomycetes,3.993920,38.2000,,,2017,Scaffold,Saccharomyces,
8298,Saccharomyces cerevisiae,Fungi,Ascomycetes,0.586761,38.5921,155.0,298.0,1992,Chromosome,Saccharomyces,1.922581
8299,Saccharomyces cerevisiae,Fungi,Ascomycetes,12.020400,38.2971,,,2018,Chromosome,Saccharomyces,
8300,Saccharomyces cerevisiae,Fungi,Ascomycetes,11.960900,38.2413,,,2018,Chromosome,Saccharomyces,


To **combine two conditional statements** in the filtering, we need to surround each conditional statements with its own pair of **parentheses**.

In [None]:
# ## Narrow down to Fungi that has genome size > 100
euk_df[(euk_df.Kingdom == 'Fungi') &
    (euk_df['Size (Mb)'] > 100)
    ]

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
323,Blumeria graminis f. sp. hordei DH14,Fungi,Ascomycetes,124.489,43.5,,,2018,Scaffold,Blumeria,
347,Puccinia triticina 1-1 BBBD Race 1,Fungi,Basidiomycetes,135.344,36.8,15539.0,15685.0,2009,Scaffold,Puccinia,1.009396
354,Tuber melanosporum,Fungi,Ascomycetes,124.946,44.9,7496.0,7496.0,2010,Scaffold,Tuber,1.000000
372,Puccinia striiformis f. sp. tritici,Fungi,Basidiomycetes,156.834,44.4,,,2018,Contig,Puccinia,
427,Melampsora larici-populina 98AG31,Fungi,Basidiomycetes,101.129,41.3,16380.0,16372.0,2011,Scaffold,Melampsora,0.999512
...,...,...,...,...,...,...,...,...,...,...,...
6406,Rhizophagus irregularis,Fungi,Other Fungi,131.335,25.8,24574.0,24485.0,2016,Scaffold,Rhizophagus,0.996378
6502,Rhizophagus irregularis,Fungi,Other Fungi,211.467,,,,2018,Scaffold,Rhizophagus,
6511,Rhizophagus irregularis,Fungi,Other Fungi,156.891,,,,2018,Scaffold,Rhizophagus,
6520,Puccinia striiformis,Fungi,Basidiomycetes,144.837,44.2,,,2019,Scaffold,Puccinia,


In [None]:
## Species that are Fungi with genome size > 100 Mb
euk_df[(euk_df.Kingdom == 'Fungi') & (euk_df['Size (Mb)'] > 100)].Species

Unnamed: 0,Species
323,Blumeria graminis f. sp. hordei DH14
347,Puccinia triticina 1-1 BBBD Race 1
354,Tuber melanosporum
372,Puccinia striiformis f. sp. tritici
427,Melampsora larici-populina 98AG31
...,...
6406,Rhizophagus irregularis
6502,Rhizophagus irregularis
6511,Rhizophagus irregularis
6520,Puccinia striiformis


#### Convert the Series in \#3 above to a python list using `to_list` method ####

In [None]:
speciesList = euk_df[ (euk_df.Kingdom == 'Fungi') &
                  (euk_df['Size (Mb)'] > 100)
                 ]["Species"].to_list()

## Slice for the top 10 on the list
speciesList[0:10]

['Blumeria graminis f. sp. hordei DH14',
 'Puccinia triticina 1-1 BBBD Race 1',
 'Tuber melanosporum',
 'Puccinia striiformis f. sp. tritici',
 'Melampsora larici-populina 98AG31',
 'Ophiocordyceps sinensis',
 'Gigaspora rosea',
 'Leucoagaricus gongylophorus Ac12',
 'Hemileia vastatrix HvCat',
 'Cenococcum geophilum 1.58']

### Q2. How many organisms are there for each Kingdom (plants, animals, fungi, protists, and other), and how many unique species names?

**HINT**: Any time we see a question involving the words ”how many
... for each ...” **"how many ... for each ..."** the answer is **`value_counts`**.

In [None]:
euk_df.Kingdom.value_counts()

Unnamed: 0_level_0,count
Kingdom,Unnamed: 1_level_1
Fungi,4494
Animals,2181
Plants,870
Protists,727
Other,30


To address "how many unique species names?", count the unique species names for plants by filtering the dataframe. **[`nunique()`]**(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html) returns the number of distinct obeservations.

In [None]:
euk_df[(euk_df["Kingdom"] == "Plants")].Species.nunique()

464

In [None]:
## To expand this to the other kingdoms

for k in ["Protists", "Plants", "Fungi", "Animals", "Other"]:
    print(k, euk_df[euk_df.Kingdom == k].Species.nunique())

Protists 449
Plants 464
Fungi 2554
Animals 1442
Other 27


#### To make the above more scalable
Hardcoding the values in kingdom is not scalable, we can implement a more elegant way to get a list of unique kingdom names directly from the dataframe **[`unique()`]**(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html) returns unique values in the order of apperance. It does NOT sort.

In [None]:
## Return an array of unique values in Kingdom
# euk_df.Kingdom.unique()

## Use an array to make it more scalable
for kingdom in euk_df.Kingdom.unique():
    print(kingdom, euk_df[euk_df.Kingdom == kingdom].Species.nunique())

Protists 449
Plants 464
Fungi 2554
Animals 1442
Other 27


### Q3. Make a new dataframe containing just the rows for the *Aquila* genus.

**Let go over some biology terminologies**
- The names under the **Species** column are scientific names that made up of a *genus* name and a *species* name separated by a space. Example: *Homo sapiens*. Note: there are some species that don't follow that format and we will ignore them for now.

To solve this problem, we will need to **separate the genus and species names** for each value in the **Species** column.

In [None]:
## Review how to split a string
a = "abc def"
a_split = a.split()
print(a_split)
print(a_split[0])


# Split the strings stored in the column Species
euk_df.Species.str.split()

['abc', 'def']
abc


Unnamed: 0,Species
0,"[Emiliania, huxleyi, CCMP1516]"
1,"[Arabidopsis, thaliana]"
2,"[Glycine, max]"
3,"[Medicago, truncatula]"
4,"[Solanum, lycopersicum]"
...,...
8297,"[Saccharomyces, cerevisiae]"
8298,"[Saccharomyces, cerevisiae]"
8299,"[Saccharomyces, cerevisiae]"
8300,"[Saccharomyces, cerevisiae]"


In [None]:
## Take the first element of each of the resulting lists (again remembering to refer to the str attribute):
## This gives us our series of genus names.
euk_df.Species.str.split().str[0]

Unnamed: 0,Species
0,Emiliania
1,Arabidopsis
2,Glycine
3,Medicago
4,Solanum
...,...
8297,Saccharomyces
8298,Saccharomyces
8299,Saccharomyces
8300,Saccharomyces


In [None]:
## Add the condition to get a series of boolean values:
euk_df.Species.str.split(' ').str[0] == "Aquila"

Unnamed: 0,Species
0,False
1,False
2,False
3,False
4,False
...,...
8297,False
8298,False
8299,False
8300,False


In [None]:
## Plug the above into the original dataframe to select the rows that are True

aquila_data = euk_df[ (euk_df.Species.str.split(' ').str[0] == "Aquila") ]
aquila_data

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
1755,Aquila chrysaetos canadensis,Animals,Birds,1192.74,41.9001,17520.0,31284.0,2014,Scaffold,Aquila,1.785616
4388,Aquila chrysaetos canadensis,Animals,Birds,1548.48,43.5,,,2014,Scaffold,Aquila,
5342,Aquila chrysaetos chrysaetos,Animals,Birds,1228.51,42.2,,,2018,Scaffold,Aquila,


In [None]:
## Extract the genus name and combine it with other columns in the dataframe to create a new dataframe
euk_df['Genus']=euk_df.Species.str.split(" ").str[0]

neweuk = euk_df[["Species","Genus","Class","Kingdom"]]
neweuk

Unnamed: 0,Species,Genus,Class,Kingdom
0,Emiliania huxleyi CCMP1516,Emiliania,Other Protists,Protists
1,Arabidopsis thaliana,Arabidopsis,Land Plants,Plants
2,Glycine max,Glycine,Land Plants,Plants
3,Medicago truncatula,Medicago,Land Plants,Plants
4,Solanum lycopersicum,Solanum,Land Plants,Plants
...,...,...,...,...
8297,Saccharomyces cerevisiae,Saccharomyces,Ascomycetes,Fungi
8298,Saccharomyces cerevisiae,Saccharomyces,Ascomycetes,Fungi
8299,Saccharomyces cerevisiae,Saccharomyces,Ascomycetes,Fungi
8300,Saccharomyces cerevisiae,Saccharomyces,Ascomycetes,Fungi


In [None]:
euk_df.head()


Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
0,Emiliania huxleyi CCMP1516,Protists,Other Protists,167.676,64.5,38549.0,38554.0,2013,Scaffold,Emiliania,1.00013
1,Arabidopsis thaliana,Plants,Land Plants,119.669,36.0529,38311.0,48265.0,2001,Chromosome,Arabidopsis,1.259821
2,Glycine max,Plants,Land Plants,979.046,35.1153,59847.0,71219.0,2010,Chromosome,Glycine,1.190018
3,Medicago truncatula,Plants,Land Plants,412.924,34.047,37603.0,41939.0,2011,Chromosome,Medicago,1.11531
4,Solanum lycopersicum,Plants,Land Plants,828.349,35.6991,31200.0,37660.0,2010,Chromosome,Solanum,1.207051


### Q4. Which organism have at least 10% more proteins than genes?

There are a few different ways to interpret ”10% more”, but for the purposes of this question we will say that we want to divide the number of proteins by the number of genes, and if the result is greater than or equal to 1.1 then we want to include the organism.

If you look at the `euk` dataframe, you will see that the columns **Number of proteins** and **Number of genes** are mixed with numeric values and dashes. To ensure that all the values in these two columns are numeric, we will use pandas method [`to_numeric`](https://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.to_numeric.html) to covert the values to numeric. We will set the `errors` argument to `'coerce'` to convert any non-numeric value to NaN (not a number).

In [None]:
euk_df["Number of genes"] =  pd.to_numeric(euk_df["Number of genes"],  errors='coerce')
euk_df["Number of proteins"] =  pd.to_numeric(euk_df["Number of proteins"],  errors='coerce')
euk_df["GC%"] =  pd.to_numeric(euk_df["GC%"],  errors='coerce')

# automatically applying the division for each row
euk_df["Proteins per gene"] = euk_df["Number of proteins"]/euk_df["Number of genes"]
euk_df.head()

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
0,Emiliania huxleyi CCMP1516,Protists,Other Protists,167.676,64.5,38549.0,38554.0,2013,Scaffold,Emiliania,1.00013
1,Arabidopsis thaliana,Plants,Land Plants,119.669,36.0529,38311.0,48265.0,2001,Chromosome,Arabidopsis,1.259821
2,Glycine max,Plants,Land Plants,979.046,35.1153,59847.0,71219.0,2010,Chromosome,Glycine,1.190018
3,Medicago truncatula,Plants,Land Plants,412.924,34.047,37603.0,41939.0,2011,Chromosome,Medicago,1.11531
4,Solanum lycopersicum,Plants,Land Plants,828.349,35.6991,31200.0,37660.0,2010,Chromosome,Solanum,1.207051


In [None]:
# To filter the rows where the ratio of proteins and genes is
# greater than or equal to 1.1, we will get a series of boolean
# values by setting the condition.

euk_df[(euk_df["Proteins per gene"] >= 1.1)]

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
1,Arabidopsis thaliana,Plants,Land Plants,119.669000,36.0529,38311.0,48265.0,2001,Chromosome,Arabidopsis,1.259821
2,Glycine max,Plants,Land Plants,979.046000,35.1153,59847.0,71219.0,2010,Chromosome,Glycine,1.190018
3,Medicago truncatula,Plants,Land Plants,412.924000,34.0470,37603.0,41939.0,2011,Chromosome,Medicago,1.115310
4,Solanum lycopersicum,Plants,Land Plants,828.349000,35.6991,31200.0,37660.0,2010,Chromosome,Solanum,1.207051
6,Oryza sativa Japonica Group,Plants,Land Plants,374.423000,43.5769,35219.0,42580.0,2015,Chromosome,Oryza,1.209007
...,...,...,...,...,...,...,...,...,...,...,...
6487,Fusarium oxysporum f. sp. melonis 26406,Fungi,Ascomycetes,54.034300,47.5000,20030.0,26719.0,2012,Scaffold,Fusarium,1.333949
6523,Fusarium oxysporum Fo47,Fungi,Ascomycetes,49.664600,47.7000,18553.0,24818.0,2012,Scaffold,Fusarium,1.337681
6626,Arabidopsis thaliana,Plants,Land Plants,93.654500,36.0433,16842.0,20111.0,2000,Chromosome,Arabidopsis,1.194098
6781,Mus musculus,Animals,Mammals,3251.250000,41.8306,31682.0,45437.0,2005,Chromosome,Mus,1.434158


### Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `iloc` and `loc`. For more advanced operations, these are the ones you're supposed to be using.

#### Index-based selection
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data.`iloc` follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

In [None]:
## New df of fungi
fungi = euk_df[(euk_df.Kingdom == 'Fungi')]

## Selecting first row
fungi.iloc[0]

Unnamed: 0,8
Species,Saccharomyces cerevisiae S288C
Kingdom,Fungi
Class,Ascomycetes
Size (Mb),12.1571
GC%,38.1556
Number of genes,6445.0
Number of proteins,6002.0
Publication year,1999
Assembly status,Complete Genome
Genus,Saccharomyces


## Extracting rows and columns

-  Extract specific rows and all columns

    ```df.iloc[rows]```

-  Extract specific rows and columns

    ```df.iloc[rows, columns]```

-  Extract specific rows and columns with specific indices

    ```df.iloc[start:stop:skip, start:stop:skip]```

#### Note that stop is exclusive

 ```[1:3]``` will extract the second and third rows for all columns.

```[1::2, :]```  will extract from the 2nd row all the way to the end every other row, and all the columns

```[:, :3]```  all rows, from the beginning to the 3rd column (indices 0, 1, 2)


In [None]:
# ## Getting the second and third rows for all columns
fungi.iloc[1:3]

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
10,Pneumocystis carinii B80,Fungi,Ascomycetes,7.66146,27.8,3695.0,3646.0,2015,Contig,Pneumocystis,0.986739
11,Schizosaccharomyces pombe,Fungi,Ascomycetes,12.5913,36.0381,6974.0,5132.0,2002,Chromosome,Schizosaccharomyces,0.735876


In [None]:
# Getting every other rows from the first row to the 10th row, and extracting 4th and 5th columns
fungi.iloc[0:10:2, 3:5]

Unnamed: 0,Size (Mb),GC%
8,12.1571,38.1556
11,12.5913,36.0381
13,29.385,49.8105
15,14.2827,33.4827
33,2.49752,47.3005


Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

In [None]:
## Select all rows, first column
fungi.iloc[:, 0]

Unnamed: 0,Species
8,Saccharomyces cerevisiae S288C
10,Pneumocystis carinii B80
11,Schizosaccharomyces pombe
12,Aspergillus nidulans FGSC A4
13,Aspergillus fumigatus Af293
...,...
8297,Saccharomyces cerevisiae
8298,Saccharomyces cerevisiae
8299,Saccharomyces cerevisiae
8300,Saccharomyces cerevisiae


On its own, the `:` operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to omit the first 3 columns and extract just the first three rows, we would do:

In [None]:
fungi.iloc[:3, 3:]

Unnamed: 0,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
8,12.1571,38.1556,6445.0,6002.0,1999,Complete Genome,Saccharomyces,0.931265
10,7.66146,27.8,3695.0,3646.0,2015,Contig,Pneumocystis,0.986739
11,12.5913,36.0381,6974.0,5132.0,2002,Chromosome,Schizosaccharomyces,0.735876


#### Label-based selection
The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

In [None]:
## Getting all rows, and specific columns based on their headers
fungi.loc[:, ['Species', 'Number of genes', 'Number of proteins']]

Unnamed: 0,Species,Number of genes,Number of proteins
8,Saccharomyces cerevisiae S288C,6445.0,6002.0
10,Pneumocystis carinii B80,3695.0,3646.0
11,Schizosaccharomyces pombe,6974.0,5132.0
12,Aspergillus nidulans FGSC A4,9586.0,9556.0
13,Aspergillus fumigatus Af293,19832.0,19260.0
...,...,...,...
8297,Saccharomyces cerevisiae,,
8298,Saccharomyces cerevisiae,155.0,298.0
8299,Saccharomyces cerevisiae,,
8300,Saccharomyces cerevisiae,,


### Set row index using a specific column

In [None]:
fungi.set_index('Species', inplace=True)
fungi.head()

Unnamed: 0_level_0,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Saccharomyces cerevisiae S288C,Fungi,Ascomycetes,12.1571,38.1556,6445.0,6002.0,1999,Complete Genome,Saccharomyces,0.931265
Pneumocystis carinii B80,Fungi,Ascomycetes,7.66146,27.8,3695.0,3646.0,2015,Contig,Pneumocystis,0.986739
Schizosaccharomyces pombe,Fungi,Ascomycetes,12.5913,36.0381,6974.0,5132.0,2002,Chromosome,Schizosaccharomyces,0.735876
Aspergillus nidulans FGSC A4,Fungi,Ascomycetes,30.276,50.2721,9586.0,9556.0,2003,Scaffold,Aspergillus,0.99687
Aspergillus fumigatus Af293,Fungi,Ascomycetes,29.385,49.8105,19832.0,19260.0,2005,Chromosome,Aspergillus,0.971158


### Reset dataframe's index

In [None]:
# Getting specific rows based on the row index and all columns
fungi.loc[['Saccharomyces cerevisiae'], :]

Unnamed: 0_level_0,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.086300,38.1473,,,2015,Complete Genome,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.165500,38.3121,,,2018,Complete Genome,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.134300,38.1573,,,2019,Complete Genome,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.150900,38.5859,,,2015,Chromosome,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.071900,38.3205,,,2016,Chromosome,Saccharomyces,
...,...,...,...,...,...,...,...,...,...,...
Saccharomyces cerevisiae,Fungi,Ascomycetes,3.993920,38.2000,,,2017,Scaffold,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,0.586761,38.5921,155.0,298.0,1992,Chromosome,Saccharomyces,1.922581
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.020400,38.2971,,,2018,Chromosome,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,11.960900,38.2413,,,2018,Chromosome,Saccharomyces,


In [None]:
fungi.reset_index(inplace=True)
fungi.head()

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
0,Saccharomyces cerevisiae S288C,Fungi,Ascomycetes,12.1571,38.1556,6445.0,6002.0,1999,Complete Genome,Saccharomyces,0.931265
1,Pneumocystis carinii B80,Fungi,Ascomycetes,7.66146,27.8,3695.0,3646.0,2015,Contig,Pneumocystis,0.986739
2,Schizosaccharomyces pombe,Fungi,Ascomycetes,12.5913,36.0381,6974.0,5132.0,2002,Chromosome,Schizosaccharomyces,0.735876
3,Aspergillus nidulans FGSC A4,Fungi,Ascomycetes,30.276,50.2721,9586.0,9556.0,2003,Scaffold,Aspergillus,0.99687
4,Aspergillus fumigatus Af293,Fungi,Ascomycetes,29.385,49.8105,19832.0,19260.0,2005,Chromosome,Aspergillus,0.971158
