## **Data Exploration with Pandas**
Copyright © Wendy Lee 2022

###  Tree of Life
<table><tr><td><img src="http://t3.gstatic.com/licensed-image?q=tbn:ANd9GcSq4PRaxgfpjNOSe81JgN8l71DWtDHpkSfH3xo8EOk7khAlqQozXnJm8ubupyHj" width=300></td><td>
<img src="https://i.guim.co.uk/img/static/sys-images/Guardian/Pix/pictures/2008/04/17/DarwinSketch.article.jpg?width=445&quality=85&auto=format&fit=max&s=c7f89552d12b8495b2b4eb4d7a5bc391" width=300></td><td><img src="https://i.pinimg.com/originals/78/3f/98/783f983d622b06b9a990ad67efabbbe8.png" width=300></td></tr></table>

In this lecture, we will tackle real world questions with pandas. We will explore an kindom of life data file. Each row in this data set represents a  particular organism. An organism is classified in a particular Kingdom and Class.




### **Learning Outcome:**
- To learn how to import delimited data into a pandas dataframe.
- To learn about getting summary information and statistics of the data stored in a dataframe.
- To learn how to filter data or extract specific rows or columns using indices from a dataframe.
- To apply data analysis techniques to solve  real-world problems.

In [None]:
import pandas as pd
# this input file is tab-delimited instead of comma-delimited
tsvFile = "https://raw.githubusercontent.com/csbfx/advpy122-data/master/euk.tsv"
# we can specify the delimiter by using the sep keyword argument
euk = pd.read_csv(tsvFile, sep='\t')
# You can sort by one or more columns
euk.sort_values(['Publication year','Species'], ascending=[False, True]) # returning a new dataframe

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status
3875,Acaulopage tetraceros,Fungi,Other Fungi,11.176800,44.4,-,-,2019,Scaffold
1594,Acipenser ruthenus,Animals,Fishes,1732.550000,34.4,-,-,2019,Scaffold
1360,Acomys cahirinus,Animals,Mammals,2306.070000,42.7,-,-,2019,Scaffold
392,Acropora millepora,Animals,Other Animals,386.600000,35.1,-,-,2019,Scaffold
2758,Actinidia eriantha,Plants,Land Plants,690.611000,35.9315,-,-,2019,Chromosome
...,...,...,...,...,...,...,...,...,...
4987,Guillardia theta,Other,Other,0.672788,27.5981,743,632,1999,Chromosome
8,Saccharomyces cerevisiae S288C,Fungi,Ascomycetes,12.157100,38.1556,6445,6002,1999,Complete Genome
17,Leishmania major strain Friedlin,Protists,Kinetoplasts,32.855100,59.7114,9388,8316,1998,Complete Genome
27,Plasmodium falciparum 3D7,Protists,Apicomplexans,23.326900,19.3315,5670,5392,1998,Complete Genome


In [None]:
euk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8302 entries, 0 to 8301
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Species             8302 non-null   object 
 1   Kingdom             8302 non-null   object 
 2   Class               8302 non-null   object 
 3   Size (Mb)           8302 non-null   float64
 4   GC%                 8302 non-null   object 
 5   Number of genes     8302 non-null   object 
 6   Number of proteins  8302 non-null   object 
 7   Publication year    8302 non-null   int64  
 8   Assembly status     8302 non-null   object 
dtypes: float64(1), int64(1), object(7)
memory usage: 583.9+ KB


### Q1. How many fungal species have genomes size bigger than 100Mb? What are their names?

**We need to filter a few things to address this question.**
1. Select all the Fungi under the "Kingdom" column.
2. Select all the Fungi with genome size greater than 100.
3. Select Species from the filtered data from step 2 above.

**Let's do it step by step. And then we will combine all of them in one line of code.**

In [None]:
# Narrow down to only Fungi
# euk[(euk['Kingdom'] == 'Fungi')][euk['Size (Mb)'] > 100].sort_values(['Size (Mb)'], ascending=False)
euk[(euk['Kingdom'] == 'Fungi')][euk['Size (Mb)'] > 100]#.sort_values(['Size (Mb)'], ascending=False)['Size (Mb)'].describe


  euk[(euk['Kingdom'] == 'Fungi')][euk['Size (Mb)'] > 100]#.sort_values(['Size (Mb)'], ascending=False)['Size (Mb)'].describe


Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status
323,Blumeria graminis f. sp. hordei DH14,Fungi,Ascomycetes,124.489,43.5,-,-,2018,Scaffold
347,Puccinia triticina 1-1 BBBD Race 1,Fungi,Basidiomycetes,135.344,36.8,15539,15685,2009,Scaffold
354,Tuber melanosporum,Fungi,Ascomycetes,124.946,44.9,7496,7496,2010,Scaffold
372,Puccinia striiformis f. sp. tritici,Fungi,Basidiomycetes,156.834,44.4,-,-,2018,Contig
427,Melampsora larici-populina 98AG31,Fungi,Basidiomycetes,101.129,41.3,16380,16372,2011,Scaffold
...,...,...,...,...,...,...,...,...,...
6406,Rhizophagus irregularis,Fungi,Other Fungi,131.335,25.8,24574,24485,2016,Scaffold
6502,Rhizophagus irregularis,Fungi,Other Fungi,211.467,-,-,-,2018,Scaffold
6511,Rhizophagus irregularis,Fungi,Other Fungi,156.891,-,-,-,2018,Scaffold
6520,Puccinia striiformis,Fungi,Basidiomycetes,144.837,44.2,-,-,2019,Scaffold


**To combine two conditional statements in the filtering, we need to surround each conditional statements with its own pair of parentheses.**

In [None]:
# Narrow down to Fungi that has genome size > 100
euk[(euk.Kingdom == 'Fungi') &
    (euk['Size (Mb)'] > 100)
    ].Species

323         Blumeria graminis f. sp. hordei DH14
347           Puccinia triticina 1-1 BBBD Race 1
354                           Tuber melanosporum
372          Puccinia striiformis f. sp. tritici
427            Melampsora larici-populina 98AG31
                          ...                   
6406                     Rhizophagus irregularis
6502                     Rhizophagus irregularis
6511                     Rhizophagus irregularis
6520                        Puccinia striiformis
6579    Puccinia striiformis f. sp. tritici CY32
Name: Species, Length: 76, dtype: object

In [None]:
# Species that are Fungi with genome size > 100 Mb
euk[(euk.Kingdom == 'Fungi') & (euk['Size (Mb)'] > 100)].Species

323         Blumeria graminis f. sp. hordei DH14
347           Puccinia triticina 1-1 BBBD Race 1
354                           Tuber melanosporum
372          Puccinia striiformis f. sp. tritici
427            Melampsora larici-populina 98AG31
                          ...                   
6406                     Rhizophagus irregularis
6502                     Rhizophagus irregularis
6511                     Rhizophagus irregularis
6520                        Puccinia striiformis
6579    Puccinia striiformis f. sp. tritici CY32
Name: Species, Length: 76, dtype: object

#### Convert the Series in \#3 above to a python list using `to_list` method ####

In [None]:
speciesList = euk[ (euk.Kingdom == 'Fungi') &
                  (euk['Size (Mb)'] > 100)
                 ]["Species"].to_list()
# top 10 on the list
speciesList[0:10] # start is inclusive, stop is exclusive, the last one is step

['Blumeria graminis f. sp. hordei DH14',
 'Puccinia triticina 1-1 BBBD Race 1',
 'Tuber melanosporum',
 'Puccinia striiformis f. sp. tritici',
 'Melampsora larici-populina 98AG31',
 'Ophiocordyceps sinensis',
 'Gigaspora rosea',
 'Leucoagaricus gongylophorus Ac12',
 'Hemileia vastatrix HvCat',
 'Cenococcum geophilum 1.58']

### Q2. How many organisms are there for each Kingdom (plants, animals, fungi, protists, and other), and how many unique species names?

**The first part of this problem is relatively easy to answer: any time we see a question involving the words ”how many
... for each ...” the answer is `value_counts`.**

In [None]:
euk.Kingdom.value_counts()

Fungi       4494
Animals     2181
Plants       870
Protists     727
Other         30
Name: Kingdom, dtype: int64

**The second part is a bit trickier. Let's start by solving part
of the problem, so we’ll begin by just counting the unique species names for plants by filtering the dataframe.** [`nunique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html) returns the number of distinct obeservations.

In [None]:
euk[(euk["Kingdom"] == "Plants")].Species.nunique()

464

**Now let's expand this to all kingdoms**

In [None]:
for k in ["Protists", "Plants", "Fungi", "Animals", "Other"]:
    print(k, euk[euk.Kingdom == k].Species.nunique())

Protists 449
Plants 464
Fungi 2554
Animals 1442
Other 27


**But hardcoding the values in kingdom is not scalable, we should implement a more elegant way to get a list of unique kingdom names directly from the dataframe** [`unique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html) returns unique values in the order of apperance. It does NOT sort.

In [None]:
euk.Kingdom.unique()

array(['Protists', 'Plants', 'Fungi', 'Animals', 'Other'], dtype=object)

In [None]:
for kingdom in euk.Kingdom.unique():
    print(kingdom, euk[euk.Kingdom == kingdom].Species.nunique())

Protists 449
Plants 464
Fungi 2554
Animals 1442
Other 27


### Q3. Make a new dataframe containing just the rows for the *Aquila* genus.

**Let go over some biology terminologies**
- The names under the **Species** column are scientific names that made up of a *genus* name and a *species* name separated by a space. Example: *Homo sapiens*. Note: there are some species that don't follow that format and we will ignore them for now.

**To solve this problem, we will need to separate the genus and species names for each value in the Species column.**

In [None]:
# Review how to split a string
a = "abc def"
a_split = a.split()
# print(a_split)
# print(a_split[0])


# Split the strings stored in the column Species
aquila_list =euk[euk.Species.str.split().str[0] == 'Aquila']
aquila_list

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status
1755,Aquila chrysaetos canadensis,Animals,Birds,1192.74,41.9001,17520,31284,2014,Scaffold
4388,Aquila chrysaetos canadensis,Animals,Birds,1548.48,43.5,-,-,2014,Scaffold
5342,Aquila chrysaetos chrysaetos,Animals,Birds,1228.51,42.2,-,-,2018,Scaffold


In [None]:
# Split the strings stored in the column Species
euk.Species.str.split()

0       [Emiliania, huxleyi, CCMP1516]
1              [Arabidopsis, thaliana]
2                       [Glycine, max]
3               [Medicago, truncatula]
4              [Solanum, lycopersicum]
                     ...              
8297       [Saccharomyces, cerevisiae]
8298       [Saccharomyces, cerevisiae]
8299       [Saccharomyces, cerevisiae]
8300       [Saccharomyces, cerevisiae]
8301       [Saccharomyces, cerevisiae]
Name: Species, Length: 8302, dtype: object

In [None]:
# Now we can take the first element of each of the resulting
# lists (again remembering to refer to the str attribute):
# This gives us our series of genus names.
euk.Species.str.split().str[0]

0           Emiliania
1         Arabidopsis
2             Glycine
3            Medicago
4             Solanum
            ...      
8297    Saccharomyces
8298    Saccharomyces
8299    Saccharomyces
8300    Saccharomyces
8301    Saccharomyces
Name: Species, Length: 8302, dtype: object

In [None]:
# Now we can add the condition to get a series of boolean values:
euk.Species.str.split(' ').str[0] == "Aquila"

0       False
1       False
2       False
3       False
4       False
        ...  
8297    False
8298    False
8299    False
8300    False
8301    False
Name: Species, Length: 8302, dtype: bool

In [None]:
# we can plug into the original dataframe to just select the
# rows with True

aquila_data = euk[ (euk.Species.str.split(' ').str[0] == "Aquila") ]
aquila_data

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status
1755,Aquila chrysaetos canadensis,Animals,Birds,1192.74,41.9001,17520,31284,2014,Scaffold
4388,Aquila chrysaetos canadensis,Animals,Birds,1548.48,43.5,-,-,2014,Scaffold
5342,Aquila chrysaetos chrysaetos,Animals,Birds,1228.51,42.2,-,-,2018,Scaffold


In [None]:
# once we have figured out how to extract the genus name
# we combine it with other columns in the dataframe to create
# a new dataframe
euk['Genus']=euk.Species.str.split(" ").str[0]

neweuk = euk[["Species","Genus","Class","Kingdom"]]
neweuk

Unnamed: 0,Species,Genus,Class,Kingdom
0,Emiliania huxleyi CCMP1516,Emiliania,Other Protists,Protists
1,Arabidopsis thaliana,Arabidopsis,Land Plants,Plants
2,Glycine max,Glycine,Land Plants,Plants
3,Medicago truncatula,Medicago,Land Plants,Plants
4,Solanum lycopersicum,Solanum,Land Plants,Plants
...,...,...,...,...
8297,Saccharomyces cerevisiae,Saccharomyces,Ascomycetes,Fungi
8298,Saccharomyces cerevisiae,Saccharomyces,Ascomycetes,Fungi
8299,Saccharomyces cerevisiae,Saccharomyces,Ascomycetes,Fungi
8300,Saccharomyces cerevisiae,Saccharomyces,Ascomycetes,Fungi


In [None]:
euk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8302 entries, 0 to 8301
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Species             8302 non-null   object 
 1   Kingdom             8302 non-null   object 
 2   Class               8302 non-null   object 
 3   Size (Mb)           8302 non-null   float64
 4   GC%                 8302 non-null   object 
 5   Number of genes     8302 non-null   object 
 6   Number of proteins  8302 non-null   object 
 7   Publication year    8302 non-null   int64  
 8   Assembly status     8302 non-null   object 
 9   Genus               8302 non-null   object 
dtypes: float64(1), int64(1), object(8)
memory usage: 648.7+ KB


### Q4. Which organism have at least 10% more proteins than genes?

There are a few different ways to interpret ”10% more”, but for the purposes of this question we’ll say that we want to divide the number of proteins by the number of genes, and if the result is greater than or equal to 1.1 then we want to include the organism.

If you look at the `euk` dataframe, you will see that the columns **Number of proteins** and **Number of genes** are mixed with numeric values and dashes. To ensure that all the values in these two columns are numeric, we will use pandas method [`to_numeric`](https://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.to_numeric.html) to covert the values to numeric. We will set the `errors` argument to `'coerce'` to convert any non-numeric value to NaN (not a number).

In [None]:
euk["Number of genes"] =  pd.to_numeric(euk["Number of genes"],  errors='coerce')
euk["Number of proteins"] =  pd.to_numeric(euk["Number of proteins"],  errors='coerce')
euk["GC%"] =  pd.to_numeric(euk["GC%"],  errors='coerce')

# automatically applying the division for each row
euk["Proteins per gene"] = euk["Number of proteins"]/euk["Number of genes"]

In [None]:
euk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8302 entries, 0 to 8301
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Species             8302 non-null   object 
 1   Kingdom             8302 non-null   object 
 2   Class               8302 non-null   object 
 3   Size (Mb)           8302 non-null   float64
 4   GC%                 7895 non-null   float64
 5   Number of genes     2372 non-null   float64
 6   Number of proteins  2371 non-null   float64
 7   Publication year    8302 non-null   int64  
 8   Assembly status     8302 non-null   object 
 9   Genus               8302 non-null   object 
 10  Proteins per gene   2370 non-null   float64
dtypes: float64(5), int64(1), object(5)
memory usage: 713.6+ KB


# To filter the rows where the ratio of proteins and genes is
# greater than or equal to 1.1, we will get a series of boolean
# values by setting the condition.

euk[(euk["Proteins per gene"] >= 1.1)]

## Select data from DataFrame

### Select rows with specific column value

In [None]:
fungi = euk[(euk.Kingdom == 'Fungi')]
fungi.head(5)

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
8,Saccharomyces cerevisiae S288C,Fungi,Ascomycetes,12.1571,38.1556,6445.0,6002.0,1999,Complete Genome,Saccharomyces,0.931265
10,Pneumocystis carinii B80,Fungi,Ascomycetes,7.66146,27.8,3695.0,3646.0,2015,Contig,Pneumocystis,0.986739
11,Schizosaccharomyces pombe,Fungi,Ascomycetes,12.5913,36.0381,6974.0,5132.0,2002,Chromosome,Schizosaccharomyces,0.735876
12,Aspergillus nidulans FGSC A4,Fungi,Ascomycetes,30.276,50.2721,9586.0,9556.0,2003,Scaffold,Aspergillus,0.99687
13,Aspergillus fumigatus Af293,Fungi,Ascomycetes,29.385,49.8105,19832.0,19260.0,2005,Chromosome,Aspergillus,0.971158


### Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `iloc` and `loc`. For more advanced operations, these are the ones you're supposed to be using.

#### Index-based selection
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data.`iloc` follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

In [None]:
fungi.iloc[0] # first row

Species               Saccharomyces cerevisiae S288C
Kingdom                                        Fungi
Class                                    Ascomycetes
Size (Mb)                                    12.1571
GC%                                          38.1556
Number of genes                               6445.0
Number of proteins                            6002.0
Publication year                                1999
Assembly status                      Complete Genome
Genus                                  Saccharomyces
Proteins per gene                           0.931265
Name: 8, dtype: object

## Extracting rows and columns

-  Extract specific rows and all columns

    ```df.iloc[rows]```

-  Extract specific rows and columns

    ```df.iloc[rows, columns]```

-  Extract specific rows and columns with specific indices

    ```df.iloc[start:stop:skip, start:stop:skip]```

#### Note that stop is exclusive

 ```[1:3]``` will extract the second and third rows for all columns.

```[1::2, :]```  will extract from the 2nd row all the way to the end every other row, and all the columns

```[:, :3]```  all rows, from the beginning to the 3rd column (indices 0, 1, 2)


In [None]:
# Getting the second and third rows for all columns
fungi.iloc[1:3]

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
10,Pneumocystis carinii B80,Fungi,Ascomycetes,7.66146,27.8,3695.0,3646.0,2015,Contig,Pneumocystis,0.986739
11,Schizosaccharomyces pombe,Fungi,Ascomycetes,12.5913,36.0381,6974.0,5132.0,2002,Chromosome,Schizosaccharomyces,0.735876


In [None]:
# Getting every other rows from the first row to the 10th row, and extracting 4th and 5th columns
fungi.iloc[0:10:2, 3:5]

Unnamed: 0,Size (Mb),GC%
8,12.1571,38.1556
11,12.5913,36.0381
13,29.385,49.8105
15,14.2827,33.4827
33,2.49752,47.3005


In [None]:
# let's check to see if we have extract the desired rows and columns
fungi.head(10)

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
8,Saccharomyces cerevisiae S288C,Fungi,Ascomycetes,12.1571,38.1556,6445.0,6002.0,1999,Complete Genome,Saccharomyces,0.931265
10,Pneumocystis carinii B80,Fungi,Ascomycetes,7.66146,27.8,3695.0,3646.0,2015,Contig,Pneumocystis,0.986739
11,Schizosaccharomyces pombe,Fungi,Ascomycetes,12.5913,36.0381,6974.0,5132.0,2002,Chromosome,Schizosaccharomyces,0.735876
12,Aspergillus nidulans FGSC A4,Fungi,Ascomycetes,30.276,50.2721,9586.0,9556.0,2003,Scaffold,Aspergillus,0.99687
13,Aspergillus fumigatus Af293,Fungi,Ascomycetes,29.385,49.8105,19832.0,19260.0,2005,Chromosome,Aspergillus,0.971158
14,Phanerochaete chrysosporium,Fungi,Basidiomycetes,39.2051,56.5,,,2016,Contig,Phanerochaete,
15,Candida albicans SC5314,Fungi,Ascomycetes,14.2827,33.4827,6263.0,6030.0,2016,Chromosome,Candida,0.962797
16,Neurospora crassa OR74A,Fungi,Ascomycetes,41.1024,48.2319,10455.0,10812.0,2003,Chromosome,Neurospora,1.034146
33,Encephalitozoon cuniculi GB-M1,Fungi,Other Fungi,2.49752,47.3005,2029.0,1996.0,2001,Chromosome,Encephalitozoon,0.983736
46,Aspergillus terreus NIH2624,Fungi,Ascomycetes,29.364,52.8,10551.0,10401.0,2005,Scaffold,Aspergillus,0.985783


Both `loc` and `iloc` are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

In [None]:
fungi.iloc[:, 0] # all rows, first column

0       Saccharomyces cerevisiae S288C
1             Pneumocystis carinii B80
2            Schizosaccharomyces pombe
3         Aspergillus nidulans FGSC A4
4          Aspergillus fumigatus Af293
                     ...              
4489          Saccharomyces cerevisiae
4490          Saccharomyces cerevisiae
4491          Saccharomyces cerevisiae
4492          Saccharomyces cerevisiae
4493          Saccharomyces cerevisiae
Name: Species, Length: 4494, dtype: object

On its own, the `:` operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to omit the first 3 columns and extract just the first three rows, we would do:

In [None]:
fungi.iloc[:3, 3:]

Unnamed: 0,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
8,12.1571,38.1556,6445.0,6002.0,1999,Complete Genome,Saccharomyces,0.931265
10,7.66146,27.8,3695.0,3646.0,2015,Contig,Pneumocystis,0.986739
11,12.5913,36.0381,6974.0,5132.0,2002,Chromosome,Schizosaccharomyces,0.735876


#### Label-based selection
The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it's the data index value, not its position, which matters.

In [None]:
# Getting all rows, and specific columns based on their headers
fungi.loc[:, ['Species', 'Number of genes', 'Number of proteins']]

Unnamed: 0,Species,Number of genes,Number of proteins
8,Saccharomyces cerevisiae S288C,6445.0,6002.0
10,Pneumocystis carinii B80,3695.0,3646.0
11,Schizosaccharomyces pombe,6974.0,5132.0
12,Aspergillus nidulans FGSC A4,9586.0,9556.0
13,Aspergillus fumigatus Af293,19832.0,19260.0
...,...,...,...
8297,Saccharomyces cerevisiae,,
8298,Saccharomyces cerevisiae,155.0,298.0
8299,Saccharomyces cerevisiae,,
8300,Saccharomyces cerevisiae,,


## Set row index using a specific column

In [None]:
fungi.set_index('Species', inplace=True)
fungi

KeyError: "None of ['Species'] are in the columns"

In [None]:
# Getting specific rows based on the row index and all columns
fungi.loc[['Saccharomyces cerevisiae'], :]

Unnamed: 0_level_0,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.086300,38.1473,,,2015,Complete Genome,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.165500,38.3121,,,2018,Complete Genome,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.134300,38.1573,,,2019,Complete Genome,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.150900,38.5859,,,2015,Chromosome,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.071900,38.3205,,,2016,Chromosome,Saccharomyces,
...,...,...,...,...,...,...,...,...,...,...
Saccharomyces cerevisiae,Fungi,Ascomycetes,3.993920,38.2000,,,2017,Scaffold,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,0.586761,38.5921,155.0,298.0,1992,Chromosome,Saccharomyces,1.922581
Saccharomyces cerevisiae,Fungi,Ascomycetes,12.020400,38.2971,,,2018,Chromosome,Saccharomyces,
Saccharomyces cerevisiae,Fungi,Ascomycetes,11.960900,38.2413,,,2018,Chromosome,Saccharomyces,


## Reset dataframe's index

In [None]:
fungi.reset_index(inplace=True)
fungi

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status,Genus,Proteins per gene
0,Saccharomyces cerevisiae S288C,Fungi,Ascomycetes,12.157100,38.1556,6445.0,6002.0,1999,Complete Genome,Saccharomyces,0.931265
1,Pneumocystis carinii B80,Fungi,Ascomycetes,7.661460,27.8000,3695.0,3646.0,2015,Contig,Pneumocystis,0.986739
2,Schizosaccharomyces pombe,Fungi,Ascomycetes,12.591300,36.0381,6974.0,5132.0,2002,Chromosome,Schizosaccharomyces,0.735876
3,Aspergillus nidulans FGSC A4,Fungi,Ascomycetes,30.276000,50.2721,9586.0,9556.0,2003,Scaffold,Aspergillus,0.996870
4,Aspergillus fumigatus Af293,Fungi,Ascomycetes,29.385000,49.8105,19832.0,19260.0,2005,Chromosome,Aspergillus,0.971158
...,...,...,...,...,...,...,...,...,...,...,...
4489,Saccharomyces cerevisiae,Fungi,Ascomycetes,3.993920,38.2000,,,2017,Scaffold,Saccharomyces,
4490,Saccharomyces cerevisiae,Fungi,Ascomycetes,0.586761,38.5921,155.0,298.0,1992,Chromosome,Saccharomyces,1.922581
4491,Saccharomyces cerevisiae,Fungi,Ascomycetes,12.020400,38.2971,,,2018,Chromosome,Saccharomyces,
4492,Saccharomyces cerevisiae,Fungi,Ascomycetes,11.960900,38.2413,,,2018,Chromosome,Saccharomyces,
