In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob

# RNA Sequences

**TASK**: In the `Data/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10th file that describes the content of each. Use pandas to import the first 9 spreadsheets into a single DataFrame. Then, add the metadata information from the 10th spreadsheet as columns in the combined `DataFrame`. Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

First we load the 9 spreadsheet into a single DataFrame.

In [2]:
def load_MIDs(num_list):
    list_dataframes = [pd.read_excel('Data/microbiome/MID{}.xls'.format(str(num)), header=None) for num in num_list]
    
    list_dataframes = [list_dataframes[num-1].rename(index=str, columns={0: 'Taxonomy', 1: 'Count'}) 
                       for num in num_list]
    
    for num in num_list:
        list_dataframes[num-1]['Barcode'] = ['MID{}'.format(str(num))]*len(list_dataframes[num-1]['Count'])
    
    return pd.concat(list_dataframes, ignore_index=True, sort=True)

In [3]:
MIDs_data = load_MIDs(range(1,10))

In [4]:
MIDs_data.head(2)

Unnamed: 0,Barcode,Count,Taxonomy
0,MID1,7,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro..."
1,MID1,2,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro..."


Once the data from all the spreadsheetshas been merged together, `MIDs_data` contains 2396 rows and 2 columns and it has already a unique index:

In [5]:
MIDs_data.tail(2)

Unnamed: 0,Barcode,Count,Taxonomy
2394,MID9,1,Bacteria Cyanobacteria Cyanobacteria Chloropl...
2395,MID9,10,Bacteria Cyanobacteria Cyanobacteria Chloropl...


Untill now there are no `NaN` values in the `DataFrame`:

In [6]:
print(MIDs_data.isnull().values.any())
#MIDs_data.fillna('unknown', inplace=True)

False


Now we will add the metadata information to the `DataFrame`.