# XML parsing 
In this file we will parse the xml data files

1. First we loop through xml files and get a very long list of dictionaries
2. We turn the list of dictionaries into a python data frame
3. Then we get the data for the sample, stored in subdictionaries in the 'Samples' field of the main dataframe
4. Number of samples to the records df
5. Output is saved two dataframes, converted to pkl files, in the same folder as the raw data files:
    - records_df --> records.pkl
    - samples_df --> samples.pkl

In [31]:
from Bio import Entrez
import pandas as pd
import glob
import os

In [67]:
print('First we loop through xml files and get a very long list of dictionaries.')
print('We turn the list of dictionaries into a pandas data frame.')
print('')

Entrez.email = "A.N.Other@example.com" # Always tell NCBI who you are

data_path = './data_claire' # rename with template
file_base_name = "all_gse_series_homo_sapiens_part"
output_file_rec = os.path.join(data_path, 'records.pkl')
output_file_sam = os.path.join(data_path, 'samples.pkl')
raw_files = sorted(glob.glob(os.path.join(data_path, file_base_name+'*')))

record_list = []
for ifile in raw_files:
    print('Parsing ',ifile)
    handle = open(ifile)
    records = Entrez.parse(handle)
    for record in records:
        record_list.append(record)
        
records_df = pd.DataFrame(record_list)

print('')
print('Now we go through each row of the larger dataframe and get the sample data from each row.')
print('')


samples_df = pd.DataFrame()
for i in range(len(records_df)):
    samples_aux = pd.DataFrame(records_df.loc[i].Samples)
    samples_aux['Id'] = records_df.loc[i].Id
    samples_aux['nsamples'] = len(samples_aux)
    samples_df = samples_df.append(samples_aux)
    if i%5000==0:
        print('Sample iteration:')
        print(i)

samples_df = samples_df.set_index(['Id','nsamples'])
print('Saving samples to ', output_file_sam)
samples_df.to_pickle(output_file_sam)

# take sample count from samples and put in records
samples_df['dum'] = 1
n_samples_df = samples_df.reset_index().groupby(['Id','nsamples']).mean().reset_index()
records_df = pd.merge(n_samples_df[['Id','nsamples']],records_df, on='Id', how='right')

print('')
print('Saving records to ', output_file_rec)
print('')
records_df.to_pickle(output_file_rec)

print('Done.')

First we loop through xml files and get a very long list of dictionaries.
We turn the list of dictionaries into a pandas data frame.

Parsing  ./data_claire/all_gse_series_homo_sapiens_part0.xml
Parsing  ./data_claire/all_gse_series_homo_sapiens_part1.xml
Parsing  ./data_claire/all_gse_series_homo_sapiens_part2.xml
Parsing  ./data_claire/all_gse_series_homo_sapiens_part3.xml
Parsing  ./data_claire/all_gse_series_homo_sapiens_part4.xml
Parsing  ./data_claire/all_gse_series_homo_sapiens_part5.xml
Parsing  ./data_claire/all_gse_series_homo_sapiens_part6.xml
Parsing  ./data_claire/all_gse_series_homo_sapiens_part7.xml
Parsing  ./data_claire/all_gse_series_homo_sapiens_part8.xml
Parsing  ./data_claire/all_gse_series_homo_sapiens_part9.xml

Now we go through each row of the larger dataframe and get the sample data from each row.

Sample iteration:
0


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Sample iteration:
1000
Sample iteration:
2000
Sample iteration:
3000
Sample iteration:
4000
Sample iteration:
5000
Sample iteration:
6000
Sample iteration:
7000
Sample iteration:
8000
Sample iteration:
9000
Sample iteration:
10000
Sample iteration:
11000
Sample iteration:
12000
Sample iteration:
13000
Sample iteration:
14000
Sample iteration:
15000
Sample iteration:
16000
Sample iteration:
17000
Sample iteration:
18000
Sample iteration:
19000
Sample iteration:
20000
Sample iteration:
21000
Sample iteration:
22000
Sample iteration:
23000
Sample iteration:
24000
Sample iteration:
25000
Sample iteration:
26000
Sample iteration:
27000
Sample iteration:
28000
Sample iteration:
29000
Sample iteration:
30000
Sample iteration:
31000
Sample iteration:
32000
Sample iteration:
33000
Sample iteration:
34000
Sample iteration:
35000
Sample iteration:
36000
Sample iteration:
37000
Sample iteration:
38000
Sample iteration:
39000
Sample iteration:
40000
Sample iteration:
41000
Sample iteration:
42000
S

KeyError: ('Id', 'nsamples')