This python pipeline is meant to prepare a dataframe for re-naming genes to be used in the OctoSeqPipeline. 

Some sections of this pipeline came from this RBH pipeline found here: https://widdowquinn.github.io/2018-03-06-ibioic/02-sequence_databases/05-blast_for_rbh.html

In [42]:
#Imports
import os
import numpy as np
import pandas as pd

BlastN was run reciprocally between the two gene models. A unique output format was used, outputting each of the column names below. It's important to not to take only the top hit for each gene because in many cases, multiple hisat genes (OCTOGenes) match up to a single ocbimv. Blasting between the two gene models also helps to collapse some of the isoforms that are nearly identical. 

In [43]:
n_cov4_Oct46 = pd.read_csv("n_cov4_align_Oct46.txt", sep="\t",header=None)
n_Oct46_cov4 = pd.read_csv("n_Oct46_align_cov4.txt", sep="\t",header=None)
n_cov4_Oct46.columns = ['query', 'subject', 'pident', 'length', 'qlen', 'slen',
                        'qcovs', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 
                        'send', 'eval', 'bitscore']
n_Oct46_cov4.columns = ['query', 'subject', 'pident', 'length', 'qlen', 'slen',
                        'qcovs', 'mismatch', 'gapopen', 'qstart', 'qend', 'sstart', 
                        'send', 'eval', 'bitscore']

The example RBH pipline used the normalized bitscores for filtering, but I ended up not using this column (I just used the regular bitscore to find a threshold). If further filtering is needs, this column might be useful.

In [44]:
# Create a new column in both dataframes: normalized bitscore
n_cov4_Oct46['norm_bitscore'] = n_cov4_Oct46.bitscore/n_cov4_Oct46.qlen
n_Oct46_cov4['norm_bitscore'] = n_Oct46_cov4.bitscore/n_Oct46_cov4.qlen

Filtering of the blast hits. I determined the different thresholds by comparing the different hits to IGV. For example, I did not find any incorrect gene matchings where the bitscore was above 297. I decided on more conservative values to avoid having any correct gene model matches. This left out some of the very short genes in the very short contigs at the end of the genome file.

In [45]:
filtered = (n_cov4_Oct46['bitscore'] > 297) & (n_cov4_Oct46['pident'] == 100) & (n_cov4_Oct46['eval'] <= 1e-40) & (n_cov4_Oct46['qcovs'] > 90)
f_n_cov4_Oct46 = n_cov4_Oct46[filtered]

filtered2 = (n_Oct46_cov4['bitscore'] > 297) & (n_Oct46_cov4['pident'] == 100) & (n_Oct46_cov4['eval'] <= 1e-40) & (n_Oct46_cov4['qcovs'] > 90)
f_n_Oct46_cov4 = n_Oct46_cov4[filtered2]

In [46]:
#Examine the sizes of the newly filtered dataframes
print(f_n_cov4_Oct46.size)
print(f_n_Oct46_cov4.size)

1396000
1506800


In [47]:
#Take a look at the top of one of the dataframes
f_n_cov4_Oct46.head()

Unnamed: 0,query,subject,pident,length,qlen,slen,qcovs,mismatch,gapopen,qstart,qend,sstart,send,eval,bitscore,norm_bitscore
31,OCTOGene.16.3,Ocbimv22004487m,100.0,275,275,275,100,0,0,1,275,1,275,1.02e-143,508.0,1.847273
40,OCTOGene.20.1,Ocbimv22004486m,100.0,256,256,1945,100,0,0,1,256,1000,745,3.4299999999999997e-133,473.0,1.847656
245,OCTOGene.3.1,Ocbimv22004488m,100.0,661,661,661,100,0,0,1,661,1,661,0.0,1221.0,1.847201
249,OCTOGene.3.2,Ocbimv22004489m,100.0,1579,1579,1579,100,0,0,1,1579,1,1579,0.0,2916.0,1.846738
622,OCTOGene.11.1,Ocbimv22004490m,100.0,405,405,405,100,0,0,1,405,1,405,0.0,749.0,1.849383


Reciprocal blast hit filtering. This step finds where reciprocal blast hits are the same and then removes the rows where they are not the same. Groups by OCTOGene

In [48]:
# Merge forward and reverse results
rbh_all = pd.merge(f_n_cov4_Oct46, f_n_Oct46_cov4[['query', 'subject']],
                left_on='subject', right_on='query',
                how='outer')
# Discard rows that are not RBH
rbh_all = rbh_all.loc[rbh_all.query_x == rbh_all.subject_y]
# Group duplicate RBH rows, taking the maximum value in each column
rbh_all_group = rbh_all.groupby(['query_x', 'subject_x']).max() 

In [49]:
#Use this to view many rows. I used this to double check with IGV that hits make sense
pd.set_option('display.max_rows', None)
#rbh_all_group.tail(500) #look at the tail end of dataframe
#rbh_all_group.head(500) #look at the head of the dataframe
rbh_all_group[100:200] #look at particular rows of the dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,pident,length,qlen,slen,qcovs,mismatch,gapopen,qstart,qend,sstart,send,eval,bitscore,norm_bitscore,query_y,subject_y
query_x,subject_x,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
OCTOGene.10262.1,Ocbimv22038022m,100.0,4086.0,4086.0,4086.0,100.0,0.0,0.0,1.0,4086.0,1.0,4086.0,0.0,7546.0,1.846794,Ocbimv22038022m,OCTOGene.10262.1
OCTOGene.10262.1,Ocbimv22038023m,100.0,2819.0,4086.0,3770.0,92.0,0.0,0.0,1268.0,4086.0,952.0,3770.0,0.0,5206.0,1.274107,Ocbimv22038023m,OCTOGene.10262.1
OCTOGene.10262.2,Ocbimv22038022m,100.0,2819.0,3770.0,4086.0,100.0,0.0,0.0,952.0,3770.0,1268.0,4086.0,0.0,5206.0,1.380902,Ocbimv22038022m,OCTOGene.10262.2
OCTOGene.10262.2,Ocbimv22038023m,100.0,3770.0,3770.0,3770.0,100.0,0.0,0.0,1.0,3770.0,1.0,3770.0,0.0,6962.0,1.846684,Ocbimv22038023m,OCTOGene.10262.2
OCTOGene.10274.2,Ocbimv22038026m,100.0,4709.0,4709.0,4709.0,100.0,0.0,0.0,1.0,4709.0,1.0,4709.0,0.0,8696.0,1.846677,Ocbimv22038026m,OCTOGene.10274.2
OCTOGene.10275.1,Ocbimv22038024m,100.0,3276.0,3276.0,3276.0,100.0,0.0,0.0,1.0,3276.0,1.0,3276.0,0.0,6050.0,1.846764,Ocbimv22038024m,OCTOGene.10275.1
OCTOGene.10279.1,Ocbimv22016147m,100.0,302.0,302.0,302.0,100.0,0.0,0.0,1.0,302.0,1.0,302.0,1.1e-158,558.0,1.847682,Ocbimv22016147m,OCTOGene.10279.1
OCTOGene.10283.1,Ocbimv22016148m,100.0,8105.0,8105.0,8105.0,100.0,0.0,0.0,1.0,8105.0,1.0,8105.0,0.0,14968.0,1.846761,Ocbimv22016148m,OCTOGene.10283.1
OCTOGene.10288.1,Ocbimv22016149m,100.0,1939.0,1939.0,1939.0,100.0,0.0,0.0,1.0,1939.0,1.0,1939.0,0.0,3581.0,1.846828,Ocbimv22016149m,OCTOGene.10288.1
OCTOGene.10296.1,Ocbimv22016150m,100.0,3554.0,3554.0,3554.0,100.0,0.0,0.0,1.0,3554.0,1.0,3554.0,0.0,6564.0,1.846933,Ocbimv22016150m,OCTOGene.10296.1


In [50]:
#Use this to find particular genes in the dataframe
a = rbh_all_group["query_y"] == 'Ocbimv22038026m'
rbh_all_group[a]
#b = rbh_all_group["subject_y"] == 'OCTOGene.10262.2'
#rbh_all_group[b]

Unnamed: 0_level_0,Unnamed: 1_level_0,pident,length,qlen,slen,qcovs,mismatch,gapopen,qstart,qend,sstart,send,eval,bitscore,norm_bitscore,query_y,subject_y
query_x,subject_x,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
OCTOGene.10274.2,Ocbimv22038026m,100.0,4709.0,4709.0,4709.0,100.0,0.0,0.0,1.0,4709.0,1.0,4709.0,0.0,8696.0,1.846677,Ocbimv22038026m,OCTOGene.10274.2


Now, read in the file that matches ocbimv names with the human-readable names that have been identified by Judit through lots of blasting. This file is made into a dictionary and then the human-readable names are mapped onto the names file in a new column.

In [51]:
#Make new dataframe (names) that removes all the extra blast information
names = rbh_all_group.iloc[:,0:0]
names = names.reset_index()
#Read in file that matches Ocbimv names to human-readable gene IDs
genes = pd.read_csv("GeneIDs - AllGenesIDed.csv", sep=",")
gene_dict = genes.set_index('Gene')['ID'].to_dict()

In [52]:
names.head()

Unnamed: 0,query_x,subject_x
0,OCTOGene.100.1,Ocbimv22010629m
1,OCTOGene.1000.1,Ocbimv22028270m
2,OCTOGene.10002.1,Ocbimv22030350m
3,OCTOGene.1002.3,Ocbimv22028271m
4,OCTOGene.10028.1,Ocbimv22027785m


In [53]:
names['ID'] = names['subject_x'].map(gene_dict)
names = names.fillna(value='none')
names.head(500)

Unnamed: 0,query_x,subject_x,ID
0,OCTOGene.100.1,Ocbimv22010629m,none
1,OCTOGene.1000.1,Ocbimv22028270m,none
2,OCTOGene.10002.1,Ocbimv22030350m,none
3,OCTOGene.1002.3,Ocbimv22028271m,none
4,OCTOGene.10028.1,Ocbimv22027785m,none
5,OCTOGene.10032.1,Ocbimv22029021m,none
6,OCTOGene.10034.1,Ocbimv22027786m,none
7,OCTOGene.10040.1,Ocbimv22027788m,none
8,OCTOGene.10045.1,Ocbimv22027787m,none
9,OCTOGene.10047.1,Ocbimv22027794m,none


In [54]:
#Separate the dataframe, regroup and combine Ocbimv columns to remove repeats. 
#Check sizes to make sure they are the same
q_id = names[['query_x','ID']]
q_s = names[['query_x','subject_x']]
q_id_s = q_id.groupby('query_x')['ID'].apply(lambda x: '-'.join(x)).reset_index()
q_s_s = q_s.groupby('query_x')['subject_x'].apply(lambda x: '-'.join(x)).reset_index()
print(q_id_s.size)
print(q_s_s.size)

55934
55934


In [55]:
#Recombine and rename columns
names_sort = pd.merge(q_s_s, q_id_s, on='query_x')
names_sort.columns = ['OctoGene', 'Ocbimv', 'ID']

In [56]:
#Cellranger only uses the gene IDs rather than the trascript IDs (so OCTOGene.1 instead of OCTOGene.1.1)
#Have to remove the ends of the OCTOGene names and then regroup and remove repeats again.
#There is probably a better way of doing this, but still very fast so I went with it
short = names_sort
short['OctoGene'] = names_sort['OctoGene'].map(lambda x: str(x)[:-2])
names_sort['OctoGene'] = names_sort['OctoGene'].apply(lambda x: x[:-1] if x.endswith('.') else x)
short.head(50)

Unnamed: 0,OctoGene,Ocbimv,ID
0,OCTOGene.100,Ocbimv22010629m,none
1,OCTOGene.1000,Ocbimv22028270m,none
2,OCTOGene.10002,Ocbimv22030350m,none
3,OCTOGene.1002,Ocbimv22028271m,none
4,OCTOGene.10028,Ocbimv22027785m,none
5,OCTOGene.10032,Ocbimv22029021m,none
6,OCTOGene.10034,Ocbimv22027786m,none
7,OCTOGene.10040,Ocbimv22027788m,none
8,OCTOGene.10045,Ocbimv22027787m,none
9,OCTOGene.10047,Ocbimv22027794m-Ocbimv22027795m,none-none


In [57]:
q_id = short[['OctoGene','ID']]
q_s = short[['OctoGene', 'Ocbimv']]
q_id_s = q_id.groupby('OctoGene')['ID'].apply(lambda x: '_'.join(x)).reset_index()
q_s_s = q_s.groupby('OctoGene')['Ocbimv'].apply(lambda x: '_'.join(x)).reset_index()
print(q_id_s.size)
print(q_s_s.size)

40074
40074


In [58]:
q_s_s.head()

Unnamed: 0,OctoGene,Ocbimv
0,OCTOGene.100,Ocbimv22010629m
1,OCTOGene.1000,Ocbimv22028270m
2,OCTOGene.10002,Ocbimv22030350m
3,OCTOGene.1002,Ocbimv22028271m
4,OCTOGene.10028,Ocbimv22027785m


In [59]:
names_short = pd.merge(q_s_s, q_id_s, on='OctoGene')
names_short.columns = ['OctoGene', 'Ocbimv', 'ID']
names_short.head(500)

Unnamed: 0,OctoGene,Ocbimv,ID
0,OCTOGene.100,Ocbimv22010629m,none
1,OCTOGene.1000,Ocbimv22028270m,none
2,OCTOGene.10002,Ocbimv22030350m,none
3,OCTOGene.1002,Ocbimv22028271m,none
4,OCTOGene.10028,Ocbimv22027785m,none
5,OCTOGene.10032,Ocbimv22029021m,none
6,OCTOGene.10034,Ocbimv22027786m,none
7,OCTOGene.10040,Ocbimv22027788m,none
8,OCTOGene.10045,Ocbimv22027787m,none
9,OCTOGene.10047,Ocbimv22027794m-Ocbimv22027795m_Ocbimv22027795m,none-none_none


In [60]:
#Remove all the extra characters from the gene IDs. Seurat doesn't like these
names_short = names_short.replace(' ', '-', regex=True)
names_short = names_short.replace('/', '-', regex=True)
names_short = names_short.replace('_', '-', regex=True)
names_short = names_short.replace(',', '', regex=True)
names_short = names_short.replace(')', '-')
names_short = names_short.replace('(', '-')

#this code is ridiculous but I couldn't think of a better way to do it because there's some that are gene-none
names_short = names_short.replace('none', 'NA')
names_short = names_short.replace('none-none', 'NA')
names_short = names_short.replace('none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none-none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none', 'NA')
names_short = names_short.replace('none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none-none', 'NA')

In [61]:
names_short.head(500)

Unnamed: 0,OctoGene,Ocbimv,ID
0,OCTOGene.100,Ocbimv22010629m,
1,OCTOGene.1000,Ocbimv22028270m,
2,OCTOGene.10002,Ocbimv22030350m,
3,OCTOGene.1002,Ocbimv22028271m,
4,OCTOGene.10028,Ocbimv22027785m,
5,OCTOGene.10032,Ocbimv22029021m,
6,OCTOGene.10034,Ocbimv22027786m,
7,OCTOGene.10040,Ocbimv22027788m,
8,OCTOGene.10045,Ocbimv22027787m,
9,OCTOGene.10047,Ocbimv22027794m-Ocbimv22027795m-Ocbimv22027795m,


In [62]:
#Save this version out to be easily visualized
names_short.to_csv('new_namekey.csv') #be sure to change name to what you want

In [63]:
names_seurat = names_short
names_seurat['OctoGene_s'] = names_short['OctoGene'].map(lambda x: str(x)[:-2])
names_seurat['OctoGene_s'] = names_short['OctoGene_s'].apply(lambda x: x[:-1] if x.endswith('.') else x)
names_seurat['OctoGene_s'] = names_short['OctoGene_s'].replace('OCTOGene', 'OG', regex=True)
names_seurat['Ocbimv_s'] = names_short['Ocbimv'].replace('Ocbimv220', 'Oc', regex=True)
names_seurat['Ocbimv_s'] = names_short['Ocbimv_s'].replace('m', '', regex=True)

In [64]:
names_seurat.head()

Unnamed: 0,OctoGene,Ocbimv,ID,OctoGene_s,Ocbimv_s
0,OCTOGene.100,Ocbimv22010629m,,OG.1,Oc10629
1,OCTOGene.1000,Ocbimv22028270m,,OG.10,Oc28270
2,OCTOGene.10002,Ocbimv22030350m,,OG.100,Oc30350
3,OCTOGene.1002,Ocbimv22028271m,,OG.10,Oc28271
4,OCTOGene.10028,Ocbimv22027785m,,OG.100,Oc27785


In [65]:
names_seurat['name'] = names_seurat['ID'].str.cat(names_seurat['Ocbimv_s'],sep="-")
names_seurat['fullname'] = names_seurat['name'].str.cat(names_seurat['OctoGene_s'],sep="-")
names_seurat = names_seurat.drop(['name', 'OctoGene_s', 'Ocbimv_s'], axis=1)

In [66]:
names_seurat.head(500)

Unnamed: 0,OctoGene,Ocbimv,ID,fullname
0,OCTOGene.100,Ocbimv22010629m,,NA-Oc10629-OG.1
1,OCTOGene.1000,Ocbimv22028270m,,NA-Oc28270-OG.10
2,OCTOGene.10002,Ocbimv22030350m,,NA-Oc30350-OG.100
3,OCTOGene.1002,Ocbimv22028271m,,NA-Oc28271-OG.10
4,OCTOGene.10028,Ocbimv22027785m,,NA-Oc27785-OG.100
5,OCTOGene.10032,Ocbimv22029021m,,NA-Oc29021-OG.100
6,OCTOGene.10034,Ocbimv22027786m,,NA-Oc27786-OG.100
7,OCTOGene.10040,Ocbimv22027788m,,NA-Oc27788-OG.100
8,OCTOGene.10045,Ocbimv22027787m,,NA-Oc27787-OG.100
9,OCTOGene.10047,Ocbimv22027794m-Ocbimv22027795m-Ocbimv22027795m,,NA-Oc27794-Oc27795-Oc27795-OG.100


Now, add the OCTOGenes to the dataframe that don't have an Ocbimv associated with them. This reads in the Hisat gtf file, retrieves all the OCTOGene gene ids, and then appends them to the end of the names_seurat dataframe

In [67]:
octogenes = pd.read_csv('/Users/gcoffing/Documents/Documents/octo/cov4_len200_splice3_strandness.gtf', sep='\t', header=None)

In [68]:
octogenes.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,KQ417173,StringTie,transcript,26231,47536,1000,+,.,"gene_id ""OCTOGene.1""; transcript_id ""OCTOGene...."
1,KQ417173,StringTie,exon,26231,26487,1000,+,.,"gene_id ""OCTOGene.1""; transcript_id ""OCTOGene...."
2,KQ417173,StringTie,exon,47237,47536,1000,+,.,"gene_id ""OCTOGene.1""; transcript_id ""OCTOGene...."
3,KQ417173,StringTie,transcript,28714,31731,1000,+,.,"gene_id ""OCTOGene.2""; transcript_id ""OCTOGene...."
4,KQ417173,StringTie,exon,28714,31731,1000,+,.,"gene_id ""OCTOGene.2""; transcript_id ""OCTOGene...."


In [69]:
octogenes['n'] = octogenes[8].str.split(';').str[0]
octogenes['n'] = octogenes['n'].str[9:]
octogenes['n'] = octogenes['n'].str[:-1]
octogenes.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,n
0,KQ417173,StringTie,transcript,26231,47536,1000,+,.,"gene_id ""OCTOGene.1""; transcript_id ""OCTOGene....",OCTOGene.1
1,KQ417173,StringTie,exon,26231,26487,1000,+,.,"gene_id ""OCTOGene.1""; transcript_id ""OCTOGene....",OCTOGene.1
2,KQ417173,StringTie,exon,47237,47536,1000,+,.,"gene_id ""OCTOGene.1""; transcript_id ""OCTOGene....",OCTOGene.1
3,KQ417173,StringTie,transcript,28714,31731,1000,+,.,"gene_id ""OCTOGene.2""; transcript_id ""OCTOGene....",OCTOGene.2
4,KQ417173,StringTie,exon,28714,31731,1000,+,.,"gene_id ""OCTOGene.2""; transcript_id ""OCTOGene....",OCTOGene.2


In [70]:
new = octogenes['n']
new = new.drop_duplicates()

In [71]:
new.head()

0     OCTOGene.1
3     OCTOGene.2
5     OCTOGene.3
9     OCTOGene.4
11    OCTOGene.5
Name: n, dtype: object

In [72]:
new = list(new)

In [73]:
#this step takes a bit of time--about 5 minutes
for i in new:
    if i in names_seurat['OctoGene'].values:
        new.remove(i)

In [74]:
print(new)

['OCTOGene.1', 'OCTOGene.2', 'OCTOGene.4', 'OCTOGene.5', 'OCTOGene.6', 'OCTOGene.7', 'OCTOGene.8', 'OCTOGene.9', 'OCTOGene.10', 'OCTOGene.12', 'OCTOGene.13', 'OCTOGene.14', 'OCTOGene.15', 'OCTOGene.17', 'OCTOGene.18', 'OCTOGene.19', 'OCTOGene.20', 'OCTOGene.21', 'OCTOGene.22', 'OCTOGene.23', 'OCTOGene.24', 'OCTOGene.25', 'OCTOGene.26', 'OCTOGene.27', 'OCTOGene.28', 'OCTOGene.29', 'OCTOGene.30', 'OCTOGene.31', 'OCTOGene.32', 'OCTOGene.34', 'OCTOGene.35', 'OCTOGene.36', 'OCTOGene.38', 'OCTOGene.39', 'OCTOGene.40', 'OCTOGene.41', 'OCTOGene.42', 'OCTOGene.43', 'OCTOGene.44', 'OCTOGene.45', 'OCTOGene.46', 'OCTOGene.47', 'OCTOGene.48', 'OCTOGene.49', 'OCTOGene.51', 'OCTOGene.52', 'OCTOGene.54', 'OCTOGene.55', 'OCTOGene.57', 'OCTOGene.58', 'OCTOGene.60', 'OCTOGene.61', 'OCTOGene.62', 'OCTOGene.63', 'OCTOGene.64', 'OCTOGene.65', 'OCTOGene.66', 'OCTOGene.67', 'OCTOGene.68', 'OCTOGene.70', 'OCTOGene.71', 'OCTOGene.73', 'OCTOGene.74', 'OCTOGene.75', 'OCTOGene.77', 'OCTOGene.79', 'OCTOGene.81', 'O

In [75]:
l = pd.DataFrame(new, columns=['OctoGene'])

In [76]:
l['Ocbimv'] = 'NA'
l['ID'] = 'NA'
l['fullname'] = l['OctoGene'].replace('OCTOGene', 'OG', regex=True)

In [77]:
l.tail(500)

Unnamed: 0,OctoGene,Ocbimv,ID,fullname
75197,OCTOGene.91067,,,OG.91067
75198,OCTOGene.91068,,,OG.91068
75199,OCTOGene.91069,,,OG.91069
75200,OCTOGene.91070,,,OG.91070
75201,OCTOGene.91071,,,OG.91071
75202,OCTOGene.91072,,,OG.91072
75203,OCTOGene.91073,,,OG.91073
75204,OCTOGene.91074,,,OG.91074
75205,OCTOGene.91075,,,OG.91075
75206,OCTOGene.91076,,,OG.91076


In [78]:
full_df = names_seurat.append(l, ignore_index=True)

In [79]:
full_df.head()

Unnamed: 0,OctoGene,Ocbimv,ID,fullname
0,OCTOGene.100,Ocbimv22010629m,,NA-Oc10629-OG.1
1,OCTOGene.1000,Ocbimv22028270m,,NA-Oc28270-OG.10
2,OCTOGene.10002,Ocbimv22030350m,,NA-Oc30350-OG.100
3,OCTOGene.1002,Ocbimv22028271m,,NA-Oc28271-OG.10
4,OCTOGene.10028,Ocbimv22027785m,,NA-Oc27785-OG.100


In [80]:
full_df.tail()

Unnamed: 0,OctoGene,Ocbimv,ID,fullname
95729,OCTOGene.91600,,,OG.91600
95730,OCTOGene.91601,,,OG.91601
95731,OCTOGene.91602,,,OG.91602
95732,OCTOGene.91603,,,OG.91603
95733,OCTOGene.91604,,,OG.91604


In [81]:
full_df.to_csv('fulldf_namekey.csv')