## Challenge: Data cleaning & validation Unit 1.3.6

Data cleaning is definitely a "practice makes perfect" skill. Using this dataset of article open-access prices paid by the [WELLCOME Trust between 2012 and 2013](https://www.dropbox.com/s/19cjdi7wqhlfcpt/WELLCOME.zip?dl=0), determine the five most common journals and the total articles for each. Next, calculate the mean, median, and standard deviation of the open-access cost per article for each journal . You will need to do considerable data cleaning in order to extract accurate estimates, and may want to look into data [encoding methods](https://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16) if you get stuck. For a real bonus round, identify the open access prices paid by subject area.

As noted in the previous assignment, don't modify the data directly. Instead, write a cleaning script that will load the raw data and whip it into shape. Jupyter notebooks are a great format for this. Keep a record of your decisions: well-commented code is a must for recording your data cleaning decision-making progress. Submit a link to your script and results below and discuss it with your mentor at your next session.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
apc_spend = pd.read_csv('WELLCOME_APCspend2013_forThinkful.csv', 
                       index_col=None, encoding='latin-1')

apc_spend.head(5)

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [3]:
apc_spend.shape

(2127, 5)

In [4]:
apc_spend.dtypes

PMID/PMCID                                             object
Publisher                                              object
Journal title                                          object
Article title                                          object
COST (£) charged to Wellcome (inc VAT when charged)    object
dtype: object

In [5]:
# Simplify column names
apc_spend = apc_spend.rename({'PMID/PMCID':'id', 'Publisher':'publisher',
                  'Journal title':'journal_title', 'Article title':'article_title', 
                  'COST (£) charged to Wellcome (inc VAT when charged)':'cost'}, axis='columns')

In [6]:
apc_spend.head(5)

Unnamed: 0,id,publisher,journal_title,article_title,cost
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [7]:
# Check for unique items
apc_spend.nunique()

id               1880
publisher         299
journal_title     984
article_title    2126
cost             1402
dtype: int64

In [8]:
# Verify there are no missing values as all columns have the 
# exact number of entries.
apc_spend.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2127 entries, 0 to 2126
Data columns (total 5 columns):
id               1928 non-null object
publisher        2127 non-null object
journal_title    2126 non-null object
article_title    2127 non-null object
cost             2127 non-null object
dtypes: object(5)
memory usage: 83.2+ KB


In [9]:
# Check for missing values computation
apc_spend.isnull().sum()

id               199
publisher          0
journal_title      1
article_title      0
cost               0
dtype: int64

In [10]:
# Since I am not going to focus on the ids and the readme file
# mentioned that the id data is not always accurate, I am removing
# the id column from the data
apc_spend = apc_spend.drop(columns=['id'])

In [11]:
# Now there is only one row with a null value, so I will remove that row
apc_spend = apc_spend.dropna()

apc_spend.isnull().sum()

publisher        0
journal_title    0
article_title    0
cost             0
dtype: int64

In [12]:
# Remove £ sterling character and $ from cost in row 179
#apc_spend['cost'] = apc_spend['cost'].apply(lambda x: x.strip('£'))
#apc_spend['cost'] = apc_spend['cost'].apply(lambda x: x.strip('$'))
apc_spend['cost'] = [x.strip('$') for x in apc_spend.cost] # quicker than lambda
apc_spend['cost'] = [x.strip('£') for x in apc_spend.cost]

apc_spend.head(5)

Unnamed: 0,publisher,journal_title,article_title,cost
0,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04
2,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,685.88


In [13]:
#Change cost column from object to integer type
apc_spend['cost'] = pd.to_numeric(apc_spend['cost'],downcast='integer')

# Round cost column to whole number
apc_spend['cost'] = apc_spend['cost'].round()

apc_spend.head(5)

Unnamed: 0,publisher,journal_title,article_title,cost
0,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.0
2,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",643.0
3,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,670.0
4,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,686.0


In [14]:
# In case of extra whitespace, I am going to remove any that may be there
apc_spend['publisher'] = apc_spend['publisher'].str.strip()
apc_spend['journal_title'] = apc_spend['journal_title'].str.strip()
apc_spend['article_title'] = apc_spend['article_title'].str.strip()

apc_spend.tail(5)

Unnamed: 0,publisher,journal_title,article_title,cost
2122,Wolters Kluwer Health,Circulation Research,Mechanistic Links Between Na+ Channel (SCN5A) ...,1334.0
2123,Wolters Kluwer Health,AIDS,Evaluation of an empiric risk screening score ...,1835.0
2124,Wolters Kluwer Health,Pediatr Infect Dis J,Topical umbilical cord care for prevention of ...,1835.0
2125,Wolters Kluwer N.V./Lippinott,AIDS,Grassroots Community Organisations' Contributi...,2375.0
2126,Wolters Kluwers,Journal of Acquired Immune Deficiency Syndromes,A novel community health worker tool outperfor...,2035.0


Now I am going to clean journal titles, first by changing the text to lower case and the looking for typos and missing part to titles and correct those that have a lot that would affect whether the would be in the top 5 most common journals.  I went throught the alphabet and scanned for obvious errors that should be corrected.

In [15]:
# Making journal titles lower case
journal = apc_spend['journal_title'].str.lower().sort_values()
journal

439                    academy of nutrition and dietetics
8                                    acs chemical biology
9                                    acs chemical biology
21                                   acs chemical biology
20                                   acs chemical biology
19                                   acs chemical biology
22                              acs chemical neuroscience
34                                               acs nano
23                                               acs nano
927     acta crystallographica section d,  biological ...
920     acta crystallographica section d: biological c...
928     acta crystallographica section f: structural b...
929     acta crystallographica section f: structural b...
921                     acta crystallographica, section d
922                                acta crystallography d
923                                                acta d
1711                           acta dermato venereologica
1716          

In [16]:
journal.value_counts().sort_values()

consciousness & cognition                                   1
neuropsychobiology                                          1
journal of acquired immune deficiency syndroms (jaids)      1
international psychogeriatrics                              1
international journal of parasitology                       1
molecular & cellular proteomics                             1
european child and adolescent psychiatry                    1
child psychology psychiatry                                 1
european journal of health law                              1
computational biology                                       1
american journal of public health                           1
health & place                                              1
journal of neuroendocrinology                               1
mol bio                                                     1
calcified tissue international                              1
american chemical society                                   1
metabolo

In [17]:
pd.set_option('display.max_colwidth', -1)
journal[journal.str.startswith('a', na=False)]

439     academy of nutrition and dietetics                                                     
8       acs chemical biology                                                                   
9       acs chemical biology                                                                   
21      acs chemical biology                                                                   
20      acs chemical biology                                                                   
19      acs chemical biology                                                                   
22      acs chemical neuroscience                                                              
34      acs nano                                                                               
23      acs nano                                                                               
927     acta crystallographica section d,  biological crystallography                          
920     acta crystallographica section d

In [18]:
journal.loc[927] = 'acta crystallographica section d: biological crystallography'
journal.loc[920] = 'acta crystallographica section d: biological crystallography'
journal.loc[921] = 'acta crystallographica section d: biological crystallography'
journal.loc[922] = 'acta crystallographica section d: biological crystallography'
journal.loc[923] = 'acta crystallographica section d: biological crystallography'

journal.loc[1717] = 'acta neuropathologica'

journal.loc[924] = 'acta crystallographica section f: structural biology and crystallization communications'
journal.loc[85] = 'antimicrobial agents and chemotherapy'
journal.loc[987] = 'antioxidants and redox signaling'

journal.loc[1889] = 'arthritis and rheumatism'
journal.loc[1888] = 'arthritis and rheumatism'
journal.loc[1890] = 'arthritis and rheumatism'

journal.loc[175] = 'arthritis research and therapy'

In [19]:
apc_spend['journal_title'] = journal

In [20]:
journal[journal.str.startswith('b', na=False)]

455     bba - molecular basis of disease                   
456     behavior research and therapy                      
457     behavior research and therapy                      
47      behavioral neuroscience                            
46      behavioral neuroscience                            
45      behavioral neuroscience                            
459     behaviour research and therapy                     
458     behaviour research and therapy                     
460     behavioural brain research                         
1188    biinformatics                                      
173     biochem journal                                    
1328    biochem soc trans                                  
461     biochemical and biophysical research communications
462     biochemical and biophysical research communications
1319    biochemical journal                                
1320    biochemical journal                                
1321    biochemical journal             

In [21]:
journal.loc[459] = 'behavior research and therapy'
journal.loc[458] = 'behavior research and therapy'
journal.loc[460] = 'behavior brain research'

journal.loc[173] = 'biochemical journal'
journal.loc[1322] = 'biochemical journal'
journal.loc[1328] = 'biochemical society transactions'
# 468 biochimie
journal.loc[1856] = 'bjophthalmol'

journal.loc[181] = 'bmc genomics'
journal.loc[2009] = 'british journal of pharmacology'

journal.loc[287] = 'british journal of ophthalmology '

In [22]:
apc_spend['journal_title'] = journal

In [23]:
journal[journal.str.startswith('c', na=False)]

1730    calcified tissue international            
1815    canadian journal of african studies       
486     cancer letters                            
1865    cancer research                           
1268    cardiovascular research                   
1272    cardiovascular research                   
1166    cardiovascular research                   
487     cell                                      
491     cell                                      
490     cell                                      
489     cell                                      
488     cell                                      
970     cell adhesion and migration               
1731    cell and tissue research                  
492     cell calcium                              
974     cell cycle                                
983     cell cycle                                
971     cell cycle                                
973     cell cycle                                
984     cell cycle             

In [24]:
journal.loc[1031] = 'cell death and disease'
journal.loc[1029] = 'cell death and disease'
journal.loc[1030] = 'cell death and disease'

journal.loc[844] = 'cell host and microbe'
journal.loc[1056] = 'cell death and differentiation'
journal.loc[2010] = 'child: care, heath & development'

journal.loc[1130] = 'clinical infectious diseases'

journal.loc[508] = 'consciousness and cognition'
journal.loc[841] = 'current biology'
journal.loc[527] = 'current opinions in neurobiology'

In [25]:
apc_spend['journal_title'] = journal

In [26]:
journal[journal.str.startswith('d', na=False)]

1652    dalton transactions                           
1213    database                                      
2065    depression and anxiety                        
2066    dermatologic surgery                          
2034    dev world bioeth.                             
2035    dev world bioeth.                             
2036    dev world bioeth.                             
2037    dev. world bioeth                             
2038    dev. world bioeth                             
1869    developing world bioethics                    
1913    developing world bioethics                    
2011    developing world bioethics                    
396     development                                   
395     development                                   
394     development                                   
393     development                                   
392     development                                   
391     development                                   
390     de

In [27]:
journal.loc[2034] = 'developing world bioethics'
journal.loc[2035] = 'developing world bioethics'
journal.loc[2036] = 'developing world bioethics'
journal.loc[2037] = 'developing world bioethics'
journal.loc[2038] = 'developing world bioethics'

journal.loc[530] = 'developmental cell'

In [28]:
apc_spend['journal_title'] = journal

In [29]:
journal[journal.str.startswith('j', na=False)]

119     j biol chem.                                                           
62      j biol chem.                                                           
118     j biol chem.                                                           
117     j biol chem.                                                           
63      j biol chem.                                                           
120     j biol chemistry                                                       
1623    j cardiovasc magn reson                                                
399     j cell sci.                                                            
344     j clin microbiol                                                       
433     j immunol                                                              
427     j immunol                                                              
1266    j infect dis                                                           
2       j med chem                      

In [30]:
journal.loc[119] = 'journal of biological chemistry'
journal.loc[62] = 'journal of biological chemistry'
journal.loc[118] = 'journal of biological chemistry'
journal.loc[117] = 'journal of biological chemistry'
journal.loc[63] = 'journal of biological chemistry'
journal.loc[120] = 'journal of biological chemistry'
journal.loc[338] = 'journal of biological chemistry'
journal.loc[161] = 'journal of biological chemistry'
journal.loc[1835] = 'journal of biological chemistry'
journal.loc[164] = 'journal of biological chemistry'
journal.loc[162] = 'journal of biological chemistry'

journal.loc[2] = 'journal of medicinal chemistry'
journal.loc[3] = 'journal of medicinal chemistry'
journal.loc[1832] = 'journal of medicinal chemistry'
journal.loc[1831] = 'journal of medicinal chemistry'
journal.loc[1829] = 'journal of medicinal chemistry'
journal.loc[1830] = 'journal of medicinal chemistry'

journal.loc[1638] = 'journal of the royal society interface'

In [31]:
apc_spend['journal_title'] = journal

In [32]:
journal[journal.str.startswith('n', na=False)]

676     n biotechnol.                          
914     nanotechnology                         
915     nanotechnology                         
1307    national academy of sciences           
960     national academy of sciences           
1339    national academy of sciences           
1047    nature communications                  
1046    nature communications                  
1045    nature communications                  
1044    nature communications                  
1043    nature communications                  
1085    nature communications                  
1076    nature communications                  
1077    nature communications                  
1078    nature communications                  
1080    nature communications                  
1081    nature communications                  
1082    nature communications                  
1083    nature communications                  
1084    nature communications                  
1105    nature communications           

In [33]:
journal.loc[1238] = 'nucleic acids research'
journal.loc[1239] = 'nucleic acids research'
journal.loc[1240] = 'nucleic acids research'

In [34]:
apc_spend['journal_title'] = journal

In [35]:
journal[journal.str.startswith('p', na=False)]

755     pain                                                                                    
754     pain                                                                                    
753     pain                                                                                    
752     pain                                                                                    
233     parasit vectors.                                                                        
1993    parasite immunology                                                                     
1994    parasite immunology                                                                     
273     parasites and vectors                                                                   
221     parasites and vectors                                                                   
364     parasitology                                                                            
349     parasitology          

In [36]:
journal.loc[1350] = 'plos one'
journal.loc[1351] = 'plos one'
journal.loc[1353] = 'plos one'
journal.loc[1354] = 'plos one'
journal.loc[1355] = 'plos one'
journal.loc[1356] = 'plos one'
journal.loc[1357] = 'plos one'
journal.loc[1352] = 'plos one'

journal.loc[1393] = 'plos neglected tropical diseases'
journal.loc[1281] = 'plos neglected tropical diseases'

journal.loc[1612] = 'plos one'
journal.loc[1613] = 'plos one'
journal.loc[1611] = 'plos one'
journal.loc[1608] = 'plos one'
journal.loc[1609] = 'plos one'
journal.loc[1607] = 'plos one'
journal.loc[1606] = 'plos one'
journal.loc[1605] = 'plos one'
journal.loc[1610] = 'plos one'

journal.loc[1314] = 'proceedings of the national academy of sciences of the united states of america'
journal.loc[1004] = 'proceedings of the national academy of sciences of the united states of america'
journal.loc[1313] = 'proceedings of the national academy of sciences of the united states of america'

In [37]:
apc_spend['journal_title'] = journal

In [38]:
apc_spend.head()

Unnamed: 0,publisher,journal_title,article_title,cost
0,CUP,psychological medicine,Reduced parahippocampal cortical thickness in subjects at ultra-high risk for psychosis,0.0
1,ACS,biomacromolecules,Structural characterization of a Model Gram-negative bacterial surface using lipopolysaccharides from rough strains of escherichia coli,2381.0
2,ACS,journal of medicinal chemistry,"Fumaroylamino-4,5-epoxymorphinans and related opioids with irreversible ? opioid receptor antagonist effects.",643.0
3,ACS,journal of medicinal chemistry,Orvinols with mixed kappa/mu opioid receptor agonist activity.,670.0
4,ACS,j org chem,Regioselective opening of myo-inositol orthoesters: mechanism and synthetic utility.,686.0


## Task #1: TOP 5 Journals

In [39]:
# Top Five Journals
journal.value_counts()[:5]

plos one                           207
journal of biological chemistry    64 
nucleic acids research             29 
neuroimage                         29 
plos pathogens                     24 
Name: journal_title, dtype: int64

In [40]:
top_journals = apc_spend[(apc_spend['journal_title'] == 'plos one')
                |(apc_spend['journal_title'] == 'journal of biological chemistry')
                |(apc_spend['journal_title'] == 'nucleic acids research')
                |(apc_spend['journal_title'] == 'neuroimage')
                |(apc_spend['journal_title'] == 'plos genetics')]
                
top_journals.sort_values(by=['cost','journal_title'])

Unnamed: 0,publisher,journal_title,article_title,cost
1469,Public Library of Science,plos one,How well are Malaria Maps used to design and finance Malaria control in Africa,122.0
1471,Public Library of Science,plos one,"Neighbourhood, route and workplace-related environmental characteristics predict adults' mode of travel to work",215.0
16,AMBSB,journal of biological chemistry,Annexin-1 interaction with FPR2/ALX,266.0
1472,Public Library of Science,plos one,Socioeconomic inequalities in non-communicable diseases prevalence in India: disparities between . . .,330.0
161,ASBMB Cadmus,journal of biological chemistry,Biochemical and immunological characterisation of Toxoplasma gondii macrophage migration inhibitory factor,381.0
1304,PLoS (Public Library of Science),plos one,Prolonged internal displacement and common mental disorders in Sri Lanka_ the COMRAID study,390.0
1473,Public Library of Science,plos one,Superantigenic activity of emm3 streptococcus pyogenes is abrogated by conserved naturally occuring smeZ mutation,425.0
1305,PLoS (Public Library of Science),plos one,Genetics of callous-unemotional behavior in children,443.0
1474,Public Library of Science,plos one,Towards Clinical Molecular Diagnosis of Inherited Cardiac Conditions: A Comparison of Bench-Top Genome DNA Sequencers,534.0
1240,Oxford University Press,nucleic acids research,Space exploration by the promoter of a long human gene during one transcription cycle,710.0


There are quite a few rows that have a ridiculous cost of 999,999.  And unfortunately we have no idea what the cost might be as all the article_titles are unique.  So I can either remove the rows or calculate the mean or median of articles for each journal and use that for the cost.  Though these are only 12 rows out of a total of 353, which accounts for 0.03 percent of the journals.  Interestingly 10 of those are from the Public Library of Science publisher. 

In [41]:
# Remove rows where cost is > 150000
too_much = apc_spend[apc_spend.cost > 150000].index
apc_spend.drop(too_much, inplace=True)

# Better to remove extreme outliers than changing with mean or median per mentor

In [42]:
apc_spend.sort_values(by=['cost','journal_title'])

Unnamed: 0,publisher,journal_title,article_title,cost
0,CUP,psychological medicine,Reduced parahippocampal cortical thickness in subjects at ultra-high risk for psychosis,0.0
243,BioMed Central Ltd,veterinary research,Understanding foot-and-mouth disease virus transmission biology: identification of the indicators of infectiousness,10.0
100,American Society for Nutrition,american society for nutrition,The association between breastfeeding and HIV on postpartum maternal weight changes over 24 months in rural South Africa,46.0
1469,Public Library of Science,plos one,How well are Malaria Maps used to design and finance Malaria control in Africa,122.0
1677,Sciedu Press,journal of biomedical graphics and computing,Functional MRI demonstrates pain perception in hand osteoarthritis has features of central pain processing,135.0
975,Landes Bioscience,channels,State-independent intracellular access of quatemary ammonium blockers to the pore of TREK-1,160.0
963,JSciMed Central,journal of neurology & translational neuroscience,Parkinson's Disease: The Catabolic Theory,160.0
1676,Sciedu Press,international journal of financial research,"Determinants of Enrolment in Voluntary Health Insurance: Evidences from a Mixed Method Study, Kerala, India",187.0
1311,PNAS,proceedings of the national academy of sciences,Multistep molecular mechanism for Bone morphogenetic protein extracellular transport in the Drosophila embryo,206.0
1471,Public Library of Science,plos one,"Neighbourhood, route and workplace-related environmental characteristics predict adults' mode of travel to work",215.0


In [43]:
apc_spend.dtypes

publisher        object 
journal_title    object 
article_title    object 
cost             float64
dtype: object

In [44]:
apc_spend.groupby(by='journal_title')['cost'].sort()
apc_spend.head()

TypeError: 'bool' object is not callable

In [50]:
# Rerunning this code again to see if values are changed
top_journals = apc_spend[(apc_spend['journal_title'] == 'plos one')
                |(apc_spend['journal_title'] == 'journal of biological chemistry')
                |(apc_spend['journal_title'] == 'nucleic acids research')
                |(apc_spend['journal_title'] == 'neuroimage')
                |(apc_spend['journal_title'] == 'plos genetics')]
                
top_journals.sort_values(by=['cost','journal_title'])

Unnamed: 0,publisher,journal_title,article_title,cost
1469,Public Library of Science,plos one,How well are Malaria Maps used to design and finance Malaria control in Africa,122.0
1471,Public Library of Science,plos one,"Neighbourhood, route and workplace-related environmental characteristics predict adults' mode of travel to work",215.0
16,AMBSB,journal of biological chemistry,Annexin-1 interaction with FPR2/ALX,266.0
1472,Public Library of Science,plos one,Socioeconomic inequalities in non-communicable diseases prevalence in India: disparities between . . .,330.0
161,ASBMB Cadmus,journal of biological chemistry,Biochemical and immunological characterisation of Toxoplasma gondii macrophage migration inhibitory factor,381.0
1304,PLoS (Public Library of Science),plos one,Prolonged internal displacement and common mental disorders in Sri Lanka_ the COMRAID study,390.0
1473,Public Library of Science,plos one,Superantigenic activity of emm3 streptococcus pyogenes is abrogated by conserved naturally occuring smeZ mutation,425.0
1305,PLoS (Public Library of Science),plos one,Genetics of callous-unemotional behavior in children,443.0
1474,Public Library of Science,plos one,Towards Clinical Molecular Diagnosis of Inherited Cardiac Conditions: A Comparison of Bench-Top Genome DNA Sequencers,534.0
1240,Oxford University Press,nucleic acids research,Space exploration by the promoter of a long human gene during one transcription cycle,710.0


## Task #2: Calculate mean, median, and standard deviation of cost per article for each journal.

In [51]:
top5_journals = pd.Series(['journal of biological chemistry',
                          'nucleic acids research',
                          'neuroimage',
                          'plos genetics',
                          'plos one'])

article_count = top_journals.groupby('journal_title')['article_title'].count()
article_mean = top_journals.groupby('journal_title')['cost'].mean()
article_median = top_journals.groupby('journal_title')['cost'].median()
article_std = top_journals.groupby('journal_title')['cost'].std()

In [54]:
article_stats = pd.DataFrame()
article_stats = article_stats.assign(journals=top5_journals.values)
article_stats = article_stats.assign(num_articles=article_count.values)
article_stats = article_stats.assign(mean_cost=article_mean.values)
article_stats = article_stats.assign(median_cost=article_median.values)
article_stats = article_stats.assign(std_cost=article_std.values)
article_stats

Unnamed: 0,journals,num_articles,mean_cost,median_cost,std_cost
0,journal of biological chemistry,62,1388.870968,1302.5,409.846652
1,nucleic acids research,29,2215.137931,2326.0,266.540231
2,neuroimage,29,1162.344828,852.0,442.150934
3,plos genetics,22,1643.045455,1712.5,153.490337
4,plos one,198,934.994949,897.0,194.968083


In [55]:
article_stats['mean_article_cost'] = article_stats['mean_cost']/article_stats['num_articles']
article_stats['median_article_cost'] = article_stats['median_cost']/article_stats['num_articles']
article_stats['std_article_cost'] = article_stats['std_cost']/article_stats['num_articles']


In [56]:
article_stats

Unnamed: 0,journals,num_articles,mean_cost,median_cost,std_cost,mean_article_cost,median_article_cost,std_article_cost
0,journal of biological chemistry,62,1388.870968,1302.5,409.846652,22.401145,21.008065,6.61043
1,nucleic acids research,29,2215.137931,2326.0,266.540231,76.384067,80.206897,9.191042
2,neuroimage,29,1162.344828,852.0,442.150934,40.080856,29.37931,15.246584
3,plos genetics,22,1643.045455,1712.5,153.490337,74.683884,77.840909,6.976833
4,plos one,198,934.994949,897.0,194.968083,4.722197,4.530303,0.984687
