### You will need to do considerable data cleaning in order to extract accurate estimates, and may want to look into data encoding methods if you get stuck.

In [517]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

%matplotlib inline

In [518]:
df = pd.read_csv('apcs/welcome2013.csv', encoding="ISO-8859-1")

### Checking out what we have

We can make sure we read something that looks sensible, and check the basic stats of the data we read.

In [519]:
df.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [520]:
df.describe(include='all')

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
count,1928,2127,2126,2127,2127
unique,1880,299,984,2126,1402
top,-,Elsevier,PLoS One,"Exclusive breastfeeding, diarrhoel morbidity a...",£2040.00
freq,7,387,92,2,94


### Basic cleanup - renaming columns

We can rename the columns to make them easier to handle. The first word of each column header is descriptive enough and doesn't contain hard to type characters like '£'.

In [521]:
# Rename columns to make our lives easier
df.rename(lambda x: str.split(x,' ')[0],axis='columns',inplace=True)
print(df.columns)

Index(['PMID/PMCID', 'Publisher', 'Journal', 'Article', 'COST'], dtype='object')


### PMID/PMCID

PMID is a unique identifier for articles published in PubMed, [an index of the biomedical literature](https://nexus.od.nih.gov/all/2015/08/31/pmid-vs-pmcid-whats-the-difference/) according to NIH. PMCID is a number assigned to articles in a PubMed archive. This column is kind of a mess in that around 200 rows don't have a valid value. Fortunately, I don't think we need to care much about this column in the analysis we're asked to do.

### Duplicate values

I see that there are some duplicated values in the PMID/PMCID column--they are mostly invalid or placeholder values. I didn't see anything with a valid value in this column that was a duplicate.

There is one duplicate article title--this DOES represent a duplicate value, so I'm deleting one of the rows.

In [522]:
duplicate_title = df.loc[df['Article'].duplicated() == True]['Article']
for i in duplicate_title:
    print(df.loc[df['Article'] == i])

     PMID/PMCID                  Publisher   Journal  \
1490    Pending  Public Library of Science  PLoS One   
1496        NaN  Public Library of Science  PLoS One   

                                                Article     COST  
1490  Exclusive breastfeeding, diarrhoel morbidity a...  £825.68  
1496  Exclusive breastfeeding, diarrhoel morbidity a...  £825.68  


In [523]:
df.drop(1496, inplace=True)

### Cost column cleanup

Values in the cost column have a lot of leading £ signs, and a handful of trailing $ signs.

We can strip the leading pound signs.

This column is supposed to be in £, so I think where we see dollar signs, we should convert those values from dollars to pounds. There are no dates in this data, but the introduction to this challenge said "between 2012 and 2013", so I grabbed the average annual exchange rates for 2012-2013 from [this site](https://www.ofx.com/en-us/forex-news/historical-exchange-rates/yearly-average-rates/):

`
December 31, 2012	0.631109
December 31, 2013	0.63955
`

The mean of those two values is 0.6353295--let's use that for our conversion.

The Cost column contains some instances of 999999.00, which are obviously bogus. There are also a few values that are what I would consider unreasonably large compared to the rest of the data--arguably the MacMillan article that costs 13200.00 is one of these, but certainly nothing larger than that can be real.

Because there aren't a large number of these bogus values, I feel like it's safe to set them to 'NaN'. We can still compute summary stats on the values that are left.

In [524]:
# Mean of 2012-2013 dollars to pounds conversion rates
np.mean((0.631109,0.63955))

0.6353295

In [525]:
# Convert COST to floats, representing values in pounds.
def strip_or_convert(cost):
    final_value = float('NaN')
# 999999.00 is an obviously bogus value. Immediately replace with NaN.
    if (cost.find('999999.00') != -1):
        return final_value
# Strip pound signs
    elif cost.startswith('£'):
        final_value = round(float(cost.strip('£')), 2)
# Convert dollars to pounds
    elif cost.endswith('$'):
        final_value = round(0.6353295 * float(cost.strip('$')), 2)
    else:
# I don't think we'll ever use this condition...?
        print("This cost is weird: {}".format(cost))
        final_value = cost
# I saw a few huge values that weren't 999999.00 but which still didn't make sense.
    return final_value if final_value < 15000.00 else float('NaN')

In [526]:
# Let's also rename this column to something slightly friendlier while we're here.
df['Cost'] = df['COST'].apply(strip_or_convert)
df.drop(['COST'],axis=1,inplace=True)

In [527]:
df['Cost'].describe()

count     2077.000000
mean      1824.710419
std        809.401251
min          0.000000
25%       1260.000000
50%       1851.290000
75%       2302.930000
max      13200.000000
Name: Cost, dtype: float64

### Publisher cleanup

Publisher names are all over the place--multiple similar but not exact names, abbreviations, with and without company types (Ltd., Inc., LLC). I want to clean those up so we can get an accurate count of articles published for each publisher.

In [528]:
# Strip spaces from around the publishers' names
df['Publisher'] = df['Publisher'].str.strip()

In [529]:
# I wrote a function to help standardize some of the publishers' names.
# It adds pairs of raw/standardized names to a dict, which we use later
#  in a replace method.
publisher_dict = {}

def publisher_cleanup(df_loc,repl_str):
    for pub in df_loc:
        publisher_dict[pub] = repl_str

Here's a lengthy section containing all of my attempts to consolidate similarly-named publishers.

I realized after I was well into this section that I may have been able to avoid some of this work by dropping the leading word "The" from all publisher names, and maybe dropping ", Inc/Ltd/LLC" from the end. 

In [530]:
# ACS = American Chemical Society and various permutations
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('ACS') \
                                      | df['Publisher'].str.contains('American Chemical Society', case=False)].unique(),
                  'American Chemical Society'
                 )

# ASBMB = American Society for Biochemistry and Molecular Biology.
# Cenveo is a publisher service, and "trusted service provider" to ASBMB,
#  so I think it's fair to roll "ASBMB/Cenveo" entries into this one.
# I didn't find anything for 'AMBSB' when I Googled, so I think that is a misspelling of 'ASBMB'.
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Biochemistry and Molecular Biol',case=False) \
                                     | df['Publisher'].str.contains('ASBMB') \
                                     | (df['Publisher'] == 'AMBSB')].unique(),
                 'American Society for Biochemistry and Molecular Biology'
                 )

# American Society of Hematology
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('H.*matology', case=False)].unique(),
                 'American Society of Hematology'
                 )

# American Society for Biochemistry and Molecular Biology
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('American Society for Biochemistry and Molecular Biology')].unique(),
                 'American Society for Biochemistry and Molecular Biology'
                 )

# ASM - American Society for Microbiology
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('American Society (of|for) Microbiology',case=False) \
                                     | df['Publisher'].str.startswith('ASM')].unique(),
                 'American Society for Microbiology'
                 )

# Bentham Science Publishers
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Benthan', case=False)].unique(),
                 'Bentham Science Publishers'
                 )

# BioMed Central
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('BioMed', case=False)].unique(),
                 'BioMed Central'
                 )

# Bioscientifica does not have a capitalized S in the middle
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.startswith('BioScient')].unique(),
                 'Bioscientifica'
                 )

# BMJ = British Medical Journal
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.startswith('BMJ')].unique(),
                 'British Medical Journal'
                 )

# Cadmus
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Cadmus',case=False)].unique(),
                  'Cadmus'
                 )

# Cambridge University Press
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Cambridge',case=False)].unique(),
                  'Cambridge University Press'
                 )

# Cenveo
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.startswith('Cenveo')].unique(),
                 'Cenveo Publisher Services',
                 )

# Cold Spring Harbor
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Cold Spring Ha.?bo.*r',case=False)].unique(),
                 'Cold Spring Harbor'
                 )

# Company of Biologists
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Company of Biol.*gist', case=False)].unique(),
                 'Company of Biologists'
                 )

# Dartmouth Journal Services
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Dar.?mouth',case=False)].unique(),
                 'Dartmouth Journal Services'
                 )

# Elsevier
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('^Elsev.*r', case=False)].unique(),
                 'Elsevier'
                 )

# The Endocrine Society
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Endocrine', case=False)].unique(),
                 'The Endocrine Society'
                 )

# Federation of American Societies for Experimental Biology
publisher_cleanup(df['Publisher'].loc[(df['Publisher'].str.contains('Experimental Biology', case=False)) \
                                     | (df['Publisher'] == 'FASEB')].unique(),
                 'Federation of American Societies for Experimental Biology'
                 )

# Frontiers Media SA
publisher_cleanup(df['Publisher'].loc[(df['Publisher'].str.startswith('Frontiers Media')) \
                   | (df['Publisher'] == 'Frontiers')].unique(),
                 'Frontiers Media SA'
                 )

# Future Medicine
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Future Medicine')].unique(),
                 'Future Medicine'
                 )

# Hindawi
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.startswith('Hindawi')].unique(),
                 'Hindawi'
                 )

# Impact Journals
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.startswith('Impact')].unique(),
                 'Impact Journals'
                 )

# International Union Against Tuberculosis and Lung Disease
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Tuberculosis and Lung Disease', case=False)].unique(),
                 'International Union Against Tuberculosis and Lung Disease'
                 )

# Informa Healthcare
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.startswith('Informa Healthcare')].unique(),
                  'Informa Healthcare'
                 )

# International Union of Crystallography
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.startswith('International Union of Crystallography')].unique(),
                 'International Union of Crystallography'
                 )

# Karger
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Karger', case=False)].unique(),
                 'Karger'
                 )

# Landes Bioscience
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.startswith('Landes Biosciences')].unique(),
                 'Landes Bioscience'
                 )

# Mary Ann Liebert
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Mary Ann Liebert', case=False)].unique(),
                 'Mary Ann Liebert'
                 )

# MIT Press
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('MIT Press', case=False)].unique(),
                  'MIT Press'
                 )

# MYJoVE
publisher_cleanup(df['Publisher'].loc[(df['Publisher'].str.contains('JoVE', case=False)) \
                                     | df['Publisher'].str.contains('Journal of Visualized Experiments')].unique(),
                 'MYJoVE')

# PNAS = Proceedings of the National Academcy of Sciences
publisher_cleanup(df['Publisher'].loc[(df['Publisher'].str.contains('National Academy of Sciences', case=False)) \
                   | df['Publisher'].str.contains('PNAS')].unique(),
                  'PNAS'
                 )

# Nature Publishing Group
publisher_cleanup(df['Publisher'].loc[(df['Publisher'].str.contains('Nature P', case=False)) \
                                      | df['Publisher'].str.contains('NPG') \
                                      | (df['Publisher'] == 'Nature')].unique(),
                 'Nature Publishing Group'
                 )

# OUP = Oxford University Press
publisher_cleanup(df['Publisher'].loc[(df['Publisher'].str.contains('Oxford',case=False)) \
                                      | (df['Publisher'] == 'OUP')].unique(),
                 'Oxford University Press'
                 )

# PLOS = Public Library of Science
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('PLOS',case=False) \
                   | df['Publisher'].str.contains('Public Library of Science')].unique(),
                 'PLOS'
                 )

# Portland Press
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Portland Press', case=False)].unique(),
                 'Portland Press'
                 )

# PubMed
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('PubMed', case=False)].unique(),
                 'PubMed'
                 )

# The Royal Society (of/for...? Since I have no idea, I'm leaving this as its own entry)
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('The Royal Society', case=False)].unique(),
                 'The Royal Society'
                 )

# Royal Society of Chemistry
publisher_cleanup(df['Publisher'].loc[(df['Publisher'].str.contains(('Royal Society for Chemistry'), case=False)) \
                   | (df['Publisher'].str.startswith('RSC'))].unique(),
                 'Royal Society of Chemistry'
                 )

# SAGE Publishing
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Sage', case=False)].unique(),
                 'SAGE Publishing'
                 )

# The Sheridan Press
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Sheridan Press', case=False)].unique(),
                 'The Sheridan Press'
                 )

# Society for General Microbiology
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Society for Genermal Microbiology', case=False)].unique(),
                 'Society for General Microbiology'
                 )

# Society for Neuroscience
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Society (of|for) Neuro', case=False)].unique(),
                 'Society for Neuroscience'
                 )

# Springer Science + Business Media
# This pulls in a record for Humana Press--I confirmed this is Humana's parent company.
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.startswith(('Springer','SPRINGER'))].unique(),
                 'Springer Science + Business Media'
                 )

# Taylor and Francis
publisher_cleanup(df['Publisher'].loc[(df['Publisher'].str.contains(('Taylor.*?Francis'), case=False)) \
                   | (df['Publisher'].str.contains('T.F'))].unique(),
                 'Taylor and Francis'
                 )

# Wiley, Wiley-Blackwell, Wiley-VCH, John Wiley & Sons, etc. are all basically the same
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('W..ey', case=False)].unique(),
                  'Wiley-Blackwell'
                 )

# Wolters Kluwer
publisher_cleanup(df['Publisher'].loc[df['Publisher'].str.contains('Wolters Kluwer', case=False)].unique(),
                 'Wolters Kluwer'
                 )



In [531]:
df['Publisher'] = df['Publisher'].replace(publisher_dict)

# Determine the five most common journals and the total articles for each. 

In [532]:
top_5_journals = df['Publisher'].value_counts()[:5]
print(top_5_journals)

Elsevier                             408
PLOS                                 306
Wiley-Blackwell                      270
Oxford University Press              167
Springer Science + Business Media     94
Name: Publisher, dtype: int64


# Next, calculate the mean, median, and standard deviation of the open-access cost per article for each journal. 

In [533]:
def journal_stats(df,journal):
    stats_for = df.loc[df['Publisher'] == journal]['Cost']
    these_stats = stats_for.describe()
    print("Summary statistics for {}:".format(journal))
    print("Mean cost: {:.2f}".format(these_stats['mean']))
    print("Median cost: {:.2f}".format(these_stats['50%']))
    print("Standard deviation: {:.4f}\n".format(these_stats['std']))

In [534]:
for journal in top_5_journals.index:
    journal_stats(df,journal)

Summary statistics for Elsevier:
Mean cost: 2435.65
Median cost: 2344.32
Standard deviation: 794.3210

Summary statistics for PLOS:
Mean cost: 1124.04
Median cost: 1014.38
Standard deviation: 403.1324

Summary statistics for Wiley-Blackwell:
Mean cost: 2010.79
Median cost: 2006.64
Standard deviation: 372.6783

Summary statistics for Oxford University Press:
Mean cost: 1844.43
Median cost: 2040.00
Standard deviation: 512.1148

Summary statistics for Springer Science + Business Media:
Mean cost: 2023.87
Median cost: 1968.11
Standard deviation: 271.5649



I feel like these numbers are reasonable--before I noticed the values between 15000.00 and 999999.00, the standard deviations for a couple of the publishers were noticeably larger than the others.

## For a real bonus round, identify the open access prices paid by subject area.

We can talk about this. I thought maybe there would be an easy way to divide these articles based on publication or 