### Checkpoint 12_6: challenge
Determine the five most common journals and the total articles for each. 
Next, calculate the mean, median, and standard deviation of the open-access cost per article for each journal. 
bonus round, identify the open access prices paid by subject area

data source: https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/WELLCOME/WELLCOME_APCspend2013_forThinkful.csv

Note: datafile was downloaded to my PC and then 'saved as: CSV/UTF-8(Comma deliminated)(* .csv)'. This 'copy' of the download was then read into Jupyter notebook as below; thus, original file remains unmodified. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv(r'C:\Users\katec\Thinkful\data_collections\WELLCOME_APCspend2013_forThinkful.csv')

In [None]:
df.head()

Noted multiple issues with 'PMID/PMCID' column, including: value for both 'PMID' and 'PMCID' listed in same cell; some values contain both alpha and numeric, some contain only alpha (no numeric), additional spaces. This would need to be addressed if this column was to be used for challenge. 
Also noted issues with the 'Publisher' column. 
The columns needed for this challenge are 'Journal title', 'Article title', and 'COST (£) charged to Wellcome (inc VAT when charged)'.

Issues for data cleaning from initial review:
1. Find Nan, evaluate whether or not they can be filled in using existing data or should be dropped.

2. Change name of 'Cost...' to shorter, more managable name.

3. Remove £ (Alt156) from 'Cost' column. 

4. Noted journal name may appear several differen ways: 'PLOS ONE', 'Plos One', 'PLoS ONE', 'PLos One', etc. 'J...', 
'Journal...', etc. Will need to make names uniform. 

5. Check min/max of cost, noted examples of unlikely values. 

6. Convert 'Cost' column to numeric from null object. 


In [None]:
df.info()

In [3]:
list(df.columns)

['PMID/PMCID',
 'Publisher',
 'Journal title',
 'Article title',
 'COST (£) charged to Wellcome (inc VAT when charged)']

##### change column name to a shorter, more managable name

In [4]:
#must remember 'inplace'
df.rename({'COST (£) charged to Wellcome (inc VAT when charged': 'Cost'}, inplace=True, axis = 1)

In [None]:
#also works
#df.rename(columns={'COST (£) charged to Wellcome (inc VAT when charged)':'Cost'}, inplace=True)

In [None]:
df.head()

In [5]:
#use this to change column names; fix all names at one time
df.columns = ['PMID/PMCID',
 'Publisher',
 'Journal_title',
 'Article_title',
 'Cost']

In [None]:
df.info()

##### remove pound sterling symbol and convert 'Cost' to numeric

In [6]:
#need to use .str to replace!
#emails.str.replace('.com', '')
                   
df['Cost'] = df['Cost'].str.replace('£', '').str.replace('$','')

In [None]:
df.head()

In [None]:
#returns EOL error
#df['Cost'] = df['Cost'].replace({'£'': ''}, regex=True)

In [None]:
#this method changes to float, but removes all the values
#for column in ['Cost']:
#    df[column] = pd.to_numeric(df[column], errors = 'coerce', downcast = 'float')

In [None]:
#df['Cost'] = df['Cost'].apply(lambda x: x.replace('£','')).apply(lambda x: x.replace('$','')).astype(float)

In [7]:
#use this method to change null object to float
df['Cost'] = df['Cost'].apply(lambda x: float(x))

In [8]:
df.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal_title,Article_title,Cost
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,685.88


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2127 entries, 0 to 2126
Data columns (total 5 columns):
PMID/PMCID       1928 non-null object
Publisher        2127 non-null object
Journal_title    2126 non-null object
Article_title    2127 non-null object
Cost             2127 non-null float64
dtypes: float64(1), object(4)
memory usage: 83.2+ KB


In [10]:
df.describe()

Unnamed: 0,Cost
count,2127.0
mean,24067.339972
std,146860.665559
min,0.0
25%,1280.0
50%,1884.01
75%,2321.305
max,999999.0


In [None]:
#boxplot


A mean of £24,067 seems unlikely (and unreasonable) for "The cost...which the institution is claiming from the Wellcome Trust grant (to cover the OA publishing fee)." 
Noted a max cost of £999,999, which is clearly invalid. Search out & impute these values as possible. Will replace with the mean of 'Cost' values without values > £9,999.99.

In [11]:
df_cost = df.loc[df['Cost'] > 9999.99]
print(df_cost)

                                            PMID/PMCID  \
149                                         PMC3234811   
227                                            3708772   
277                                        PMC3668259    
358                                         PMC3219211   
404                                         PMC3533396   
410                                                NaN   
491                                  PMCID: PMC3464430   
560                                         PMC3632754   
630    Epub ahead of print April 2013 - print in press   
660                           PMID:23291342 PMC3581773   
669                                         PMC3594749   
670                                  PMCID: PMC3679449   
811                                                NaN   
815                                                NaN   
825                                                NaN   
829                              23200744  PMC3552157    
873           

In [12]:
df_cost.shape

(50, 5)

In [13]:
df_cost.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 149 to 1987
Data columns (total 5 columns):
PMID/PMCID       42 non-null object
Publisher        50 non-null object
Journal_title    49 non-null object
Article_title    50 non-null object
Cost             50 non-null float64
dtypes: float64(1), object(4)
memory usage: 2.3+ KB


In [14]:
#find mean of 'Cost' without invalid values above
df_truecost = df.loc[df['Cost'] < 9999.99].mean()
print(df_truecost)

Cost    1822.055908
dtype: float64


In [15]:
#find & replace invalid values
invalid_cost = df['Cost'] > 9999
type(invalid_cost)

pandas.core.series.Series

In [18]:
print(invalid_cost)

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2097    False
2098    False
2099    False
2100    False
2101    False
2102    False
2103    False
2104    False
2105    False
2106    False
2107    False
2108    False
2109    False
2110    False
2111    False
2112    False
2113    False
2114    False
2115    False
2116    False
2117    False
2118    False
2119    False
2120    False
2121    False
2122    False
2123    False
2124    False
2125    False
2126    False
Name: Cost, Length: 2127, dtype: bool


In [16]:
df['Cost'] = df['Cost'].replace([invalid_cost], 1822.06, inplace=True)

In [17]:
print(df.loc[[149, 227, 277]])

      PMID/PMCID        Publisher                    Journal_title  \
149   PMC3234811            ASBMB  Journal of Biological Chemistry   
227      3708772  BioMed Central                     BMC Genomics.   
277  PMC3668259               BMC                           Trials   

                                         Article_title  Cost  
149  Picomolar nitric oxide signals from central ne...  None  
227  Phenotypic, genomic, and transcriptional chara...  None  
277  Community resource centres to improve the heal...  None  


In [20]:
df.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal_title,Article_title,Cost
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,


In [None]:
df['Cost'].value_counts()

In [None]:
df.info()

In [None]:
df['Journal title'].unique()

In [None]:
df['Journal_title'].value_counts()

In [None]:
 df.apply(lambda x: sum(x.isnull()),axis=0) 

In [None]:
re.findall

In [None]:
def measure_id (row):
   if row['Measure Name'] == 'READM_30_COPD_HRRP' :
      return 'COPD'
   if row['Measure Name'] == 'READM_30_HF_HRRP' :
      return 'HF'
   if row['Measure Name'] == 'READM_30_PN_HRRP' :
      return 'PN'
   if row['Measure Name'] == 'READM_30_AMI_HRRP' :
      return 'AMI'
   if row['Measure Name'] == 'READM_30_HIP_KNEE_HRRP' :
      return 'HIP_KNEE'
   if row['Measure Name'] == 'READM_30_CABG_HRRP' :
      return 'CABG'

In [None]:
df.apply (lambda row: measure_id(row), axis=1)

In [None]:
df['measure'] = df.apply (lambda row: measure_id(row), axis=1)

In [None]:
df_with_nans = df.applymap(lambda elem: float('NaN') if elem == "Not Available" else elem)

In [None]:
df_readmin = df_with_nans.dropna(subset=df_with_nans.columns.drop('Footnote')).copy()

In [None]:
for column in ['Number of Discharges', 'Excess Readmission Ratio', 'Predicted Readmission Rate', 'Expected Readmission Rate', 'Number of Readmissions']:
    df_readmin[column] = pd.to_numeric(df_readmin[column], errors = 'coerce', downcast = 'float')

In [None]:
df_readmin.groupby('Measure Name')['Excess Readmission Ratio'].describe()