### Checkpoint 12_6: challenge
Determine the five most common journals and the total articles for each. 
Next, calculate the mean, median, and standard deviation of the open-access cost per article for each journal. 
bonus round, identify the open access prices paid by subject area

data source: https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/WELLCOME/WELLCOME_APCspend2013_forThinkful.csv

Note: datafile was downloaded to my PC and then 'saved as: CSV/UTF-8(Comma deliminated)(* .csv)'. This 'copy' of the download was then read into Jupyter notebook as below; thus, original file remains unmodified. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv(r'C:\Users\katec\Thinkful\data_collections\WELLCOME_APCspend2013_forThinkful.csv')

In [None]:
df.head()

Noted multiple issues with 'PMID/PMCID' column, including: value for both 'PMID' and 'PMCID' listed in same cell; some values contain both alpha and numeric, some contain only alpha (no numeric), additional spaces. This would need to be addressed if this column was to be used for challenge. 
Also noted issues with the 'Publisher' column. 
The columns needed for this challenge are 'Journal title', 'Article title', and 'COST (£) charged to Wellcome (inc VAT when charged)'.

Issues for data cleaning from initial review:
1. Find Nan, evaluate whether or not they can be filled in using existing data or should be dropped.

2. Change name of 'Cost...' to shorter, more managable name.

3. Remove £ (Alt156) from 'Cost' column. 

4. Noted journal name may appear several differen ways: 'PLOS ONE', 'Plos One', 'PLoS ONE', 'PLos One', etc. 'J...', 
'Journal...', etc. Will need to make names uniform. 

5. Check min/max of cost, noted examples of unlikely values. 

6. Convert 'Cost' column to numeric from null object. 


In [None]:
df.info()

In [None]:
list(df.columns)

##### change column name to a shorter, more managable name
Also change names of other columns to remove spaces between words

In [None]:
#must remember 'inplace'
#df.rename({'COST (£) charged to Wellcome (inc VAT when charged': 'Cost'}, inplace=True, axis = 1)

In [None]:
#also works
#df.rename(columns={'COST (£) charged to Wellcome (inc VAT when charged)':'Cost'}, inplace=True)

In [None]:
df.head()

In [3]:
#use this to change column names; fix all names at one time
df.columns = ['PMID/PMCID',
 'Publisher',
 'Journal_title',
 'Article_title',
 'Cost']

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2127 entries, 0 to 2126
Data columns (total 5 columns):
PMID/PMCID       1928 non-null object
Publisher        2127 non-null object
Journal_title    2126 non-null object
Article_title    2127 non-null object
Cost             2127 non-null object
dtypes: object(5)
memory usage: 83.2+ KB


##### remove pound sterling symbol 

In [5]:
#need to use .str to replace!
#emails.str.replace('.com', '')
                   
df['Cost'] = df['Cost'].str.replace('£', '').str.replace('$','')

In [None]:
df.head()

In [None]:
#returns EOL error
#df['Cost'] = df['Cost'].replace({'£'': ''}, regex=True)

In [None]:
#this method changes to float, but removes all the values
#for column in ['Cost']:
#    df[column] = pd.to_numeric(df[column], errors = 'coerce', downcast = 'float')

In [None]:
#df['Cost'] = df['Cost'].apply(lambda x: x.replace('£','')).apply(lambda x: x.replace('$','')).astype(float)

##### convert 'Cost' column to numeric

In [7]:
#use this method to change null object to float
df['Cost'] = df['Cost'].apply(lambda x: float(x))

In [8]:
df.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal_title,Article_title,Cost
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,685.88


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2127 entries, 0 to 2126
Data columns (total 5 columns):
PMID/PMCID       1928 non-null object
Publisher        2127 non-null object
Journal_title    2126 non-null object
Article_title    2127 non-null object
Cost             2127 non-null float64
dtypes: float64(1), object(4)
memory usage: 83.2+ KB


In [10]:
df.describe()

Unnamed: 0,Cost
count,2127.0
mean,24067.339972
std,146860.665559
min,0.0
25%,1280.0
50%,1884.01
75%,2321.305
max,999999.0


In [None]:
#boxplot


##### change invalid  values in 'Cost' column to mean of 'Cost'

A mean of £24,067 seems unlikely (and unreasonable) for "The cost...which the institution is claiming from the Wellcome Trust grant (to cover the OA publishing fee)." 
Noted a max cost of £999,999, which is clearly invalid. Search out & impute these values as possible. Will replace with the mean of 'Cost' values without values > £9,999.99.

In [11]:
df_cost = df.loc[df['Cost'] > 9999.99]
print(df_cost)

                                            PMID/PMCID  \
149                                         PMC3234811   
227                                            3708772   
277                                        PMC3668259    
358                                         PMC3219211   
404                                         PMC3533396   
410                                                NaN   
491                                  PMCID: PMC3464430   
560                                         PMC3632754   
630    Epub ahead of print April 2013 - print in press   
660                           PMID:23291342 PMC3581773   
669                                         PMC3594749   
670                                  PMCID: PMC3679449   
811                                                NaN   
815                                                NaN   
825                                                NaN   
829                              23200744  PMC3552157    
873           

In [12]:
df_cost.shape

(50, 5)

In [13]:
df_cost.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 149 to 1987
Data columns (total 5 columns):
PMID/PMCID       42 non-null object
Publisher        50 non-null object
Journal_title    49 non-null object
Article_title    50 non-null object
Cost             50 non-null float64
dtypes: float64(1), object(4)
memory usage: 2.3+ KB


In [None]:
#find mean of 'Cost' without invalid values above
df_truecost = df.loc[df['Cost'] < 9999.99].mean()
print(df_truecost)

In [None]:
#find & replace invalid values
#invalid_cost = df['Cost'] > 9999
#type(invalid_cost)

In [None]:
#df['Cost'] = df['Cost'].replace([invalid_cost], 1822.06, inplace=True)

In [14]:
df['Cost'].values[df['Cost'].values > 9999] = 1822.06

In [15]:
print(df.loc[[149, 227, 277]])

      PMID/PMCID        Publisher                    Journal_title  \
149   PMC3234811            ASBMB  Journal of Biological Chemistry   
227      3708772  BioMed Central                     BMC Genomics.   
277  PMC3668259               BMC                           Trials   

                                         Article_title     Cost  
149  Picomolar nitric oxide signals from central ne...  1822.06  
227  Phenotypic, genomic, and transcriptional chara...  1822.06  
277  Community resource centres to improve the heal...  1822.06  


In [None]:
df.head()

In [16]:
df.describe()

Unnamed: 0,Cost
count,2127.0
mean,1822.056004
std,758.617343
min,0.0
25%,1280.0
50%,1834.77
75%,2293.465
max,6000.0


##### make 'Journal_title's uniform
using pattern searches

In [17]:
df['Journal_title'].value_counts()

PLoS One                                                            92
PLoS ONE                                                            62
Journal of Biological Chemistry                                     48
Nucleic Acids Research                                              21
Proceedings of the National Academy of Sciences                     19
PLoS Neglected Tropical Diseases                                    18
Human Molecular Genetics                                            18
Nature Communications                                               17
Neuroimage                                                          15
PLoS Pathogens                                                      15
PLoS Genetics                                                       15
PLOS ONE                                                            14
NeuroImage                                                          14
Brain                                                               14
BMC Pu

In [18]:
 df.apply(lambda x: sum(x.isnull()),axis=0) 

PMID/PMCID       199
Publisher          0
Journal_title      1
Article_title      0
Cost               0
dtype: int64

In [19]:
print(df.dropna(subset=['Journal_title'])) 

                            PMID/PMCID  \
0                                  NaN   
1                           PMC3679557   
2                23043264  PMC3506128    
3                  23438330 PMC3646402   
4                 23438216 PMC3601604    
5                           PMC3579457   
6                           PMC3709265   
7                 23057412 PMC3495574    
8                    PMCID: PMC3780468   
9                    PMCID: PMC3621575   
10                   PMCID: PMC3739413   
11                   PMCID: PMC3530961   
12                   PMCID: PMC3624797   
13                          PMC3413243   
14                          PMC3694353   
15                          PMC3572711   
16                            22610094   
17                   PMCID: PMC3586974   
18        23455506  PMCID: PMC3607399    
19          PMID: 24015914 PMC3833349    
20                       : PMC3805332    
21                                 NaN   
22            PMCID:\n    PMC36567

In [20]:
df.loc[df['Journal_title'].str.contains('Plos one', case=False, na=False), 'Journal_title'] = 'Plos One'

### everything above this point works

In [22]:
df['Journal_title'].value_counts()

Plos One                                                            190
Journal of Biological Chemistry                                      48
Nucleic Acids Research                                               21
Proceedings of the National Academy of Sciences                      19
Human Molecular Genetics                                             18
PLoS Neglected Tropical Diseases                                     18
Nature Communications                                                17
PLoS Pathogens                                                       15
Neuroimage                                                           15
PLoS Genetics                                                        15
Brain                                                                14
NeuroImage                                                           14
BMC Public Health                                                    14
Movement Disorders                                              

#### Determine the five most common journals and the total articles for each. 

In [28]:
df['Journal_counts']= df['Journal_title'].value_counts()

In [29]:
df.head(20)

Unnamed: 0,PMID/PMCID,Publisher,Journal_title,Article_title,Cost,Journal_counts
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0,
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04,
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56,
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64,
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,685.88,
5,PMC3579457,ACS,Journal of Medicinal Chemistry,Comparative Structural and Functional Studies ...,2392.2,
6,PMC3709265,ACS,Journal of Proteome Research,Mapping Proteolytic Processing in the Secretom...,2367.95,
7,23057412 PMC3495574,ACS,Mol Pharm,Quantitative silencing of EGFP reporter gene b...,649.33,
8,PMCID: PMC3780468,ACS (Amercian Chemical Society) Publications,ACS Chemical Biology,A Novel Allosteric Inhibitor of the Uridine Di...,1294.59,
9,PMCID: PMC3621575,ACS (Amercian Chemical Society) Publications,ACS Chemical Biology,Chemical proteomic analysis reveals the drugab...,1294.78,


In [34]:
df.groupby('Journal_title').count()


Unnamed: 0_level_0,PMID/PMCID,Publisher,Article_title,Cost,Journal_counts
Journal_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ACS Chemical Biology,4,5,5,5,0
ACS Chemical Neuroscience,1,1,1,1,0
ACS NANO,1,1,1,1,0
ACS Nano,1,1,1,1,0
ACTA F,1,1,1,1,0
AGE,1,1,1,1,0
AIDS,3,3,3,3,0
AIDS Behav,1,1,1,1,0
AIDS Care,2,2,2,2,0
AIDS Journal,1,1,1,1,0


In [27]:
print(df.sort_values(['Journal_counts'], ascending=False)
             .reset_index)

<bound method DataFrame.reset_index of                              PMID/PMCID  \
1440                         PMC3646743   
1512                         PMC3572168   
1506                      PMID:23300533   
1507                         PMC3519837   
1508           PMCID:\n    PMC3465279\n   
1509                         PMC3564847   
1510                           20975956   
1511                                NaN   
1513                         PMC3519537   
1504                         PMC3769242   
1514                         PMC3548842   
1515                24124519/PMC3790821   
1516                         PMC3485137   
1517                         PMC3501466   
1518                         PMC3520920   
1519                     PMC3691227\n\n   
1505                         PMC3823974   
1503                       PMC3547960\n   
1486                            3577721   
1494                        PMC3749184    
1488                            3646760   
1489           

In [None]:
df.head()

In [25]:
df.groupby(df['Journal_title', 'Journal_count'] > 17)['Article_title'].nunique()

In [26]:
print(df_articlecounts)

Journal_counts
False    1813
True      313
Name: Article_title, dtype: int64


In [37]:
df.groupby('Journal_title')['Cost'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Journal_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ACS Chemical Biology,5.0,1418.186000,507.309560,947.07,1267.7600,1294.590,1294.7800,2286.73
ACS Chemical Neuroscience,1.0,1186.800000,,1186.80,1186.8000,1186.800,1186.8000,1186.80
ACS NANO,1.0,642.890000,,642.89,642.8900,642.890,642.8900,642.89
ACS Nano,1.0,693.390000,,693.39,693.3900,693.390,693.3900,693.39
ACTA F,1.0,754.900000,,754.90,754.9000,754.900,754.9000,754.90
AGE,1.0,2002.000000,,2002.00,2002.0000,2002.000,2002.0000,2002.00
AIDS,3.0,2059.306667,281.067979,1834.77,1901.7000,1968.630,2171.5750,2374.52
AIDS Behav,1.0,1834.770000,,1834.77,1834.7700,1834.770,1834.7700,1834.77
AIDS Care,2.0,2189.170000,61.617285,2145.60,2167.3850,2189.170,2210.9550,2232.74
AIDS Journal,1.0,2015.720000,,2015.72,2015.7200,2015.720,2015.7200,2015.72


In [36]:
type(df.groupby('Journal_title')['Cost'].mean())

pandas.core.series.Series

In [None]:
df.info()

In [None]:
df['Journal title'].unique()

In [None]:
 df.apply(lambda x: sum(x.isnull()),axis=0) 

In [None]:
re.findall

In [None]:
def measure_id (row):
   if row['Measure Name'] == 'READM_30_COPD_HRRP' :
      return 'COPD'
   if row['Measure Name'] == 'READM_30_HF_HRRP' :
      return 'HF'
   if row['Measure Name'] == 'READM_30_PN_HRRP' :
      return 'PN'
   if row['Measure Name'] == 'READM_30_AMI_HRRP' :
      return 'AMI'
   if row['Measure Name'] == 'READM_30_HIP_KNEE_HRRP' :
      return 'HIP_KNEE'
   if row['Measure Name'] == 'READM_30_CABG_HRRP' :
      return 'CABG'

In [None]:
df.apply (lambda row: measure_id(row), axis=1)

In [None]:
df['measure'] = df.apply (lambda row: measure_id(row), axis=1)

In [None]:
df_with_nans = df.applymap(lambda elem: float('NaN') if elem == "Not Available" else elem)

In [None]:
df_readmin = df_with_nans.dropna(subset=df_with_nans.columns.drop('Footnote')).copy()

In [None]:
for column in ['Number of Discharges', 'Excess Readmission Ratio', 'Predicted Readmission Rate', 'Expected Readmission Rate', 'Number of Readmissions']:
    df_readmin[column] = pd.to_numeric(df_readmin[column], errors = 'coerce', downcast = 'float')

In [None]:
df_readmin.groupby('Measure Name')['Excess Readmission Ratio'].describe()