# Purpose
The goal of this notebook is to make a master CSV. 
# Unique ID
Unfortunately, many stations share the same name, which has forced me to ignore some in my analysis. This is meant to fix that. 

In [467]:
import pandas as pd
import re
import numpy as np

I'm going to look at the onboard by stations CSV since this seems to be a reliable source for stations (other CSVs like number of choice hires seems to leave out some stations).

In [468]:
staff = pd.read_csv("Staff_Size/Onboard_By_statyion_By_FY.csv")
staff = staff.drop(0).drop(1) # dropping first and second lines because these don't refer to actual stations.
staff.sample(10)

Unnamed: 0,ORGANIZATION,MAY-FY11,MAY-FY12,MAY-FY13,MAY-FY14,MAY-FY15,MAY-FY16
88,(V12) (676) MC TOMAH WI,1016,1026.0,1066,1095,1085,1085
134,(V21) (459) HCS HONOLULU HI,749,757.0,846,871,936,1028
44,(V07) (508) MC ATLANTA GA,2983,3121.0,3382,3712,4028,4487
53,(V08) (516) MC BAY PINES FL,3360,3375.0,3561,3782,4032,4110
25,(V04) (562) MC ERIE PA,678,669.0,660,689,713,693
84,(V12) (556) FHCC NORTH CHICAGO IL,1956,2049.0,2120,2125,2166,2115
69,(V10) (539) MC CINCINNATI OH,2043,2097.0,2147,2172,2168,2225
67,(V10) (487) V10HSCINCINNATI OH,52,46.0,54,53,51,47
95,(V16) (520) MC BILOXI MS,1817,1924.0,2125,2172,2289,2235
36,(V06) (558) MC DURHAM NC,2497,2574.0,2911,3240,3457,3458


In [469]:
# taking this function from my analysis notebook to strip out the stations. Slight modifications
def get_station(new_string):
    try:
        query = r"^\((V\d\d)\) \((.*)\)"
        m = re.search(query, new_string)
        return m.group(2).upper()
    except:
        None
        
staff["Station"] = staff["ORGANIZATION"].apply(get_station)
#staff

In [470]:
staff["Station"].value_counts()



516    1
658    1
358    1
679    1
757    1
632    1
631    1
630    1
659    1
503    1
517    1
502    1
549    1
548    1
506    1
509    1
508    1
402    1
637    1
674    1
561    1
676    1
531    1
578    1
534    1
605    1
537    1
544    1
463    1
540    1
      ..
598    1
640    1
596    1
603    1
558    1
595    1
593    1
607    1
590    1
740    1
664    1
438    1
585    1
644    1
618    1
586    1
529    1
562    1
612    1
523    1
436    1
521    1
520    1
528    1
526    1
504    1
575    1
799    1
496    1
623    1
Name: Station, dtype: int64

Okay that's stange. There are no duplicates in this list. Could these all be unique IDs? If so I may not need to generate any new IDs and instead use these. 

### Bringing in Pending 2014 Wait Times

In [471]:
# A function to format 
def make_df(Type,Time,File):
    Title = "{}_{}".format(Type,Time)
    csv = pd.read_csv(File,
                               usecols=[0,1,5,24,25,26],
                               skiprows=[0], #skipping the original header
                               names=["Location","Appts_{}".format(Title),
                                      "%_Appts_Over_30_{}".format(Title),
                                      "PC_Wait_{}".format(Title),
                                      "SC_Wait_{}".format(Title),"MH_{}".format(Title)])
    csv["Station"] = csv["Location"].apply(get_station)
    return csv

In [472]:
Pending_1412 = make_df("Pending","1412","Wait_Time/Pending_Clean/14_12_Wait.csv")
Pending_1412.head()

Unnamed: 0,Location,Appts_Pending_1412,%_Appts_Over_30_Pending_1412,PC_Wait_Pending_1412,SC_Wait_Pending_1412,MH_Pending_1412,Station
0,"(V01) (402) Togus, ME",42899,2.70%,2.51,4.07,2.77,402
1,"(V01) (405) White River Junction, VT",23709,2.16%,3.92,3.95,0.79,405
2,"(V01) (518) Bedford, MA",8105,5.17%,1.12,10.53,4.41,518
3,"(V01) (523) VA Boston HCS, MA",85625,1.71%,1.66,3.53,3.92,523
4,"(V01) (608) Manchester, NH",19511,2.64%,2.66,4.21,5.13,608


### Trying To Consolidate Staff And Pending

In [473]:
#Grabbing all stations in pending that have duplicates
multiple_pending_series = Pending_1412["Station"].value_counts()
multiple_pending_dataframe = multiple_pending_series.to_frame()
multiple_pending_list = multiple_pending_dataframe[multiple_pending_dataframe["Station"] > 1].index.tolist()

pending_unique = pd.DataFrame()
for item in multiple_pending_list:
    new = Pending_1412[Pending_1412["Station"]==item]
    pending_unique = pending_unique.append(new)
    
    
#grabbing all in staff that aren't in pending...  
in_staff_not_pending = [x for x in staff["Station"].tolist() if x not in Pending_1412["Station"].tolist()]
staff_unique = pd.DataFrame()
for item in in_staff_not_pending:
    new = staff[staff["Station"]==item]
    staff_unique = staff_unique.append(new)
    
# ...and including that in a dataframe with the uniques from 
for item in multiple_pending_list:
    new = staff[staff["Station"]==item]
    staff_unique = staff_unique.append(new)



#### Cleaning
Identified areas that need to be clean and stations that match. At this point I'm going to try and generate unique IDs and then I'm going to bring in other data to join. 

Many of the locations in station in the hiring csv don't match the wait time csv. 


In [474]:
# identified these as not stations, but facilities. Dropping from dataframe. 
to_drop = [18,29,50,59,66,77,87,96,105,114,123,144,152]
for index in to_drop:
    Pending_1412 = Pending_1412.drop(index)



### Generating Unique IDs

In [475]:
IDs = []
query = r"^\((V\d\d)\) \((\d\d\d)\) (.{2})"
#m = re.search(query, test)
#m.group(3)



for index, row in Pending_1412.iterrows():
    if row["Station"] not in IDs:
        print row["Location"]
        m = re.search(query, row["Location"])
        loc = m.group(3).lower()
        IDs.append(row["Station"])

    else:
        m = re.search(query, row["Location"])
        loc = m.group(3).lower()
        IDs.append(row["Station"]+loc)
Pending_1412["ID"] = IDs
Pending_1412

(V01) (402) Togus, ME
(V01) (405) White River Junction, VT
(V01) (518) Bedford, MA
(V01) (523) VA Boston HCS, MA
(V01) (608) Manchester, NH
(V01) (631) VA Central Western Massachusetts HCS
(V01) (650) Providence, RI
(V01) (689) VA Connecticut HCS, CT
(V02) (528) Albany, NY
(V03) (526) Bronx, NY
(V03) (561) New Jersey HCS, NJ
(V03) (620) VA Hudson Valley HCS, NY
(V03) (630) New York Harbor HCS, NY
(V03) (632) Northport, NY
(V04) (460) Wilmington, DE
(V04) (503) Altoona, PA
(V04) (529) Butler, PA
(V04) (540) Clarksburg, WV
(V04) (542) Coatesville, PA
(V04) (562) Erie, PA
(V04) (595) Lebanon, PA
(V04) (642) Philadelphia, PA
(V04) (646) Pittsburgh, PA
(V04) (693) Wilkes��Barre, PA
(V05) (512) Baltimore HCS, MD
(V05) (613) Martinsburg, WV
(V05) (688) Washington, DC
(V06) (517) Beckley, WV
(V06) (558) Durham, NC
(V06) (565) Fayetteville, NC
(V06) (590) Hampton, VA
(V06) (637) Asheville, NC
(V06) (652) Richmond, VA
(V06) (658) Salem, VA
(V06) (659) Salisbury, NC
(V07) (508) Atlanta, GA
(V07) 

Unnamed: 0,Location,Appts_Pending_1412,%_Appts_Over_30_Pending_1412,PC_Wait_Pending_1412,SC_Wait_Pending_1412,MH_Pending_1412,Station,ID
0,"(V01) (402) Togus, ME",42899,2.70%,2.51,4.07,2.77,402,402
1,"(V01) (405) White River Junction, VT",23709,2.16%,3.92,3.95,0.79,405,405
2,"(V01) (518) Bedford, MA",8105,5.17%,1.12,10.53,4.41,518,518
3,"(V01) (523) VA Boston HCS, MA",85625,1.71%,1.66,3.53,3.92,523,523
4,"(V01) (608) Manchester, NH",19511,2.64%,2.66,4.21,5.13,608,608
5,(V01) (631) VA Central Western Massachusetts HCS,22602,9.26%,10.54,12.50,4.18,631,631
6,"(V01) (650) Providence, RI",38893,4.12%,7.40,4.59,9.33,650,650
7,"(V01) (689) VA Connecticut HCS, CT",49615,2.18%,2.35,4.15,3.92,689,689
8,"(V02) (528) Albany, NY",31376,3.16%,2.33,3.64,4.29,528,528
9,"(V02) (528) Bath, NY",14816,4.20%,3.35,7.70,4.97,528,528ba


In [476]:
#pd.read_csv("Wait_Time/Completed_Cleaned/Wait_Times_201409.csv")

In [477]:
#manually changing some IDs in staff based on shared location
staff["ID"] = staff["Station"]
staff["ID"][11] = "528"
staff["ID"][92] = "657st"
staff["ID"][12] = "528bu"
staff.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,ORGANIZATION,MAY-FY11,MAY-FY12,MAY-FY13,MAY-FY14,MAY-FY15,MAY-FY16,Station,ID
2,(V01) (402) HCS TOGUS ME,1293,1272.0,1248,1283,1335,1446,402,402
3,(V01) (405) MROC WHT RIVER JCT VT,852,853.0,881,921,1022,1177,405,405
4,(V01) (478) V1HCSBEDFORD MA,30,36.0,34,53,48,42,478,478
5,(V01) (518) MC BEDFORD MA,1313,1265.0,1245,1251,1297,1352,518,518
6,(V01) (523) HCS BOSTON MA,4022,4011.0,4041,4032,4123,4303,523,523


In [478]:
master = Pending_1412.merge(staff, on="ID", how="left")
master["Station"] = master["Station_x"]
master= master.drop("Station_x",1).drop("Station_y",1)
print master.columns

Index([u'Location', u'Appts_Pending_1412', u'%_Appts_Over_30_Pending_1412',
       u'PC_Wait_Pending_1412', u'SC_Wait_Pending_1412', u'MH_Pending_1412',
       u'ID', u'ORGANIZATION', u'MAY-FY11', u'MAY-FY12', u'MAY-FY13',
       u'MAY-FY14', u'MAY-FY15', u'MAY-FY16', u'Station'],
      dtype='object')


#### So what was the result of our merge?

In [479]:
def check_len(master):
    check = len(master) - 141
    if check == 0:
        print "Hurray! We didn't lose any rows."
    if check < 0:
        print "Uh oh. We lost {} rows!".format(str(abs(check)))
    if check > 0:
        print "we somehow...gained {} row? Well that's not right.".format(str(abs(check)))
check_len(master)

Hurray! We didn't lose any rows.


Great! Now I'm going to redefine ID so that it better matches with future dataframes.

In [480]:
ID = []
query = r"^\((V\d\d)\) \((\d\d\d)\) (.{2})"

for index, row in master.iterrows():
    m = re.search(query, row["Location"])
    loc = m.group(3).lower()
    ID.append(row["Station"]+loc)

master["ID"] = ID
master["ID"][72] = "537je" # edge case
master.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Location,Appts_Pending_1412,%_Appts_Over_30_Pending_1412,PC_Wait_Pending_1412,SC_Wait_Pending_1412,MH_Pending_1412,ID,ORGANIZATION,MAY-FY11,MAY-FY12,MAY-FY13,MAY-FY14,MAY-FY15,MAY-FY16,Station
0,"(V01) (402) Togus, ME",42899,2.70%,2.51,4.07,2.77,402to,(V01) (402) HCS TOGUS ME,1293.0,1272.0,1248.0,1283.0,1335.0,1446.0,402
1,"(V01) (405) White River Junction, VT",23709,2.16%,3.92,3.95,0.79,405wh,(V01) (405) MROC WHT RIVER JCT VT,852.0,853.0,881.0,921.0,1022.0,1177.0,405
2,"(V01) (518) Bedford, MA",8105,5.17%,1.12,10.53,4.41,518be,(V01) (518) MC BEDFORD MA,1313.0,1265.0,1245.0,1251.0,1297.0,1352.0,518
3,"(V01) (523) VA Boston HCS, MA",85625,1.71%,1.66,3.53,3.92,523va,(V01) (523) HCS BOSTON MA,4022.0,4011.0,4041.0,4032.0,4123.0,4303.0,523
4,"(V01) (608) Manchester, NH",19511,2.64%,2.66,4.21,5.13,608ma,(V01) (608) MC MANCHESTER NH,663.0,683.0,691.0,686.0,759.0,803.0,608


## Bringing in Pending 2016

In [481]:
#functions to format 2016 
def get_station(new_string):
    try:
        query = r"^\((V\d\d)\) \(([0-9,A-Z]{3,5})"
        m = re.search(query, new_string)
        return m.group(2).upper()
    except:
        None
        
staff["Station"] = staff["ORGANIZATION"].apply(get_station)




def make_df_2016(Type,Time,File):
    Title = "{}_{}".format(Type,Time)
    csv = pd.read_csv(File,
                               usecols=[0,1,5,20,21,22],
                               skiprows=[0], #skipping the original header
                               names=["Location","Appts_{}".format(Title),
                                      "%_Appts_Over_30_{}".format(Title),
                                      "PC_Wait_{}".format(Title),
                                      "SC_Wait_{}".format(Title),"MH_{}".format(Title)])
    csv["Station"] = csv["Location"].apply(get_station)
    return csv
Pending_1610 = make_df_2016("Pending","1610","Wait_Time/Pending_Clean/16_10_Wait.csv")
#Pending_1610 = Pending_1610.drop(297).drop(369) # edge case causing issues, not a station anyway

In [482]:
def generate_IDs(df):
    ID = []
    for index, row in df.iterrows():
        #print index
        if len(row["Station"]) > 3:
            #print row["Station"]
            ID.append(None)
        else:
            try:
                #print index
                #print "entering else"
                #print "This is the location: "+row["Location"]
                query = r"^\((V\d\d)\) \((\d\d\d)\) (.{2})"
                m = re.search(query, row["Location"].upper())
                new_ID = row["Station"]+m.group(3).lower()
                if new_ID not in ID:
                    ID.append(new_ID)
                else:
                    ID.append(None)
                #print m.group(3).upper()
            except AttributeError:
                #print "AttributeError"
                ID.append(None)
    return ID
ID = generate_IDs(Pending_1610)
Pending_1610["ID"] = ID

In [483]:
Pending_1610 = Pending_1610.dropna(subset = ["ID"])
master = master.merge(Pending_1610,on="ID")
print master.columns
master["Station"] = master["Station_x"]
master["Location"] = master["Location_x"]
master= master.drop("Station_x",1).drop("Station_y",1).drop("Location_y",1).drop("Location_x",1)
master.head()

Index([u'Location_x', u'Appts_Pending_1412', u'%_Appts_Over_30_Pending_1412',
       u'PC_Wait_Pending_1412', u'SC_Wait_Pending_1412', u'MH_Pending_1412',
       u'ID', u'ORGANIZATION', u'MAY-FY11', u'MAY-FY12', u'MAY-FY13',
       u'MAY-FY14', u'MAY-FY15', u'MAY-FY16', u'Station_x', u'Location_y',
       u'Appts_Pending_1610', u'%_Appts_Over_30_Pending_1610',
       u'PC_Wait_Pending_1610', u'SC_Wait_Pending_1610', u'MH_Pending_1610',
       u'Station_y'],
      dtype='object')


Unnamed: 0,Appts_Pending_1412,%_Appts_Over_30_Pending_1412,PC_Wait_Pending_1412,SC_Wait_Pending_1412,MH_Pending_1412,ID,ORGANIZATION,MAY-FY11,MAY-FY12,MAY-FY13,MAY-FY14,MAY-FY15,MAY-FY16,Appts_Pending_1610,%_Appts_Over_30_Pending_1610,PC_Wait_Pending_1610,SC_Wait_Pending_1610,MH_Pending_1610,Station,Location
0,42899,2.70%,2.51,4.07,2.77,402to,(V01) (402) HCS TOGUS ME,1293.0,1272.0,1248.0,1283.0,1335.0,1446.0,43609,4.18%,7.82,5.67,3.62,402,"(V01) (402) Togus, ME"
1,23709,2.16%,3.92,3.95,0.79,405wh,(V01) (405) MROC WHT RIVER JCT VT,852.0,853.0,881.0,921.0,1022.0,1177.0,21091,3.83%,6.09,7.09,1.86,405,"(V01) (405) White River Junction, VT"
2,8105,5.17%,1.12,10.53,4.41,518be,(V01) (518) MC BEDFORD MA,1313.0,1265.0,1245.0,1251.0,1297.0,1352.0,14456,7.54%,0.6,11.21,3.07,518,"(V01) (518) Bedford, MA"
3,85625,1.71%,1.66,3.53,3.92,523va,(V01) (523) HCS BOSTON MA,4022.0,4011.0,4041.0,4032.0,4123.0,4303.0,84373,4.43%,14.46,6.76,3.08,523,"(V01) (523) VA Boston HCS, MA"
4,19511,2.64%,2.66,4.21,5.13,608ma,(V01) (608) MC MANCHESTER NH,663.0,683.0,691.0,686.0,759.0,803.0,25424,4.09%,5.75,4.58,5.21,608,"(V01) (608) Manchester, NH"


In [484]:
check_len(master)

Hurray! We didn't lose any rows.


## Bringing in Complete 2016

In [485]:
def get_station(new_string):
    try:
        query = r"^\((V\d\d)\) \(([0-9,A-Z]{3,5})"
        m = re.search(query, new_string)
        return m.group(2).upper()
    except:
        None
        
staff["Station"] = staff["ORGANIZATION"].apply(get_station)




def make_df_2016(Type,Time,File):
    Title = "{}_{}".format(Type,Time)
    csv = pd.read_csv(File,
                               usecols=[0,1,5,13,14,15],
                               skiprows=[0], #skipping the original header
                               names=["Location","Appts_{}".format(Title),
                                      "%_Appts_Over_30_{}".format(Title),
                                      "PC_Wait_{}".format(Title),
                                      "SC_Wait_{}".format(Title),"MH_{}".format(Title)])
    csv["Station"] = csv["Location"].apply(get_station)
    return csv
Complete_1608 = make_df_2016("Complete","1608","Wait_Time/Completed_Cleaned/16_08_Wait.csv")
Complete_1608.head()

Unnamed: 0,Location,Appts_Complete_1608,%_Appts_Over_30_Complete_1608,PC_Wait_Complete_1608,SC_Wait_Complete_1608,MH_Complete_1608,Station
0,"(V01) (402) Togus, ME",28683,0.0134,4.96,2.73,1.87,402
1,(V01) (402) Togus VAMC,17464,0.0163,5.96,2.66,2.59,402
2,(V01) (402GA) Aroostook County\r(Caribou),756,0.0423,8.84,0.0,3.41,402GA
3,(V01) (402GB) Calais,379,0.0,1.43,0.0,0.36,402GB
4,(V01) (402GC) Rumford,428,0.0,1.31,0.0,1.11,402GC


In [486]:
ID = generate_IDs(Complete_1608)
Complete_1608["ID"] = ID
Complete_1608 = Complete_1608.dropna(subset = ["ID"])
master = master.merge(Complete_1608,on="ID")
master["Station"] = master["Station_x"]
master["Location"] = master["Location_x"]
master= master.drop("Station_x",1).drop("Station_y",1).drop("Location_y",1).drop("Location_x",1)

In [487]:
master.columns

Index([u'Appts_Pending_1412', u'%_Appts_Over_30_Pending_1412',
       u'PC_Wait_Pending_1412', u'SC_Wait_Pending_1412', u'MH_Pending_1412',
       u'ID', u'ORGANIZATION', u'MAY-FY11', u'MAY-FY12', u'MAY-FY13',
       u'MAY-FY14', u'MAY-FY15', u'MAY-FY16', u'Appts_Pending_1610',
       u'%_Appts_Over_30_Pending_1610', u'PC_Wait_Pending_1610',
       u'SC_Wait_Pending_1610', u'MH_Pending_1610', u'Appts_Complete_1608',
       u'%_Appts_Over_30_Complete_1608', u'PC_Wait_Complete_1608',
       u'SC_Wait_Complete_1608', u'MH_Complete_1608', u'Station', u'Location'],
      dtype='object')

In [488]:
check_len(master)

Hurray! We didn't lose any rows.


In [489]:
# Reference for duplicates I need to check out. 
master[master.duplicated(subset="Station", keep=False)][["Location","Station","ID"]]

Unnamed: 0,Location,Station,ID
8,"(V02) (528) Albany, NY",528,528al
9,"(V02) (528) Bath, NY",528,528ba
10,"(V02) (528) Western New York, NY",528,528we
11,"(V02) (528) Canandaigua, NY",528,528ca
12,"(V02) (528) Syracuse, NY",528,528sy
79,"(V15) (589) Columbia, MO",589,589co
80,"(V15) (589) Kansas City, MO",589,589ka
81,"(V15) (589) Eastern KS HCS, KS",589,589ea
82,"(V15) (589) Wichita, KS",589,589wi
83,"(V15) (657) Marion, IL",657,657ma


## Bring in Hiring

I'm going to have an issue dealing with the stations with the same ID. 

I made a new CSV called Hiring_Clean where I put the appropriate ID in a new ID column. I put ignore for those that were impossible to join accurately. 

In [490]:
master.columns

Index([u'Appts_Pending_1412', u'%_Appts_Over_30_Pending_1412',
       u'PC_Wait_Pending_1412', u'SC_Wait_Pending_1412', u'MH_Pending_1412',
       u'ID', u'ORGANIZATION', u'MAY-FY11', u'MAY-FY12', u'MAY-FY13',
       u'MAY-FY14', u'MAY-FY15', u'MAY-FY16', u'Appts_Pending_1610',
       u'%_Appts_Over_30_Pending_1610', u'PC_Wait_Pending_1610',
       u'SC_Wait_Pending_1610', u'MH_Pending_1610', u'Appts_Complete_1608',
       u'%_Appts_Over_30_Complete_1608', u'PC_Wait_Complete_1608',
       u'SC_Wait_Complete_1608', u'MH_Complete_1608', u'Station', u'Location'],
      dtype='object')

In [491]:
Hiring = pd.read_csv("Hiring/Hiring_Clean.csv")
Hiring = Hiring[Hiring["ID"]!= "Ignore"] 
Hiring_summed = pd.pivot_table(Hiring,index=["Station"],values=["NbrEmps"],aggfunc=np.sum).reset_index()
Hiring_summed = Hiring_summed.rename(columns = {"NbrEmps":"Choice_Hires"})

In [492]:
master.columns

Index([u'Appts_Pending_1412', u'%_Appts_Over_30_Pending_1412',
       u'PC_Wait_Pending_1412', u'SC_Wait_Pending_1412', u'MH_Pending_1412',
       u'ID', u'ORGANIZATION', u'MAY-FY11', u'MAY-FY12', u'MAY-FY13',
       u'MAY-FY14', u'MAY-FY15', u'MAY-FY16', u'Appts_Pending_1610',
       u'%_Appts_Over_30_Pending_1610', u'PC_Wait_Pending_1610',
       u'SC_Wait_Pending_1610', u'MH_Pending_1610', u'Appts_Complete_1608',
       u'%_Appts_Over_30_Complete_1608', u'PC_Wait_Complete_1608',
       u'SC_Wait_Complete_1608', u'MH_Complete_1608', u'Station', u'Location'],
      dtype='object')

In [493]:
master = master.merge(Hiring_summed, on="Station", how="left")
master.head()

Unnamed: 0,Appts_Pending_1412,%_Appts_Over_30_Pending_1412,PC_Wait_Pending_1412,SC_Wait_Pending_1412,MH_Pending_1412,ID,ORGANIZATION,MAY-FY11,MAY-FY12,MAY-FY13,...,SC_Wait_Pending_1610,MH_Pending_1610,Appts_Complete_1608,%_Appts_Over_30_Complete_1608,PC_Wait_Complete_1608,SC_Wait_Complete_1608,MH_Complete_1608,Station,Location,Choice_Hires
0,42899,2.70%,2.51,4.07,2.77,402to,(V01) (402) HCS TOGUS ME,1293.0,1272.0,1248.0,...,5.67,3.62,28683,0.0134,4.96,2.73,1.87,402,"(V01) (402) Togus, ME",23.0
1,23709,2.16%,3.92,3.95,0.79,405wh,(V01) (405) MROC WHT RIVER JCT VT,852.0,853.0,881.0,...,7.09,1.86,21581,0.0177,3.62,4.08,1.32,405,"(V01) (405) White River Junction, VT",24.0
2,8105,5.17%,1.12,10.53,4.41,518be,(V01) (518) MC BEDFORD MA,1313.0,1265.0,1245.0,...,11.21,3.07,13170,0.0115,0.43,6.01,0.76,518,"(V01) (518) Bedford, MA",5.0
3,85625,1.71%,1.66,3.53,3.92,523va,(V01) (523) HCS BOSTON MA,4022.0,4011.0,4041.0,...,6.76,3.08,46306,0.0271,7.47,5.75,2.17,523,"(V01) (523) VA Boston HCS, MA",28.0
4,19511,2.64%,2.66,4.21,5.13,608ma,(V01) (608) MC MANCHESTER NH,663.0,683.0,691.0,...,4.58,5.21,18849,0.016,4.99,2.93,2.94,608,"(V01) (608) Manchester, NH",19.0


In [494]:
check_len(master)

Hurray! We didn't lose any rows.


## Bring in Leadership

Also creating a version of the CSV to deal with duplicate stations.

In [495]:
Leadership = pd.read_csv("Leadership/Leadership_cleaned.csv")
Leadership = Leadership[Leadership["ID"] != "Ignore"]
Leadership["Vacancy"] = Leadership["Acting/Detailed"] == "VACANT"
Leadership["Interim"] = Leadership["Acting/Detailed"] != "VACANT"


# For easy merging


for_dups = []
for index, row in Leadership.iterrows():
    if type(row["ID"]) == str:
        for_dups.append(row["ID"])
    else:
        for_dups.append(row["Station"])
Leadership["for_dups"] = for_dups

Leadership_summed = pd.pivot_table(Leadership, index=["for_dups"],values=["Vacancy","Interim"],aggfunc=np.sum).reset_index()
#Leadership_summed = Leadership_summed.merge(Leadership[["Station","ID"]], on="Station")





dup_list = master[master.duplicated(subset="Station",keep=False)]["ID"].tolist()
dup_list.append("612n.")


for_dups = []
for index, row in master.iterrows():
    if row["ID"] in dup_list:
        for_dups.append(row["ID"])
    else:
        for_dups.append(row["Station"])
master["for_dups"] = for_dups






In [496]:
master = master.merge(Leadership_summed, on="for_dups", how="left")
master["Interim"] = master["Interim"].fillna(0)
master["Vacancy"] = master["Vacancy"].fillna(0)

In [497]:
check_len(master)

Hurray! We didn't lose any rows.


In [498]:
master.head()

Unnamed: 0,Appts_Pending_1412,%_Appts_Over_30_Pending_1412,PC_Wait_Pending_1412,SC_Wait_Pending_1412,MH_Pending_1412,ID,ORGANIZATION,MAY-FY11,MAY-FY12,MAY-FY13,...,%_Appts_Over_30_Complete_1608,PC_Wait_Complete_1608,SC_Wait_Complete_1608,MH_Complete_1608,Station,Location,Choice_Hires,for_dups,Interim,Vacancy
0,42899,2.70%,2.51,4.07,2.77,402to,(V01) (402) HCS TOGUS ME,1293.0,1272.0,1248.0,...,0.0134,4.96,2.73,1.87,402,"(V01) (402) Togus, ME",23.0,402,0.0,1.0
1,23709,2.16%,3.92,3.95,0.79,405wh,(V01) (405) MROC WHT RIVER JCT VT,852.0,853.0,881.0,...,0.0177,3.62,4.08,1.32,405,"(V01) (405) White River Junction, VT",24.0,405,0.0,0.0
2,8105,5.17%,1.12,10.53,4.41,518be,(V01) (518) MC BEDFORD MA,1313.0,1265.0,1245.0,...,0.0115,0.43,6.01,0.76,518,"(V01) (518) Bedford, MA",5.0,518,0.0,1.0
3,85625,1.71%,1.66,3.53,3.92,523va,(V01) (523) HCS BOSTON MA,4022.0,4011.0,4041.0,...,0.0271,7.47,5.75,2.17,523,"(V01) (523) VA Boston HCS, MA",28.0,523,0.0,0.0
4,19511,2.64%,2.66,4.21,5.13,608ma,(V01) (608) MC MANCHESTER NH,663.0,683.0,691.0,...,0.016,4.99,2.93,2.94,608,"(V01) (608) Manchester, NH",19.0,608,0.0,0.0


In [499]:
master.columns

Index([u'Appts_Pending_1412', u'%_Appts_Over_30_Pending_1412',
       u'PC_Wait_Pending_1412', u'SC_Wait_Pending_1412', u'MH_Pending_1412',
       u'ID', u'ORGANIZATION', u'MAY-FY11', u'MAY-FY12', u'MAY-FY13',
       u'MAY-FY14', u'MAY-FY15', u'MAY-FY16', u'Appts_Pending_1610',
       u'%_Appts_Over_30_Pending_1610', u'PC_Wait_Pending_1610',
       u'SC_Wait_Pending_1610', u'MH_Pending_1610', u'Appts_Complete_1608',
       u'%_Appts_Over_30_Complete_1608', u'PC_Wait_Complete_1608',
       u'SC_Wait_Complete_1608', u'MH_Complete_1608', u'Station', u'Location',
       u'Choice_Hires', u'for_dups', u'Interim', u'Vacancy'],
      dtype='object')

## Bringing in Completed 2014

In [500]:
pd.read_csv("Wait_Time/Completed_Cleaned/Wait_Times_201409.csv").head()

Unnamed: 0,Location_2014,Appts_2014,Appts_Complete_Over_30_Days_%_2014,PC_Wait_2014,SC_Wait_2014,MH_Wait_2014
0,"(V01) (402) Togus, ME",24458,0.0167,2.72,3.34,1.43
1,"(V01) (405) White River Junction, VT",19232,0.0158,2.24,3.22,0.84
2,"(V01) (518) Bedford, MA",11976,0.0159,0.88,5.3,3.67
3,"(V01) (523) VA Boston HCS, MA",45669,0.0105,1.16,3.17,2.63
4,"(V01) (608) Manchester, NH",17616,0.0163,2.49,3.65,3.01


In [501]:
def make_df_complete_2014(Type,Time,File):
    Title = "{}_{}".format(Type,Time)
    csv = pd.read_csv(File,
                               usecols=[0,1,5,12,13,14],
                               skiprows=[0], #skipping the original header
                               names=["Location","Appts_{}".format(Title),
                                      "%_Appts_Over_30_{}".format(Title),
                                      "PC_Wait_{}".format(Title),
                                      "SC_Wait_{}".format(Title),"MH_{}".format(Title)])
    csv["Station"] = csv["Location"].apply(get_station)
    return csv
Complete_1409 = make_df_complete_2014("Complete","1409","Wait_Time/Completed_Cleaned/14_09_Wait.csv")
Complete_1409["ID"] = generate_IDs(Complete_1409)
Complete_1409

Unnamed: 0,Location,Appts_Complete_1409,%_Appts_Over_30_Complete_1409,PC_Wait_Complete_1409,SC_Wait_Complete_1409,MH_Complete_1409,Station,ID
0,"(V01) (402) Togus, ME",24458,0.0167,2.72,3.34,1.43,402,402to
1,"(V01) (405) White River Junction, VT",19232,0.0158,2.24,3.22,0.84,405,405wh
2,"(V01) (518) Bedford, MA",11976,0.0159,0.88,5.30,3.67,518,518be
3,"(V01) (523) VA Boston HCS, MA",45669,0.0105,1.16,3.17,2.63,523,523va
4,"(V01) (608) Manchester, NH",17616,0.0163,2.49,3.65,3.01,608,608ma
5,(V01) (631) VA Central Western Massachusetts HCS,19536,0.0500,6.31,7.80,2.91,631,631va
6,"(V01) (650) Providence, RI",28521,0.0142,2.35,3.65,4.05,650,650pr
7,"(V01) (689) VA Connecticut HCS, CT",48622,0.0150,1.34,3.42,2.99,689,689va
8,"(V02) (528) Albany, NY",24865,0.0088,1.28,1.86,3.02,528,528al
9,"(V02) (528) Bath, NY",13553,0.0185,3.05,5.33,0.13,528,528ba


In [502]:
for_dups = []
for index, row in Complete_1409.iterrows():
    if row["ID"] in dup_list:
        for_dups.append(row["ID"])
    else:
        for_dups.append(row["Station"])
Complete_1409["for_dups"] = for_dups

In [503]:
columns_to_use = Complete_1409.columns.difference(master.columns).tolist()
columns_to_use.append("for_dups")

master = master.merge(Complete_1409[columns_to_use], on="for_dups", how="left")
master.columns






Index([u'Appts_Pending_1412', u'%_Appts_Over_30_Pending_1412',
       u'PC_Wait_Pending_1412', u'SC_Wait_Pending_1412', u'MH_Pending_1412',
       u'ID', u'ORGANIZATION', u'MAY-FY11', u'MAY-FY12', u'MAY-FY13',
       u'MAY-FY14', u'MAY-FY15', u'MAY-FY16', u'Appts_Pending_1610',
       u'%_Appts_Over_30_Pending_1610', u'PC_Wait_Pending_1610',
       u'SC_Wait_Pending_1610', u'MH_Pending_1610', u'Appts_Complete_1608',
       u'%_Appts_Over_30_Complete_1608', u'PC_Wait_Complete_1608',
       u'SC_Wait_Complete_1608', u'MH_Complete_1608', u'Station', u'Location',
       u'Choice_Hires', u'for_dups', u'Interim', u'Vacancy',
       u'%_Appts_Over_30_Complete_1409', u'Appts_Complete_1409',
       u'MH_Complete_1409', u'PC_Wait_Complete_1409',
       u'SC_Wait_Complete_1409'],
      dtype='object')

In [504]:
master.head()

Unnamed: 0,Appts_Pending_1412,%_Appts_Over_30_Pending_1412,PC_Wait_Pending_1412,SC_Wait_Pending_1412,MH_Pending_1412,ID,ORGANIZATION,MAY-FY11,MAY-FY12,MAY-FY13,...,Location,Choice_Hires,for_dups,Interim,Vacancy,%_Appts_Over_30_Complete_1409,Appts_Complete_1409,MH_Complete_1409,PC_Wait_Complete_1409,SC_Wait_Complete_1409
0,42899,2.70%,2.51,4.07,2.77,402to,(V01) (402) HCS TOGUS ME,1293.0,1272.0,1248.0,...,"(V01) (402) Togus, ME",23.0,402,0.0,1.0,0.0167,24458.0,1.43,2.72,3.34
1,23709,2.16%,3.92,3.95,0.79,405wh,(V01) (405) MROC WHT RIVER JCT VT,852.0,853.0,881.0,...,"(V01) (405) White River Junction, VT",24.0,405,0.0,0.0,0.0158,19232.0,0.84,2.24,3.22
2,8105,5.17%,1.12,10.53,4.41,518be,(V01) (518) MC BEDFORD MA,1313.0,1265.0,1245.0,...,"(V01) (518) Bedford, MA",5.0,518,0.0,1.0,0.0159,11976.0,3.67,0.88,5.3
3,85625,1.71%,1.66,3.53,3.92,523va,(V01) (523) HCS BOSTON MA,4022.0,4011.0,4041.0,...,"(V01) (523) VA Boston HCS, MA",28.0,523,0.0,0.0,0.0105,45669.0,2.63,1.16,3.17
4,19511,2.64%,2.66,4.21,5.13,608ma,(V01) (608) MC MANCHESTER NH,663.0,683.0,691.0,...,"(V01) (608) Manchester, NH",19.0,608,0.0,0.0,0.0163,17616.0,3.01,2.49,3.65


In [505]:
check_len(master)    

Hurray! We didn't lose any rows.


# Self Generated Columns

In [506]:
# Appts
master["Pending_Increase"] = master["Appts_Pending_1610"] - master["Appts_Pending_1412"]
master["Pending_Increase_%"] = master["Pending_Increase"]/master["Appts_Pending_1412"]
master["Complete_Increase"] = master["Appts_Complete_1608"] - master["Appts_Complete_1409"]
master["Complete_Increase_%"] = master["Complete_Increase"]/master["Appts_Complete_1409"]

# Leadership
master["Missing_Leadership"] = master["Vacancy"] + master["Interim"]

# Hires
master["Choice_Increase_%"] = master["Choice_Hires"]/master["MAY-FY14"]


# Wait Times
master["PC_Wait_Pending_Increase"] = master["PC_Wait_Pending_1610"] - master["PC_Wait_Pending_1412"]
master["PC_Wait_Pending_Increase_%"] = master["PC_Wait_Pending_Increase"]/master["PC_Wait_Pending_1412"]

master["PC_Wait_Complete_Increase"] = master["PC_Wait_Complete_1608"] - master["PC_Wait_Complete_1409"]
master["PC_Wait_Complete_Increase_%"] = master["PC_Wait_Complete_Increase"]/master["PC_Wait_Complete_1409"]

# Turnover

In [507]:
turnover = pd.read_csv("Turnover/Turnover_Physician.csv",na_values="-")
# to drop national numbers and visn
turnover = turnover[turnover["Organization"].str.len() > 5]

turnover["Station"]= turnover["Organization"].apply(get_station)

for_dups = []
for index, row in turnover.iterrows():
    if type(row["ID"]) == str:
        for_dups.append(row["ID"])
    else:
        for_dups.append(row["Station"])
turnover["for_dups"] = for_dups


turnover.head()

Unnamed: 0,Organization,Specialty,FY11,FY12,FY13,FY14,FY15,ID,Station,for_dups
174,(V01) (402) HCS TOGUS ME,0602 Physician (All Specialties),0.0997,0.0955,0.1133,0.1474,0.0781,,402,402
175,(V01) (402) HCS TOGUS ME,01 ANESTHESIOLOGY,0.375,0.7742,,,,,402,402
176,(V01) (402) HCS TOGUS ME,02 SURGERY,,,,0.6667,,,402,402
177,(V01) (402) HCS TOGUS ME,07 ORTHOPEDIC SURGERY,,0.2927,,,,,402,402
178,(V01) (402) HCS TOGUS ME,08 OTOLARYNGOLOGY,,,,1.0,,,402,402


### Not merging in all of turnover

To do this would create a massive number of columns. Instead, it makes more sense to merge filtered versions in on a case by case basis.

But...it does make sense to merge in specialty 

In [508]:
turnover.to_csv("turnover/Turnover_Physician_For_Merging.csv")

In [509]:

physician_turnover = pd.read_csv("Turnover/Turnover_Physician_For_Merging.csv",na_values="-",
                           usecols=[2,3,4,5,6,7,10],
                           skiprows=[0], #skipping the original header
                           names=["Specialty","Physician_Turnover_FY11","Physician_Turnover_FY12",
                                  "Physician_Turnover_FY13","Physician_Turnover_FY14","Physician_Turnover_FY15",
                                  "for_dups"])
physician_turnover = physician_turnover[physician_turnover["Specialty"]=="0602 Physician (All Specialties)"].drop("Specialty",1)
master = master.merge(physician_turnover,on="for_dups",how="left")

# Export to CSV

In [510]:
master.to_csv("Master/Master.csv", index=False)

In [511]:
master.columns

Index([u'Appts_Pending_1412', u'%_Appts_Over_30_Pending_1412',
       u'PC_Wait_Pending_1412', u'SC_Wait_Pending_1412', u'MH_Pending_1412',
       u'ID', u'ORGANIZATION', u'MAY-FY11', u'MAY-FY12', u'MAY-FY13',
       u'MAY-FY14', u'MAY-FY15', u'MAY-FY16', u'Appts_Pending_1610',
       u'%_Appts_Over_30_Pending_1610', u'PC_Wait_Pending_1610',
       u'SC_Wait_Pending_1610', u'MH_Pending_1610', u'Appts_Complete_1608',
       u'%_Appts_Over_30_Complete_1608', u'PC_Wait_Complete_1608',
       u'SC_Wait_Complete_1608', u'MH_Complete_1608', u'Station', u'Location',
       u'Choice_Hires', u'for_dups', u'Interim', u'Vacancy',
       u'%_Appts_Over_30_Complete_1409', u'Appts_Complete_1409',
       u'MH_Complete_1409', u'PC_Wait_Complete_1409', u'SC_Wait_Complete_1409',
       u'Pending_Increase', u'Pending_Increase_%', u'Complete_Increase',
       u'Complete_Increase_%', u'Missing_Leadership', u'Choice_Increase_%',
       u'PC_Wait_Pending_Increase', u'PC_Wait_Pending_Increase_%',
       u'PC_

In [512]:
for column in master.columns:
    print column

Appts_Pending_1412
%_Appts_Over_30_Pending_1412
PC_Wait_Pending_1412
SC_Wait_Pending_1412
MH_Pending_1412
ID
ORGANIZATION
MAY-FY11
MAY-FY12
MAY-FY13
MAY-FY14
MAY-FY15
MAY-FY16
Appts_Pending_1610
%_Appts_Over_30_Pending_1610
PC_Wait_Pending_1610
SC_Wait_Pending_1610
MH_Pending_1610
Appts_Complete_1608
%_Appts_Over_30_Complete_1608
PC_Wait_Complete_1608
SC_Wait_Complete_1608
MH_Complete_1608
Station
Location
Choice_Hires
for_dups
Interim
Vacancy
%_Appts_Over_30_Complete_1409
Appts_Complete_1409
MH_Complete_1409
PC_Wait_Complete_1409
SC_Wait_Complete_1409
Pending_Increase
Pending_Increase_%
Complete_Increase
Complete_Increase_%
Missing_Leadership
Choice_Increase_%
PC_Wait_Pending_Increase
PC_Wait_Pending_Increase_%
PC_Wait_Complete_Increase
PC_Wait_Complete_Increase_%
Physician_Turnover_FY11
Physician_Turnover_FY12
Physician_Turnover_FY13
Physician_Turnover_FY14
Physician_Turnover_FY15
