## Data Collection
Predicting total colony forming units (CFU) from particle counts, time of day, weather, percentage of outdoor particulates, or location.

Goal: Organize your data to streamline the next steps of your capstone.

■ Data loading
■ Data joining

In [1]:
#importing the tools
import pandas as pd
import datetime as dt
import numpy as np
import os
import glob
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import matplotlib.pyplot as plt

In [2]:
#Data loading
data = pd.read_excel(r'data\filt_cfus.xlsx', sheet_name = [0,3,4])

dfIC = pd.read_excel('data\VADIC.xlsx', sheet_name = [1], index_col=0, parse_dates=True)
dfMC = pd.read_excel('data\VADMC.xlsx', sheet_name = [0], index_col=0, parse_dates=True)


# dfMC is both hospital's particle air sampling data.
# dfIC is Infection Control's particle air sampling data.
# dfAHA is both hopsital's ad hoc particle air sampling data.

# Cleaning and unifying CFU data

In [3]:
for key in data:
     print(key, data[key].head())

0                   Date           location  cfu
0  2018-11-30 00:00:00   Inside Room L108    1
1  2018-11-30 00:00:00     ED Parking lot   34
2  2010-11-30 00:00:00  unused agar strip    0
3  2018-11-29 00:00:00  unused agar strip    0
4  2018-11-29 00:00:00   Inside Room L108    0
3         Date           location  cfu
0 2018-11-27  UNUSED AGAR STRIP    0
1 2018-11-27             11L NS    2
2 2018-11-27            11L SEC    5
3 2018-11-27             10M NS    1
4 2018-11-27           9MICU NS    2
4         Date                           location  cfu
0 2018-09-14  B1-C6 Balcony   BMT/Hem Onc Spine    1
1 2018-09-14             B1-A6 Balcony by A6577    1
2 2018-09-14             B1-A5 Balcony by A5577    2
3 2018-09-14             B1-A4 Balcony by A4577    1
4 2018-09-14             B1-C4 Balcony by C4877    2


In [4]:
adhoc = pd.DataFrame(data[0])
ml = pd.DataFrame(data[3])
mb = pd.DataFrame(data[4])

print("\n ml: \n", ml.dtypes, ml.shape, "\n ad hoc: \n",adhoc.dtypes,adhoc.shape,"\n mb: \n", mb.dtypes, mb.shape)


 ml: 
 Date        datetime64[ns]
location            object
cfu                  int64
dtype: object (1403, 3) 
 ad hoc: 
 Date        object
location    object
cfu          int64
dtype: object (2532, 3) 
 mb: 
 Date        datetime64[ns]
location            object
cfu                  int64
dtype: object (1219, 3)


In [5]:
adhoc['Date'] = pd.to_datetime(adhoc['Date'])
adhoc['Date']

0      2018-11-30
1      2018-11-30
2      2010-11-30
3      2018-11-29
4      2018-11-29
          ...    
2527   2004-02-03
2528   2004-02-03
2529   2004-02-03
2530   2004-02-03
2531   2004-02-03
Name: Date, Length: 2532, dtype: datetime64[ns]

In [6]:
adhoc['Source'] = 'adhoc'
ml['Source'] = 'ML'
mb['Source'] = 'MB'

In [7]:
print("\n ml", ml.dtypes, "\n ad hoc",adhoc.dtypes,"\n mb", mb.dtypes)
adhoc['cfu'].unique()


 ml Date        datetime64[ns]
location            object
cfu                  int64
Source              object
dtype: object 
 ad hoc Date        datetime64[ns]
location            object
cfu                  int64
Source              object
dtype: object 
 mb Date        datetime64[ns]
location            object
cfu                  int64
Source              object
dtype: object


array([ 1, 34,  0,  9,  3,  2,  5, 27, 14,  4, 11, 24, 22,  7, 23, 19, 20,
        8, 13, 12, 28, 10, 17, 18, 30,  6, 16, 15, 31, 26, 25, 21, 86, 46,
       32, 29, 33, 39], dtype=int64)

In [8]:
cfu_df = pd.DataFrame(pd.concat([adhoc, ml, mb]))
cfu_df.head(), cfu_df.shape

(        Date           location  cfu Source
 0 2018-11-30   Inside Room L108    1  adhoc
 1 2018-11-30     ED Parking lot   34  adhoc
 2 2010-11-30  unused agar strip    0  adhoc
 3 2018-11-29  unused agar strip    0  adhoc
 4 2018-11-29   Inside Room L108    0  adhoc,
 (5154, 4))

In [9]:
cfu_df.describe()
cfu_df.info()
len(cfu_df.location.unique())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5154 entries, 0 to 1218
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      5154 non-null   datetime64[ns]
 1   location  5153 non-null   object        
 2   cfu       5154 non-null   int64         
 3   Source    5154 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 201.3+ KB


1754

# Time for the Air particulate data

In [12]:
ic = pd.DataFrame(dfIC[1].reset_index())
mc = pd.DataFrame(dfMC[0].reset_index())

In [13]:
print("\n ic: \n", ic.dtypes, ic.shape, "\n mc: \n",mc.dtypes,mc.shape)


 ic: 
 Sampling Date            datetime64[ns]
Sample Location                  object
Temp                             object
RH                              float64
Particle (total >.3)            float64
Particle (>.5 per m3)           float64
Time of Sampling                float64
Traffic                          object
SF Gate Weather                  object
Notes                            object
dtype: object (334, 10) 
 mc: 
 Sampling Date            datetime64[ns]
Sample Location                  object
Temp                            float64
RH                              float64
Particle (total >.3)            float64
Particle (>.5 per m3)           float64
Time of Sampling                 object
Traffic                         float64
SF Gate Weather                  object
Notes                            object
dtype: object (2406, 10)


In [14]:
#Renaming and unifying column names
ic = ic.rename(columns={"Sampling Date":"Date",
                            "Sample Location":"location",
                            "Particle (total >.3)":"Total_Particles>0.3",
                            "Particle (>.5 per m3)":"Total_Particles>0.5/m3",
                            "Time of Sampling":"Sample_Time",
                            "SF Gate Weather":"Weather",
                            "Percentage .3 over .5":"Ratio_0.3/0.5"})
mc = mc.rename(columns={"Sampling Date":"Date",
                            "Sample Location":"location",
                            "Particle (total >.3)":"Total_Particles>0.3",
                            "Particle (>.5 per m3)":"Total_Particles>0.5/m3",
                            "Time of Sampling":"Sample_Time",
                            "SF Gate Weather":"Weather",
                            "Percentage .3 over .5":"Ratio_0.3/0.5"})

print(mc.keys(),ic.keys())

Index(['Date', 'location', 'Temp', 'RH', 'Total_Particles>0.3',
       'Total_Particles>0.5/m3', 'Sample_Time', 'Traffic', 'Weather', 'Notes'],
      dtype='object') Index(['Date', 'location', 'Temp', 'RH', 'Total_Particles>0.3',
       'Total_Particles>0.5/m3', 'Sample_Time', 'Traffic', 'Weather', 'Notes'],
      dtype='object')


In [15]:
print(len(ic.location.unique()), len(mc.location.unique()),ic.location.unique(),mc.location.unique())

ic.shape, mc.shape

160 123 ['Hallway by Nursing Station outside plastic Barrier 14M'
 'Outside of double-doors in long corridor labeled with sign "Staff Only Do Not Enter." These doors are immediately adjacent to 15L north hallways and 15L waiting room. '
 'Immediately outside of construction barricade, inside area labedled "Staff Only Do Not Enter".'
 'Outside 15M west stairwell (barricaded, emergency exit only from construction area).'
 'Outside 15M east stairwell (barricaded, emergency exit only from construction area).'
 'ED Parking Lot' 'Inside 6 Moffitt South (M618) ICC nursing station'
 'Inside 6 Moffitt South ICC Bed 1' 'Inside 6 Moffitt South ICC Bed 10'
 'Inside 6 Moffitt South ICC Bed 5' 'Inside 6 Moffitt Sout ICC Bed 9'
 'Outside Construction barrier' 'Outside double doors to 12N (cath lab)'
 'Inside unit 12N' 'Hallway 12S Corridor'
 'Room J146 Mammography Waiting Room' 'Mt. Zion Shuttle Stop'
 'Right side hall next to reception desk 8NICU'
 'Left side hall next to reception desk 8NICU'
 'Ini

((334, 10), (2406, 10))

In [17]:
#merging the particulate data now that they are in the same format
part_df = pd.DataFrame(pd.concat([ic, mc]))

part_df.index = part_df['Date']
part_df.head(), part_df.shape

(                 Date                                           location  \
 Date                                                                       
 2018-01-03 2018-01-03  Hallway by Nursing Station outside plastic Bar...   
 2019-04-03 2019-04-03  Outside of double-doors in long corridor label...   
 2019-04-03 2019-04-03  Immediately outside of construction barricade,...   
 2019-04-03 2019-04-03  Outside 15M west stairwell (barricaded, emerge...   
 2019-04-03 2019-04-03  Outside 15M east stairwell (barricaded, emerge...   
 
             Temp     RH  Total_Particles>0.3  Total_Particles>0.5/m3  \
 Date                                                                   
 2018-01-03  72.9   38.0             108659.0               2167314.0   
 2019-04-03    68  155.0              20945.0               1778622.0   
 2019-04-03    71  150.0              22199.0               1682156.0   
 2019-04-03  72.8  148.0              41271.0               1401414.0   
 2019-04-03  68.7  14

# Time for merging

In [20]:
#fixing the index
cfu_df = cfu_df.set_index('Date')
cfu_df.index, part_df.index

(DatetimeIndex(['2018-11-30', '2018-11-30', '2010-11-30', '2018-11-29',
                '2018-11-29', '2018-11-29', '2018-11-20', '2018-11-20',
                '2018-11-20', '2018-11-20',
                ...
                '2015-01-23', '2015-01-23', '2015-01-23', '2015-01-23',
                '2015-01-23', '2015-01-23', '2015-01-23', '2015-01-23',
                '2015-01-23', '2015-04-07'],
               dtype='datetime64[ns]', name='Date', length=5154, freq=None),
 DatetimeIndex(['2018-01-03', '2019-04-03', '2019-04-03', '2019-04-03',
                '2019-04-03', '2019-04-03', '2019-03-26', '2019-03-26',
                '2019-03-26', '2019-03-26',
                ...
                '2007-07-31', '2007-07-31', '2007-07-31', '2007-07-31',
                '2007-07-31', '2007-07-31', '2007-07-31', '2007-07-31',
                '2007-07-31', '2007-07-31'],
               dtype='datetime64[ns]', name='Date', length=2740, freq=None))

In [21]:
df = part_df.merge(cfu_df,  how='left', on=['location'])
df

Unnamed: 0,Date,location,Temp,RH,Total_Particles>0.3,Total_Particles>0.5/m3,Sample_Time,Traffic,Weather,Notes,cfu,Source
0,2018-01-03,Hallway by Nursing Station outside plastic Bar...,72.9,38.0,108659.0,2167314.0,1121,1,Mostly Cloudy,,,
1,2019-04-03,Outside of double-doors in long corridor label...,68,155.0,20945.0,1778622.0,1138,3,Cloudy,,,
2,2019-04-03,"Immediately outside of construction barricade,...",71,150.0,22199.0,1682156.0,1143,1,Cloudy,,,
3,2019-04-03,"Outside 15M west stairwell (barricaded, emerge...",72.8,148.0,41271.0,1401414.0,1150,0,Cloudy,,,
4,2019-04-03,"Outside 15M east stairwell (barricaded, emerge...",68.7,146.0,41727.0,1603004.0,1155,0,Cloudy,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
25982,2007-07-31,ED Parking Lot,65.4,61.3,361025.0,,,1,Foggy and Windy,,18.0,adhoc
25983,2007-07-31,ED Parking Lot,65.4,61.3,361025.0,,,1,Foggy and Windy,,3.0,adhoc
25984,2007-07-31,ED Parking Lot,65.4,61.3,361025.0,,,1,Foggy and Windy,,4.0,adhoc
25985,2007-07-31,9 Long ICU Nurses Station,,,26603.0,,,2,Foggy and Windy,,,


In [22]:
df.describe

<bound method NDFrame.describe of             Date                                           location  Temp  \
0     2018-01-03  Hallway by Nursing Station outside plastic Bar...  72.9   
1     2019-04-03  Outside of double-doors in long corridor label...    68   
2     2019-04-03  Immediately outside of construction barricade,...    71   
3     2019-04-03  Outside 15M west stairwell (barricaded, emerge...  72.8   
4     2019-04-03  Outside 15M east stairwell (barricaded, emerge...  68.7   
...          ...                                                ...   ...   
25982 2007-07-31                                     ED Parking Lot  65.4   
25983 2007-07-31                                     ED Parking Lot  65.4   
25984 2007-07-31                                     ED Parking Lot  65.4   
25985 2007-07-31                          9 Long ICU Nurses Station   NaN   
25986 2007-07-31                                         10 IC/ICC    NaN   

          RH  Total_Particles>0.3  Total_

In [23]:
df.isnull().sum()

Date                          0
location                      0
Temp                       1687
RH                         1687
Total_Particles>0.3         365
Total_Particles>0.5/m3     7805
Sample_Time                6957
Traffic                     145
Weather                    2914
Notes                     25573
cfu                        1235
Source                     1235
dtype: int64

In [None]:
#ask DJ how to get the titles of columns to stop adding. (dates, dates_x, dates_y)
print(df_particulate.keys())

In [None]:
#df_melt = pd.melt(dfW, id_vars=['Date'], var_name='Location',value_name='CFU')
#df_cfu= pd.DataFrame()
#df_to_merge = [dfW,dfE,dfAH]

#for df in df_to_merge:
#    df=pd.melt(df, id_vars=['Date'], var_name='Location',value_name='CFU')
print(df_to_merge)

In [None]:
#merging the CFU data now that they are in the same format
df_all = df_particulate.merge(df_CFU, how='outer', on=['Date','Location'])
df_all

In [None]:
dftest = df_all.dropna(how='all')
dftest

In [None]:
#dfE.reset_index()
dftest = df_all.set_index(['Date','Location'])
print(dftest.head())

In [None]:
df_all.isnull().sum()

Many locations are named in differing patterns, or the room number is in the prior sample(s).

## Data Organization
Create a file structure and add your work to the GitHub repository you’ve created for this project.

In [None]:
the_list = pd.DataFrame()
the_list = the_list.append(df) for df in df_list
    
print(the_list)

In [None]:
result = pd.concat(df_list)

## Data Definition
Goal: Gain an understanding of your data features to inform the next steps of your project.

## Data Cleaning
Goal: Clean up the data in order to prepare it for the next steps of your project