
# __Data Audit Code__

### i) Data Audit Table
- This workbook contains code to audit excel data files
- By opening a 'csv' file and running the below code, each function will return a dataframe that can be pasted into a document. 
- 1) __data_audit__ The first function will create a dataframe that detects missing values for each column
- 2) __value_count__ This function returns a series containting counts of unqiue values for each column
- 3) __join_tables__ Returns an inner join of the two dataframes produced above

### ii) Deleting Missing Values 
- This section can be used for deleting incomplete data in dataframes
- 1) __missing_values_table__ craetes a data for missing data only 
- 2) __remove_incomplete_columns__ The code runs the missing values table and removes the columns which contain more than x% of missing data


In [1]:
import pandas as pd 
import numpy as np
df = pd.read_csv('2019 Hotel Master List.csv')

In [2]:
def data_audit(df):
    
    '''Detects missing values for each column and returns a dataframe'''
    
    # Total missing values
    mis_val = df.isnull().sum()
        
    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / len(df)
        
    # Create a table
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Missing Values'})
        
    # Round values in table
    mis_val_table_ren_columns = mis_val_table_ren_columns.round(2)
        
    # Create df with missing values
    missing_values_only = mis_val_table_ren_columns[
    mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
    '% of Total Missing Values', ascending=False)
        
    # Repeat for values with no data at all
    NO_DATA = mis_val_table_ren_columns[
    mis_val_table_ren_columns['% of Total Missing Values'] == 100]
        
    # and print the summary information
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"   
               
            "There are " + str(missing_values_only.shape[0]) +
              " columns that have missing values.\n" + 
               
               "There are " + str(NO_DATA.shape[0]) +
              " columns that have no data.")
        
    #data audit copied to clipboard (paste into excel)
    print(mis_val_table_ren_columns.to_clipboard())
        
    # Return the dataframe
    return mis_val_table_ren_columns

In [3]:
data_audit(df)

Your selected dataframe has 50 columns.
There are 35 columns that have missing values.
There are 0 columns that have no data.
None


Unnamed: 0,Missing Values,% of Total Missing Values
Fac ID,0,0.0
Hotel Code,1,0.09
Hotel Name,0,0.0
Open/Closed,0,0.0
Brand,0,0.0
Business Unit,0,0.0
Geography Division,0,0.0
Ops VP,1,0.09
Ops Director/RGM,34,3.14
AGM,697,64.42


In [4]:
#create a function to concat columns. Used in pivot table agg function below
def string_concat(x):
    return  ' , '.join(x.values.astype(str))

In [5]:
def value_count(df):
    
    '''Return a Series containing counts of unique values for each column'''
    
    #create a dataframe with each column, its variable, and value count
    df = pd.concat([df.apply(lambda x: x.value_counts()).T.stack()], axis = 1)
    
    #reset index
    df = df.reset_index()
    
    #concat column variable and count
    df['Value_counts'] = df[['level_1', 0]].apply(lambda x: ' : '.join(x.astype(str)), axis=1)
    
    #create a pivot table with each colun as the index and an agg function to concat all value counts
    table = pd.pivot_table(df, index = ['level_0'], 
                       values = ['Value_counts'], 
                        aggfunc ={'Value_counts':string_concat})
    
    return table

    

In [6]:
value_count(df)

  result = result.union(other)
  result = result.union(other)
  index = _union_indexes(indexes)


Unnamed: 0_level_0,Value_counts
level_0,Unnamed: 1_level_1
AGM,"(Katara hotel) : 1.0 , (Poland hotels) : 2.0 ,..."
Additional CC E-Mail Address,"gm@hiruncornhotel.co.uk : 1.0 , gm@hilancaster..."
Brand,"CP : 191.0 , EX : 312.0 , HI : 397.0 , IC : 10..."
Brand EX Guestrooms Launch Month,"01 December 2019 : 15.0 , 01 March 2019 : 11.0..."
Brand EX Guestrooms Status,"Refurb : 2.0 , NHOP : 32.0 , No : 431.0 , Refu..."
Brand EX Public Areas Launch Month,"01 December 2019 : 4.0 , 01 March 2020 : 11.0 ..."
Brand EX Public Areas Status,"Excluded : 1.0 , NHOP : 31.0 , No : 492.0 , Re..."
Brand HI Guestroom Soft/Hard,"Hard : 60.0 , Soft : 2.0"
Brand HI Guestroom Type,"Generic scheme : 94.0 , Interpretation new sch..."
Brand HI Guestrooms Launch Month,"01 December 2019 : 8.0 , 01 March 2020 : 5.0 ,..."


In [7]:
def join_tables(df):
    
    '''Returns the joined above dataframes (value_counts + data_aduit)'''
    
    #create dataframes using data_audit function (see above) and value_counts function (see above)
    data_audited = data_audit(df)
    value_counts = value_count(df)
    
    #reset_index
    data_audited = data_audited.reset_index()
    value_counts = value_counts.reset_index()
    
    #rename value count table to index (for merging of dataframes)
    value_counts.rename(columns={'level_0': 'index'}, inplace=True)
    
    #inner join the two dataframes
    df = pd.merge(data_audited, value_counts, on=['index'], how='inner')
    
    #copy dataframe to clipboard (to can paste to excel or other source)
    print(df.to_clipboard())
    
    return df

In [8]:
join_tables(df)

Your selected dataframe has 50 columns.
There are 35 columns that have missing values.
There are 0 columns that have no data.
None
None


Unnamed: 0,index,Missing Values,% of Total Missing Values,Value_counts
0,Fac ID,0,0.0,"F10016 : 1.0 , F10032 : 1.0 , F10056 : 1.0 , F..."
1,Hotel Code,1,0.09,"ABZAA : 1.0 , ABZCC : 1.0 , ABZEC : 1.0 , ABZE..."
2,Hotel Name,0,0.0,"ANA Crowne Plaza Chitose : 1.0 , ANA Crowne Pl..."
3,Open/Closed,0,0.0,"Closed (2018) : 17.0 , Closed (2019) : 8.0 , O..."
4,Brand,0,0.0,"CP : 191.0 , EX : 312.0 , HI : 397.0 , IC : 10..."
5,Business Unit,0,0.0,"AUAJ : 80.0 , EUR : 747.0 , IMEA : 160.0 , SEA..."
6,Geography Division,0,0.0,"Africa : 1.0 , Africa Franchise : 20.0 , Austr..."
7,Ops VP,1,0.09,"Amith Khanna : 41.0 , Aron Libinson : 38.0 , B..."
8,Ops Director/RGM,34,3.14,"Adam McDonald : 3.0 , Alexander Preusser : 73...."
9,AGM,697,64.42,"(Katara hotel) : 1.0 , (Poland hotels) : 2.0 ,..."


## Missing Values Table
- The below code returns a dataframe with missing data only

In [9]:
def missing_values_table(df):
    '''Creates a data for missing data only'''
    
    # Total missing values
    mis_val = df.isnull().sum()
        
    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / len(df)
        
    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
    '% of Total Values', ascending=False).round(1)
        
    # Print some summary information
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
    # Return the dataframe with missing information
    return mis_val_table_ren_columns
    


In [10]:
missing_values_table(df)

Your selected dataframe has 50 columns.
There are 35 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
Hotel Closed Date,1042,96.3
Brand HI Open Lobby Refurb,1023,94.5
Brand HI Guestroom Soft/Hard,1020,94.3
Brand HI MGS Launch Month,1010,93.3
Brand EX Public Areas Launch Month,996,92.1
Brand HI Guestroom Type,962,88.9
Brand EX Guestrooms Launch Month,933,86.2
Brand HI Guestrooms Launch Month,924,85.4
ERC Grouping,903,83.5
Brand HI Open Lobby Launch Month,852,78.7


## Removing Missing Data
- The below code runs the missing values table
- Then removes the columns which contain more than x% of missing data


In [11]:
#df = dataframe
#x = the % of missing data
def remove_incomplete_columns(df, x):
    
    '''Code to remove columns where more than X% of the columns have missing data'''
    
    #use the missing_values_table function (see above)
    missing_df = missing_values_table(df)
    
    #delete the columns which are more than x% missing
    missing_columns = list(missing_df[missing_df['% of Total Values'] > x].index)
    print('We will remove %d columns.' % len(missing_columns))
    
    #reassign dataframe with new data
    df = df.drop(columns = list(missing_columns))
    
    return df

In [14]:
df = remove_incomplete_columns(df,50)

Your selected dataframe has 50 columns.
There are 35 columns that have missing values.
We will remove 19 columns.


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1082 entries, 0 to 1081
Data columns (total 31 columns):
Fac ID                          1082 non-null object
Hotel Code                      1081 non-null object
Hotel Name                      1082 non-null object
Open/Closed                     1082 non-null object
Brand                           1082 non-null object
Business Unit                   1082 non-null object
Geography Division              1082 non-null object
Ops VP                          1081 non-null object
Ops Director/RGM                1048 non-null object
Hotel Type                      1082 non-null object
Country Brand Grouping          1082 non-null object
RGI Comparable View             1082 non-null object
Owner Account                   709 non-null object
Management Company              714 non-null object
Country                         1082 non-null object
Prioritisation Score            729 non-null object
Lat                             1082 non-null fl