## Wave 1 Data Cleaning

This notebook runs the process to clean the Wave 1 survey data. In order for any wave 1 modelling to be correctly applied to the wave 2 data, it is necessary to ensure that each variable's name and values are consistent with the corresponding variable in the wave 2 data. 

The cleaning process includes:
- Deleting unnecessary variables
- Recoding Non response as null, and deleting records with null values
- Recoding values where necessary such that they will be consistent across the wave 1 and wave 2 data files.

**Variables of interest**
As noted above, only certain variables will be retained for analysis. These are a set of demographic and attitudinal variables which are present in both the wave 1 and 2 survey data (so that a wave 1 model can be tested on the wave 2 dataset). 

The variables of interest are (see separate data dictionary for further info):
- DEMAGE: Respondent Age
- DEMREG: UK Region of residence
- DEMSEX: Respondent Gender
- DEMEDU: Respondent Education level
- DEMWRK: Respondent Working status
- DEMREL: Respondent religion
- DEMINC: Respondent Income
- COV_Shield: Have you been shielding because you are in a vulnerable group for coronavirus (COVID-19)?
- COV_TRUST_1: What sources of information about coronavirus (COVID-19) do you trust? - National Television
- COV_TRUST_2: What sources of information about coronavirus (COVID-19) do you trust? - Satellite/International television
- COV_TRUST_3: What sources of information about coronavirus (COVID-19) do you trust? - Radio
- COV_TRUST_4: What sources of information about coronavirus (COVID-19) do you trust? - Newspapers
- COV_TRUST_5: What sources of information about coronavirus (COVID-19) do you trust? - Social media (Facebook, Twitter, etc)
- COV_TRUST_6: What sources of information about coronavirus (COVID-19) do you trust? - National Public Health Authorities
- COV_TRUST_7: What sources of information about coronavirus (COVID-19) do you trust? - Healthcare Workers
- COV_TRUST_8: What sources of information about coronavirus (COVID-19) do you trust? - International health authorities 
- COV_TRUST_9: What sources of information about coronavirus (COVID-19) do you trust? - Government websites 
- COV_TRUST_10: What sources of information about coronavirus (COVID-19) do you trust? - The internet or search engines  
- COV_TRUST_11: What sources of information about coronavirus (COVID-19) do you trust? - Family and friends
- COV_TRUST_12: What sources of information about coronavirus (COVID-19) do you trust? - Work, school, or college
- COV_TRUST_13: What sources of information about coronavirus (COVID-19) do you trust? - Other
- VAC_DEC: Are you currently responsible for decisions relating to the vaccination of children?
- COV_VAC_SELF: If a new coronavirus (COVID-19) vaccine became available, would you accept the vaccine for yourself?
- COV_KNOWL_1: Agree/Disagree - Washing hands with soap or sanitiser can help prevent the spread of coronavirus (COVID-19)
- COV_KNOWL_2: Agree/Disagree - Staying in your own home and reducing contact with others can help protect you against catching coronavirus (COVID-19
- COV_KNOWL_3: Agree/Disagree - Staying indoors and reducing contact with others can help protect others from catching coronavirus (COVID-19) 
- COV_KNOWL_4: Agree/Disagree - If you catch coronavirus (COVID-19), you can infect somebody else before you have developed symptoms
- COV_KNOWL_5: Agree/Disagree - On average, before lockdown, someone with coronavirus (COVID-19) would have infected 2-3 other people
- COV_KNOWL_6: Agree/Disagree - Treatments exist to prevent you catching coronavirus (COVID-19)
- COV_KNOWL_7: Agree/Disagree - Coronavirus (COVID-19) may be caused by mobile network signals or mobile network towers 
- DREAD: Agree/Disagree - When I think about coronavirus (COVID-19), I find it difficult to think calmly about it and feel great dread or fear
- ANX_1: Over the past three months I have been feeling - Calm
- ANX_2: Over the past three months I have been feeling -  Tense
- ANX_3: Over the past three months I have been feeling - Upset
- ANX_4: Over the past three months I have been feeling - Relaxed
- ANX_5: Over the past three months I have been feeling - Content
- ANX_6: Over the past three months I have been feeling - Worried

In [1]:
# Importing Necessary Packages
import numpy as np
import pandas as pd
import scipy.stats as stats

In [2]:
# Importing data as pandas dataframe
df_wave1 = pd.read_csv("C:/Users/laure/Documents/lshtm_raw_data/COVID-19_wave_1.csv", encoding='latin')

In [3]:
# Printing value counts for each variable to get an understanding of the data.
for column in df_wave1.columns:
    print(column)
    print()
    print(df_wave1[column].value_counts())
    print()
    print()
    print("-------------------")
    print()

StartDate

27/09/2020 03:21    73
27/09/2020 03:37    70
27/09/2020 03:49    69
27/09/2020 03:47    67
27/09/2020 03:43    64
                    ..
29/09/2020 00:06     1
29/09/2020 00:03     1
29/09/2020 00:13     1
29/09/2020 00:04     1
14/10/2020 04:40     1
Name: StartDate, Length: 4568, dtype: int64


-------------------

EndDate

27/09/2020 03:39    68
27/09/2020 03:52    67
27/09/2020 03:40    60
27/09/2020 03:53    59
27/09/2020 03:49    59
                    ..
28/09/2020 22:03     1
28/09/2020 22:42     1
28/09/2020 22:51     1
28/09/2020 23:09     1
14/10/2020 04:44     1
Name: EndDate, Length: 4569, dtype: int64


-------------------

Status

IP Address    17000
Spam              2
Name: Status, dtype: int64


-------------------

IPAddress

78.129.209.36     6
78.129.209.34     5
37.220.16.34      4
185.125.226.42    4
78.129.209.35     4
                 ..
86.6.202.181      1
146.199.246.58    1
109.159.33.49     1
80.0.170.247      1
79.68.51.160      1
Name: IPAddre

In [4]:
# Deleting unnecessary variables
deletion_list = ["StartDate",
"EndDate",
"Status",
"IPAddress",
"Progress",
"Duration__in_seconds_",
"Finished",
"RecordedDate",
"ResponseId",
"RecipientLastName",
"RecipientFirstName",
"RecipientEmail",
"LocationLatitude",
"LocationLongitude",
"DistributionChannel",
"UserLanguage",
"Postcode",
"COV_BEHAV_1",
"COV_BEHAV_2",
"COV_BEHAV_3",
"COV_BEHAV_4",
"COV_BEHAV_5",
"COV_VAC_WHEN",
"COV_VAC_WHY_1",
"COV_VAC_WHY_2",
"COV_VAC_WHY_3",
"COV_VAC_WHY_4",
"COV_VAC_WHY_5",
"COV_VAC_WHY_6",
"COV_VAC_WHY_13",
"COV_VAC_WHY_7",
"COV_VAC_WHY_8",
"COV_VAC_WHY_9",
"COV_VAC_WHY_12",
"COV_VAC_WHY_9_TEXT",
"COV_VAC_OTHERS",
"COV_POS_1",
"COV_POS_2",
"COV_POS_3",
"COV_POS_4",
"COV_POS_5",
"VAC_CHI_REFUSE",
"VAC_CHI_WHICH_1",
"VAC_CHI_WHICH_2",
"VAC_CHI_WHICH_4",
"VAC_CHI_WHICH_5",
"VAC_CHI_WHICH_5_TEXT",
"COV_VAC_CHI",
"COV_VAC_CHI_WHY_1",
"COV_VAC_CHI_WHY_2",
"COV_VAC_CHI_WHY_3",
"COV_VAC_CHI_WHY_4",
"COV_VAC_CHI_WHY_5",
"COV_VAC_CHI_WHY_6",
"COV_VAC_CHI_WHY_11",
"COV_VAC_CHI_WHY_7",
"COV_VAC_CHI_WHY_8",
"COV_VAC_CHI_WHY_9",
"COV_VAC_CHI_WHY_10",
"COV_VAC_CHI_WHY_9_TEXT",
"COV_TRUST_13_TEXT",
"FLU_VAC_OFF",
"FLU_VAC_COND",
"FLU_VAC_OTH",
"FLU_VAC_12",
"FLU_VAC_HES",
"FLU_VAC_HES_WHY_1",
"FLU_VAC_HES_WHY_2",
"FLU_VAC_HES_WHY_3",
"FLU_VAC_HES_WHY_8",
"FLU_VAC_HES_WHY_10",
"FLU_VAC_HES_WHY_8_TEXT",
"MMR_VAC_CHI",
"Age",
"state",
"Time_check",
"straight_COV_KNOWL_1",
"straight_ANX",
"straight_VAC_HES",
"straight_no",
"straight_rate", 
"VAC_HES_1",
"VAC_HES_2",
"VAC_HES_3",
"VAC_HES_4",
"VAC_HES_5",
"DEMETH"
]


df_wave1.drop(deletion_list, axis=1, inplace=True)

In [5]:
# Checking final variable list
df_wave1.columns

Index(['DEMAGE', 'DEMREG', 'DEMSEX', 'DEMEDU', 'DEMWRK', 'DEMREL', 'DEMLAN',
       'DEMINC', 'COV_VAC_SELF', 'COV_SHIELD', 'COV_TRUST_1', 'COV_TRUST_2',
       'COV_TRUST_3', 'COV_TRUST_4', 'COV_TRUST_5', 'COV_TRUST_6',
       'COV_TRUST_7', 'COV_TRUST_8', 'COV_TRUST_9', 'COV_TRUST_10',
       'COV_TRUST_11', 'COV_TRUST_12', 'COV_TRUST_13', 'COV_KNOWL_1',
       'COV_KNOWL_2', 'COV_KNOWL_3', 'COV_KNOWL_4', 'COV_KNOWL_5',
       'COV_KNOWL_6', 'COV_KNOWL_7', 'VAC_DEC', 'ANX_1', 'ANX_2', 'ANX_3',
       'ANX_4', 'ANX_5', 'ANX_6', 'DREAD'],
      dtype='object')

In [6]:
# Recoding values where necessary, so that values are consistent across both wave 1 and wave 2 data.
# Setting dictionaries to assign new labels to old labels.

region = {'East Midlands': "East Midlands",
'North East': "North East",
'North West': "North West",
'West Midlands': "West Midlands",
'South East': "South East",
'Northern Irelend': "Northern Ireland",
'Scotland': "Scotland",
'Wales': "Wales",
'Yorkshire and The Humber': "Yorkshire and The Humber",
'East of England': "East of England",
'Greater London': "Greater London",
'South West': "South West"
}

education = {'No academic qualifications': "No academic qualifications",
'0-4 GCSE, O-levels, or equivalents': "0-4 GCSE, O-levels, or equivalents",
'5+ GCSE, O-levels, 1 A level, or equivalents': "5+ GCSE, O-levels, 1 A level, or equivalents",
'Apprenticeship': "Apprenticeship",
'2+ A levels or equivalents': "2+ A levels or equivalents",
'Undergraduate or postgraduate degree, or other professional qualification': "Undergraduate or postgraduate degree",
'Other (e.g. vocational, foreign qualifications)': "Other",
'Do not know': "Do not know",
'Do not wish to answer': "Do not wish to answer",
}

work = {'Working full-time (including self-employed)': "Working full-time",
'Working part-time (including self-employed)': "Working part-time",
'Unemployed': "Unemployed",
'Student': "Student",
'Looking after the home': "Looking after the home",
'Retired': "Retired",
'Do not wish to answer': "Do not wish to answer",
'Unable to work (e.g. short- or long-term disability)': "Unable to work",
}

In [7]:
# Creating lists of variables for renaming, and the corresponding value dictionaries.
relabel_list = ["DEMREG",
"DEMEDU",
"DEMWRK"]

values_dict_list = [region, education, work]

In [8]:
for item1, item2 in list(zip(relabel_list, values_dict_list)): # Zipping variable and corresponding value dictionaries together
    new_name = item1 + '_new' # Creating a new variable name per variable for value relabelling.
    df_wave1[new_name] = df_wave1[item1].map(item2) # Mapping the new values onto the new variable
    df_wave1.drop(item1, axis=1, inplace=True) # Dropping the original variable
    df_wave1.rename(columns={new_name: item1}, inplace=True) # Relabelling the new variable to the original variable name for consistency

In [9]:
# Creating function to recode non-response values as null

def col_recode(x):
    """Function to recode survey non-response values as null
    
    :param x: Original row value for recoding.
    :return: Result of any necessary recoding.
    """
    
    
    if x == 'Do not wish to answer':
        result = np.nan
    elif x == 'Do not know':
        result = np.nan
    elif x == '#NULL!':
        result = 0
    else:
        result = x
    return result

In [10]:
# Applying the non-response to null recode across the whole data file.
df1 = df_wave1.applymap(col_recode)

In [11]:
# Dropping those records coded as 'Other' in the gender column, as there are so few cases (<1%) and any analysis will not be meaningful.
gender_filter = df1[df1['DEMSEX'] == 'Other']
df1.drop(gender_filter.index, axis=0, inplace=True)
df1['DEMSEX'].value_counts()

Female    8754
Male      8207
Name: DEMSEX, dtype: int64

In [12]:
# Dropping the language variable in its entirety, as the distribution is so skewed towards English/Welsh, therefore do not expect any meaningful analysis.
df1.drop(['DEMLAN'], axis=1, inplace=True)

In [13]:
# Preparing Categorical variables for dummification, by assigning to a dummy list.

vars_for_dummy = [
    "DEMREG",
    "DEMSEX",
    "DEMEDU",
    "DEMWRK",
    "DEMREL",
    "DEMINC",
    "COV_SHIELD",
    "COV_TRUST_1",
    "COV_TRUST_2",
    "COV_TRUST_3",
    "COV_TRUST_4",
    "COV_TRUST_5",
    "COV_TRUST_6",
    "COV_TRUST_7",
    "COV_TRUST_8",
    "COV_TRUST_9",
    "COV_TRUST_10",
    "COV_TRUST_11",
    "COV_TRUST_12",
    "COV_TRUST_13"]

In [14]:
# Recoding the vaccine decision responsibility variable from string to integer (i.e. creating a dummy variable)
decision_values = {'Yes': 1, 'No': 0}
df1['VAC_DEC_new'] = df1['VAC_DEC'].map(decision_values)
df1.drop(['VAC_DEC'], axis=1, inplace=True)
df1.rename(columns={'VAC_DEC_new': 'VAC_DEC'}, inplace=True)

In [15]:
# Preparing scale variables for recoding to integer values. Grouping each scale type into lists and creating value recode dictionaries.

knowledge_scale = ['COV_KNOWL_1', 'COV_KNOWL_2', 'COV_KNOWL_3', 'COV_KNOWL_4', 'COV_KNOWL_5', 
                   'COV_KNOWL_6', 'COV_KNOWL_7', 'DREAD']
knowledge_values = {'Strongly disagree': 1, 'Tend to disagree': 2, 'Tend to agree': 3, 'Strongly agree': 4}

emotion_scale = ['ANX_1', 'ANX_2', 'ANX_3', 'ANX_4', 'ANX_5', 'ANX_6']
emotion_values = {'Not at all': 1, 'Somewhat': 2, 'Very much': 3}

willingness_values = {'Yes, definitely': 1, 'Unsure, but leaning towards yes': 2, 
                      'Unsure, but leaning towards no': 3, 'No, definitely not': 4}

In [16]:
# Recoding the COV_VAC_SELF variable.
df1['COV_VAC_SELF_new'] = df1['COV_VAC_SELF'].map(willingness_values)
df1.drop(['COV_VAC_SELF'], axis=1, inplace=True)
df1.rename(columns={'COV_VAC_SELF_new': 'COV_VAC_SELF'}, inplace=True)

In [17]:
# Recoding the knowledge scale variables.
for column in knowledge_scale:
    new_name = column + '_new'
    df1[new_name] = df1[column].map(knowledge_values)
    df1.drop(column, axis=1, inplace=True)
    df1.rename(columns={new_name: column}, inplace=True)

In [18]:
# Recoding the emotional scale variables.
for column in emotion_scale:
    new_name = column + '_new'
    df1[new_name] = df1[column].map(emotion_values)
    df1.drop(column, axis=1, inplace=True)
    df1.rename(columns={new_name: column}, inplace=True)

In [19]:
# Dropping all null values across the dataset.
df1.dropna(inplace=True)

In [20]:
# Checking cleaned value counts, including as percentages, to identify if any further value recategorisation is required.
for column in df1.columns:
    print()
    print(column)
    print(df1[column].value_counts())
    print()
    print()
    print(column)
    print(df1[column].value_counts(normalize=True))
    print()
    print()
    print("---------")


DEMAGE
65    268
49    220
50    204
66    197
40    194
     ... 
87      4
86      3
89      2
88      1
90      1
Name: DEMAGE, Length: 73, dtype: int64


DEMAGE
65    0.027955
49    0.022948
50    0.021279
66    0.020549
40    0.020236
        ...   
87    0.000417
86    0.000313
89    0.000209
88    0.000104
90    0.000104
Name: DEMAGE, Length: 73, dtype: float64


---------

DEMSEX
Male      4919
Female    4668
Name: DEMSEX, dtype: int64


DEMSEX
Male      0.513091
Female    0.486909
Name: DEMSEX, dtype: float64


---------

DEMREL
Christian              5114
Atheist or agnostic    3221
Other                   726
Muslim                  278
Hindu                   108
Jewish                   77
Buddhist                 63
Name: DEMREL, dtype: int64


DEMREL
Christian              0.533431
Atheist or agnostic    0.335976
Other                  0.075728
Muslim                 0.028998
Hindu                  0.011265
Jewish                 0.008032
Buddhist               0.006571

In [21]:
# Recategorising the religion variable so that those religions making <5% of sample are grouped into 'other' category.
df1['DEMREL'] = df1['DEMREL'].apply(lambda x: 'Other' if (x == 'Hindu' or x == 'Jewish' or x == 'Buddhist') else x)

In [22]:
# Creating two extra binary target variables ('yes', or 'no') for use in modelling. Target 1 only includes those 'definitely yes' or 'definitely no'. Target 2 groups those 'unsure' into those categories into which they are leaning.

df1['target_1'] = df1['COV_VAC_SELF'].apply(lambda x: 1 if x == 1 else 0 if x == 4 else np.nan)
df1['target_2'] = df1['COV_VAC_SELF'].apply(lambda x: 1 if (x == 1 or x == 2) else 0 if (x == 3 or x == 4) else np.nan)

In [23]:
# Dummifying the dataset
wave1 = pd.get_dummies(df1, columns=vars_for_dummy)

In [24]:
# Getting summary of dummifyed variables
wave1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9587 entries, 1 to 17000
Data columns (total 87 columns):
 #   Column                                                                                             Non-Null Count  Dtype  
---  ------                                                                                             --------------  -----  
 0   DEMAGE                                                                                             9587 non-null   int64  
 1   VAC_DEC                                                                                            9587 non-null   int64  
 2   COV_VAC_SELF                                                                                       9587 non-null   int64  
 3   COV_KNOWL_1                                                                                        9587 non-null   float64
 4   COV_KNOWL_2                                                                                        9587 non-null   floa

In [25]:
# SAVING DUMMY AND NON DUMMY DATAFILES

In [26]:
# Saving non dummy dataframe to csv for use in subsequent analysis
df1.to_csv("C:/Users/laure/Documents/vaccine_hesitancy_and_uptake/1_data_cleaning/wave_1_vaccine_intention_data_nondum.csv")

In [27]:
# Saving dummified dataframe to csv for use in subsequent analysis
wave1.to_csv("C:/Users/laure/Documents/vaccine_hesitancy_and_uptake/1_data_cleaning/wave_1_vaccine_intention_data.csv")