## Wave 2 Data Cleaning

This notebook runs the process to clean the Wave 2 survey data. In order for any wave 1 modelling to be correctly applied to the wave 2 data, it is necessary to ensure that where the variables are common to both data files, each variable's name and values are consistent with the corresponding variable in the wave 1 data. 

The cleaning process includes:
- Deleting unnecessary variables
- Recoding Non response as null, and deleting records with null values
- Recoding values where necessary such that they will be consistent across the wave 1 and wave 2 data files, and converting to string/integer as appropriate

**Variables of interest**
As noted above, only certain variables will be retained for analysis. These are a set of demographic and attitudinal variables which are present in both the wave 1 and 2 survey data (so that a wave 1 model can be tested on the wave 2 dataset), along with extra variables only included in wave 2 data deemed as likely to be informative for model refinement.

The variables of interest are (see separate data dictionary for further info):
- DEMAGE: Respondent Age
- DEMREG: UK Region of residence
- DEMSEX: Respondent Gender
- DEMEDU: Respondent Education level
- DEMWRK: Respondent Working status
- DEMREL: Respondent religion
- DEMINC: Respondent Income
- COV_Shield: Have you been shielding because you are in a vulnerable group for coronavirus (COVID-19)?
- COV_TRUST_1: What sources of information about coronavirus (COVID-19) do you trust? - National Television
- COV_TRUST_2: What sources of information about coronavirus (COVID-19) do you trust? - Satellite/International television
- COV_TRUST_3: What sources of information about coronavirus (COVID-19) do you trust? - Radio
- COV_TRUST_4: What sources of information about coronavirus (COVID-19) do you trust? - Newspapers
- COV_TRUST_5: What sources of information about coronavirus (COVID-19) do you trust? - Social media (Facebook, Twitter, etc)
- COV_TRUST_6: What sources of information about coronavirus (COVID-19) do you trust? - National Public Health Authorities
- COV_TRUST_7: What sources of information about coronavirus (COVID-19) do you trust? - Healthcare Workers
- COV_TRUST_8: What sources of information about coronavirus (COVID-19) do you trust? - International health authorities 
- COV_TRUST_9: What sources of information about coronavirus (COVID-19) do you trust? - Government websites 
- COV_TRUST_10: What sources of information about coronavirus (COVID-19) do you trust? - The internet or search engines  
- COV_TRUST_11: What sources of information about coronavirus (COVID-19) do you trust? - Family and friends
- COV_TRUST_12: What sources of information about coronavirus (COVID-19) do you trust? - Work, school, or college
- COV_TRUST_13: What sources of information about coronavirus (COVID-19) do you trust? - Other
- VAC_DEC: Are you currently responsible for decisions relating to the vaccination of children?
- COV_VAC_SELF: If a new coronavirus (COVID-19) vaccine became available, would you accept the vaccine for yourself?
- COV_KNOWL_1: Agree/Disagree - Washing hands with soap or sanitiser can help prevent the spread of coronavirus (COVID-19)
- COV_KNOWL_2: Agree/Disagree - Staying in your own home and reducing contact with others can help protect you against catching coronavirus (COVID-19
- COV_KNOWL_3: Agree/Disagree - Staying indoors and reducing contact with others can help protect others from catching coronavirus (COVID-19) 
- COV_KNOWL_4: Agree/Disagree - If you catch coronavirus (COVID-19), you can infect somebody else before you have developed symptoms
- COV_KNOWL_5: Agree/Disagree - On average, before lockdown, someone with coronavirus (COVID-19) would have infected 2-3 other people
- COV_KNOWL_6: Agree/Disagree - Treatments exist to prevent you catching coronavirus (COVID-19)
- COV_KNOWL_7: Agree/Disagree - Coronavirus (COVID-19) may be caused by mobile network signals or mobile network towers 
- DREAD: Agree/Disagree - When I think about coronavirus (COVID-19), I find it difficult to think calmly about it and feel great dread or fear
- ANX_1: Over the past three months I have been feeling - Calm
- ANX_2: Over the past three months I have been feeling -  Tense
- ANX_3: Over the past three months I have been feeling - Upset
- ANX_4: Over the past three months I have been feeling - Relaxed
- ANX_5: Over the past three months I have been feeling - Content
- ANX_6: Over the past three months I have been feeling - Worried
- VAC_PASS_1: Agree/Disagree - Proof of vaccination via a vaccine certificate or passport for social events infringes on personal liberties 
- VAC_PASS_2: Agree/Disagree - I wish to be free to reject a vaccine without consequences on my ability to attend public or social events 
- VAC_PASS_3: Agree/Disagree - Individuals who reject a vaccine should not be allowed to attend social events
- VAC_PASS_4: Agree/Disagree - Private companies should have the right to reject individuals if they have not received a vaccine 
- VAC_PASS_5: Agree/Disagree - Biological or immunological status should not be used as a condition for entry to social events
- VAC_PASS_6: Agree/Disagree - I am more likely to accept a vaccine if it is required for my job 
- VAC_PASS_7: Agree/Disagree - Vaccine certificates or passports do not compel people to get vaccinated 
- VAC_PASS_UK: If a coronavirus (COVID-19) certificate or passport was required to attend social events, would you be more or less inclined to accept a vaccine
- VAC_PASS_INT: If a coronavirus (COVID-19) certificate or passport was required for international travel, would you be more or less inclined to accept a vaccine
- COV_INV: Have you received an invitation to receive a coronavirus (COVID-19) vaccine?
- DEMNHS: In the last 12 months, have you worked for the NHS as a healthcare professional
- Q71: Which party would you vote for if there was a general election tomorrow?
- MIST_1: Fake/Real - Article 1
- MIST_2: Fake/Real - Article 2
- MIST_3: Fake/Real - Article 3
- MIST_4: Fake/Real - Article 4
- MIST_5: Fake/Real - Article 5
- MIST_6: Fake/Real - Article 6
- MIST_7: Fake/Real - Article 7
- MIST_8: Fake/Real - Article 8
- MIST_9: Fake/Real - Article 9
- MIST_10: Fake/Real - Article 10
- MIST_11: Fake/Real - Article 11
- MIST_12: Fake/Real - Article 12
- MIST_13: Fake/Real - Article 13
- MIST_14: Fake/Real - Article 14
- MIST_15: Fake/Real - Article 15
- MIST_16: Fake/Real - Article 16
- MIST_17: Fake/Real - Article 17
- MIST_18: Fake/Real - Article 18
- MIST_19: Fake/Real - Article 19
- MIST_20: Fake/Real - Article 20


In [1]:
# Importing Necessary Packages
import numpy as np
import pandas as pd
import scipy.stats as stats

In [2]:
# Importing data as pandas dataframe
df_wave2= pd.read_csv("C:/Users/laure/Documents/lshtm_raw_data/COVID-19_wave_2_april_2021.csv")

In [3]:
# Age variable appears to be labelled as integer values, with actual age the integer value plus 17.
# Recoding age values to true integer values.
df_wave2['DEMAGE'] = df_wave2['DEMAGENUM'].apply(lambda x: x + 17)

In [4]:
# Deleting unnecessary variables
deletion_vars = ["StartDate",
"EndDate",
"Status",
"Progress",
"Duration__in_seconds_",
"Finished",
"RecordedDate",
"ResponseId",
"DistributionChannel",
"UserLanguage",
"UNSURE_SECOND_DOSE_16",
"UNSURE_SECOND_DOSE_1",
"UNSURE_SECOND_DOSE_14",
"UNSURE_SECOND_DOSE_13",
"UNSURE_SECOND_DOSE_3",
"UNSURE_SECOND_DOSE_4",
"UNSURE_SECOND_DOSE_5",
"UNSURE_SECOND_DOSE_15",
"UNSURE_SECOND_DOSE_6",
"UNSURE_SECOND_DOSE_7",
"UNSURE_SECOND_DOSE_8",
"UNSURE_SECOND_DOSE_9",
"UNSURE_SECOND_DOSE_12",
"UNSURE_SECOND_DOSE_9_TEXT",
"COV_LOC",
"COV_LOC_6_TEXT",
"COV_VAC_Y_16",
"COV_VAC_Y_1",
"COV_VAC_Y_17",
"COV_VAC_Y_14",
"COV_VAC_Y_13",
"COV_VAC_Y_3",
"COV_VAC_Y_4",
"COV_VAC_Y_5",
"COV_VAC_Y_15",
"COV_VAC_Y_6",
"COV_VAC_Y_7",
"COV_VAC_Y_8",
"COV_VAC_Y_9",
"COV_VAC_Y_12",
"COV_VAC_Y_9_TEXT",
"COV_APT",
"COV_APT_Y_3",
"COV_APT_Y_4",
"COV_APT_Y_10",
"COV_APT_Y_5",
"COV_APT_Y_17",
"COV_APT_Y_1",
"COV_APT_Y_6",
"COV_APT_Y_7",
"COV_APT_Y_6_TEXT",
"COV_LEG",
"COV_LEG_4_TEXT",
"COV_CON_16",
"COV_CON_3",
"COV_CON_4",
"COV_CON_5",
"COV_CON_6",
"COV_CON_7",
"COV_CON_12",
"COV_CON_8",
"COV_CON_9",
"COV_CON_10",
"COV_CON_11",
"COV_CON_13",
"COV_CON_15",
"COV_CON_14",
"COV_CON_13_TEXT",
"COV_INFO_13_TEXT",
"VAC_CHI_REFUSE",
"VAC_CHI_WHICH_1",
"VAC_CHI_WHICH_2",
"VAC_CHI_WHICH_3",
"VAC_CHI_WHICH_4",
"VAC_CHI_WHICH_5",
"VAC_CHI_WHICH_5_TEXT",
"COV_VAC_CHI",
"Q61_1",
"Q61_2",
"Q62_1",
"state",
"Age",
"RecipientFirstName",
"Group",
"Age_Grouped",
"DEMLAN",
"DEMAGENUM",
"COV_LOC_6_TEXT___Parent_Topics",
"COV_LOC_6_TEXT___Topics",
"Unique_ID",
"WEIGHT",
"VAC_ACC_1",
"VAC_ACC_2",
"VAC_ACC_3",
"VAC_ACC_4",
"VAC_ACC_5",
"VAC_ACC_6",
"VAC_ACC_7",
"VAC_ACC_8",
"VAC_ACC_9",                 
"VAC_HES_1",
"VAC_HES_2",
"VAC_HES_3",
"VAC_HES_4",
"VAC_HES_5",
"VAC_HES_6",
"VAC_HES_7",
"VAC_HES_8",
"VAC_HES_9",
"VAC_HES_10",
"VAC_HES_11",
"VAC_HES_12",
"VAC_HES_13",
"DEMETH",
"Q60"]

w2 = df_wave2.drop(deletion_vars, axis=1)

In [5]:
# Flipping dread scale so that values match wave 1 values (i.e. disagree = low score, agree = high score)
dread_vals = {1: 4, 2: 3, 3: 2, 4: 1, 5: 5}

new_name = 'dread_new'
w2['dread_new'] = w2['DREAD'].map(dread_vals)
w2.drop(['DREAD'], axis=1, inplace=True)
w2.rename(columns={'dread_new': 'DREAD'}, inplace=True)
w2['DREAD'].value_counts()

2    5663
3    4960
1    3657
4    1505
5     825
Name: DREAD, dtype: int64

In [6]:
# Recoding values where necessary, so that values are consistent across both wave 1 and wave 2 data.
# Many categorical variables are in integer format, therefore recoding as categorical variables for dummification.
# Setting dictionaries to assign new labels to old labels.

region = {1: "East Midlands",
2: "North East",
3: "North West",
4: "West Midlands",
5: "South East",
6: "Northern Ireland",
7: "Scotland",
8: "Wales",
9: "Yorkshire and The Humber",
10: "East of England",
11: "Greater London",
12: "South West",
13: "Other"
}

sex = {1: 'Male', 2: 'Female', 3: 'Other'}

education = {1: "No academic qualifications",
2: "0-4 GCSE, O-levels, or equivalents",
3: "5+ GCSE, O-levels, 1 A level, or equivalents",
4: "Apprenticeship",
5: "2+ A levels or equivalents",
6: "Undergraduate or postgraduate degree",
7: "Other",
8: "Do not know",
9: "Do not wish to answer",
}

work = {1: "Working full-time",
2: "Working part-time",
3: "Unemployed",
4: "Student",
5: "Looking after the home",
6: "Retired",
7: "Do not wish to answer",
8: "Unable to work",
}

religion = {1: "Christian",
2: "Hindu",
3: "Muslim",
4: "Jewish",
5: "Buddhist",
6: "Atheist or agnostic",
7: "Other",
8: "Do not wish to answer",
}

income = {1: "Under £15,000",
2: "£15,000 to £24,999",
3: "£25,000 to £34,999",
4: "£35,000 to £44,999",
5: "£45,000 to £54,999",
6: "£55,000 to £64,999",
7: "£65,000 to £99,999",
8: "Over £100,000",
9: "Do not wish to answer",
}

nhs = {1: "Yes",
2: "No",
3: "Do not know",
}

invitation = {1: "Yes",
             2: "No",
             3: "Do not know"}

shield = {1: "Yes",
2: "No",
}

decision_maker = {1: 1,
2: 0,
}

political_party = {'1': "Labour",
'2': "Conservative",
'3': "Liberal Democrat",
'4': "Reform Party ",
'5': "Green Party",
'6': "UKIP",
'7': "BNP",
'8': "Local independent",
'9': "DUP",
'10': "SNP",
'12': "UUP",
'13': "Sinn Féin",
'14': "Other",
'16': "Do not know",
'17': "Not eligible",
'19': "SDLP",
'20': "Alba Party",
' ': 'NULL'                   
}

In [7]:
# Creating lists of variables for renaming, and the corresponding value dictionaries.
relabel_list = ["DEMREG",
"DEMSEX",
"DEMEDU",
"DEMWRK",
"DEMREL",
"DEMINC",
"COV_INV",
"DEMNHS",
"COV_SHIELD",
"VAC_DEC", 
'Q71']

values_dict_list = [region, sex, education, work, religion, 
                    income, invitation, nhs, shield, decision_maker, political_party]

In [8]:
for item1, item2 in list(zip(relabel_list, values_dict_list)): # Zipping variable and corresponding value dictionaries together
    new_name = item1 + '_new' # Creating a new variable name per variable for value relabelling.
    w2[new_name] = w2[item1].map(item2) # Mapping the new values onto the new variable
    w2.drop(item1, axis=1, inplace=True) # Dropping the original variable
    w2.rename(columns={new_name: item1}, inplace=True) # Relabelling the new variable to the original variable name for consistency

In [9]:
# Value relabelling process for wave 2 only mist variables.
mist_val_dict = {1: 'Fake', 2: 'Real'}

mist_list = ["MIST_1",
"MIST_2",
"MIST_3",
"MIST_4",
"MIST_5",
"MIST_6",
"MIST_7",
"MIST_8",
"MIST_9",
"MIST_10",
"MIST_11",
"MIST_12",
"MIST_13",
"MIST_14",
"MIST_15",
"MIST_16",
"MIST_17",
"MIST_18",
"MIST_19",
"MIST_20"]

for item in mist_list:
    new_name = item + '_new'
    w2[new_name] = w2[item].map(mist_val_dict)
    w2.drop(item, axis=1, inplace=True)
    w2.rename(columns={new_name: item}, inplace=True)

In [10]:
# Value relabelling process for media information variables. Using list rather than dictionary as all values are already in dummy format. i.e. zero or one.

info_list = ["COV_INFO_1",
"COV_INFO_2",
"COV_INFO_3",
"COV_INFO_4",
"COV_INFO_5",
"COV_INFO_6",
"COV_INFO_7",
"COV_INFO_8",
"COV_INFO_9",
"COV_INFO_10",
"COV_INFO_11",
"COV_INFO_12",
"COV_INFO_13"]

new_value = ["National television",
"Satellite / international television channels",
"Radio",
"Newspapers",
"Social media (Facebook, Twitter, etc)",
"National public health authorities (such as the NHS or Public Health England / Wales)",
"Healthcare workers",
"International health authorities (such as The World Health Organization)",
"Government websites",
"The internet or search engines",
"Family and friends",
"Work, school, or college",
"Other (please specify)"]


In [11]:
for item1, item2 in list(zip(info_list, new_value)): # Zipping variable and corresponding value dictionaries together
    new_name = item1 + '_new' # Creating a new variable name per variable for value relabelling.
    w2[new_name] = w2[item1].apply(lambda x: 0 if x == " " else item2) # Mapping the new values onto the new variable, coding empty strings as zero
    w2.drop(item1, axis=1, inplace=True) # Dropping the original variable
    w2.rename(columns={new_name: item1}, inplace=True) # Relabelling the new variable to the original variable name for consistency

In [12]:
# Creating function to recode non-response values as null

def col_recode(x):
    """Function to recode survey non-response values as null
    
    :param x: Original row value for recoding.
    :return: Result of any necessary recoding.
    """
    
    
    if x == 'Do not wish to answer':
        result = np.nan
    elif x == 'Do not know':
        result = np.nan
    elif x == 'NULL':
        result = np.nan
    elif x == ' ':
        result = np.nan
    else:
        result = x
    return result

In [13]:
# Applying the non-response to null recode across the whole data file.
wave2 = w2.applymap(col_recode)

In [14]:
# Dropping those records coded as 'Other' in the gender column, as there are so few cases (<1%) and any analysis will not be meaningful.
gender_filter = wave2[wave2['DEMSEX'] == 'Other']
wave2.drop(gender_filter.index, axis=0, inplace=True)
wave2['DEMSEX'].value_counts()

Female    8448
Male      8124
Name: DEMSEX, dtype: int64

In [15]:
# Dropping those who answered Don't know at COV_INFO (i.e. COV_INFO_15 ==1)
wave2['COV_INFO_15'].value_counts()
info_filter = wave2[wave2['COV_INFO_15'] == '1']
wave2.drop(info_filter.index, axis=0, inplace=True)
wave2.drop('COV_INFO_15', axis=1, inplace=True)

In [16]:
# Creating a merged 'intention' variable for those who have not received the vaccine (i.e. same willingness target as wave 1).
# Creating value merge function

def create_intention(df):
    """Function to merge both 'intention' variables into one measure.
    
    :param df: Dataframe on which to apply the transformation.
    :return: Value code for final variable.
    """
    
    if (df['COV_VAC_1'] == '1') or (df['COV_ACC_2'] == '1'):
        value = 1
    elif(df['COV_VAC_1'] == '3') or (df['COV_ACC_2'] == '2'):
        value = 2
    elif(df['COV_VAC_1'] == '4') or (df['COV_ACC_2'] == '3'):
        value = 3
    elif(df['COV_VAC_1'] == '2') or (df['COV_ACC_2'] == '4'):
        value = 4
    else:
        value = np.nan
    return value

In [17]:
# Applying create_intention function to wave 2 data and creating merged variable
wave2['intention'] = wave2.apply(create_intention, axis=1)

In [18]:
# Checking function has been applied correctly
wave2[['intention']].value_counts()

intention
1.0          3673
2.0          1249
3.0           652
4.0           555
dtype: int64

In [19]:
wave2[['COV_VAC_1']].value_counts()

COV_VAC_1
1            288
2            241
4            198
3            181
dtype: int64

In [20]:
wave2[['COV_ACC_2']].value_counts()

COV_ACC_2
1            3385
2            1068
3             454
4             314
dtype: int64

In [21]:
# Dropping original intention variables
wave2.drop(['COV_VAC_1', 'COV_ACC_2'], axis=1, inplace=True)

In [22]:
# Dropping records with null values at demographic variables
na_list = ['COV_INV', 'DEMEDU', 'DEMWRK', 'DEMREL', 'DEMINC', 'DEMNHS', 'Q71']

for item in na_list:
    wave2.dropna(axis=0, subset=item, inplace=True)

In [23]:
# Dropping those who coded 'other' at region variable. No such code in wave 1 data and number of records is negligible (<1%)
dropreg = wave2[wave2['DEMREG'] == 'Other']
wave2.drop(dropreg.index, axis=0, inplace=True)

In [24]:
wave2.columns

Index(['COV_DOSE', 'COV_DOSE_2', 'COV_KNOWL_1', 'COV_KNOWL_2', 'COV_KNOWL_3',
       'COV_KNOWL_4', 'COV_KNOWL_5', 'COV_KNOWL_6', 'COV_KNOWL_7',
       'VAC_PASS_1', 'VAC_PASS_2', 'VAC_PASS_3', 'VAC_PASS_4', 'VAC_PASS_5',
       'VAC_PASS_6', 'VAC_PASS_7', 'VAC_PASS_UK', 'VAC_PASS_INT', 'ANX_1',
       'ANX_2', 'ANX_3', 'ANX_4', 'ANX_5', 'ANX_6', 'DEMAGE', 'DREAD',
       'DEMREG', 'DEMSEX', 'DEMEDU', 'DEMWRK', 'DEMREL', 'DEMINC', 'COV_INV',
       'DEMNHS', 'COV_SHIELD', 'VAC_DEC', 'Q71', 'MIST_1', 'MIST_2', 'MIST_3',
       'MIST_4', 'MIST_5', 'MIST_6', 'MIST_7', 'MIST_8', 'MIST_9', 'MIST_10',
       'MIST_11', 'MIST_12', 'MIST_13', 'MIST_14', 'MIST_15', 'MIST_16',
       'MIST_17', 'MIST_18', 'MIST_19', 'MIST_20', 'COV_INFO_1', 'COV_INFO_2',
       'COV_INFO_3', 'COV_INFO_4', 'COV_INFO_5', 'COV_INFO_6', 'COV_INFO_7',
       'COV_INFO_8', 'COV_INFO_9', 'COV_INFO_10', 'COV_INFO_11', 'COV_INFO_12',
       'COV_INFO_13', 'intention'],
      dtype='object')

In [25]:
# Renaming media info variables to match wave 1 variable naming.
wave2.rename({"COV_INFO_1": "COV_TRUST_1",
"COV_INFO_2": "COV_TRUST_2",
"COV_INFO_3": "COV_TRUST_3",
"COV_INFO_4": "COV_TRUST_4",
"COV_INFO_5": "COV_TRUST_5",
"COV_INFO_6": "COV_TRUST_6",
"COV_INFO_7": "COV_TRUST_7",
"COV_INFO_8": "COV_TRUST_8",
"COV_INFO_9": "COV_TRUST_9",
"COV_INFO_10": "COV_TRUST_10",
"COV_INFO_11": "COV_TRUST_11",
"COV_INFO_12": "COV_TRUST_12",
"COV_INFO_13": "COV_TRUST_13"}, axis=1, inplace=True)

In [26]:
# No response at Vac_pass variables are coded as 5. Recoding as null and dropping records.

# Creating recoding function.
def vac_pass_recode(x):
    """Function to recode vac_pass question non-response values as null
    
    :param x: Original row value for recoding.
    :return: Result of any necessary recoding.
    """
    
    if x == 5:
        result = np.nan
    else:
        result = x
    return result

In [27]:
# Applying recoding function to vac_pass variables and recoding other values to a consistent 1-5 scale (as for some reason, the scale midpoint was coded as 8 rather than 3).

pass_list = ["VAC_PASS_1",
"VAC_PASS_2",
"VAC_PASS_3",
"VAC_PASS_4",
"VAC_PASS_5",
"VAC_PASS_6",
"VAC_PASS_7"]



value_dictionary = {1: 1, 2: 2, 4: 5, 3: 4, 8: 3}

for column in pass_list:
    wave2[column] = w2[column].apply(vac_pass_recode) # recoding '5' value as null
    wave2.dropna(axis=0, subset=column, inplace=True) # Dropping subsequent nulls
    new_name = column + '_new' # Creating new variable name
    wave2[new_name] = wave2[column].map(value_dictionary) # remapping the value list into the new variable per the variable dictionary.
    wave2.drop(column, axis=1, inplace=True) # Dropping the original variable
    wave2.rename(columns={new_name: column}, inplace=True) # renaming the new variable back to the old name for consistency.


In [28]:
# Following the same procedure for other vaccine pass variables, where '6' refers to no response.

# Creating recoding function.
def vac_pass_recode_part2(x):
    """Function to recode vac_pass question non-response values as null
    
    :param x: Original row value for recoding.
    :return: Result of any necessary recoding.
    """
    
    if x == 6:
        result = np.nan
    else:
        result = x
    return result

In [29]:
# Applying recoding function to remaining vac_pass variables and recoding other values to a consistent 1-5 scale (as for some reason, the scale midpoint was coded as 8 rather than 3).


pass_list2 = ["VAC_PASS_UK",
"VAC_PASS_INT"]

value_dictionary = {1: 1, 2: 2, 7: 3, 8: 4, 5: 5}

for column in pass_list2:
    wave2[column] = w2[column].apply(vac_pass_recode_part2)
    wave2.dropna(axis=0, subset=column, inplace=True)

new_name = "VAC_PASS_UK_new"
wave2["VAC_PASS_UK_new"] = wave2["VAC_PASS_UK"].map(value_dictionary)
wave2.drop("VAC_PASS_UK", axis=1, inplace=True)
wave2.rename(columns={"VAC_PASS_UK_new": "VAC_PASS_UK"}, inplace=True)

for column in pass_list2:
    print(column)
    print(wave2[column].value_counts())
    print()
    print("----------")
    print()

VAC_PASS_UK
3    4608
5    3457
4    1600
1     512
2     389
Name: VAC_PASS_UK, dtype: int64

----------

VAC_PASS_INT
5.0    4116
3.0    4053
4.0    1587
1.0     446
2.0     364
Name: VAC_PASS_INT, dtype: int64

----------



In [30]:
# Dropping records which coded No response (Don't know) at COV_KNOWL variables. Don't know coded as 5.
knowledge_vars = ['COV_KNOWL_1', 'COV_KNOWL_2', 'COV_KNOWL_3', 'COV_KNOWL_4', 'COV_KNOWL_5', 'COV_KNOWL_6', 'COV_KNOWL_7']

for column in knowledge_vars:
    mask = wave2[wave2[column] == 5]
    wave2.drop(wave2[wave2[column] == 5].index, inplace=True)


In [31]:
# Dropping records which coded No response (Don't know) at ANX variables. Don't know coded as 6.
anx_vars = ['ANX_1', 'ANX_2', 'ANX_3', 'ANX_4', 'ANX_5', 'ANX_6']

for column in anx_vars:
    mask = wave2[wave2[column] == 6]
    wave2.drop(wave2[wave2[column] == 6].index, inplace=True)

In [32]:
# Dropping records which coded No response (Don't know) at DREAD 
mask = wave2[wave2['DREAD'] == 5]
wave2.drop(wave2[wave2['DREAD'] == 5].index, inplace=True)

In [33]:
# Checking cleaned value counts, including as percentages, to identify if any further value recategorisation is required.
for column in wave2.columns:
    print()
    print(column)
    print(wave2[column].value_counts())
    print()
    print()
    print(column)
    print(wave2[column].value_counts(normalize=True))
    print()
    print()
    print("---------")


COV_DOSE
1    4372
2    1109
3     329
Name: COV_DOSE, dtype: int64


COV_DOSE
1    0.752496
2    0.190878
3    0.056627
Name: COV_DOSE, dtype: float64


---------

COV_DOSE_2
1    4176
2     172
3      20
4       4
Name: COV_DOSE_2, dtype: int64


COV_DOSE_2
1    0.955169
2    0.039341
3    0.004575
4    0.000915
Name: COV_DOSE_2, dtype: float64


---------

COV_KNOWL_1
4    6229
3    1440
2     176
1     136
Name: COV_KNOWL_1, dtype: int64


COV_KNOWL_1
4    0.780479
3    0.180429
2    0.022052
1    0.017040
Name: COV_KNOWL_1, dtype: float64


---------

COV_KNOWL_2
4    6404
3    1251
2     178
1     148
Name: COV_KNOWL_2, dtype: int64


COV_KNOWL_2
4    0.802406
3    0.156747
2    0.022303
1    0.018544
Name: COV_KNOWL_2, dtype: float64


---------

COV_KNOWL_3
4    6184
3    1410
2     214
1     173
Name: COV_KNOWL_3, dtype: int64


COV_KNOWL_3
4    0.774840
3    0.176670
2    0.026814
1    0.021676
Name: COV_KNOWL_3, dtype: float64


---------

COV_KNOWL_4
4    6212
3    1452
2 

In [34]:
# Dropping categories with low value counts.
wave2.drop(wave2[wave2['DEMWRK'] == 'Unable to work (including, for example, a short- or long-term disability)'].index, inplace=True)

In [35]:
# Recategorising other categories with low counts as 'other' grouping
wave2['Q71'] = wave2['Q71'].apply(lambda x: 'Other/Not eligible' if (x == 'Other' or x == 'Not eligible' or x == 'UKIP' or x == 'DUP' or x == 'Sinn Féin' or x == 'SDLP' or x == 'UUP' or x == 'Alba Party' or x == 'BNP') else x)
wave2['DEMREL'] = wave2['DEMREL'].apply(lambda x: 'Other' if (x == 'Hindu' or x == 'Jewish' or x == 'Buddhist') else x)

In [36]:
# Creating two extra binary intention target variables ('yes', or 'no') for use in modelling. Target 1 only includes those 'definitely yes' or 'definitely no'. Target 2 groups those 'unsure' into those categories into which they are leaning.
wave2['intenttarget_1'] = wave2['intention'].apply(lambda x: 1 if x == 1 else 0 if x == 4 else np.nan)
wave2['intenttarget_2'] = wave2['intention'].apply(lambda x: 1 if (x == 1 or x == 2) else 0 if (x == 3 or x == 4) else np.nan)

In [37]:
# Creating binary uptake target variable ('yes', or 'no') for use in modelling. Merging one and two dose answers into one 'yes' category
wave2['uptaketarget'] = wave2['COV_DOSE'].apply(lambda x: 1 if (x == '1' or x == '2') else 0 if (x == '3') else np.nan)

In [38]:
# Saving total dataframe to CSV
wave2.to_csv("C:/Users/laure/Documents/vaccine_hesitancy_and_uptake/1_data_cleaning/wave_2_data_nondum.csv")

In [39]:
# Splitting total file into subsamples for 'intention' and 'uptake' for use in modelling.
w2_uptake = wave2.dropna(axis=0, subset='COV_DOSE')
w2_intention = wave2.dropna(axis=0, subset='intention')

In [40]:
# Dropping unnecessary intention variable from uptake sample
w2uptake = w2_uptake.copy()
w2uptake.drop('intention', axis=1, inplace=True)

In [41]:
# Dropping unnecessary uptake variable from uptake sample
w2intention = w2_intention.copy()
w2intention.drop(['COV_DOSE', 'COV_DOSE_2'], axis=1, inplace=True)

In [42]:
# Creating list of variables for dummification
dummy_list = ["MIST_1",
"MIST_2",
"MIST_3",
"MIST_4",
"MIST_5",
"MIST_6",
"MIST_7",
"MIST_8",
"MIST_9",
"MIST_10",
"MIST_11",
"MIST_12",
"MIST_13",
"MIST_14",
"MIST_15",
"MIST_16",
"MIST_17",
"MIST_18",
"MIST_19",
"MIST_20",
"DEMREG",
"DEMSEX",
"DEMEDU",
"DEMWRK",
"DEMREL",
"DEMINC",
"COV_INV",
"DEMNHS",
"COV_SHIELD",
"Q71",
"COV_TRUST_1",
"COV_TRUST_2",
"COV_TRUST_3",
"COV_TRUST_4",
"COV_TRUST_5",
"COV_TRUST_6",
"COV_TRUST_7",
"COV_TRUST_8",
"COV_TRUST_9",
"COV_TRUST_10",
"COV_TRUST_11",
"COV_TRUST_12",
"COV_TRUST_13"]

In [43]:
# Dummifying categorical variables
w2upt_dum = pd.get_dummies(w2_uptake, columns=dummy_list)
w2int_dum = pd.get_dummies(w2_intention, columns=dummy_list)

In [44]:
# Saving dummified files
w2upt_dum.to_csv("C:/Users/laure/Documents/vaccine_hesitancy_and_uptake/1_data_cleaning/wave_2_vaccine_uptake_data.csv")
w2int_dum.to_csv("C:/Users/laure/Documents/vaccine_hesitancy_and_uptake/1_data_cleaning/wave_2_vaccine_intention_data.csv")