The Data sources are:

### 1. Liam's SPSS coded data
**File:** The Loop 2017 Final Interventions.xlsx

Exported as Excel from SPSS, keeping the variable names.

This file contains 1325 entries.


27 have null festival or sample numbers so can't be used, leaving 1298


One has sample number 12151, two have sample number 0 - these cannot be merged.


This leaves 1295 - all of which can be merged

### 2. Guy's cleaned up lab data
**File:** Loop 2017 Lab fixed data.csv

Saved from: Dropbox/Testing/2017 results processing/Loop 2017 Lab fixed data.xlsm in the ‘Raw Lab Data’ sheet

This file contains 2544 entries


1900 entries start with F


621 entries begin with A (amnesty) so can't be merged


23 Begin with W? so can't be merged


Entry SGP2017 F0465 needs editing as 'Client gender' is FemaleaMalee 

### 3. Boomtown Intervention Questionnaire
**File:** BTReport 2017 - Form responses 3.csv

Exported from: https://docs.google.com/spreadsheets/d/15pdETY0HK-VbBcV-N0swt6ZrRBbeDnZR5RGDzfq95dg

This file contains 194 entries

### Merging the data

Merging the data on Festival and SampleNumber resulted in 1295 entries



In [1]:
# Module imports
import os
import numpy as np
import pandas as pd

In [2]:
bt_interventions = '/opt/random/BTReport 2017 - Form responses 3.csv'

In [3]:
spssdata = '/opt/random/The Loop 2017 Final Interventions.xlsx'

spss_df = pd.read_excel(spssdata)

# Change festival names
spss_df['Festival'].replace(['BoomTown', 'KC', 'SGP'], ['BT2017', 'KC2017', 'SGP2017'], inplace=True)

# Ensure all Sample numbers are consistent
# 1. Delete any rows where SampleNumber or Festival is NA as we can't do anything with it
spss_df.dropna(subset=['SampleNumber', 'Festival'], inplace=True)

# 2. Make all sample numbers a 4-digit code starting with F
spss_df['SampleNumber'] = spss_df['SampleNumber'].apply(lambda x: 'F{:04d}'.format(int(x)))

# Combine date and time columns into new single column
spss_df['Date'] = pd.to_datetime(spss_df['Date']) # Convert Date to datetime object
spss_df['Date & Time of intervention'] = spss_df.apply(lambda r : pd.datetime.combine(r['Date'], r['Time']), 1)

# Remove Day, Date, Time and SurveyID columns
spss_df.drop(['Day', 'Date', 'Time', 'SurveyID'], axis=1, inplace=True)

# Below shows we are left with 1298 datasets
print(len(spss_df))

1298


In [4]:
labdata = '/opt/random/Loop 2017 Lab fixed data.csv'
date_cols = ['Sample submission time', 'Date & Time of return']
lab_df = pd.read_csv(labdata, encoding="ISO-8859-1", engine="python", parse_dates=date_cols)

# Remame 'Event Name' and 'Sample Number' columns so they match
lab_df.rename(columns={'Event  Name': 'Festival', 'Sample Number': 'SampleNumber'}, inplace=True)

# Delete any rows where SampleNumber or Festival is NA as we can't do anything with it
lab_df.dropna(subset=['SampleNumber', 'Festival'], inplace=True) # This just drops one case

# Uppercase all sample numbers
labels = ['SampleNumber']
lab_df.loc[:, labels] = lab_df[labels].apply(lambda x: x.str.upper())

# Some sample numbers begin with W or F 
#print(len(lab_df[ ~ (lab_df['SampleNumber'].str.startswith('F') | lab_df['SampleNumber'].str.startswith('A')) ]))

In [5]:
dft = pd.merge(spss_df, lab_df, how='inner', on=['Festival','SampleNumber'])
print("%d entries were merged" % len(dft))

# For checking which entries can't be merged - check for right_only
#pd.merge(lab_df, spss_df, how='outer', indicator=True)

# Sort first by Festival, then SampleNumber
dft.sort_values(['Festival', 'SampleNumber'], ascending=True, inplace=True)

# Here we reorder columns that should be identical to:
# 1. spot data errors
# 2. remove duplicate columns once we're happy data is consistent
prefix_cols = ['Festival', 'SampleNumber',
             'Sample submission time', 'Date & Time of return', 'Date & Time of intervention', 
             'Client age', 'Age', 'Client gender', 'Gender', 'Bought as', 'SubmittedSubstanceAs']

# Get the list of columns excluding the ones in prefix_cols
cols = [c for c in dft.columns.tolist() if c not in prefix_cols]
# Prepend prefix_cols to create the new list
cols = prefix_cols + cols
# Reorder columns
dft = dft[cols]

# Uppercase all genders for consistency
labels = ['Client gender', 'Gender']
dft.loc[:, labels] = dft[labels].apply(lambda x: x.str.upper())
# Set any MISSING to be nan
dft.loc[:, labels] = dft.loc[:, labels].replace({'MISSING':np.nan})

# Dump to excel
writer = pd.ExcelWriter('merged.xlsx')
dft.to_excel(writer, 'MergedData', index=False)
writer.save()

1295 entries were merged


In [8]:
# # See which non-na ages don't match
# # 540 entries have valid ages
# print(len(dft))
# df = dft[pd.notnull(dft['Client age']) & pd.notnull(dft['Age'])]
# print(len(df))
# df = df[df['Client age'] != df['Age']]
# # 127 don't match
# print(len(df))
# # df.to_csv('foo.csv')


# Cross tab 'Client gender' and 'Gender
#print(set(dft['Client gender'].values))
#print(set(dft['Gender'].values))

# Look where they don't match
df = dft[pd.notnull(dft['Client gender']) & pd.notnull(dft['Gender'])]
df = df[df['Client gender'] != df['Gender']]
# 127 don't match
print(len(df))
df.to_csv('foo.csv')

62
