The Data sources are:

### 1. Liam's SPSS coded data
**File:** The Loop 2017 Final Interventions.xlsx

Exported as Excel from SPSS, keeping the variable names.

This file contains 1325 entries.


27 have null festival or sample numbers so can't be used, leaving 1298


One has sample number 12151, two have sample number 0 - these cannot be merged.


This leaves 1295 - all of which can be merged

### 2. Guy's cleaned up lab data
**File:** Loop 2017 Lab fixed data.csv

Saved from: Dropbox/Testing/2017 results processing/Loop 2017 Lab fixed data.xlsm in the ‘Raw Lab Data’ sheet

This file contains 2544 entries


1900 entries start with F


621 entries begin with A (amnesty) so can't be merged


23 Begin with W? so can't be merged

Festival Names

### 3. Boomtown Intervention Questionnaire
**File:** BTReport 2017 - Form responses 3.csv

Exported from: https://docs.google.com/spreadsheets/d/15pdETY0HK-VbBcV-N0swt6ZrRBbeDnZR5RGDzfq95dg

This file contains 194 entries

### Merging the data

Merging the data on Festival and SampleNumber resulted in 1295 entries



In [14]:
# Module imports
import os
import pandas as pd

Need to consolidate 'Festival' and 'Sample Number' Columns across all datasets
* Change all columns to have same name
* Ensure sample numbers are all in the same format: FXXXX for FOH samples and AXXXX for Amnesty samples (which have no intervention data/provenance data associated with them).

In [2]:
bt_interventions = '/opt/random/BTReport 2017 - Form responses 3.csv'

In [15]:
spssdata = '/opt/random/The Loop 2017 Final Interventions.xlsx'
spss_df = pd.read_excel(spssdata)

# Remove SurveyID column
spss_df.drop('SurveyID', axis=1, inplace=True)

# Change festival names
spss_df['Festival'].replace(['BoomTown', 'KC', 'SGP'], ['BT2017', 'KC2017', 'SGP2017'], inplace=True)

# Ensure all Sample numbers are consistent
# 1. Delete any rows where SampleNumber or Festival is NA as we can't do anything with it
spss_df.dropna(subset=['SampleNumber', 'Festival'], inplace=True)
# Make all sample numbers a 4-digit code starting with F
spss_df['SampleNumber'] = spss_df['SampleNumber'].apply(lambda x: 'F{:04d}'.format(int(x)))

# Below shows we are left with 1298 datasets
print(len(spss_df))

1298


In [16]:
labdata = '/opt/random/Loop 2017 Lab fixed data.csv'
lab_df = pd.read_csv(labdata, encoding="ISO-8859-1")

# Remame 'Event Name' and 'Sample Number' columns so they match
lab_df.rename(columns={'Event  Name': 'Festival', 'Sample Number': 'SampleNumber'}, inplace=True)

# Delete any rows where SampleNumber or Festival is NA as we can't do anything with it
lab_df.dropna(subset=['SampleNumber', 'Festival'], inplace=True)
# This just drops one case

# Uppercase all sample numbers
labels = ['SampleNumber']
lab_df.loc[:, labels] = lab_df[labels].apply(lambda x: x.str.upper())

# Some sample numbers begin with W or F 
#print(len(lab_df[ ~ (lab_df['SampleNumber'].str.startswith('F') | lab_df['SampleNumber'].str.startswith('A')) ]))



In [None]:
dft = pd.merge(spss_df, lab_df, how='inner', on=['Festival','SampleNumber'])
print("%d entries were merged" % len(dft))
# Sort
dft.sort_values(['Festival', 'SampleNumber'], ascending=True, inplace=True)
print(dft[['Festival', 'SampleNumber']][0:10])
writer = pd.ExcelWriter('merged.xlsx')
dft.to_excel(writer, 'MergedData', index=False)
writer.save()

# Find out which entries can't be merged - check for right_only
#pd.merge(lab_df, spss_df, how='outer', indicator=True)

1295 entries were merged
    Festival SampleNumber
641   BT2017        F0015
700   BT2017        F0022
701   BT2017        F0023
702   BT2017        F0024
916   BT2017        F0036
650   BT2017        F0074
655   BT2017        F0078
941   BT2017        F0079
631   BT2017        F0081
640   BT2017        F0084


In [None]:
# Here we reorder columns that should be identical to:
# 1. spot data errors
# 2. remove duplicate columns
prefix_cols = ['Festival', 'SampleNumber',
             'Sample submission time', 'Date & Time of return', 'Day', 'Date', 'Time', 
             'Client age', 'Age', 'Client gender', 'Gender', 'Bought as', 'SubmittedSubstanceAs']

# Get the list of columns excluding the ones in prefix_cols
cols = [c for c in dft.columns.tolist() if c not in prefix_cols]
# Prepend prefix_cols to create the new list
cols = prefix_cols + cols

# Reorder columns
dft = dft[cols]

writer = pd.ExcelWriter('merged2.xlsx')
dft.to_excel(writer, 'MergedData', index=False)
writer.save()