### Merging the original data with the coded short free-text answers

#### NB:

If the coding changes/free text answer classes are adjusted, this will be reflected in the content of the isolated free text files.

Therefore, the code in this notebook and any analyses depending on the coded/merged columns produced herein need to be re-executed.

In [1]:
import os, pandas as pd, re

In [2]:
exportdate = 20180327
projectname = 'repract'

In [3]:
df = pd.read_csv(f'../../data/{exportdate}{projectname}.csv')
df.head(2)

Unnamed: 0,lfdn,external_lfdn,tester,dispcode,lastpage,quality,duration,v_7039,v_7040,v_7041,...,output_mode,javascript,flash,session_id,language,cleaned,ats,datetime,date_of_last_access,date_of_first_mail
0,106,0,no tester,Completed after break (32),2138658,NotShown,-1,NotShown,NotShown,0,...,HTML,NotShown,NotShown,3bb21c1b318e2f6b87557566bdd6b4d9,English,Not cleaned,1515411510,2018-01-08 11:38:30,2018-01-08 13:07:14,0000-00-00 00:00:00
1,131,0,no tester,Completed (31),2138658,NotShown,3805,NotShown,NotShown,NotShown,...,HTML,NotShown,NotShown,fc38f6556787a459c2cc604abf799448,English,Not cleaned,1515667019,2018-01-11 10:36:59,2018-01-11 11:40:24,0000-00-00 00:00:00


In [26]:
basedir = '../../analysis/freetext'
freetextfiles = os.listdir(basedir)
dfs = {file[:-4]:pd.read_csv(f'{basedir}/{file}') for file in freetextfiles}
short_answer_dfs_keys = ['v_11_coded', 'v_19_coded', 'v_6_coded', 'v_16_coded']
dfs = {k:v for k,v in dfs.items() if k in short_answer_dfs_keys}
dfs.keys()

dict_keys(['v_11_coded', 'v_16_coded', 'v_19_coded', 'v_6_coded'])

There are four questions with short free text answers:
- v_6 and v_16 are list-supplementing (for v_5, primary respondent role, and v_15, primary systems class, respectively)
- v_11 and v_19 are list-supplanting (years of respondent experience and respondent client sector)

In other notebooks, respondents' answers to these questions (fully algorithmically) and stored the resulting csv files. 
Once these files have been loaded as pandas DataFrames, we can use a tiny pipeline to produce a new DataFrame that contains only relevant columns, plus the coded free text columns, plus the merged columns (for the list-supplementing variables).
This is done below (the function interfaces are ugly, I know, but the code gets the job done).

In [79]:
def prepare_merged_df(df, dfs):
    newdf = pd.DataFrame(df, copy=True)
    short_answer_dfs_keys = ['v_11_coded', 'v_19_coded', 'v_6_coded', 'v_16_coded']
    for dfkey in short_answer_dfs_keys:
        newdf = newdf.merge(dfs[dfkey], how='outer').fillna('NotAnswered')
        if dfkey not in short_answer_dfs_keys[:2]:
            varno = int(dfkey.split("_")[1])
            newdf[f'v_{varno-1}_{varno}_merged'] = [x[1] if x[0].startswith('Other') else x[0] 
                                                    for x in list(zip(newdf[f'v_{varno-1}'], newdf[dfkey]))]
    return newdf

def drop_admin_columns(df, olddf):
    columns_to_drop = list(olddf.columns.values)[1:7] + list(olddf.columns.values)[1329:] + ['v_122', 'v_123']
    return df.drop(columns_to_drop, axis=1)

def fix_not_answered(df):
    return df.replace('0', 'NotAnswered')

def pipeline_df(df, dfs):
    newdf = prepare_merged_df(df, dfs)
    newdf = drop_admin_columns(newdf, df)
    newdf = fix_not_answered(newdf)
    return newdf

In [81]:
newdf = pipeline_df(df, dfs)
newdf.head(3)

Unnamed: 0,lfdn,v_7039,v_7040,v_7041,v_7042,v_7043,v_7044,v_7045,v_7046,v_7047,...,v_16,v_19,v_124,v_1373,v_11_coded,v_19_coded,v_6_coded,v_5_6_merged,v_16_coded,v_15_16_merged
0,106,NotShown,NotShown,NotAnswered,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,...,NotAnswered,Automotive,Italy,NotAnswered,4.0,Automotive,Researcher,Researcher,NotAnswered,Hybrid / mix of embedded systems and informati...
1,131,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,...,NotAnswered,education,Belgium,NotAnswered,1.0,Education,NotAnswered,Developer,NotAnswered,(Business) information systems
2,139,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,NotShown,...,Customer facing software products,Wide range (from automotive supplier to insura...,Germany,It would have been easier to rate the research...,10.0,Automotive,Manager,Manager,(Business) information systems,(Business) information systems


In [83]:
#newdf.to_csv(f'../../analysis/{exportdate}{projectname}_with_shorttext_integration.csv', index=False)

Now we are in a position to analyze all data that's not long free-text answers, and to re-generate all our support files and graphics without the ugly 'Other (please specify)'. Yippie.

The End.