Checking out the device logger data set and comparing to the repaired preference data set. The 'target' artwork column from the device logger data set can be added to the repaired preferences data set based on shared user_id/item_id combinations. This combination should happen only once - a piece of artwork should only receive one favorable or disfavorable rating per user. 

In [94]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

In [2]:
# loading in the data sets from the current directory
dir = Path(os.getcwd())
device_logger = pd.read_csv(dir/"device_preference_log.csv", parse_dates=['timestamp'])
prefs = pd.read_csv(dir/"prefs2080423REPAIRED.csv", parse_dates=['synced_timestamp', 'timestamp'])

To merge, these datasets will have to have the same column names. Since the preferences set will ultimately be keeped, I will updated the device logger columns. Also, altering the prefs column to show True or False instead of 'y' or 'n'.

In [3]:
device_logger['pref'] = device_logger.pref.replace('n', False)
device_logger['pref'] = device_logger.pref.replace('y', True)
device_logger = device_logger.rename(columns={'artid': 'item_id'})
device_logger = device_logger.rename(columns={'session_id': 'user_id'})
device_logger = device_logger.rename(columns={'timestamp': 'synced_timestamp'})
prefs['synced_timestamp'] = prefs['synced_timestamp'].dt.round('1S')

Checking to see if the number of non-survey data pieces in the preferences column match the number of rows in the device logger data set. Looks like there are slightly more records in the preferences set, which should be alright.

In [4]:
print(device_logger.shape)
prefs[~prefs.user_id.str.startswith('admin')].shape

(2201, 5)


(2714, 6)

Also want to make sure there aren't duplicate user/artwork pairs. If that is the case and the response or the art work varies, there isn't enough data to tell what the true preference was. It is also possible the divrods device mistook the art that a user intended to like. These duplicate pieces will have to be removed to ensure the recommendor model has accurate data.

In [5]:
dups_dl = device_logger.groupby(['user_id','item_id']).size().reset_index().rename(columns={0:'count'})
dups_dl[dups_dl['count'] > 1].shape

(166, 3)

In [6]:
# lost several hundred rows due to duplicates
dl2 = device_logger.drop_duplicates(keep=False, subset=['user_id', 'item_id'])
print(dl2.shape)
device_logger.shape

(1815, 5)


(2201, 5)

In [7]:
# same with the preference data - many duplicates have to be removed
dups_prefs = prefs.groupby(['user_id','item_id']).size().reset_index().rename(columns={0:'count'})
dups_prefs[dups_prefs['count'] > 1].shape

(222, 3)

In [164]:
prefs_deduped = prefs.drop_duplicates(keep=False, subset=['user_id', 'item_id'])
print(prefs_deduped.shape)
prefs.shape

(3155, 6)


(3679, 6)

Merging the data sets through a pandas merge, based on user_id and item_id pairs. We will transfer over the 'target' value from the device logger data set to the preferences data set. Also, dropping the timestamp and resource id columns as the former is mostly NA's and the latter does not provide benefit.

In [9]:
merged_df = pd.merge(prefs_deduped, dl2[['user_id', 'item_id', 'target']], how='left',
                     on=['user_id', 'item_id'])
merged_df = merged_df.drop(['resource_id', 'timestamp'], axis='columns')
# saving the changes to a new .csv
merged_df.to_csv("merged_preferences.csv", index=False)