## Monthly Stats and Metrics for Scott Media Creation Lab's Spaces and Equipment

_by Kris Joseph, kjo@yorku.ca_

This script is intended to be run monthly to grab key data points for metrics and stats. It should be run AFTER the end of the month (e.g. run the script for July AFTER August 1 to ensure there's a full month of data in the system).

Multiple outputs will be generated:
1. "Raw" equipment data, in a file called (yyyy-mm-dd)_equip.csv
    
2. "Raw" spaces data, in a file called (yyyy-mm-dd)_space.csv
    
3. Final metrics data (for input into DSI's tracking spreadsheet) for both modules:
    - (yyyy-mm-dd)_equipment_finalStats.csv: equipment data only
    - (yyyy-mm-dd)_space_finalStats.csv: spaces data only
    - (yyyy-mm-dd)overall_finalStats.csv: ALL data, ready to copy into spreadsheet
    
*IMPORTANT*
This script is totally tied to the format of the data tracking spreadsheet (i.e. the number of columns, in the order generated here, are intended to match the layout of the spreadsheet so that when a data row is added it just has to be pasted in and all the columns will line up properly. As a result, any changes to the column order in this script OR in the DSI tracking sheet must be made in both places (here and in the spreadsheet).

#### GENERAL NOTES ON THIS IMPLEMENTATION

1. Want to revise this so that a month and year can be input on the command line to grab data easily


## Set the value for the month that stats are needed for

To run this script, the only thing you should have to change is the "datadate" field below. Enter in the form yyyy-mm-01

In [5]:
import requests, json, csv, calendar, datetime, math, collections, hashlib, os, shutil
from os.path import exists
from datetime import *
import pandas as pd

## INPUT variable
# Set this to the year and month for which data will be gathered, in yyy-mm-dd format. dd should always be 01
# This is also the date format LibCal's API expects for the API calls
datadate = "2024-05-01"


## Variable initializations and constants

In [6]:
# Our outputData dictionary. This dictionary is gonna have a TON of data points in it eventually....
equipmentOutputData = {}
spacesOutputData = {}

# We'll need an accurate number of days in the month to make a proper call to the API (get 30 days of data for some months, 31 for others, and 28 for one special month)
[year, month, day] = datadate.split("-")
daysInMonth = calendar.monthrange(int(year), int(month))[1]

# Make backups of old "overall users" data, since calculating first-time users
# means comparing the previous month to the current one. If we don't make a backup, and
# these scripts fail for some reason, future runs will always show "first time users" as 0 because
# the "existingUsers" data files will already have been updated
fileExtension = ".txt"
fileDirectory = "data/"
for file in ["existingSpaceUsers", "existingEquipmentUsers", "existingOverallUsers"]:
    if not exists(fileDirectory + file + "-" + str(year) + str(month) + fileExtension):
        print("Nope")
        shutil.copyfile(fileDirectory + file + fileExtension, fileDirectory +file + "-" + str(year) + str(month) + fileExtension)

# This 'config file' contains the variable definitions for all of the equipment names, 
# space names, column names for data, etc.  The most common reason for these scripts
# to break is because something in LibCal has changed (the name of an item or space, question
# responses in the booking forms (e.g. facuty name or relationship to institution), etc.
# This import statement lets me keep all that stuff in a separate file.
from libCalNames import *

## Get a token for API acess

By default, tokens are valid for one hour

In [7]:
# URLs and data structures for API calls are all listed in the admin pages for the Libapps API module
url = 'https://yorku.libcal.com/1.1/oauth/token'
myRequestData = {'client_id': clientID,
        'client_secret': clientSecret,
        'grant_type': 'client_credentials'}

# send the request
call = requests.post(url, data = myRequestData)

# API authorization is returned in a JSON object, and we need to grab/store our access token, which
# is used to validate API calls for getting/setting data
authorizationData = call.json()
accessToken = authorizationData['access_token']

## Send a request

LibCal uses two different modules for Spaces and Equipment, and each has different API URLs. In this section we'll build API calls to pull one month's worth of data for each 

In [8]:
equipURL = 'https://yorku.libcal.com/1.1/equipment/bookings'
spaceURL = 'https://yorku.libcal.com/1.1/space/bookings'

# NOTE for the following: the MAXIMUM record limit for the LibCal API is 500, meaning 500 rows of data. No issues currently, but in the future
# this may become a problem that needs to be dealt with. A month that contains more than 500 records would have data truncated.
# As of March 2023 (our busiest month so far jusged in summer of 2023) the equiopment booking list
# had 415 entries... so the time of This Breaking is nigh at hand
equipData = {'date': datadate,
        'days': daysInMonth,
        'limit': 500,
        'lid': locationID,
        'include_cancel': 1,
        'formAnswers': 1}
spaceData = {'date': datadate,
        'lid': locationID,  
        'days': daysInMonth,
        'limit': 500,
        'lid': locationID,
        'include_cancel': 1,
        'formAnswers': 1}

headers = {'Authorization':'Bearer '+accessToken}

# get Equipment module data for the month
response = requests.get(equipURL, headers=headers, params=equipData)
equipAPIData = response.json()

# get Spaces module data for the month
response = requests.get(spaceURL, headers=headers, params=spaceData)
spacesAPIData = response.json()


## First-pass data cleaning (for raw CSV output)

Some JSON values are EMPTY and this will result in misaligned columns of data in CSV output file near the end of the rows (where custom form question data is output).  We need to iterate through JSON data and replace any empy values in "cancelled", "q2489", "q2490" and "q2491" fields.  I'm using two loops, one for each data set, because their structures are different. I'm sure there's a more elgeant way to generalize this, but why get fancy if we don't need to?



In [9]:
# List of field names for which data may be missing (due to booking form variations etc.
# for example, if no "cancellation" of a booking occurred, there is no 'cancelled' data in the record.
# This is mostly for space data. Equioment data only has "cancelled" as a possible missing field
possiblyMissingFields = ['cancelled', 'q2579', 'q2669', 'seat_id', 'seat_name', 'check_in_code']

# Basic cleaning for Equipment API data
for entry in equipAPIData:
    if "cancelled" in entry:
        if entry['cancelled'] == "":
            entry['cancelled'] = "null"
    else:
        entry["cancelled"] = "null"
     
    # These are the custom questions, which are occasionally not filled out; 
    # empty questions result in mal-formatted CSV output so I set to null if nonexistent in a record
    for customQKey in ['q2489','q2491', 'q2704', 'q2705', 'q2669', 'q2861', 'q2862', 'q2734', 'q2579', 'q3039', 'q3047','q3048']:
        if (customQKey not in entry.keys()) or (entry[customQKey] == ""):
            entry[customQKey] = "null"
        
    #Change key names for custom question fields
    entry["relpToYork"] = entry.pop("q2489")
    entry["project"] = entry.pop("q2491")
    entry["VRexperience"] = entry.pop("q2579")
    entry["groupBooking"] = entry.pop("q2704")
    entry["groupSize"] = entry.pop("q2705")
    entry["faculty"] = entry.pop("q2734")
    entry["flexStudioUse"] = entry.pop("q2669")
    entry["flexStudioPhotoCameraChoice"] = entry.pop("q2862")
    entry["flexStudioVidCameraChoice"] = entry.pop("q2861")
    entry["flexStudioBackgroundChoice"] = entry.pop("q3039")
    entry.pop("q3047")
    entry.pop("q3048")
    
    # Remove identifying patron information
    entry.pop("firstName") 
    entry.pop("lastName")
    entry.pop("account")
    entry["email"] = hashlib.md5(entry["email"].encode()).hexdigest()
    

    # Handle cases where the 'Other' field value might cause problems with stats (since it's a possible answer for
    # two different questions on booking forms
    if entry["relpToYork"] == "Other": entry["relpToYork"] = "Other Relationship" 
    if entry["faculty"] == "Other": entry["faculty"] = "Other Faculty or No Faculty"

# Basic cleaning for Spaces API data
for entry in spacesAPIData:
    
    # Go through list of 'possibly missing' fields to see if they're in the data; if so and empty,
    # set to null. Without this, the output CSV file will be missing some fields and data won't line up with headers
    for possiblyMissing in possiblyMissingFields:
        if possiblyMissing in entry:
            if str(entry[possiblyMissing]) == "":
                entry[possiblyMissing] = "null"
        else:
            entry[possiblyMissing] = "null"

    # These are the custom questions, which are occasionally not filled out; 
    # empty questions result in mal-formatted CSV output so I set to null if nonexistent in a record
    for customQKey in ['q2489','q2491','q2491','q2669', 'q2704', 'q2705', 'q2861', 'q2862', 'q2734', 'q2579', 'q3039', 'q3047', 'q3048']:
        if (customQKey not in entry.keys()) or (entry[customQKey] == ""):
            entry[customQKey] = "null"
    
    #Change key names for custom question fields
    entry["relpToYork"] = entry.pop("q2489")
    entry["project"] = entry.pop("q2491")
    entry["VRexperience"] = entry.pop("q2579")
    entry["flexStudioUse"] = entry.pop("q2669")
    entry["flexStudioPhotoCameraChoice"] = entry.pop("q2862")
    entry["flexStudioVidCameraChoice"] = entry.pop("q2861")
    entry["flexStudioBackgroundChoice"] = entry.pop("q3039")
    entry["faculty"] = entry.pop("q2734")
    entry["groupBooking"] = entry.pop("q2704")
    entry["groupSize"] = entry.pop("q2705")
    entry.pop("q3047")
    entry.pop("q3048")
    
    # Remove identifying patron information
    entry.pop("firstName") 
    entry.pop("lastName")
    entry.pop("account")
    entry["email"] = hashlib.md5(entry["email"].encode()).hexdigest()
    
    # Handle cases where the 'Other' field value might cause problems with stats (since it's a possible answer for
    # two different questions on booking forms
    if entry["relpToYork"] == "Other": entry["relpToYork"] = "Other Relationship" 
    if entry["faculty"] == "Other": entry["faculty"] = "Other Faculty or No Faculty"

In [10]:
print(entry)

{'bookId': 'cs_Xg41Zmu3', 'id': 8996655, 'eid': 19904, 'cid': 6842, 'lid': 2632, 'fromDate': '2024-05-30T14:00:00-04:00', 'toDate': '2024-05-30T16:10:00-04:00', 'created': '2024-05-21T18:57:11-04:00', 'email': '471f5364145667650dad6fed5483a6f7', 'status': 'Mediated Approved', 'location_name': 'Scott Media Creation Lab', 'category_name': 'Flex Studio Spaces', 'item_name': '204 Flex Studio', 'event': None, 'check_in_code': 'E34', 'cancelled': 'null', 'seat_id': 'null', 'seat_name': 'null', 'relpToYork': 'Undergraduate Student', 'project': 'Women in Law Podcast ', 'VRexperience': 'null', 'flexStudioUse': 'Podcasting (Audio and Video)', 'flexStudioPhotoCameraChoice': 'null', 'flexStudioVidCameraChoice': 'null', 'flexStudioBackgroundChoice': 'Black', 'faculty': 'Health (HH)', 'groupBooking': 'null', 'groupSize': 'null'}


## Writing Data to a CSV file


In [11]:

## OUTPUT: Equipment data

csvOut = open("data/"+datadate+"_equip.csv", 'w')

# create the csv writer object
csv_writer = csv.DictWriter(csvOut, fieldnames=equipInputDataFieldnames)

# Output the header first
csv_writer.writeheader()
 
for record in equipAPIData:
    csv_writer.writerow(record)
 
csvOut.close()




## OUTPUT: Space data

csvOut = open("data/"+datadate+"_space.csv", 'w')

# create the csv writer object
csv_writer = csv.DictWriter(csvOut, fieldnames=spacesInputDataFieldnames)

# Output the header first
csv_writer.writeheader()

for record in spacesAPIData:
    csv_writer.writerow(record)
 
csvOut.close()



## Pull in CSV data for final processing 

_(AND/OR convert the data to DataFrames, the code for which was added in June 2023 but has not yet been tested in ANY way)_

In [12]:
# Hey, so those files we literally just created? Let's read 'em into Pandas Dataframes! 
# Why is this so obtuse, you ask?
# Because this was originally a separate script and I should consider converting the dict from earlier 
# in THIS script into a DataFrame but TBH I think the raw CSV files are still valuable and so this is ok by me

# Anyway, since we built the CSV files in the previous step, the format of them should be reliable 
# and we can simply read them into Pandas

spacesData = pd.read_csv("data/"+datadate+"_space.csv", index_col='id')
equipData = pd.read_csv("data/"+datadate+"_equip.csv", index_col='id')


###
### FOR FUTURE WORK: to convert a list to DataFrame without doing the CSV interim step, use
### this pattern:
###
###     import pandas as pd
###     list_name = ['item_1', 'item_2', 'item_3', ...]
###     df = pd.DataFrame(list_name, columns=['column_name'])
###
### CONVERT Spaces data to DataFrame (ADDED JUNE 2023 but NOT TESTED...)
# spacesData = pd.DataFrame(spacesAPIData, columns=['spacesInputDataFieldnames'])
###
### CONVERT Equipment data to DataFrame (ADDED JUNE 2023 but NOT TESTED...)
# equipData = pd.DataFrame(equipAPIData, columns=['equipInputDataFieldnames'])

# Metrics and Stats Processing

## Cancelled vs Actual Bookings


In [13]:
# Drop staff-affiliated bookings right off the top, so numbers all match; 
# otherwise the counts of cancellations, etc. get thrown off
for address in adminEmails:
    spacesData.drop(spacesData.index[(spacesData["email"] == address)],axis=0,inplace=True)
    equipData.drop(equipData.index[(equipData["email"] == address)],axis=0,inplace=True)

# Grab a Series of just Status Column
spacesBookingStatus = spacesData["status"]
equipmentBookingStatus = equipData["status"]

# How many do we have? DO counts of all of the relevant data fields
spacesOutputData["totalBookings"] = len(spacesBookingStatus)
equipmentOutputData["totalBookings"] = len(equipmentBookingStatus)

spacesOutputData["cancelledByUsers"] = len(spacesBookingStatus[spacesBookingStatus.str.startswith('Cancelled by User')])
equipmentOutputData["cancelledByUsers"] = len(equipmentBookingStatus[equipmentBookingStatus.str.startswith('Cancelled by User')])

spacesOutputData["cancelledBySystem"] = len(spacesBookingStatus[spacesBookingStatus.str.startswith('Cancelled by System')])
equipmentOutputData["cancelledBySystem"] = len(equipmentBookingStatus[equipmentBookingStatus.str.startswith('Cancelled by System')])

spacesOutputData["cancelledByAdmin"] = len(spacesBookingStatus[spacesBookingStatus.str.startswith('Cancelled by Admin')])
equipmentOutputData["cancelledByAdmin"] = len(equipmentBookingStatus[equipmentBookingStatus.str.startswith('Cancelled by Admin')])

spacesOutputData["totalActualBookings"] = spacesOutputData["totalBookings"]-spacesOutputData["cancelledByUsers"]-spacesOutputData["cancelledBySystem"]-spacesOutputData["cancelledByAdmin"]
equipmentOutputData["totalActualBookings"] = equipmentOutputData["totalBookings"]-equipmentOutputData["cancelledByUsers"]-equipmentOutputData["cancelledBySystem"]-equipmentOutputData["cancelledByAdmin"]


print("SPACES DATA")
print("Total bookings made:", spacesOutputData["totalBookings"])
print("Cancelled by users:", spacesOutputData["cancelledByUsers"]) 
print("Cancelled for late checkin:", spacesOutputData["cancelledBySystem"])
print("Cancelled by staff:", spacesOutputData["cancelledByAdmin"])
print("Total actual bookings:", spacesOutputData["totalActualBookings"])
print()
print("EQUIPMENT DATA")
print("Total bookings made:", equipmentOutputData["totalBookings"])
print("Cancelled by users:", equipmentOutputData["cancelledByUsers"]) 
print("Cancelled for late checkin:", equipmentOutputData["cancelledBySystem"])
print("Cancelled by staff:", equipmentOutputData["cancelledByAdmin"])
print("Total actual bookings:", equipmentOutputData["totalActualBookings"])

SPACES DATA
Total bookings made: 86
Cancelled by users: 20
Cancelled for late checkin: 10
Cancelled by staff: 6
Total actual bookings: 50

EQUIPMENT DATA
Total bookings made: 348
Cancelled by users: 26
Cancelled for late checkin: 52
Cancelled by staff: 28
Total actual bookings: 242


## DROPPING UNWANTED DATA

Drop rows where bookings have been cancelled (by the user; by a staff member; by the system due to late checkin)

In [14]:
# Drop bookings canceled by User and by System
spacesData.drop(spacesData.index[(spacesData["status"] == 'Cancelled by User')],axis=0,inplace=True)
equipData.drop(equipData.index[(equipData["status"] == 'Cancelled by User')],axis=0,inplace=True)

spacesData.drop(spacesData.index[(spacesData["status"] == 'Cancelled by System')],axis=0,inplace=True)
equipData.drop(equipData.index[(equipData["status"] == 'Cancelled by System')],axis=0,inplace=True)

spacesData.drop(spacesData.index[(spacesData["status"].str.startswith('Cancelled by Admin'))],axis=0,inplace=True)
equipData.drop(equipData.index[(equipData["status"].str.startswith('Cancelled by Admin'))],axis=0,inplace=True)

#spacesData.drop(spacesData.index[(spacesData["cid"].str.startswith('6212'))],axis=0,inplace=True)
#equipData.drop(equipData.index[(equipData["cid"].str.startswith('6212'))],axis=0,inplace=True)
spacesData.drop(spacesData.loc[spacesData['cid']==6212].index, axis=0, inplace=True)
equipData.drop(equipData.loc[equipData['cid']==6212].index, axis=0, inplace=True)



## Data: Unique projects, VR content choices, and Flex Studio uses

In [15]:
# Set up a string translation table to remove newline characters from "project" entry fields
# Since the string is built by casting a LIST as a STRING, we can also remove
# the [ and ] characters that Python would use to show the sdata is in a list
strTranslation = str.maketrans('', '' ,'\r\n[]')

# First we can grab project field data and calculate the number of "unique" projects found within it
spacesOutputData["uniqueProjects"] = len(spacesData['project'].unique())-1
spacesOutputData["projectList"] = str(spacesData['project'].unique()).replace("nan ","") #remove Pandas NaN values from the string
spacesOutputData["projectList"] = spacesOutputData["projectList"].translate(strTranslation)

equipmentOutputData["uniqueProjects"] = len(equipData['project'].unique())-1
equipmentOutputData["projectList"] = str(equipData['project'].unique()).replace("nan ","") #remove Pandas NaN values from the string
equipmentOutputData["projectList"] = equipmentOutputData["projectList"].translate(strTranslation)

# Users also provide info on how they'll use the Flex Studio or VR Rooms as part of those forms, so grab that...
spacesOutputData["VRContentList"] = str(spacesData['VRexperience'].unique()).replace("nan","") #remove Pandas NaN values from the string
spacesOutputData["VRContentList"] = spacesOutputData["VRContentList"].translate(strTranslation)
spacesOutputData["flexStudioUseList"] = str(spacesData['flexStudioUse'].unique()).replace("nan","") #remove Pandas NaN values from the string
spacesOutputData["flexStudioUseList"] = spacesOutputData["flexStudioUseList"].translate(strTranslation)

print("Equipment projects:", equipmentOutputData["projectList"])
print("Number of unique equipment projects:", equipmentOutputData["uniqueProjects"])
print()
print("Spaces projects:", spacesOutputData["projectList"])
print("Number of unique spaces projects:", spacesOutputData["uniqueProjects"])
print()
print("Flex Studio projects:", spacesOutputData["flexStudioUseList"])
print("VR Room experiences:", spacesOutputData["VRContentList"])


Equipment projects: 'an video for business project ' 'YouTube Podcast ' 'social media ' 'Audio testing' 'Testing Audio' 'Student Training Video' 'Photography Project' 'Personal Project' 'Private event' 'Content Creating' 'Recording event' 'Virtual Reality and Metaverse' 'Video creation' 'Personal Dance Project to gain self-confidence' 'Headshots' 'sports media' 'personal' 'Own project.' 'Own project' 'CMDS1000 Screenfast Remediated' 'Soundings ' 'FA VISA 2065' 'Training' 'training' 'Personal Project (training)' 'Faculy of Health Videos' 'Library Project' 'Personal' 'Meetings' 'Test shooting for a photography project' 'Song recording' 'Project' 'Social Media Content' 'Social Media content' 'Journalism interviews' 'Recruitment promotion video' 'Recruitment promotion' 'Sound Art Project' 'documentary' 'Research video' 'Photography ' 'Shoot' 'Shooy' 'DJ practicing ' 'teaching' 'Casual photography' 'personal documentary' 'Student Film' 'Experimental Feature Film' 'Psychology' 'Creative cont

## Data: Frequencies / Counts for faculty, RTI, space categories and equipment categories


In [16]:
# Both the equipment and spaces data have info for faculty, RTI, space/equipment category and item, so we'll
# look through those to build tallies and store them in the output data
for category in ["faculty", "relpToYork", "category_name", 'item_name']:
    spacesCategoryTallies = spacesData[category].value_counts(dropna=True).to_dict()
    equipmentCategoryTallies = equipData[category].value_counts(dropna=True).to_dict()
    for key in spacesCategoryTallies:
        spacesOutputData[key] = spacesCategoryTallies[key]
    for key in equipmentCategoryTallies:
        equipmentOutputData[key] = equipmentCategoryTallies[key]    

# The Spaces data has a unique "seat_name" category so we'll do this one separately...
workstationTallies = spacesData["seat_name"].value_counts(dropna=True).to_dict()
for key in workstationTallies:
    spacesOutputData[key] = workstationTallies[key]     


## Unique Users

In [17]:
# We can count unique users by looking at unique email addresses. As a note, I have seen that some students will
# use more than one address when booking, which makes one person appear as two (or three).... not sure there's
# much to be done about this
spacesOutputData["uniqueUsers"] = len(spacesData['email'].unique())
equipmentOutputData["uniqueUsers"] = len(equipData['email'].unique())

print("Number of unique space users:", spacesOutputData["uniqueUsers"])
print("Number of unique equipment users:", equipmentOutputData["uniqueUsers"])

Number of unique space users: 27
Number of unique equipment users: 74


## User access times

In [18]:
# the fromDate field is used in both modules to log the time a user checks out a piece of equipment OR checks in to a space
# In this section we'll run througb that data and pull out frequences by day of the week and hour of the day
spacesCheckInTimes = spacesData['fromDate'].unique()
spacesCheckInTimes = spacesCheckInTimes.tolist()

equipCheckInTimes = equipData['fromDate'].unique()
equipCheckInTimes = equipCheckInTimes.tolist()

# Run through each of the checkin times lists to steip out day and hour info, and populate
# the outputData structures for ewach of the equipment and spaces categories
for reportType in [spacesCheckInTimes, equipCheckInTimes]:
    
    #initialize array for hours-only data
    checkInHours = []
    checkInDays = []

    # Convert HH:MM info strings into hours                                    
    for entry in reportType:
        if isinstance(entry, str) == True:
            accessTime = datetime.strptime(entry, "%Y-%m-%dT%H:%M:%S%z")
            # Need to handle an case where bookings are added manually, in which
            # case the time is set to midnight. For now alter this to read 10AM (lab opening time)
            if accessTime.strftime("%I%p") == "12AM":
                checkInHours.append(accessTime.strftime("10AM"))
            else:
                checkInHours.append(accessTime.strftime("%I%p"))        
            checkInDays.append(accessTime.strftime("%A"))

    # set up a Counter object to do frequency counts for checking times (total for whole month)
    checkInHoursCount = collections.Counter(checkInHours)
    checkInDaysCount = collections.Counter(checkInDays)

    for item in [checkInHoursCount, checkInDaysCount]:

        # Store totals in our outputData
        for key, value in item.items():
            if reportType == spacesCheckInTimes:
                spacesOutputData[key] = value
                print(key, spacesOutputData[key])
            else:
                equipmentOutputData[key] = value
                print(key, equipmentOutputData[key])
            #print(f"{key}: {value}")



11AM 14
02PM 6
03PM 4
05PM 1
12PM 6
01PM 6
04PM 2
Wednesday 12
Thursday 8
Sunday 4
Monday 7
Tuesday 8
11AM 27
01PM 11
02PM 25
04PM 14
09AM 3
03PM 23
12PM 21
05PM 1
Monday 16
Wednesday 33
Thursday 34
Tuesday 28
Sunday 7
Friday 7


## Data: First Time Users

In [19]:
spacesOutputData["firstTimeUsers"] = 0
equipmentOutputData["firstTimeUsers"] = 0
overallFirstTimeUsers = 0

# Read existing data files and build user lists
existingSpaceUserFile = open("data/existingSpaceUsers.txt", "r")
existingEquipmentUserFile = open("data/existingEquipmentUsers.txt", "r")
existingOverallUserFile = open("data/existingOverallUsers.txt", "r")

existingSpaceUserDataSet = existingSpaceUserFile.read()
existingSpaceUsers = existingSpaceUserDataSet.split("\n")

existingEquipmentUserDataSet = existingEquipmentUserFile.read()
existingEquipmentUsers = existingEquipmentUserDataSet.split("\n")

existingOverallUserDataSet = existingOverallUserFile.read()
existingOverallUsers = existingOverallUserDataSet.split("\n")

# Get a list of unique users for this month
spaceUniqueUsers = spacesData['email'].unique()
spaceUniqueUsers = spaceUniqueUsers.tolist()

equipUniqueUsers = equipData['email'].unique()
equipUniqueUsers = equipUniqueUsers.tolist()

overallUniqueUsers = list(set(spaceUniqueUsers + equipUniqueUsers))




for usergroup in [spaceUniqueUsers, equipUniqueUsers, overallUniqueUsers]:

    for userThisMonth in usergroup:

        # make a hash of the email address so we don't have text files full of plaintext addresses
        # NOTE: no longer needed if all emakil addressed are hashed in "raw" CSV output
        #userThisMonth = hashlib.md5(userThisMonth.encode()).hexdigest()
        
        if usergroup == spaceUniqueUsers:
            if userThisMonth not in existingSpaceUsers:
                spacesOutputData["firstTimeUsers"] += 1
                existingSpaceUsers.append(userThisMonth)
        elif usergroup == equipUniqueUsers:
            if userThisMonth not in existingEquipmentUsers:
                equipmentOutputData["firstTimeUsers"] += 1
                existingEquipmentUsers.append(userThisMonth)
        else:
            if userThisMonth not in existingOverallUsers:
                overallFirstTimeUsers += 1
                existingOverallUsers.append(userThisMonth)
        
existingOverallUsers = list(set(existingSpaceUsers + existingEquipmentUsers))  

# Dump the new existing user lists back to a file
with open("data/existingSpaceUsers.txt", "w") as existingSpaceUserFile:
    existingSpaceUserDataSet = "\n".join(existingSpaceUsers)
    existingSpaceUserFile.write(existingSpaceUserDataSet)

with open("data/existingEquipmentUsers.txt", "w") as existingEquipmentUserFile:
    existingEquipmentUserDataSet = "\n".join(existingEquipmentUsers)
    existingEquipmentUserFile.write(existingEquipmentUserDataSet)
    
with open("data/existingOverallUsers.txt", "w") as existingOverallUserFile:
    existingOverallUserDataSet = "\n".join(existingOverallUsers)
    existingOverallUserFile.write(existingOverallUserDataSet)

print("This month's number of new space users:", spacesOutputData["firstTimeUsers"])
print("This month's number of new equipment users:", equipmentOutputData["firstTimeUsers"])
print("This month's number of new overall users:", overallFirstTimeUsers)

This month's number of new space users: 16
This month's number of new equipment users: 30
This month's number of new overall users: 42


## Final Data Output


In [20]:
# now we will open a file for writing
spaceCsvOut = open("data/"+datadate+"_space_finalStats.csv", 'w')
equipmentCsvOut = open("data/"+datadate+"_equipment_finalStats.csv", 'w')

# create the csv writer object
spaceCsvWriter = csv.DictWriter(spaceCsvOut, fieldnames=spacesFinalFieldnames)
equipmentCsvWriter = csv.DictWriter(equipmentCsvOut, fieldnames=equipmentFinalFieldnames)

# Output the header first
spaceCsvWriter.writeheader()
equipmentCsvWriter.writeheader()

spaceCsvWriter.writerow(spacesOutputData)
equipmentCsvWriter.writerow(equipmentOutputData)

spaceCsvOut.close()
equipmentCsvOut.close()