# Event Categorization

Below is the result of analyzing a list of arbitrary third-party events, and categorizing them into specific activities. Although the instruction focused on just the Eventbrite JSON file, I also looked at the Ticket Master JSON file for comparison.

## Eventbrite Classification

For this exercise, it seemed more effective to explore the events data first, and then make decisions on the list of human activities immediately after.

In [1]:
# import numpy as np
import pandas as pd
import json

# open, load, and inspect the dict from the json file
eb_json = open('datasets/eb-response-page1.json')
eb_dict = json.load(eb_json)
eb_dict.keys()

# eb_dict['pagination']
# eb_dict['location']

dict_keys(['pagination', 'events', 'location'])

After navigating the dictionaries inside `eb_dict`, I discovered that all relevant data can be found in `events`.

In [2]:
# create df for data cleaning
eb_events_raw = pd.DataFrame(eb_dict['events'])
eb_events_raw.head(2).T

Unnamed: 0,0,1
name,"{'text': 'The 4 Shifts to Lifelong Health, End...",{'text': 'Easy Access to Business Lines of Cre...
description,{'text': '4 Powerful Shifts to Lifelong Health...,{'text': 'Raising Capital for My Business Fast...
id,81156743003,70394157821
url,https://www.eventbrite.com/e/the-4-shifts-to-l...,https://www.eventbrite.com/e/easy-access-to-bu...
start,"{'timezone': 'America/Los_Angeles', 'local': '...","{'timezone': 'America/Los_Angeles', 'local': '..."
end,"{'timezone': 'America/Los_Angeles', 'local': '...","{'timezone': 'America/Los_Angeles', 'local': '..."
organization_id,304101319856,308256707370
created,2019-11-10T05:59:40Z,2019-08-24T19:57:54Z
changed,2019-11-10T06:00:24Z,2019-08-24T19:57:59Z
published,2019-11-10T06:00:24Z,2019-08-24T19:57:58Z


I explored `name`, `description`, `start`, and `end`, just in case they contained information that could help in the categorization. I ended up getting just the `text` values for `name` and `description`, and the `local` values from `

In [3]:
# eb_dict_cols = ['name', 'description', 'start', 'end']

# for e in eb_dict_cols:
#     print('Sample entries in "{}":'.format(e))
#     for event in eb_events_raw[e][:2]:
#         print(event)
#     print('\n')

# eb_events_raw['pacific_time_start'] = eb_events_raw['start'].map(lambda x: x['local'])
# eb_events_raw['pacific_time_end'] = eb_events_raw['end'].map(lambda x: x['local'])

In [4]:
eb_events_raw['name_text'] = eb_events_raw['name'].map(lambda x: x['text'])
eb_events_raw['description_text'] = eb_events_raw['description'].map(lambda x: x['text'])

In [5]:
eb_events_raw['name_text'].head()

0    The 4 Shifts to Lifelong Health, Endless Energ...
1    Easy Access to Business Lines of Credit - Port...
2          Where Can I Get Business Funding - Portland
3          Where Can I Get Business Funding - Portland
4    A Guide to Korean BBQ - Team Building by Cozym...
Name: name_text, dtype: object

In [6]:
eb_events_raw['description_text'].head()

0    4 Powerful Shifts to Lifelong Health, Endless ...
1    Raising Capital for My Business\nFast Unsecure...
2    Raising Capital for My Business\nUnderstanding...
3    Raising Capital for My Business\nUnderstanding...
4    Book a unique team building event centered aro...
Name: description_text, dtype: object

I created a smaller dataframe `eb_events` that only contained the columns that were essential to complete the task at hand.

In [7]:
# identify relevant columns to retain
eb_cols = ['id',
           'name_text',
           'description_text',
           'summary',
           'url',
           'category_id',
           'subcategory_id'
          ]

# create new df with streamlined data
eb_events = eb_events_raw[eb_cols]
eb_events.head()

Unnamed: 0,id,name_text,description_text,summary,url,category_id,subcategory_id
0,81156743003,"The 4 Shifts to Lifelong Health, Endless Energ...","4 Powerful Shifts to Lifelong Health, Endless ...","4 Powerful Shifts to Lifelong Health, Endless ...",https://www.eventbrite.com/e/the-4-shifts-to-l...,107,
1,70394157821,Easy Access to Business Lines of Credit - Port...,Raising Capital for My Business\nFast Unsecure...,Raising Capital for My Business\nFast Unsecure...,https://www.eventbrite.com/e/easy-access-to-bu...,101,1001.0
2,70311727269,Where Can I Get Business Funding - Portland,Raising Capital for My Business\nUnderstanding...,Raising Capital for My Business\nUnderstanding...,https://www.eventbrite.com/e/where-can-i-get-b...,101,1001.0
3,70309033211,Where Can I Get Business Funding - Portland,Raising Capital for My Business\nUnderstanding...,Raising Capital for My Business\nUnderstanding...,https://www.eventbrite.com/e/where-can-i-get-b...,101,1001.0
4,81376997791,A Guide to Korean BBQ - Team Building by Cozym...,Book a unique team building event centered aro...,Book a unique team building event centered aro...,https://www.eventbrite.com/e/a-guide-to-korean...,110,10003.0


In [8]:
# eb_events[eb_events['category_id'] == '101']

eb_events['category_id'].value_counts()

110    40
101     7
107     3
Name: category_id, dtype: int64

It seemed like I could just leverage Eventbrite's own classification to complete the task. Here are the activities:

- 101: Business Classes
- 107: Health Classes
- 110: Cooking Classes

I created a dictionary for these key-value pairs, and created a new `eb_events` column called `category`. Since some of these events are actually the same classes in different time slots, I grabbed only the unique event names for the final output.

In [9]:
eb_cats = {'101': 'Business Classes',
           '107': 'Health Classes',
           '110': 'Cooking Classes'}

pd.options.mode.chained_assignment = None  # default='warn'
eb_events['category'] = eb_events['category_id'].apply(lambda x: eb_cats[x])
eb_events[['name_text', 'category', 'category_id', ]].drop_duplicates().sort_values(by='category_id')

Unnamed: 0,name_text,category,category_id
1,Easy Access to Business Lines of Credit - Port...,Business Classes,101
2,Where Can I Get Business Funding - Portland,Business Classes,101
41,CBD Health & Wellness Business Opportunity (Jo...,Business Classes,101
28,Real Estate Investing for Entrepreneurs - Port...,Business Classes,101
27,Learn To Do Real Estate in less time & make mo...,Business Classes,101
26,MINDSHOP™ | The Art of Lean Innovation,Business Classes,101
0,"The 4 Shifts to Lifelong Health, Endless Energ...",Health Classes,107
42,Train Your Brain To Make More Money - 6 Critic...,Health Classes,107
25,Heal Your Binge Eating and Lifelong Dieting [F...,Health Classes,107
43,Vegan Comfort Food - Cooking Class by Cozymeal™,Cooking Classes,110


To extend the scope of this classification, it might be necessary to leverage Eventbrite's own categorization, and maybe add a layer of specificity as necessary.

# Ticket Master Classification

In trying to think through how to best leverage existing data to categorize events, I also looked at Ticket Master's json file. My goal was to have an idea of their taxonomy, and pick up on some best practices. Below are some sample rows from the dataframe.

In [10]:
# open, load, and inspect the dict from the json file
tm_json = open('datasets/tm-response-page1.json')
tm_dict = json.load(tm_json)

tm_events_raw = pd.DataFrame(tm_dict['_embedded']['events'])
tm_events_raw.head(4).T

Unnamed: 0,0,1,2,3
name,Portland Trail Blazers vs. Los Angeles Lakers,Portland Trail Blazers vs. Los Angeles Lakers,Portland Trail Blazers vs. Washington Wizards,Oregon Ducks Men's Basketball vs. UCLA Bruins ...
type,event,event,event,event
id,vvG1HZ4YSAOKK5,vvG1HZ4YSA-KK2,vvG1HZ4YSk8dpE,vvG1HZ4SlVpy61
test,False,False,False,False
url,https://www.ticketmaster.com/portland-trail-bl...,https://www.ticketmaster.com/portland-trail-bl...,https://www.ticketmaster.com/portland-trail-bl...,https://www.ticketmaster.com/oregon-ducks-mens...
locale,en-us,en-us,en-us,en-us
images,"[{'ratio': '3_2', 'url': 'https://s1.ticketm.n...","[{'ratio': '3_2', 'url': 'https://s1.ticketm.n...","[{'ratio': '3_2', 'url': 'https://s1.ticketm.n...","[{'ratio': '16_9', 'url': 'https://s1.ticketm...."
sales,{'public': {'startDateTime': '2019-09-09T17:00...,{'public': {'startDateTime': '2019-09-09T17:00...,{'public': {'startDateTime': '2019-09-09T17:00...,{'public': {'startDateTime': '2019-10-14T15:30...
dates,"{'start': {'localDate': '2019-12-06', 'localTi...","{'start': {'localDate': '2019-12-28', 'localTi...","{'start': {'localDate': '2020-03-04', 'localTi...","{'start': {'localDate': '2020-01-26', 'localTi..."
classifications,"[{'primary': True, 'segment': {'id': 'KZFzniwn...","[{'primary': True, 'segment': {'id': 'KZFzniwn...","[{'primary': True, 'segment': {'id': 'KZFzniwn...","[{'primary': True, 'segment': {'id': 'KZFzniwn..."


The sample data suggests that there are more datapoints that need to be sussed out from columns that contain dictionaries. The column `classifications` contain ways of categorizing, with keys like `segment`, `genre`, `subGenre`, `type`, `subType` and `family`. There are columns with `Undefined` values, so I used `Unknown` instead to fill out null values.

In [11]:
cls = tm_events_raw.classifications
cls_cols = ['name', 'segment', 'genre', 'subGenre', 'type', 'subType', 'family']
col_list = []

for col in cls_cols[1:]:
    for i in range(cls.shape[0]):
        try: 
            col_list.append(cls[i][0][col]['name'])
        except:
            col_list.append('Unknown')
            
    tm_events_raw[col] = col_list
    col_list.clear()

In [12]:
tm_events_raw[cls_cols].head()

Unnamed: 0,name,segment,genre,subGenre,type,subType,family
0,Portland Trail Blazers vs. Los Angeles Lakers,Sports,Basketball,NBA,Group,Team,Unknown
1,Portland Trail Blazers vs. Los Angeles Lakers,Sports,Basketball,NBA,Group,Team,Unknown
2,Portland Trail Blazers vs. Washington Wizards,Sports,Basketball,NBA,Group,Team,Unknown
3,Oregon Ducks Men's Basketball vs. UCLA Bruins ...,Sports,Basketball,College,Group,Team,Unknown
4,Portland Trail Blazers vs. Chicago Bulls,Sports,Basketball,NBA,Group,Team,Unknown


Adapting a multi-level approach to categorizing seems more beneficial in the long-term, so adapting a similar strategy might be a good idea.