In [1]:
private_key= "insert key here'""

In [2]:
import requests
import json
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

The idea behind my classification system is building a labeled training set to train a probabilistic classifier that will eventually be able to take titles and or descriptions of events WITHOUT LABELS and correctly predict their category. I think this solution is quite viable because of the nature of event titles and descriptions. For instance, the presence of the word "Vs" will almost always be attached to a sporting event. Additionally the beauty of using a probabilisitic classifier is that we can classify events into an "Other" category if there is high ambiguity. Human review of these events into their appropriate category and inclusion into model training will ensure improved model performance with successive iterations.

I think a great place to start in building a labeled training set, particularly where EventBrite events are concerned, is using their classification system. Putting in a call to their API allows us to get back the categories and subcategories they associate with each id number. Personally, I think this list is a little bit long. I could see using an abbreivated version, which I have given an example of below. This is of personal preference, as I don't generally like having more than ~12-15 choices in a dropdown menu. Using a system of dictionaries will let us easily transcribe labels across APIs.

In [3]:
categories = requests.get("https://www.eventbriteapi.com/v3/categories/?token=" + private_key).json()
sub_categories = requests.get("https://www.eventbriteapi.com/v3/subcategories/?token=" + private_key).json()

activities_list = ['Music','Visual Arts','Performing Arts','Film','Lectures and Books','Fashion','Food and Drink',
'Festivals and Fairs','Charities','Sports', 'Active Life','Nightlife','Kids and Family','Other']

In this cell I'll take the JSON of categories and subcategories from above and make dictionaries of id:name. 

In [4]:
id_name = {}
for cat in categories['categories']:
    id_name[cat['id']] = cat['name']
    
id_subname = {}
for sub_cat in sub_categories['subcategories']:
    id_subname[sub_cat['id']] = sub_cat['name']

Now, I'll open the JSON file pulled from your repo and make a list of events as requested. Flattening the nested structure, pulling the event name, event description, and categories into a dateframe, and mapping category names to ids will leave us with a structure ideal for training a classifier.

In [5]:
events = []
with open("eb-response-page1.json", 'r') as activities:
    data = json.load(activities)
    for each in data['events']:
        events.append(each)
    activities.close()
    
df = pd.io.json.json_normalize(events)
df = df[['name.text','description.text','id','category_id','subcategory_id']]
df['category_classified'] = df.category_id.map(id_name)
df['sub_category_classified'] = df.subcategory_id.map(id_subname)

df['description.text'] = df['description.text'].str.replace('\n',' ') 
df['description.text'] = df['description.text'].str.lower()                                                            

With the simple steps sbove we're ready to vectorize our text. Here I've used TfidfVectorizer with logarithmic term frequency, minimum document frequency of 2 (which would be adjusted based one size of training set), both uni and bigrams, and stopwords removed. Fitting this to the description gives us 435 features for our 50 documents.

In [6]:
labels = df['category_classified']
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=2, norm='l2',
                        ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df['description.text']).toarray()
features.shape

(50, 435)

I chose to use a Multinomial Naive Bayes in this simple example but I think it's a reasonable choice without having tested other probabilisitic classifiers. As mentioned above, we can use a threshold probability to assign events to an "Other" category until we can improve certainty with future training.

In [7]:
mnb = MultinomialNB().fit(features,labels)

I also wanted to look at the ticketmaster API call to see that this method will still work. The output of the TM API is a crazy nested structure but the same general principle works. Developing pre-processing pipelines for each API will be the most important part behind this method.

In [8]:
tm_events = []
with open('tm-response-page1-Copy1.json', 'r') as ticketmaster:
    data = json.load(ticketmaster)
    for each in data['_embedded']['events']:
        info = [each['name'],each['classifications'][0]['segment']]
        tm_events.append(info)

In [9]:
tm_events[0]

['Portland Trail Blazers vs. Los Angeles Lakers',
 {'id': 'KZFzniwnSyZfZ7v7nE', 'name': 'Sports'}]

Lastly I'll output a CSV of event id, name and category. 

In [10]:
output = df[['name.text','id','category_id']]
output.to_csv('eventbrite_events', encoding='utf-8', index=False)