# Script to extract Eventbrite API data
This script will populate all localities/suburbs under each LGA, extract future events and past events (28 days before) data from Eventbrite API.<br><br>
Section 1 will populate all localities/suburbs under each of the 31 LGAs using the two datasets, <a href="http://www.corra.com.au/australian-postcode-location-data/">Latitude and Longitude of localities in Australia</a> and <a href="https://discover.data.vic.gov.au/dataset/victorian-electors-by-locality-postcode-and-electorates">List of localities/suburbs of each LGA in Victoria</a> for further use in Section 2 and Section 3.<br><br>

Section 2 will extract past 28 days events data in each LGA from Eventbrite API from the defined `start_date` and `end_date` and categorised based on their culture type through text classification using the events' name and description. The final dataframe will be stored into a .csv file for further analysis with Twitter API and visualisation in Microsoft Power BI.<br><br>

Section 3 will extract all future events from the defined `end_date` and categorised based on their culture type through text classification using the events' name and description. The final dataframe will be stored into a .csv file for further analysis with Twitter API and visualisation in Microsoft Power BI.<br><br>


### Output Documents
1. <b>eventbritedatapastevents.csv</b> - contains past 28 days events of each community with their name, description, id, start_date and culture category
2. <b>eventbritedatafutureevents.csv</b> - contains future events of each community with their name, id and culture category

### Note
- Manual input of dates required in <b>Set Date</b> section in the format `YYYY-MM-DDTHH:MM:SS` 
- Manual input required in <b>Section 1</b> if extending locations from current 31 LGAs
- Manual input required in <b>Section 2</b> and <b>Section 3</b> to key in EventBrite API Access Code in the format `Bearer xxxxxxxxxxxxxxxxxxxx`, where `xxxxxxxxxxxxxxxxxxxx` should be replaced
- Ensure that `Australian_Post_Codes_Lat_Lon.csv` and `LocalityFinder.xls` are in the same directory as this script file in order to ensure the script can be run

### Set Date

In [1]:
# set start_date and end_date
start_date = '2019-09-10T00:00:01'
end_date = '2019-10-10T23:59:59'

## Section 1 - Populating localities/suburbs under each LGA

In [2]:
# import data manipulation pandas library
import pandas as pd

# read in lat_lon data
lat_lon = pd.read_csv("./Australian_Post_Codes_Lat_Lon.csv")
# filter lat_lon to just Victoria state
lat_lon = lat_lon.loc[lat_lon['state']=='VIC', ['postcode','suburb','lat','lon']]
# read in locality data
locality = pd.read_excel("./LocalityFinder.xls")
# set headers as 2nd row as file is not cleaned initially
headers = locality.iloc[1]
# remove 1st and 2nd row
locality = locality[2:]
# set the right column headers
locality.columns = headers
# remove unnecessary columns
locality = locality.iloc[:,0:3]
# store the 31 LGA accorinding to the format in locality data (e.g. Yarra City Council, Yarra Ranges Shire Council)
LGA = ['Hobsons Bay City Council', 'Maribyrnong City Council', 'Melbourne City Council', 'Yarra City Council',\
       'Port Phillip City Council','Boroondara City Council', 'Stonnington City Council', 'Glen Eira City Council',\
       'Bayside City Council', 'Banyule City Council', 'Brimbank City Council', 'Casey City Council',\
       'Greater Dandenong City Council', 'Darebin City Council', 'Frankston City Council', 'Hume City Council',\
       'Kingston City Council', 'Knox City Council', 'Manningham City Council', 'Maroondah City Council',\
       'Melton City Council', 'Monash City Council', 'Moonee Valley City Council', 'Moreland City Council',\
       'Whitehorse City Council', 'Whittlesea City Council', 'Wyndham City Council', 'Cardinia Shire Council',\
       'Mornington Peninsula Shire Council', 'Nillumbik Shire Council', 'Yarra Ranges Shire Council']
# filter the locality data to just the 31 LGA
filtered_locality = locality[locality['Municipality\r\nName'].isin(LGA)]
# remove duplicates
filtered_locality = filtered_locality.drop_duplicates(keep='first').sort_values(['Post\r\nCode'], ascending=True)
# filter the lat_lon to just the locality data
postcodes = filtered_locality['Post\r\nCode']
filtered_lat_lon = lat_lon[lat_lon['postcode'].isin(postcodes)]

# create lat and long columns in filtered_locality
filtered_locality['latitude'] = None
filtered_locality['longitude'] = None
# compate both filtered_locality and filterd_lat_lon to get the lat and lon
for i in range(0,len(filtered_locality)):
    for j in range(0,len(filtered_lat_lon)):
        if (filtered_lat_lon.iloc[j,1] == filtered_locality.iloc[i,0].upper()):
            filtered_locality.iloc[i,3] = filtered_lat_lon.iloc[j,2]
            filtered_locality.iloc[i,4] = filtered_lat_lon.iloc[j,3]
            break;
# remove empty rows without latitude and longitude
filtered_locality = filtered_locality.dropna(how='any')

# output to csv for further use
filtered_locality.to_csv('./filtered_locality.csv', index=False)

## Section 2 - Extracting past 28 days events data

In [3]:
# for text analysis to categorise the events based on different races
# import tokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams
from nltk.stem import PorterStemmer

# regex of all alphabets of length greater than 3 since race_list shortest word length is 3
tokenizer = RegexpTokenizer(r"\b[a-zA-Z]{3,}\b")

# set the 29 race lists based on the census data
race_list = ['australian','chinese','croatian','dutch','english','filipino','french','german','greek','hungarian'\
             ,'indian','irish', 'italian', 'korean', 'lebanese', 'macedonian', 'maltese', 'maori', 'new zealander'\
             ,'polish','russian','scottish','serbian','african','spanish','sri lankan','turkish','vietnamese'\
             ,'welsh']

# stem the race_list
ps = PorterStemmer()
for i in range(0,len(race_list)):
    race_list[i] = ps.stem(race_list[i])

# function to get the culture type of the events
def get_culture(event):
    # initialise temp output
    temp_output = None
    # extract name and description for each event and replace the newline and tabs
    name = event['name']['text']
    desc = event['description']['text']
    if desc!=None:
        name = name + ' ' + desc
    name = name.replace("\n"," ").replace("\t"," ").strip().lower()
    # tokenize each description text into individual word (unigram)
    unigrams = tokenizer.tokenize(name)
    # stem the unigrams
    for x in range(0,len(unigrams)):
        unigrams[x] = ps.stem(unigrams[x])
    # tokenize each description text into bigrams
    bigrams = list(ngrams(unigrams, 2))
    # combine both unigrams and bigrams into single list, tokens
    tokens = []
    for bigram in bigrams:
        tokens.append((''.join([w + ' ' for w in bigram])).strip())
    tokens = tokens + unigrams
    # compare race_list and see if any of the tokens contain them
    for race in race_list:
        if race in tokens:
            temp_output = race
    # set race as non-classifiable if no race type matches
    if temp_output == None:
        temp_output = 'non-classifiable'
    return temp_output

In [4]:
# import eventbrite and requests library for api requests
from eventbrite import Eventbrite
import requests
# import json for data storage and printing
import json

# empty dataframe to store final result
past_events = pd.DataFrame(columns=['LGA','name','desc','id','start_date','category'])
# loop through the 31 LGA
for region in LGA:
    # filter locality data to each individual LGA
    temp_locality=filtered_locality[filtered_locality.iloc[:,2]==region]
    # create a temp dataframe for this region
    temp = pd.DataFrame(columns=['LGA','name','desc','id','start_date','category'])
    # loop through each suburb in this region
    for i in range(0, len(temp_locality)):
        # eventbrite api access code
        headers = {
        'Authorization': 'Bearer xxxxxxxxxxxxxxxxxxxx',  ############# INPUT EVENTBRITE API ACCESS CODE HERE ############
        }
        # set params with latitude, longitude, radius of 2km, start date and end date defined at the start
        params = (
            ('location.longitude', temp_locality.iloc[i,4]),
            ('location.latitude', temp_locality.iloc[i,3]),
            ('location.within', '2km'),
            ('start_date.range_start', start_date),
            ('start_date.range_end', end_date)
        )
        # get response from eventbrite api
        response = requests.get('https://www.eventbriteapi.com/v3/events/search', headers=headers, params=params)
        # store data in json format
        json_data = json.loads(response.text)
        # extract just the 'events'
        overall_data = json_data['events']
        # since each request will only show 50 events per page, if there are instance >50 events, run this loop
        if json_data['pagination']['page_count'] > 1:
            for j in range (2, json_data['pagination']['page_count']+1):
                params = (
                    ('location.longitude', temp_locality.iloc[i,4]),
                    ('location.latitude', temp_locality.iloc[i,3]),
                    ('location.within', '2km'),
                    ('start_date.range_start', start_date),
                    ('start_date.range_end', end_date),
                    ('page', j)
                )
                # send request to eventbrite api again to get the next 50 events and so on
                response = requests.get('https://www.eventbriteapi.com/v3/events/search', headers=headers, params=params)
                json_data = json.loads(response.text)
                # append all events into overall_data list
                for each in json_data['events']:
                    overall_data.append(each)
        # loop through eventbrite data and extract only 'name','desc','id' columns
        for k in range(0, len(overall_data)):
            # run get_culture function to get the cultural category for this event
            category = get_culture(overall_data[k])
            temp.loc[len(temp)] = [region,overall_data[k]['name']['text'],overall_data[k]['description']['text'],\
                                   overall_data[k]['id'],overall_data[k]['start']['local'],category]
        # remove duplicated rows based on 'id'
        temp = temp.drop_duplicates(subset='id', keep="first")
        # get 5 most recent events
        temp = temp.sort_values('start_date',ascending=False)
        temp = temp.iloc[0:5,:]
    # append/concat this temp dataframe for this region to final dataframe
    past_events = pd.concat([past_events,temp])
    
# output to csv file for further use
past_events.to_csv('./eventbritedatapastevents.csv', index=False)
print("Done!")

Done!


## Section 3 - Extracting future events data

In [34]:
# empty dataframe to store final result
future_events = pd.DataFrame(columns=['LGA','name','id','category'])
# loop through the 31 LGA
for region in LGA:
    # filter locality data to each individual LGA
    temp_locality=filtered_locality[filtered_locality.iloc[:,2]==region]
    # create a temp dataframe for this region
    temp = pd.DataFrame(columns=['LGA','name','id','category'])
    # loop through each suburb in this region
    for i in range(0, len(temp_locality)):
        # eventbrite api access code
        headers = {
        'Authorization': 'Bearer xxxxxxxxxxxxxxxxxxxx',  ############# INPUT EVENTBRITE API ACCESS CODE HERE ############
        }
        # set params with latitude, longitude, radius of 2km, start date and end date Aug 1 to Aug 31
        params = (
            ('location.longitude', temp_locality.iloc[i,4]),
            ('location.latitude', temp_locality.iloc[i,3]),
            ('location.within', '2km'),
            ('start_date.range_start', end_date)
        )
        # get response from eventbrite api
        response = requests.get('https://www.eventbriteapi.com/v3/events/search', headers=headers, params=params)
        # store data in json format
        json_data = json.loads(response.text)
        # extract just the 'events'
        overall_data = json_data['events']
        # since each request will only show 50 events per page, if there are instance >50 events, run this loop
        if json_data['pagination']['page_count'] > 1:
            for j in range (2, json_data['pagination']['page_count']+1):
                params = (
                    ('location.longitude', temp_locality.iloc[i,4]),
                    ('location.latitude', temp_locality.iloc[i,3]),
                    ('location.within', '2km'),
                    ('start_date.range_start', end_date),
                    ('page', j)
                )
                # send request to eventbrite api again to get the next 50 events and so on
                response = requests.get('https://www.eventbriteapi.com/v3/events/search', headers=headers, params=params)
                json_data = json.loads(response.text)
                # append all events into overall_data list
                for each in json_data['events']:
                    overall_data.append(each)
        # loop through eventbrite data and extract only 'name','desc','id' columns
        for k in range(0, len(overall_data)):
            # run get_culture function to get the cultural category for this event
            category = get_culture(overall_data[k])
            temp.loc[len(temp)] = [region,overall_data[k]['name']['text'],overall_data[k]['id'],category]
        # remove duplicated rows based on 'id'
        temp = temp.drop_duplicates(subset='id', keep="first")
    # append/concat this temp dataframe for this region to final dataframe
    future_events = pd.concat([future_events,temp])
    
# output to csv file for further use
future_events.to_csv('./eventbritedatafutureevents.csv', index=False)
print("Done!")

Done!


# ------------------------------------- END OF SCRIPT --------------------------------------#