## Generating the LIST Dataframes FOR THE KG


In this notebook we are going to generate and save (into disk) a dataframe per CE ontology NODE. 
This is the list of the dataframes saved to file. These dataframes have **ONLY the necessary attributes for later generating the graph, and linking the information between nodes**:

- df_events	
- df_event_category 
- df_event_tags
- df_event_description
- df_event_properties
- df_event_phonenumber	

- df_schedules
- df_schedules_phone_numbers

- df_performances			
- df_performances_properties
- df_performances_descriptions	
- df_performances_links		
			
- df_tickets

- df_places
- df_places_tags
- df_places_properties
- df_places_description
- df_places_loc	
- df_places_pn



We are also generating another intermediate dataframes (those dataframes have 'total' in their names). But those intermediate dataframes will not be saved to disk. 

The idea will be later to read from those dataframes saved to disk to create the knowledge graph. 

In order to create these, we are going to use two dataframes previously calculated:
 - df_news_events
 - df_places
 
 

In [None]:
import yaml
import string
import copy
from datetime import datetime
import pandas as pd
from yaml import safe_load
from pandas.io.json import json_normalize
from difflib import SequenceMatcher
import pickle
import numpy as np
import collections
from yaml import safe_load

We are going to save the dataframes in a directory called dataframes_final

In [None]:
!mkdir /content/drive/MyDrive/dataframes_final


mkdir: cannot create directory ‘/content/drive/MyDrive/dataframes_final’: No such file or directory


In [None]:
dataframes_final="/content/drive/MyDrive/dataframes_final"


## 1. EVENTS DATAFRAME

Going to read the events dataframe and work with it. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
os.listdir("/content/drive/MyDrive")

In [None]:
with open("/content/drive/MyDrive/df_new_events","rb") as df_new_events:
    df_new_events=pickle.load(df_new_events)

In [None]:

df_new_events.iloc[0]


In [None]:
df_events=df_new_events[['event_id','id', 'name', 'created_ts', 'modified_ts','website', 'ranking_in_level', 'ranking_level', 'sort_name', 'status']]

**Important** Use this dataframe for reading the properties of events in your knowlege graph. Not the df_new_events

In [None]:
df_events.to_pickle(dataframes_final+"/df_events")

## 2. EVENTS_DESCRIPTION DATAFRAME

In [None]:
df_new_events['descriptions']

0       [{'type': 'description.list.default', 'descrip...
1       [{'type': 'description.list.default', 'descrip...
2       [{'type': 'description.list.default', 'descrip...
3       [{'type': 'description.list.default', 'descrip...
4       [{'type': 'description.list.default', 'descrip...
                              ...                        
2272    [{'type': 'description.list.default', 'descrip...
2276    [{'type': 'description.list.default', 'descrip...
2278    [{'type': 'description.official', 'description...
2279    [{'type': 'description.list.default', 'descrip...
2280    [{'type': 'description.list.default', 'descrip...
Name: descriptions, Length: 38700, dtype: object

In [None]:
df_e_desc=df_new_events[['event_id','descriptions']].explode('descriptions')
df_e_desc=df_e_desc[df_e_desc['descriptions'].notna()]
df_e_desc=pd.concat([df_e_desc.drop(['descriptions'], axis=1), df_e_desc['descriptions'].apply(pd.Series)], axis=1)
#df_e_desc=df_e_desc.drop(0, axis=1)
df_e_desc

Unnamed: 0,event_id,type,description
0,157884,description.list.default,Swedish trio combining acoustic instrumentatio...
1,194419,description.list.default,Brilliant mix of English tradition and America...
1,194419,description.official,Nominated for Musician of the Year and for Bes...
2,240818,description.list.default,"Robbie Burns was funny, right? So toast the ba..."
3,345866,description.list.default,"The Stand's spankingly good new talent night, ..."
...,...,...,...
2276,1586592,description.list.default,Tour starting at Edinburgh's The Elephant Hous...
2278,1595055,description.official,"The tour will be led by Lisa Williams, directo..."
2278,1595055,description.list.default,"The tour will be led by Lisa Williams, directo..."
2279,1599103,description.list.default,"Pull on your wellies, wrap up warm and come pi..."


In [None]:
#comment this line if you want to save the dataframe to file
df_e_desc.to_pickle(dataframes_final+"/df_event_description")

## 3. EVENTS_CATEGORY DATAFRAME

In [None]:
df_new_events['category']

0          Music
1          Music
2         Comedy
3         Comedy
4         Comedy
          ...   
2272       Sport
2276    Days out
2278    Days out
2279    Days out
2280    Days out
Name: category, Length: 38700, dtype: object

In [None]:
df_e_category=df_new_events[['event_id','category']]
df_e_category

Unnamed: 0,event_id,category
0,157884,Music
1,194419,Music
2,240818,Comedy
3,345866,Comedy
4,347164,Comedy
...,...,...
2272,1584208,Sport
2276,1586592,Days out
2278,1595055,Days out
2279,1599103,Days out


In [None]:
df_e_category.to_pickle(dataframes_final+"/df_event_category")

## 4. EVENTS_PROPERTIES DATAFRAME

In [None]:
df_e_prop=df_new_events[['event_id','properties']]
df_e_prop=pd.concat([df_e_prop.drop(['properties'], axis=1), df_e_prop['properties'].apply(pd.Series)], axis=1)
df_e_prop

Unnamed: 0,event_id,actor,actor:sample,affiliate:getmein,affiliate:seatwave,author,awards:fringe-sustainable-practice:2015,awards:fringe-sustainable-practice:2017,booking_essential,cast,...,list:website:comments-enabled,list:website:comments-end-date,list:website:company,list:website:hitlisted,list:website:list-of-sites,organisation,pa:rating,place:capacity:max,simpleview:original:categories,writer
0,157884,,,,,,,,False,,...,,2013-01-31 00:00:00,,,,,,,,
1,194419,,,,,,,,False,,...,,2020-01-28 05:01:07,,,,,,,,
2,240818,,,,,,,,False,,...,,,,,,,,,,
3,345866,,,,,,,,False,,...,,,,,,,,,,
4,347164,,,,,,,,False,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2272,1584208,,,,,,,,False,,...,,,,,,,,,,
2276,1586592,,,,,,,,False,,...,,,,,,,,,,
2278,1595055,,,,,,,,False,,...,,,,,,,,,,
2279,1599103,,,,,,,,False,,...,,,,,,,,,,


In [None]:
df_e_prop.to_pickle(dataframes_final+"/df_event_properties")

## 5. EVENTS_TAGS Dataframe

In [None]:
df_e_tags=df_new_events[['event_id','tags']].explode('tags')
df_e_tags=df_e_tags[df_e_tags['tags'].notna()]

In [None]:
df_e_tags.to_pickle(dataframes_final+"/df_event_tags")

## 6 EVENTS_PHONE_NUMBER DATAFRAME

In [None]:
df_events_pn=df_new_events[['event_id', 'phone_numbers']]
df_events_pn=pd.concat([df_events_pn.drop(['phone_numbers'], axis=1), df_events_pn['phone_numbers'].apply(pd.Series)], axis=1)
df_events_pn=df_events_pn.drop(0, axis=1)
df_events_pn

Unnamed: 0,event_id,box_office,info
0,157884,,
1,194419,,
2,240818,,
3,345866,,
4,347164,,
...,...,...,...
2272,1584208,,
2276,1586592,,0131 555 5558
2278,1595055,,
2279,1599103,,07793 600 289


In [None]:
df_events_pn.to_pickle(dataframes_final+"/df_event_phonenumber")

## 7. Schedules Dataframe 

In [None]:
df_schedules=df_new_events[['event_id', 'schedules']].explode('schedules')
df_schedules_total=pd.concat([df_schedules.drop(['schedules'], axis=1), df_schedules['schedules'].apply(pd.Series)], axis=1)
df_schedules_total

Unnamed: 0,event_id,start_ts,end_ts,place_id,performances,performance_space,phone_numbers
0,157884,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,383,"[{'ts': '2018-04-26T20:00:00+01:00', 'duration...",,
1,194419,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,11092,"[{'ts': '2018-03-10T19:30:00+00:00', 'duration...",,
1,194419,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,11200,"[{'ts': '2018-03-08T20:00:00+00:00', 'links': ...",,
1,194419,2018-05-07T20:00:00+01:00,2018-05-07T20:00:00+01:00,386,"[{'ts': '2018-05-07T20:00:00+01:00', 'links': ...",,
2,240818,2018-01-24T20:30:00+00:00,2018-01-28T20:30:00+00:00,1,"[{'ts': '2018-01-24T20:30:00+00:00', 'links': ...",,
...,...,...,...,...,...,...,...
2272,1584208,2020-10-10T16:00:00+01:00,2020-10-10T16:00:00+01:00,127508,"[{'ts': '2020-10-10T16:00:00+01:00', 'duration...",,
2276,1586592,2020-08-12T08:00:00+01:00,2020-10-30T08:00:00+00:00,127571,"[{'ts': '2020-08-12T08:00:00+01:00', 'duration...",,
2278,1595055,2020-09-12T10:30:00+01:00,2020-10-24T10:30:00+01:00,127985,"[{'ts': '2020-09-12T10:30:00+01:00', 'duration...",,
2279,1599103,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,128231,"[{'ts': '2020-10-16T09:00:00+01:00', 'duration...",,


In [None]:
df_schedules=df_schedules_total[['event_id', 'start_ts', 'end_ts', 'place_id', 'performance_space']]
df_schedules

Unnamed: 0,event_id,start_ts,end_ts,place_id,performance_space
0,157884,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,383,
1,194419,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,11092,
1,194419,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,11200,
1,194419,2018-05-07T20:00:00+01:00,2018-05-07T20:00:00+01:00,386,
2,240818,2018-01-24T20:30:00+00:00,2018-01-28T20:30:00+00:00,1,
...,...,...,...,...,...
2272,1584208,2020-10-10T16:00:00+01:00,2020-10-10T16:00:00+01:00,127508,
2276,1586592,2020-08-12T08:00:00+01:00,2020-10-30T08:00:00+00:00,127571,
2278,1595055,2020-09-12T10:30:00+01:00,2020-10-24T10:30:00+01:00,127985,
2279,1599103,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,128231,


In [None]:
df_schedules.to_pickle(dataframes_final+"/df_schedules")

## 8. Schedules _TAGS DATAFRAME ??? 

NOTE: SCHEDULES DO NOT HAVE TAGS - so that part of the ontology is wrong - you should correct the ontology. 

## 8. Schedules_PhoneNumber Dataframe

In [None]:
df_schedules_pn=df_schedules_total[['event_id', 'start_ts', 'place_id','end_ts', 'phone_numbers']]
df_schedules_pn=df_schedules_pn[df_schedules_pn['phone_numbers'].notna()]
df_schedules_pn

Unnamed: 0,event_id,start_ts,place_id,end_ts,phone_numbers
65,58304,2018-04-29T19:30:00+01:00,610,2018-04-29T19:30:00+01:00,{'info': '01506 777666'}
100,834033,2020-01-24T21:00:00+00:00,377,2020-01-24T21:00:00+00:00,{'info': '0844 573 8455'}
145,943017,2018-04-08T11:30:00+01:00,379,2018-04-08T11:30:00+01:00,{'info': '0844 557 2686'}
182,345799,2019-08-17T20:00:00+01:00,381,2019-08-17T20:00:00+01:00,{'info': '0131 473 2000'}
339,557533,2018-09-29T19:30:00+01:00,383,2018-09-29T19:30:00+01:00,{'info': '01316682019'}
...,...,...,...,...,...
2898,1680444,2021-08-15T20:30:00+01:00,129818,2021-08-15T20:30:00+01:00,{'info': '0131 473 2000'}
2899,1680445,2021-08-14T20:30:00+01:00,129818,2021-08-14T20:30:00+01:00,{'info': '0131 473 2000'}
2900,1680449,2021-08-26T20:30:00+01:00,129818,2021-08-26T20:30:00+01:00,{'info': '0131 473 2000'}
2901,1680450,2021-08-17T20:30:00+01:00,129818,2021-08-17T20:30:00+01:00,{'info': '0131 473 2000'}


In [None]:
df_schedules_pn=pd.concat([df_schedules_pn.drop(['phone_numbers'], axis=1), df_schedules_pn['phone_numbers'].apply(pd.Series)], axis=1)

df_schedules_pn

Unnamed: 0,event_id,start_ts,place_id,end_ts,info
65,58304,2018-04-29T19:30:00+01:00,610,2018-04-29T19:30:00+01:00,01506 777666
100,834033,2020-01-24T21:00:00+00:00,377,2020-01-24T21:00:00+00:00,0844 573 8455
145,943017,2018-04-08T11:30:00+01:00,379,2018-04-08T11:30:00+01:00,0844 557 2686
182,345799,2019-08-17T20:00:00+01:00,381,2019-08-17T20:00:00+01:00,0131 473 2000
339,557533,2018-09-29T19:30:00+01:00,383,2018-09-29T19:30:00+01:00,01316682019
...,...,...,...,...,...
2898,1680444,2021-08-15T20:30:00+01:00,129818,2021-08-15T20:30:00+01:00,0131 473 2000
2899,1680445,2021-08-14T20:30:00+01:00,129818,2021-08-14T20:30:00+01:00,0131 473 2000
2900,1680449,2021-08-26T20:30:00+01:00,129818,2021-08-26T20:30:00+01:00,0131 473 2000
2901,1680450,2021-08-17T20:30:00+01:00,129818,2021-08-17T20:30:00+01:00,0131 473 2000


In [None]:
df_schedules_pn.to_pickle(dataframes_final+"/df_schedules_phone_numbers")

## 9. Performances Dataframe

In [None]:
df_total_performances=df_schedules_total[['event_id','place_id', 'start_ts', 'end_ts', 'performances']].explode('performances')
df_total_performances=pd.concat([df_total_performances.drop(['performances'], axis=1), df_total_performances['performances'].apply(pd.Series)], axis=1)
#### NEW FOR DROPPING REPEATED PERFORMANCES
df_total_performances=df_total_performances.drop_duplicates(subset=['ts', 'place_id','start_ts', 'end_ts', 'event_id'], keep="first")
df_total_performances

Unnamed: 0,event_id,place_id,start_ts,end_ts,ts,duration,links,tickets,properties,descriptions,time_unknown
0,157884,383,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,150.0,"[{'type': 'booking', 'url': 'http://www.theque...","[{'type': 'Standard', 'currency': 'GBP', 'min_...",,,
1,194419,11092,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,120.0,,"[{'type': 'Standard', 'currency': 'GBP', 'min_...",,,
1,194419,11200,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,,"[{'type': 'booking', 'url': 'https://www.ticke...","[{'type': 'Standard', 'currency': 'GBP', 'desc...",{'performance.sold-out': True},,
1,194419,386,2018-05-07T20:00:00+01:00,2018-05-07T20:00:00+01:00,2018-05-07T20:00:00+01:00,,"[{'type': 'booking', 'url': 'https://www.trave...","[{'type': 'Standard', 'currency': 'GBP', 'min_...",,,
2,240818,1,2018-01-24T20:30:00+00:00,2018-01-28T20:30:00+00:00,2018-01-24T20:30:00+00:00,,"[{'type': 'booking', 'url': 'http://www.thesta...","[{'type': 'Standard', 'currency': 'GBP', 'min_...",,"[{'type': 'list.description.default', 'descrip...",
...,...,...,...,...,...,...,...,...,...,...,...
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-16T09:00:00+01:00,480.0,"[{'type': 'booking', 'url': 'https://www.kildu...","[{'type': 'Standard', 'currency': 'GBP', 'min_...",,,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-17T09:00:00+01:00,480.0,"[{'type': 'booking', 'url': 'https://www.kildu...","[{'type': 'Standard', 'currency': 'GBP', 'min_...",,,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-18T09:00:00+01:00,480.0,"[{'type': 'booking', 'url': 'https://www.kildu...","[{'type': 'Standard', 'currency': 'GBP', 'min_...",,,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-19T09:00:00+01:00,480.0,"[{'type': 'booking', 'url': 'https://www.kildu...","[{'type': 'Standard', 'currency': 'GBP', 'min_...",,,


In [None]:
df_performaces=df_total_performances[['event_id','place_id', 'start_ts', 'end_ts', 'ts', 'duration', 'time_unknown' ]]

In [None]:
#comment this line if you want to save the dataframe to file
df_performaces.to_pickle(dataframes_final+"/df_performances")

## 10. PERFORMANCES_PROPERTIES DATAFRAME


**IMPORTANT** WE HAVE REALISED THAT WE DONT LONGER NEED THESE THREE NODES: PROPERTYEVENTS, THEATHRE AND FILM
ALL the properties of these 3 nodes are now PERFORMANCE_PROPERTY. 

The ontology should be updated to reflect this

Note: The follow cell takes 2 or 3 mintues to run

In [None]:
df_p_prop_total=df_total_performances[['event_id', 'place_id','start_ts', 'end_ts', 'ts', 'properties']]
df_p_prop_total= df_p_prop_total[df_p_prop_total['properties'].notna()]
df_p_prop=pd.concat([df_p_prop_total.drop(['properties'], axis=1), df_p_prop_total['properties'].apply(pd.Series)], axis=1)


In [None]:
df_p_prop.drop_duplicates(keep='first')

Unnamed: 0,event_id,place_id,start_ts,end_ts,ts,performance.sold-out,event.festival,performance.cancelled,event.support,list.hitlisted,...,event.film.premium-screening,event.film.over-18s,event.film.parent-and-baby,event.film.senior,event.film.subtitled,event.film.3d,event.film.imax,event.theatre.bsl-interpreted,event.film.autism-friendly,event.theatre.captioned
1,194419,11200,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,True,,,,,...,,,,,,,,,,
3,345866,1,2018-11-05T20:30:00+00:00,2019-04-29T20:30:00+01:00,2019-02-04T20:30:00+00:00,,Glen's Guide,,,,...,,,,,,,,,,
3,345866,1,2018-11-05T20:30:00+00:00,2019-04-29T20:30:00+01:00,2019-02-11T20:30:00+00:00,,Glen's Guide,,,,...,,,,,,,,,,
3,345866,1,2018-11-05T20:30:00+00:00,2019-04-29T20:30:00+01:00,2019-02-18T20:30:00+00:00,,Glen's Guide,,,,...,,,,,,,,,,
3,345866,1,2018-11-05T20:30:00+00:00,2019-04-29T20:30:00+01:00,2019-02-25T20:30:00+00:00,,Glen's Guide,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2278,1595055,127985,2020-09-12T10:30:00+01:00,2020-10-24T10:30:00+01:00,2020-10-09T14:30:00+01:00,,,,,True,...,,,,,,,,,,
2278,1595055,127985,2020-09-12T10:30:00+01:00,2020-10-24T10:30:00+01:00,2020-10-11T10:30:00+01:00,,,,,True,...,,,,,,,,,,
2278,1595055,127985,2020-09-12T10:30:00+01:00,2020-10-24T10:30:00+01:00,2020-10-16T14:30:00+01:00,,,,,True,...,,,,,,,,,,
2278,1595055,127985,2020-09-12T10:30:00+01:00,2020-10-24T10:30:00+01:00,2020-10-23T14:00:00+01:00,,,,,True,...,,,,,,,,,,


In [None]:
df_p_prop.to_pickle(dataframes_final+"/df_performances_properties")

## 11. PERFORMANCES DESCRIPTION

In [None]:
df_p_desc=df_total_performances[['event_id', 'start_ts','place_id', 'end_ts', 'ts','descriptions']].explode('descriptions')
df_p_desc=df_p_desc[df_p_desc['descriptions'].notna()]
df_p_desc=pd.concat([df_p_desc.drop(['descriptions'], axis=1), df_p_desc['descriptions'].apply(pd.Series)], axis=1)
df_p_desc

Unnamed: 0,event_id,start_ts,place_id,end_ts,ts,type,description
2,240818,2018-01-24T20:30:00+00:00,1,2018-01-28T20:30:00+00:00,2018-01-24T20:30:00+00:00,list.description.default,"With Vladimir McTavish, Jim Smith, Wisarut Jan..."
2,240818,2018-01-24T20:30:00+00:00,1,2018-01-28T20:30:00+00:00,2018-01-28T20:30:00+00:00,list.description.default,"With Stephen Carlin, Gus Lymburn, Donald Alexa..."
3,345866,2017-11-06T20:30:00+00:00,1,2018-04-30T20:30:00+01:00,2017-11-13T20:30:00+00:00,list.description.default,With host Gareth Mutch and headliner Jim Smith.
3,345866,2017-11-06T20:30:00+00:00,1,2018-04-30T20:30:00+01:00,2017-11-20T20:30:00+00:00,list.description.default,With headliner Stuart Mitchell.
3,345866,2017-11-06T20:30:00+00:00,1,2018-04-30T20:30:00+01:00,2017-11-27T20:30:00+00:00,list.description.default,With headliner Peter Brush.
...,...,...,...,...,...,...,...
2253,1582837,2020-08-27T18:00:00+01:00,127344,2020-10-31T00:00:00+00:00,2020-08-29T22:00:00+01:00,list.description.default,Alien (1979)
2253,1582837,2020-08-27T18:00:00+01:00,127344,2020-10-31T00:00:00+00:00,2020-08-30T10:30:00+01:00,list.description.default,Up (2009)
2253,1582837,2020-08-27T18:00:00+01:00,127344,2020-10-31T00:00:00+00:00,2020-08-30T14:00:00+01:00,list.description.default,The Princess Bride (1987)
2253,1582837,2020-08-27T18:00:00+01:00,127344,2020-10-31T00:00:00+00:00,2020-08-30T17:45:00+01:00,list.description.default,La La Land (2017)


In [None]:
df_p_desc.drop_duplicates(keep='first')

Unnamed: 0,event_id,start_ts,place_id,end_ts,ts,type,description
2,240818,2018-01-24T20:30:00+00:00,1,2018-01-28T20:30:00+00:00,2018-01-24T20:30:00+00:00,list.description.default,"With Vladimir McTavish, Jim Smith, Wisarut Jan..."
2,240818,2018-01-24T20:30:00+00:00,1,2018-01-28T20:30:00+00:00,2018-01-28T20:30:00+00:00,list.description.default,"With Stephen Carlin, Gus Lymburn, Donald Alexa..."
3,345866,2017-11-06T20:30:00+00:00,1,2018-04-30T20:30:00+01:00,2017-11-13T20:30:00+00:00,list.description.default,With host Gareth Mutch and headliner Jim Smith.
3,345866,2017-11-06T20:30:00+00:00,1,2018-04-30T20:30:00+01:00,2017-11-20T20:30:00+00:00,list.description.default,With headliner Stuart Mitchell.
3,345866,2017-11-06T20:30:00+00:00,1,2018-04-30T20:30:00+01:00,2017-11-27T20:30:00+00:00,list.description.default,With headliner Peter Brush.
...,...,...,...,...,...,...,...
2253,1582837,2020-08-27T18:00:00+01:00,127344,2020-10-31T00:00:00+00:00,2020-08-29T22:00:00+01:00,list.description.default,Alien (1979)
2253,1582837,2020-08-27T18:00:00+01:00,127344,2020-10-31T00:00:00+00:00,2020-08-30T10:30:00+01:00,list.description.default,Up (2009)
2253,1582837,2020-08-27T18:00:00+01:00,127344,2020-10-31T00:00:00+00:00,2020-08-30T14:00:00+01:00,list.description.default,The Princess Bride (1987)
2253,1582837,2020-08-27T18:00:00+01:00,127344,2020-10-31T00:00:00+00:00,2020-08-30T17:45:00+01:00,list.description.default,La La Land (2017)


In [None]:
df_p_desc.to_pickle(dataframes_final+"/df_performances_descriptions")

## 12. PERFORMANCE LINKS

These lines takes 5 minutes to run

In [None]:
df_p_links=df_total_performances[['event_id', 'start_ts', 'place_id','end_ts', 'ts','links']].explode('links')
df_p_links=df_p_links[df_p_links['links'].notna()]
df_p_links=pd.concat([df_p_links.drop(['links'], axis=1), df_p_links['links'].apply(pd.Series)], axis=1)
df_p_links

Unnamed: 0,event_id,start_ts,place_id,end_ts,ts,type,url
0,157884,2018-04-26T20:00:00+01:00,383,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,booking,http://www.thequeenshall.net/whats-on/shows/va...
1,194419,2018-03-08T20:00:00+00:00,11200,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,booking,https://www.ticketsource.co.uk/booking/date/44...
1,194419,2018-05-07T20:00:00+01:00,386,2018-05-07T20:00:00+01:00,2018-05-07T20:00:00+01:00,booking,https://www.traverse.co.uk/whats-on/event-deta...
2,240818,2018-01-24T20:30:00+00:00,1,2018-01-28T20:30:00+00:00,2018-01-24T20:30:00+00:00,booking,http://www.thestand.co.uk/show/29395/burns_nig...
2,240818,2018-01-24T20:30:00+00:00,1,2018-01-28T20:30:00+00:00,2018-01-28T20:30:00+00:00,booking,http://www.thestand.co.uk/show/29397/burns_nig...
...,...,...,...,...,...,...,...
2278,1595055,2020-09-12T10:30:00+01:00,127985,2020-10-24T10:30:00+01:00,2020-10-24T10:30:00+01:00,booking,https://www.eventbrite.com/e/black-history-wal...
2279,1599103,2020-10-16T09:00:00+01:00,128231,2020-10-19T09:00:00+01:00,2020-10-16T09:00:00+01:00,booking,https://www.kilduff.co.uk/patch/
2279,1599103,2020-10-16T09:00:00+01:00,128231,2020-10-19T09:00:00+01:00,2020-10-17T09:00:00+01:00,booking,https://www.kilduff.co.uk/patch/
2279,1599103,2020-10-16T09:00:00+01:00,128231,2020-10-19T09:00:00+01:00,2020-10-18T09:00:00+01:00,booking,https://www.kilduff.co.uk/patch/


In [None]:
df_p_links.to_pickle(dataframes_final+"/df_performances_links")

## 13. TICKETS

This cell takes 5 minutes to run


In [None]:
df_tickets=df_total_performances[['event_id','place_id','start_ts', 'end_ts', 'ts','tickets']].explode('tickets')
df_tickets=df_tickets[df_tickets['tickets'].notna()]
df_tickets
df_tickets=pd.concat([df_tickets.drop(['tickets'], axis=1), df_tickets['tickets'].apply(pd.Series)], axis=1)


In [None]:
df_tickets

Unnamed: 0,event_id,place_id,start_ts,end_ts,ts,type,currency,min_price,description,max_price
0,157884,383,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,Standard,GBP,14.0,,
1,194419,11092,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,Standard,GBP,15.0,,
1,194419,11092,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,Concession,GBP,13.0,,
1,194419,11092,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,Children,GBP,6.0,,
1,194419,11200,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,Standard,GBP,,tbc,
...,...,...,...,...,...,...,...,...,...,...
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-16T09:00:00+01:00,Standard,GBP,1.0,,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-17T09:00:00+01:00,Standard,GBP,1.0,,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-18T09:00:00+01:00,Standard,GBP,1.0,,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-19T09:00:00+01:00,Standard,GBP,1.0,,


In [None]:
df_ticketDescription=df_tickets[['event_id','place_id','start_ts', 'type','end_ts', 'ts','description']]

df_ticketsDescription= df_ticketDescription[df_ticketDescription['description'].notna()]
df_ticketsDescription

Unnamed: 0,event_id,place_id,start_ts,type,end_ts,ts,description
1,194419,11200,2018-03-08T20:00:00+00:00,Standard,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,tbc
4,347164,1,2019-05-04T20:30:00+01:00,Standard,2019-10-26T20:30:00+01:00,2019-08-31T20:30:00+01:00,£tbc
4,347164,1,2019-05-04T20:30:00+01:00,Standard,2019-10-26T20:30:00+01:00,2019-09-07T20:30:00+01:00,£tbc
4,347164,1,2019-05-04T20:30:00+01:00,Standard,2019-10-26T20:30:00+01:00,2019-09-14T20:30:00+01:00,£tbc
4,347164,1,2019-05-04T20:30:00+01:00,Standard,2019-10-26T20:30:00+01:00,2019-09-21T20:30:00+01:00,£tbc
...,...,...,...,...,...,...,...
2112,1542024,103236,2020-05-22T19:30:00+01:00,Standard,2020-05-22T19:30:00+01:00,2020-05-22T19:30:00+01:00,Adults £15\n16 years and under £10\nIncludes b...
2127,1443727,104128,2020-07-11T20:00:00+01:00,Standard,2020-07-11T20:00:00+01:00,2020-07-11T20:00:00+01:00,£tbc
2142,1237645,106273,2020-07-23T11:00:00+01:00,Standard,2020-07-23T11:00:00+01:00,2020-07-23T11:00:00+01:00,tbc
2157,1547779,114954,2020-06-14T19:30:00+01:00,Standard,2020-06-14T19:30:00+01:00,2020-06-14T19:30:00+01:00,£tbc


In [None]:
df_ticketsDescription=df_ticketsDescription.drop_duplicates(keep='first')
df_ticketsDescription

Unnamed: 0,event_id,place_id,start_ts,type,end_ts,ts,description
1,194419,11200,2018-03-08T20:00:00+00:00,Standard,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,tbc
4,347164,1,2019-05-04T20:30:00+01:00,Standard,2019-10-26T20:30:00+01:00,2019-08-31T20:30:00+01:00,£tbc
4,347164,1,2019-05-04T20:30:00+01:00,Standard,2019-10-26T20:30:00+01:00,2019-09-07T20:30:00+01:00,£tbc
4,347164,1,2019-05-04T20:30:00+01:00,Standard,2019-10-26T20:30:00+01:00,2019-09-14T20:30:00+01:00,£tbc
4,347164,1,2019-05-04T20:30:00+01:00,Standard,2019-10-26T20:30:00+01:00,2019-09-21T20:30:00+01:00,£tbc
...,...,...,...,...,...,...,...
2112,1542024,103236,2020-05-22T19:30:00+01:00,Standard,2020-05-22T19:30:00+01:00,2020-05-22T19:30:00+01:00,Adults £15\n16 years and under £10\nIncludes b...
2127,1443727,104128,2020-07-11T20:00:00+01:00,Standard,2020-07-11T20:00:00+01:00,2020-07-11T20:00:00+01:00,£tbc
2142,1237645,106273,2020-07-23T11:00:00+01:00,Standard,2020-07-23T11:00:00+01:00,2020-07-23T11:00:00+01:00,tbc
2157,1547779,114954,2020-06-14T19:30:00+01:00,Standard,2020-06-14T19:30:00+01:00,2020-06-14T19:30:00+01:00,£tbc


In [None]:
df_ticketsDescription.to_pickle(dataframes_final+"/df_ticketDescription")

In [None]:
df_tickets=df_tickets.drop(['description'],axis=1)
df_tickets

Unnamed: 0,event_id,place_id,start_ts,end_ts,ts,type,currency,min_price,max_price
0,157884,383,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,Standard,GBP,14.0,
1,194419,11092,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,Standard,GBP,15.0,
1,194419,11092,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,Concession,GBP,13.0,
1,194419,11092,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,Children,GBP,6.0,
1,194419,11200,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,Standard,GBP,,
...,...,...,...,...,...,...,...,...,...
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-16T09:00:00+01:00,Standard,GBP,1.0,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-17T09:00:00+01:00,Standard,GBP,1.0,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-18T09:00:00+01:00,Standard,GBP,1.0,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-19T09:00:00+01:00,Standard,GBP,1.0,


In [None]:
df_tickets=df_tickets.drop_duplicates(keep='first')
df_tickets

Unnamed: 0,event_id,place_id,start_ts,end_ts,ts,type,currency,min_price,max_price
0,157884,383,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,Standard,GBP,14.0,
1,194419,11092,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,Standard,GBP,15.0,
1,194419,11092,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,Concession,GBP,13.0,
1,194419,11092,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,2018-03-10T19:30:00+00:00,Children,GBP,6.0,
1,194419,11200,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,2018-03-08T20:00:00+00:00,Standard,GBP,,
...,...,...,...,...,...,...,...,...,...
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-16T09:00:00+01:00,Standard,GBP,1.0,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-17T09:00:00+01:00,Standard,GBP,1.0,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-18T09:00:00+01:00,Standard,GBP,1.0,
2279,1599103,128231,2020-10-16T09:00:00+01:00,2020-10-19T09:00:00+01:00,2020-10-19T09:00:00+01:00,Standard,GBP,1.0,


In [None]:
# un-comment this line if you want to save the df_p_properties into a file
df_tickets.to_pickle(dataframes_final+"/df_tickets")

## 14. PLACES

In [None]:
with open("/content/drive/MyDrive/df_places","rb") as df_places:
    df_places_total=pickle.load(df_places)

In [None]:
df_places_total.iloc[0]

address                                               5 York Place
email                                         admin@thestand.co.uk
postal_code                                                EH1 3EB
properties       {'place.child-restrictions': True, 'place.faci...
sort_name                                                    Stand
town                                                     Edinburgh
website                                  http://www.thestand.co.uk
place_id                                                         1
modified_ts                                   2021-11-24T12:18:33Z
created_ts                                    2021-11-24T12:18:33Z
name                                                     The Stand
loc              {'latitude': '55.955806109395006', 'longitude'...
country_code                                                    GB
tags                 [Bar & pub food, Comedy, Restaurants, Venues]
descriptions     [{'type': 'description.list.default', 'descri

In [None]:
df_places=df_places_total[['place_id', 'created_ts', 'modified_ts', 'name', 'sort_name', 'address', 'town', 'postal_code', 'country_code', 'website', 'email', 'status']]
df_places

Unnamed: 0,place_id,created_ts,modified_ts,name,sort_name,address,town,postal_code,country_code,website,email,status
0,1,2021-11-24T12:18:33Z,2021-11-24T12:18:33Z,The Stand,Stand,5 York Place,Edinburgh,EH1 3EB,GB,http://www.thestand.co.uk,admin@thestand.co.uk,live
1,371,2019-12-04T13:27:26Z,2019-12-04T13:27:26Z,St Bride's Centre,St Bride's Centre,10 Orwell Terrace,Edinburgh,EH11 2DY,GB,http://stbrides.wordpress.com,,live
2,372,2021-02-23T16:57:44Z,2021-02-23T16:57:44Z,Institut Français d'Ecosse,Institut Français d'Ecosse,West Parliament Square,Edinburgh,EH1 1RN,GB,http://www.ifecosse.org.uk,ifecosse.edimbourg-cslt@diplomatie.gouv.fr,live
3,375,2015-02-18T15:59:38Z,2015-02-18T15:59:38Z,Meadowbank Sports Centre,Meadowbank Sports Centre,139 London Road,Edinburgh,EH7 6AE,GB,http://www.edinburghleisure.co.uk,,live
4,376,2020-01-27T10:18:15Z,2020-01-27T10:18:15Z,Royal Highland Centre,Royal Highland Centre,Ingliston,Edinburgh,EH28 8NB,GB,http://www.royalhighlandcentre.co.uk,,live
...,...,...,...,...,...,...,...,...,...,...,...,...
512,127508,2020-07-22T16:37:19Z,2020-07-22T16:37:19Z,Lochgelly Raceway,Lochgelly Raceway,A92,Lochgelly,KY5 9HG,GB,https://www.hardieracepromotions.co.uk/,,live
514,127985,2020-09-06T21:18:42Z,2020-09-06T21:18:42Z,Melville Monument,Melville Monument,42 St Andrew Square,Edinburgh,EH2 2AD,GB,,,live
515,128007,2020-09-08T17:49:40Z,2020-09-08T17:49:40Z,Edinburgh Technopole,Edinburgh Technopole,Milton Bridge,Edinburgh,EH26 0BB,GB,https://edinburghtechnopole.co.uk/,,live
516,128231,2020-09-22T17:58:11Z,2020-09-22T17:58:11Z,Kilduff Farm,Kilduff Farm,Kilduff Farm Drem,North Berwick,EH39 5BD,GB,,,live


In [None]:
df_places.to_pickle(dataframes_final+"/df_places")

## 15. PLACES PROPERTIES DATAFRAME

In [None]:
df_place_prop=df_places_total[['place_id','properties']]
df_place_prop=df_place_prop[df_place_prop['properties'].notna()]
df_place_prop=pd.concat([df_place_prop.drop(['properties'], axis=1), df_place_prop['properties'].apply(pd.Series)], axis=1)
df_place_prop

Unnamed: 0,place_id,place.child-restrictions,place.facilities.free-wifi,place.facilities.dogs-allowed,place.facilities.parking,place.facilities.toilets,place.facilities.toilets_disabled,place.facilities.wheelchair-access,place.capacity.max,place.facilities.guide-dogs,place.facilities.hearing-loop,place.child-friendly,place.facilities.toilets.baby-changing
0,1,True,True,False,True,True,False,False,160,,,,
2,372,,False,,False,False,False,True,,,,,
3,375,,,,,,,,16500.0,,,,
4,376,,,,,,,,35000,,,,
5,377,,,,,,,,788,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
477,124806,,True,,False,True,True,True,100,,,,
485,127866,,True,,False,True,False,True,40,,,,
501,129920,,False,,False,True,True,True,,,,,
374,62062,,True,,True,True,True,True,,,,,


In [None]:

df_place_prop.to_pickle(dataframes_final+"/df_places_properties")

## 16. PLACE PHONE NUMBER

In [None]:
df_places_pn=df_places_total[['place_id','phone_numbers']]
df_places_pn=df_places_pn[df_places_pn['phone_numbers'].notna()]
df_places_pn=pd.concat([df_places_pn.drop(['phone_numbers'], axis=1), df_places_pn['phone_numbers'].apply(pd.Series)], axis=1)

df_places_pn

Unnamed: 0,place_id,info,box_office
0,1,0131 558 7272,0131 558 7272
1,371,0131 346 1405,
2,372,0131 285 6030,
3,375,0131 661 5351,
4,376,,0131 335 6200
...,...,...,...
439,101921,0131 313 4404,
440,103236,01450 360400,
505,126723,01835 830271,
512,127508,07584 837 445,


In [None]:
df_places_pn.to_pickle(dataframes_final+"/df_places_pn")

## 17. PLACE LOCATION DATAFRAME

In [None]:
df_places_loc=df_places_total[['place_id','loc']]
df_places_loc=df_places_loc[df_places_loc['loc'].notna()]
df_places_loc=pd.concat([df_places_loc.drop(['loc'], axis=1), df_places_loc['loc'].apply(pd.Series)], axis=1)

df_places_loc

Unnamed: 0,place_id,latitude,longitude
0,1,55.955806109395006,-3.1923184844646357
1,371,55.94255035,-3.22056693
2,372,55.94930633508542,-3.192111771011355
3,375,55.95640000,-3.15627000
4,376,55.94067800,-3.36880500
...,...,...,...
512,127508,56.12694697199462,-3.281544714233391
514,127985,55.95418700,-3.19310200
515,128007,55.85879500,-3.20775100
516,128231,55.98810000,-2.77096100


In [None]:

df_places_loc.to_pickle(dataframes_final+"/df_places_loc")

## 18. PLACES DESCRIPTION

In [None]:
df_places_desc=df_places_total[['place_id','descriptions']].explode('descriptions')
df_places_desc=df_places_desc[df_places_desc['descriptions'].notna()]
df_places_desc=pd.concat([df_places_desc.drop(['descriptions'], axis=1), df_places_desc['descriptions'].apply(pd.Series)], axis=1)

df_places_desc

Unnamed: 0,place_id,type,description
0,1,description.list.default,Cheerful cavern with all the ingredients requi...
1,371,description.list.default,The St Brides Community Centre is a former chu...
2,372,description.list.default,The Institut Francais d'Ecosse in Edinburgh's ...
4,376,description.list.default,"A popular large-scale events venue, the Royal ..."
5,377,description.list.default,"One of Edinburgh's largest, multi-use venues, ..."
...,...,...,...
501,129920,description.official,The purpose built cabaret venue for the Ladybo...
529,130974,description.official,For more info about the exhibition head to @bl...
439,101921,description.list.default,This Dalry stalwart continues to deliver creat...
439,101921,description.official,First Coast is quality neighbourhood bistro na...


In [None]:
# un-comment this line if you want to save the df_p_properties into a file
df_places_desc.to_pickle(dataframes_final+"/df_places_description")

## 19. PLACES TAGS

In [None]:
df_places_tags=df_places_total[['place_id','tags']].explode('tags')
df_places_tags

Unnamed: 0,place_id,tags
0,1,Bar & pub food
0,1,Comedy
0,1,Restaurants
0,1,Venues
1,371,Cinemas
...,...,...
515,128007,Business centre
516,128231,Farm
516,128231,Outdoors
517,128392,Pubs & bars


In [None]:
df_places_tags=df_places_tags[df_places_tags['tags'].notna()]

In [None]:
df_places_tags.to_pickle(dataframes_final+"/df_places_tags")

In [None]:
df_places_tags

Unnamed: 0,place_id,tags
0,1,Bar & pub food
0,1,Comedy
0,1,Restaurants
0,1,Venues
1,371,Cinemas
...,...,...
515,128007,Business centre
516,128231,Farm
516,128231,Outdoors
517,128392,Pubs & bars


# SPARQL Vereification Data Set

In [None]:
CheckData=df_schedules.merge(df_places,on=['place_id','place_id'])

In [None]:
CheckData=CheckData.merge(df_e_tags,on=['event_id'])

In [None]:
CheckData=CheckData.merge(df_e_category,on=['event_id'])

In [None]:
Edinburgh=CheckData[CheckData['town']=='Edinburgh']

In [None]:
tem=Edinburgh.loc[Edinburgh['tags']=='Music']

In [None]:
tem[tem['category']=='Music']

Unnamed: 0,event_id,start_ts,end_ts,place_id,performance_space,created_ts,modified_ts,name,sort_name,address,town,postal_code,country_code,website,email,status,tags,category
1,157884,2018-04-26T20:00:00+01:00,2018-04-26T20:00:00+01:00,383,,2020-01-24T12:59:18Z,2020-01-24T12:59:18Z,The Queen's Hall,Queen's Hall,85--89 Clerk Street,Edinburgh,EH8 9JG,GB,http://www.thequeenshall.net,boxoffice@queenshalledinburgh.org,live,Music,Music
10,160228,2018-11-23T20:00:00+00:00,2018-11-23T20:00:00+00:00,383,,2020-01-24T12:59:18Z,2020-01-24T12:59:18Z,The Queen's Hall,Queen's Hall,85--89 Clerk Street,Edinburgh,EH8 9JG,GB,http://www.thequeenshall.net,boxoffice@queenshalledinburgh.org,live,Music,Music
26,178122,2018-03-23T19:00:00+00:00,2018-03-23T19:00:00+00:00,383,,2020-01-24T12:59:18Z,2020-01-24T12:59:18Z,The Queen's Hall,Queen's Hall,85--89 Clerk Street,Edinburgh,EH8 9JG,GB,http://www.thequeenshall.net,boxoffice@queenshalledinburgh.org,live,Music,Music
28,178122,2020-04-10T19:00:00+01:00,2020-04-10T19:00:00+01:00,383,,2020-01-24T12:59:18Z,2020-01-24T12:59:18Z,The Queen's Hall,Queen's Hall,85--89 Clerk Street,Edinburgh,EH8 9JG,GB,http://www.thequeenshall.net,boxoffice@queenshalledinburgh.org,live,Music,Music
30,178122,2019-04-19T19:00:00+01:00,2019-04-19T19:00:00+01:00,383,,2020-01-24T12:59:18Z,2020-01-24T12:59:18Z,The Queen's Hall,Queen's Hall,85--89 Clerk Street,Edinburgh,EH8 9JG,GB,http://www.thequeenshall.net,boxoffice@queenshalledinburgh.org,live,Music,Music
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135238,1680449,2021-08-26T20:30:00+01:00,2021-08-26T20:30:00+01:00,129818,,2021-06-28T10:55:23Z,2021-06-28T10:55:23Z,Edinburgh Park Festival Venue,Edinburgh Park Festival Venue,Lochside Way,Edinburgh,EH12 9GG,GB,https://www.eif.co.uk/venues/edinburgh-park,,live,Music,Music
135239,1680450,2021-08-17T20:30:00+01:00,2021-08-17T20:30:00+01:00,129818,,2021-06-28T10:55:23Z,2021-06-28T10:55:23Z,Edinburgh Park Festival Venue,Edinburgh Park Festival Venue,Lochside Way,Edinburgh,EH12 9GG,GB,https://www.eif.co.uk/venues/edinburgh-park,,live,Music,Music
135240,1680455,2021-08-09T20:30:00+01:00,2021-08-09T20:30:00+01:00,129818,,2021-06-28T10:55:23Z,2021-06-28T10:55:23Z,Edinburgh Park Festival Venue,Edinburgh Park Festival Venue,Lochside Way,Edinburgh,EH12 9GG,GB,https://www.eif.co.uk/venues/edinburgh-park,,live,Music,Music
135290,1691538,2021-08-25T21:00:00+01:00,2021-08-26T21:00:00+01:00,130151,Main Stage,2021-07-21T15:46:15Z,2021-07-21T15:46:15Z,NCP Edinburgh Castle Terrace,NCP Edinburgh Castle Terrace,Castle Terrace,Edinburgh,EH1 2EW,GB,,,live,Music,Music


In [None]:
def dataframe_groupby_size(df, column_list, rename, level=None, city=None, period="full"):
    if len(column_list)==1:
        column=column_list[0]
        df_v1=df.groupby([column]).size().reset_index()
        df_v1=df_v1.rename(columns={0: rename}).sort_values(by=[rename], ascending=False)
        if city:
            title= level+" " + rename+ " per "+ column + " at " +city
            if period!= "full":
                title = title + " for the month of " + month_string + " over the years"
            fig_scatter=px.scatter(df_v1, x=column,y=rename, color=rename, size=rename, size_max=50, title=title)
            fig_bar= px.bar(df_v1, x=column, y=rename, color=column, barmode='group', title=title)
            
           
        else:
            title= level+" " + rename+ " per "+ column 
            if period != "full":
                title = title + " for the month of " + month_string + " over the years"
            fig_scatter=px.scatter(df_v1, x=column,y=rename, color=rename, size=rename, size_max=50, title=title)
            fig_bar= px.bar(df_v1, x=column, y=rename, color=column, barmode='group', title=title)
            

        return df_v1, fig_scatter, fig_bar
    else:
        df_v1=df.groupby(column_list).size().reset_index()
        df_v1=df_v1.rename(columns={0: rename})
        return df_v1

In [None]:
g_tags=dataframe_groupby_size(tem, ['tags','category'], 'frequency', 'Events')

In [None]:
g_tags

Unnamed: 0,tags,category,frequency
0,Music,Books,92
1,Music,Clubs,257
2,Music,Comedy,81
3,Music,Dance,233
4,Music,Days out,337
5,Music,Film,358
6,Music,Kids,75
7,Music,LGBT,21
8,Music,Music,7698
9,Music,Talks & Lectures,3
