# Co-Occurring Tag Analysis

Analysing how tags co-occur across various Parliamentary publications. The idea behind this is to see whether there are naturally occurring groupings of topic tags by virtue of their co-occurence when used to tag different classes of Parlimanetary publication.

In [6]:
#Data files
!ls data/dataexport

[34medms[m[m        [34mproceedings[m[m [34mterms[m[m


## Utils

Import a library that lets us work with the data files:

In [11]:
#Data is provided as Turtle/ttl files - rdflib handles those

#!pip3 install rdflib
from rdflib import Graph

Simple utility to load all the `.ttl` files in a particular directory into a graph:

In [110]:
import os
def ttl_graphbuilder(path,g=None,debug=False):
    if g is None:
        g=Graph()
    for ttl in [f for f in os.listdir(path) if f.endswith('.ttl')]:
        if debug: print(ttl)
        g.parse('{}/{}'.format(path,ttl), format='turtle')
    return g

Tools for running queries over a graph and either printing the result or putting it into a `pandas` dataframe:

In [61]:
def rdfQuery(graph,q):
    ans=graph.query(q)
    for row in ans:
        for el in row:
            print(el,end=" ")
        print()

#ish via https://github.com/schemaorg/schemaorg/blob/sdo-callisto/scripts/dashboard.ipynb
import pandas as pd
def sparql2df(graph,q, cast_to_numeric=True):
    a=graph.query(q)
    c = []
    for b in a.bindings:
        rowvals=[]
        for k in a.vars:
            rowvals.append(b[k])
        c.append(rowvals)

    df = pd.DataFrame(c)
    df.columns = [str(v) for v in a.vars]
    if cast_to_numeric:
        df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

    return df

Tools to support the export and display of graphs - `netowrkx` package is handy in this respect, eg exporting to GEXF format for use with Gephi.

In [None]:
import networkx as nx

## Exploring the Data - Terms

In [42]:
path='data/dataexport/terms'
termgraph=ttl_graphbuilder(path)

In [91]:
#What's in the graph generally?
q='''
SELECT DISTINCT ?x ?y ?z {
    ?x ?y ?z.
} LIMIT 10
'''
rdfQuery(termgraph,q)

http://data.parliament.uk/terms/93225 http://www.w3.org/2004/02/skos/core#narrower http://data.parliament.uk/terms/90383 
http://data.parliament.uk/terms/414110 http://www.w3.org/2004/02/skos/core#notation  
http://data.parliament.uk/terms/408942 http://www.w3.org/2004/02/skos/core#related http://data.parliament.uk/terms/60315 
http://data.parliament.uk/terms/357565 http://www.w3.org/2004/02/skos/core#narrower http://data.parliament.uk/terms/349840 
http://data.parliament.uk/terms/60278 http://www.w3.org/2004/02/skos/core#exactMatch http://data.parliament.uk/terms/97318 
http://data.parliament.uk/terms/37000 http://www.w3.org/2004/02/skos/core#notation ORG 
http://data.parliament.uk/terms/1326 http://www.w3.org/2004/02/skos/core#broader http://data.parliament.uk/terms/1 
http://data.parliament.uk/terms/83361 http://www.w3.org/2004/02/skos/core#prefLabel Standing Veterinary Committee 
http://data.parliament.uk/terms/300910 http://www.w3.org/2004/02/skos/core#exactMatch http://data.parli

In [92]:
#What does a term have associated with it more specicfically?
q='''
SELECT DISTINCT ?y ?z {
    <http://data.parliament.uk/terms/95551> ?y ?z.
} LIMIT 10
'''
rdfQuery(termgraph,q)

http://www.w3.org/2004/02/skos/core#narrower http://data.parliament.uk/terms/95502 
http://www.w3.org/2004/02/skos/core#narrower http://data.parliament.uk/terms/95494 
http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.w3.org/2004/02/skos/core#Concept 
http://www.w3.org/2004/02/skos/core#notation TPG 
http://www.w3.org/2004/02/skos/core#narrower http://data.parliament.uk/terms/95550 
http://www.w3.org/2004/02/skos/core#broader http://data.parliament.uk/terms/95548 
http://www.w3.org/2004/02/skos/core#prefLabel Defence policy 
http://www.w3.org/2004/02/skos/core#narrower http://data.parliament.uk/terms/95586 


Looks like the label is what we want:

In [97]:
q='''
SELECT DISTINCT ?z ?topic {
    ?z <http://www.w3.org/2004/02/skos/core#prefLabel> ?topic.
} LIMIT 10
'''
sparql2df(termgraph,q)

Unnamed: 0,z,topic
0,http://data.parliament.uk/terms/90385,Broadcasting programmes
1,http://data.parliament.uk/terms/4491,British Airports Group
2,http://data.parliament.uk/terms/28888,Derbyshire Royal Infirmary NHS Trust
3,http://data.parliament.uk/terms/24830,Citizens Charter Complaints Task Force
4,http://data.parliament.uk/terms/84469,UK Energy Research Centre
5,http://data.parliament.uk/terms/36872,Graduate School of Business Administration
6,http://data.parliament.uk/terms/65884,Railway Group
7,http://data.parliament.uk/terms/67412,Safety-Net Foundation
8,http://data.parliament.uk/terms/83361,Standing Veterinary Committee
9,http://data.parliament.uk/terms/50788,Probation and aftercare


## Exploring the Data - EDMS

In [87]:
path='data/dataexport/edms'
g=ttl_graphbuilder(path)

In [88]:
#See what's there generally...
q='''
SELECT DISTINCT ?x ?y ?z {
    ?x ?y ?z.
} LIMIT 10
'''
rdfQuery(g,q)

http://data.parliament.uk/edms/50212 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://data.parliament.uk/schema/parl#EarlyDayMotion 
http://data.parliament.uk/edms/50457 http://data.parliament.uk/schema/parl#motionText That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase over the same period in 2016; further notes 156 of these job losses will be in York, a city that in the last six months has seen 2,000 job losses announced and has become the most inequitable city outside of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with the EU are commencing; further recogni

In [90]:
#Explore a specific EDM
q='''
SELECT DISTINCT ?y ?z {
    <http://data.parliament.uk/edms/50457> ?y ?z.
}
'''
rdfQuery(g,q)

http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://data.parliament.uk/schema/parl#EarlyDayMotion 
http://data.parliament.uk/schema/parl#motionText That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase over the same period in 2016; further notes 156 of these job losses will be in York, a city that in the last six months has seen 2,000 job losses announced and has become the most inequitable city outside of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and production 

Let's merge the EDM graph data with the terms data.

In [98]:
path='data/dataexport/edms'
g=ttl_graphbuilder(path,termgraph)

Now we can look at the term labels associated with a particular EDM.

In [99]:
q='''
SELECT DISTINCT ?t ?z {
    <http://data.parliament.uk/edms/50114> <http://data.parliament.uk/schema/parl#topic> ?z.
    ?z <http://www.w3.org/2004/02/skos/core#prefLabel> ?t.
} LIMIT 10
'''
rdfQuery(g,q)

Arms control http://data.parliament.uk/terms/95494 
Defence policy http://data.parliament.uk/terms/95551 
International politics and government http://data.parliament.uk/terms/95650 
North America http://data.parliament.uk/terms/95690 


We can also create a table that links topic labels with EDMs. 

In [79]:
q='''
SELECT DISTINCT ?edms ?topic {
    ?edms <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.parliament.uk/schema/parl#EarlyDayMotion>.
    ?edms <http://data.parliament.uk/schema/parl#topic> ?z.
    ?z <http://www.w3.org/2004/02/skos/core#prefLabel> ?topic.
}
'''
g_df=sparql2df(g,q)
g_df.head()

Unnamed: 0,edms,topic
0,http://data.parliament.uk/edms/50424,Sports and Olympic Games
1,http://data.parliament.uk/edms/50424,Children and families
2,http://data.parliament.uk/edms/50129,Adult education
3,http://data.parliament.uk/edms/49551,Sports and Olympic Games
4,http://data.parliament.uk/edms/50276,Sports and Olympic Games


From this table, we can a generate a bipartite `networkx` graph that links topic labels with EDMs.

In [80]:
nxg=nx.from_pandas_dataframe(g_df, 'edms', 'topic')
#nx.write_gexf(nxg,'edms.gexf')

We can then project this bipartite graph onto just the topic label nodes - edges will now connect nodes that are linked through one or more common EDMs.

In [81]:
from networkx.algorithms import bipartite
#We can find the sets of names/tags associated with the disjoint sets in the graph
edms,topic=bipartite.sets(nxg)

#Collapse the bipartite graph to a graph of topic lables connected via a common EDM
topicgraph= bipartite.projected_graph(nxg, topic)
nx.write_gexf(topicgraph,'edms_topics.gexf')

We can also generate a weighted graph, where edges are weighted relative to how many times topics are linked through different EDMs.

In [101]:
topicgraph_weighted= bipartite.weighted_projected_graph(nxg, topic)
nx.write_gexf(topicgraph_weighted,'edms_topics_weighted.gexf')

## Exploring the Data - proceedings

In [111]:
path='data/dataexport/proceedings'
p=ttl_graphbuilder(path,debug=True)

0006D323-D0B5-4E22-A26E-75ABB621F58E.ttl


BadSyntax: at line 12 of <>:
Bad syntax (newline found in string literal) at ^ in:
"...b'ject <http://data.parliament.uk/terms/91873> ;\r\n\tparl:text "'^b'\r\nMr Philip Hollobone (in the Chair) \r\n\r\n\r\n \r\n Share this co'..."

In [113]:
!ls {path}

[31m0006D323-D0B5-4E22-A26E-75ABB621F58E.ttl[m[m
[31m00133F82-F04E-4F1E-AE6C-CC5B6720E072.ttl[m[m
[31m00329467-EFFF-4C18-B4C7-D6511C43C146.ttl[m[m
[31m00DA951D-1DF2-4DAF-ABDE-CBFFD2958838.ttl[m[m
[31m01A4E161-A86D-4A22-BAB2-AEC9E07A19C6.ttl[m[m
[31m02239CBB-E462-439D-853A-98D0D8156C60.ttl[m[m
[31m024D61E2-09CB-4055-8E88-3F842A27FBF1.ttl[m[m
[31m02C8BCC1-6BF4-4B49-8D8E-D3B6589DB26B.ttl[m[m
[31m02DB9D81-010A-436E-8E00-7E3F5CAC95E1.ttl[m[m
[31m02ED21B0-A2A6-4161-919C-3B143EE887D0.ttl[m[m
[31m02F5D2BC-B031-40E2-B918-56C126FDCED1.ttl[m[m
[31m03556065-507A-4B4E-A824-66DD1E96E105.ttl[m[m
[31m0372A8D2-E0DB-4FF8-889D-4981676A943C.ttl[m[m
[31m0382E657-48C5-457A-81E3-2CF5C522FCE9.ttl[m[m
[31m03BFC240-9C9F-43E1-A04F-DA604D52A79E.ttl[m[m
[31m03E1F7D7-CD11-4E54-829B-4FA67DE210F3.ttl[m[m
[31m0401820A-2896-4F56-BB87-759B20CF834D.ttl[m[m
[31m0439FDFB-2EB6-4014-9789-7826216C2410.ttl[m[m
[31m049A4CE6-CC57-411C-9507-1F86D6EB1024.ttl

In [118]:
!cat {path}/0006D323-D0B5-4E22-A26E-75ABB621F58E.ttl

@prefix parl: <http://data.parliament.uk/schema/parl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sesame: <http://www.openrdf.org/schema/sesame#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix fn: <http://www.w3.org/2005/xpath-functions#> .

<http://hansard.intranet.data.parliament.uk/Commons/2017-03-07/17030780000001> a parl:Proceeding ;
	dcterms:subject <http://data.parliament.uk/terms/91873> ;
	parl:text "
Mr Philip Hollobone (in the Chair) 


 
 Share this contribution   


Would those who are not staying for the next debate please be kind enough to leave quickly and quietly? I see we have some of Liverpool’s finest in the Chamber.


 
 Share this contribution  " ;
	dcterms:date "2017-03-07T00:00:00.000"^^xsd:dateTime ;
	dcterms:subject <http: