# Wikidata Project

This project explores some of the data available in Wikidata's open knowledge graph through the use of SPARQL and analyzes and displays the data with Python and graphing libraries: matplotlib, plotly. 

There are two sections in this notebook.

The first section looks at musical tonalities of all works available in Wikidata and the movements they belong to. I was curious about the relationship between musical movements and tonalities, and through this data, wanted to answer questions such as "what were the most popular tonalities during the Romantic music period?", "what is the tonality landscape like for works of the First Viennese School?", "which movements had the most pieces in D minor?". The queries and resulting graphs attempt to address these questions.

The second section displays a map of indigenous languages in what is now the United States. Please see the sections below for more details on how this is done.

## Getting Started
The requirements.txt file has been updated to include `plotly` and `ipywidgets` packages. Please note that `ipywidgets` also needs to be enabled as a Jupyter extension. This should be handled by Binder through the `postBuild` file but for reference, the following command can be run to enable the extension:

```
jupyter nbextension enable --py widgetsnbextension
```

Import the needed python packages

In [173]:
import requests
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import random
from ipywidgets import interact

## Wikidata Query Setup

The code below is unchanged from the example.

Defines Wikidata SPARQL endpoint.  Numerous query examples using this endpoint can be found on the [Wikidata examples page](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples).

In [174]:
wikidata_endpoint = 'https://query.wikidata.org/sparql'

`do_query` is a simple helper method for setting the HTTP header and submitting the request using the excellent python [requests](https://requests.readthedocs.io/en/master/) package.

In [175]:
def do_query(query):
    rsp = requests.post(
        wikidata_endpoint,
        data=query,
        headers={
            'Content-type': 'application/sparql-query',
            'Accept': 'application/sparql-results+json',
            'User-Agent': 'https://github.com/JSTOR-Labs/sw-dev-project'
        }
    )
    if rsp.status_code != 200:
        raise Exception(f"Query failed with status code {rsp.status_code}.")
    return rsp.json()


# Tonality in Musical Movements



In [176]:
movements_dict = {}
movements_tonalities = {}

In [177]:
#get all movements
def movements():
    print('\nMovements with Tonalities and corresponding QID:')
    
    query = '''
        SELECT ?movement ?movementLabel
        WHERE
        {
          ?work wdt:P826 ?tonality . #find all works that have tonality
          ?work wdt:P86 ?composer . #find composer of work
          ?composer wdt:P135 ?movement . #find musical movement of composer

          SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
        }
        GROUP BY ?movement ?movementLabel
    '''
        
    results = do_query(query)['results']['bindings']
    
    for m in results:
        qid = m['movement']['value'].split('/')[-1] #split URI string to get QID
        movements_dict[m['movementLabel']['value']] = qid
        
    print(movements_dict)

In [178]:
movements()


Movements with Tonalities and corresponding QID:
{'Classical period': 'Q17723', 'Western classical music': 'Q9730', 'Romantic music': 'Q207591', 'baroque music': 'Q8361', 'First Viennese School': 'Q702207', 'Impressionism in music': 'Q837182', 'musical modernism': 'Q2426218', 'Harlem Renaissance': 'Q829895', 'Russian symbolism': 'Q1879488', 'Romanticism': 'Q37068', 'avant-garde': 'Q102932', '20th-century classical music': 'Q1338153', 'glam rock': 'Q76092', 'new wave': 'Q187760', 'art rock': 'Q217467', 'pop rock': 'Q484641', 'blue-eyed soul': 'Q885561', 'art pop': 'Q25094849', 'social contract': 'Q1326430', 'jazz': 'Q8341', 'vocal': 'Q2529757', 'classicism': 'Q170292'}


In [179]:
#get counts of tonalities for each movement
def movement_count():
    print('\nQuerying count of tonalities by movement...')

    for qid in movements_dict.values():
        query = '''
            SELECT ?tonalityLabel (COUNT(?tonalityLabel) as ?count)
                WHERE
                {
                  ?composer wdt:P135 wd:%s . #find all composers of movement
                  ?work wdt:P86 ?composer . #find works of composers
                  ?work wdt:P826 ?tonality . #find tonalities of works

                  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
                }
                GROUP BY ?tonalityLabel
                ORDER BY DESC(?count)
        ''' % qid
        
        results = do_query(query)['results']['bindings']
        
        movements_tonalities[qid] = {}
        for t in results:
            movements_tonalities[qid][t['tonalityLabel']['value']] = int(t['count']['value'])
            
    print('...Done!')
    #print(movements_tonalities)

In [180]:
movement_count()


Querying count of tonalities by movement...
...Done!


After querying data and parsing it into dictionaries, the data is plotted using a pie chart with a drop down menu that allows selection of movement.

In [181]:
def pie(id):
    print('Tonality Count')
    for tonality, count in movements_tonalities[id].items():
        print('{} {}'.format(tonality, count))

    plt.title('Tonalities')
    plt.pie(movements_tonalities[id].values(), labels=movements_tonalities[id].keys(), shadow=True)
    plt.axis('equal')
    plt.subplots_adjust(0,0,1,1)
    plt.show()

In [182]:
interact(pie, id=movements_dict)

interactive(children=(Dropdown(description='id', options={'Classical period': 'Q17723', 'Western classical mus…

<function __main__.pie(id)>

This graph displays 5 randomly selected movements and their respective tonality counts.

In [183]:
def plotly_bar():
    #randomly pick 5 movements
    sample = random.sample(list(movements_dict.items()), 5)

    fig = go.Figure(data=[
        go.Bar(name=sample[0][0], x=list(movements_tonalities[sample[0][1]].keys()), y=list(movements_tonalities[sample[0][1]].values())),
        go.Bar(name=sample[1][0], x=list(movements_tonalities[sample[1][1]].keys()), y=list(movements_tonalities[sample[1][1]].values())),
        go.Bar(name=sample[2][0], x=list(movements_tonalities[sample[2][1]].keys()), y=list(movements_tonalities[sample[2][1]].values())),
        go.Bar(name=sample[3][0], x=list(movements_tonalities[sample[3][1]].keys()), y=list(movements_tonalities[sample[3][1]].values())),
        go.Bar(name=sample[4][0], x=list(movements_tonalities[sample[4][1]].keys()), y=list(movements_tonalities[sample[4][1]].values())),
    ])
    fig.update_layout(barmode='stack')
    fig.show()

In [185]:
plotly_bar()



# Map of Languages in the Contiguous United States

In the below function, two queries are used to gather information on indigenous languages of the United States. The first is used to search for indigenous languages and count the number of states each language is associated with (this can vary from 1 state to 6). The second query then retrieves the coordinate locations for each language (this can vary from 1 location to 4).

A language is determined "indigenous to the contiguous united states" by the state it is indigenous to. States and coordinate location properties of a language are not related and have different counts/values. Two queries were used because grouping by state in order to get a count of states and query distinct coordinate locations for each language is not possible with just one query.

My original solution was to use the below nested query to reduce redundancy but because the request would occasionally time out, I separated the queries.

```
        SELECT ?language ?languageLabel ?coordinates ?statecount

        WHERE {
          ?language wdt:P625 ?coordinates . #get coordinates of language

          #subquery
          { SELECT ?language (COUNT(?state) as ?statecount)
                   WHERE {
                      ?language wdt:P31 wd:Q34770 . #find languages
                      ?language wdt:P2341 ?state . #that are indigenous to a state
                      ?state wdt:P361 wd:Q578170 . #a state that is a part of the contiguous united states
                     }

            GROUP BY ?language
          }
          #end of subquery

          SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }

          }
```

In [186]:
mapdata = {
    'lat': [],
    'lon':[],
    'text':[],
    'count_size':[],
    'count_color':[],
}
lang_state_dict = {}

In [187]:
def languages_state_count_query():
    print('\nQuerying Indigenous Languages and their states...')
    query = '''
        SELECT ?language (COUNT(?state) as ?statecount)
        WHERE {
            ?language wdt:P31 wd:Q34770 . #find languages
            ?language wdt:P2341 ?state . #that are indigenous to a state
            ?state wdt:P361 wd:Q578170 . #a state that is a part of the contiguous united states
        }

        GROUP BY ?language
    '''
    results = do_query(query)['results']['bindings']
    
    for x in results:
        qid = x['language']['value'].split('/')[-1]
        count = x['statecount']['value']
        lang_state_dict[qid] = count

    print('...Done!')

In [188]:
languages_state_count_query()


Querying Indigenous Languages and their states...
...Done!


In [189]:
def languages_coordinates_query():    
    print('\nQuerying Indigenous Languages and their coordinates...')
    query = '''
        SELECT DISTINCT ?language ?languageLabel ?coordinates
        WHERE {
            ?language wdt:P31 wd:Q34770 . #find languages
            ?language wdt:P2341 ?state . #that are indigenous to a state
            ?state wdt:P361 wd:Q578170 . #a state that is a part of the contiguous united states
            ?language wdt:P625 ?coordinates. # get coordinates
        
            SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
        }
    '''
    results = do_query(query)['results']['bindings']
    
    for lang in results:
        #append coordinates
        coord = lang['coordinates']['value'][6:-1]
        lon_coord, lat_coord = coord.split()
        
        mapdata['lat'].append(lat_coord)
        mapdata['lon'].append(lon_coord)
        
        #append text (language label)
        text = lang['languageLabel']['value']
        mapdata['text'].append(text)
        
        #get count from lang_state_dict and append
        qid = lang['language']['value'].split('/')[-1]
        count = int(lang_state_dict[qid])
        mapdata['count_size'].append(count+6)
        mapdata['count_color'].append(count*200)
        
    print('...Done!')

In [190]:
languages_coordinates_query()


Querying Indigenous Languages and their coordinates...
...Done!


The number of states a language is indigenous to gives us an idea of widely used or at least how spatially wide the territory of the peoples speaking a certain language was. The coordinate values of a language give us a relative geographical point of where these languages were used.

In the map, the colors of the markers correspond to each unique language. The size of the markers correlate to how many states (in today's current US geography) the language is indigenous to. The bigger the marker, the higher number of states the language is known to be indigenous to.

In [191]:
#display map
def display_map(data):
    fig = go.Figure(data=go.Scattergeo(
        lon = data['lon'],
        lat = data['lat'],
        text = data['text'],
        mode = 'markers',
        
        marker = dict(
            size=data['count_size'],
            color=data['count_color'],
            )
        ))

    fig.update_layout(
            title = 'Languages in Contiguous United States',
            geo_scope = 'north america'
        )
    fig.show()

In [192]:
display_map(mapdata)