# Transform Special Attributes Dataset

In this notebook we will provide a transformation for the original dataset of special attributes, more specifically we are going to replace the QIDs by their respective value.

In [2]:
import pandas as pd

Load the parquet dataframe with attributes of each author

In [3]:
speaker_attributes = pd.read_parquet("../data/speaker_attributes.parquet")

QID_columns = ["nationality", "gender", "ethnic_group", "occupation", "party", "candidacy", "religion", "academic_degree"]

speaker_attributes

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Q161885, Q30]",[Q6581097],1395141751,,W000178,"[Q82955, Q189290, Q131512, Q1734662, Q294126, ...",[Q327591],,Q23,George Washington,"[Q698073, Q697949]",item,[Q682443]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[Q145],[Q6581097],1395737157,[Q7994501],,"[Q214917, Q28389, Q6625963, Q4853732, Q1884422...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Q31],[Q6581097],1380367296,,,"[Q36180, Q40348, Q182436, Q1265807, Q205375, Q...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[Q30],[Q6581097],1395142029,,,"[Q82955, Q15982858, Q18814623, Q1028181, Q1408...",[Q29468],,Q207,George W. Bush,"[Q327959, Q464075, Q3586276, Q4450587]",item,"[Q329646, Q682443, Q33203]"
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Q29],[Q6581097],1391704596,,,[Q1028181],,,Q297,Diego Velázquez,,item,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9055976,[Barker Howard],,[Q30],[Q6581097],1397399351,,,[Q82955],,,Q106406560,Barker B. Howard,,item,
9055977,[Charles Macomber],,[Q30],[Q6581097],1397399471,,,[Q82955],,,Q106406571,Charles H. Macomber,,item,
9055978,,[+1848-04-01T00:00:00Z],,[Q6581072],1397399751,,,,,,Q106406588,Dina David,,item,
9055979,,[+1899-03-18T00:00:00Z],,[Q6581072],1397399799,,,,,,Q106406593,Irma Dexinger,,item,


Load the wikidata conversion file

In [4]:
conversion_QID = pd.read_csv('../data/wikidata_labels_descriptions_quotebank.csv.bz2', compression='bz2', index_col='QID')

conversion_QID

Unnamed: 0_level_0,Label,Description
QID,Unnamed: 1_level_1,Unnamed: 2_level_1
Q31,Belgium,country in western Europe
Q45,Portugal,country in southwestern Europe
Q75,Internet,global system of connected computer networks
Q148,People's Republic of China,sovereign state in East Asia
Q155,Brazil,country in South America
...,...,...
Q106302506,didgeridooist,musician who plays the didgeridoo
Q106341153,biochemistry teacher,teacher of biochemistry at any level
Q106368830,2018 Wigan Metropolitan Borough Council electi...,
Q106369692,2018 Wigan Metropolitan Borough Council electi...,


Transform speaker attributes functions

In [5]:
# Given QID gets the value for it
def get_QID_value(QID):
    try:
        val = conversion_QID.loc[QID]["Label"]
    except:
        val = None
    return val

def get_QID_value_remote(QID):
    entity = client.get(QID, load = True)
    entity_DataFrame = pd.DataFrame.from_dict(entity.data)
    return entity_DataFrame["labels"]["en"]["value"]

In [6]:
def replace_QID_by_value_for_rows(df, QID_columns):
    return df.apply(lambda x: x.transform(lambda y: [get_QID_value(qid) for qid in y] if (y is not None) else y) if (x.name in QID_columns) else x, axis=0)


Run the replacement for all rows (we show an example for 100 rows)

In [7]:
result = replace_QID_by_value_for_rows(speaker_attributes[:100], QID_columns)

result

Unnamed: 0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion
0,"[Washington, President Washington, G. Washingt...",[+1732-02-22T00:00:00Z],"[Great Britain, United States of America]",[male],1395141751,,W000178,"[politician, military officer, farmer, cartogr...",[independent politician],,Q23,George Washington,"[1792 United States presidential election, 178...",item,[Episcopal Church]
1,"[Douglas Noel Adams, Douglas Noël Adams, Dougl...",[+1952-03-11T00:00:00Z],[United Kingdom],[male],1395737157,[White British],,"[playwright, screenwriter, novelist, children'...",,,Q42,Douglas Adams,,item,
2,"[Paul Marie Ghislain Otlet, Paul Marie Otlet]",[+1868-08-23T00:00:00Z],[Belgium],[male],1380367296,,,"[writer, lawyer, librarian, information scient...",,,Q1868,Paul Otlet,,item,
3,"[George Walker Bush, Bush Jr., Dubya, GWB, Bus...",[+1946-07-06T00:00:00Z],[United States of America],[male],1395142029,,,"[politician, motivational speaker, autobiograp...",[Republican Party],,Q207,George W. Bush,"[2000 United States presidential election, 200...",item,"[United Methodist Church, Episcopal Church, Me..."
4,"[Velázquez, Diego Rodríguez de Silva y Velázqu...",[+1599-06-06T00:00:00Z],[Spain],[male],1391704596,,,[painter],,,Q297,Diego Velázquez,,item,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,,[+1155-01-01T00:00:00Z],[France],[male],1336514229,,,"[singer, poet, composer, trouvère]",,,Q5170,Blondel de Nesle,,item,
96,"[Gualterus de Castellione, Gualterus de Insula...",[+1135-01-01T00:00:00Z],[France],[male],1390122732,,,"[writer, theologian, poet, Goliard]",,,Q5198,Walter of Châtillon,,item,[Catholic Church]
97,"[François-Edouard Picot, Francois Eduard Picot...","[+1786-10-10T00:00:00Z, +1786-10-17T00:00:00Z]",[France],[male],1382414258,,,[painter],,,Q5233,François-Édouard Picot,,item,
98,,[+1769-03-03T00:00:00Z],[France],[male],1340507797,,,[puppeteer],,,Q5280,Laurent Mourguet,,item,


We do it for the whole speaker attribute file

In [None]:
speaker_attributes_updated = replace_QID_by_value_for_rows(speaker_attributes, QID_columns)

We cannot show the result of the cell, it was taking a long time to execute

Save file in data folder

In [None]:
speaker_attributes_updated.to_parquet('../data/speaker_attributes_updated.parquet')