# Using Pandas dataframes for relational database functionality


Some basic terminology:
- Python is the programming *language* used in Jupyter notebooks.
- Jupyter notebooks is the cell-based *app* that implements Python.
- Pandas is a *library* of functions designed to work especially with tabular data.

In [1]:
# first import pandas and numpy to make sure the relevant functionality is available to us

import numpy as np
import pandas as pd

from pandas import DataFrame, Series

Now we need to *read in* some data to work with. The base data is usually in a CSV file on your hard drive (though you can read in other kinds of files, and it doesn't necessarily need to be stored on your personal hard drive). As of this cell, there is no data stored in local memory. We need to tell Python where to find our data and to save it as a local variable in this notebook. To do that, we need to know its *path*, i.e. its hierarchical location in the computer file system.

Remember, two dots (`..`) tells the path to jump up one directory.

To start, I am going to read in data for my own personalized dictionary of terminology useful and specific to my research.

In [20]:
# read in glossary data

glossary = pd.read_csv('../../DropBox/Active_Directories/Digital_Humanities/Datasets/exported_database_data/glossary.csv', \
                       names=['UID', 'Term', 'Emic_Term', 'Translit', 'Scope', 'Tags'])

We have now saved our CSV data as a *dataframe* variable called `glossary`. We can look at the entire thing simply by calling `glossary`. However, quite often our tables will be quite large, which means it might be easier to get some descriptive information about what we are dealing with.

In [29]:
glossary.sample(5)

#glossary.describe(include='all')

Unnamed: 0,UID,Term,Emic_Term,Translit,Scope,Tags
216,219,suragh,سراغ,surāgh,,police
54,56,tarikana,تریکانه,,transoxania,taxesinheritance
83,85,turra,طرة,,arab_world,arab_world
29,31,ramz,رمز,ramz,indic,royal
205,208,tugh,,,transoxania,ritual


This is all well in good, but so far not really much better than just opening the CSV file in Excel. And, wait a second: where is the column for the most important piece of information in a dictionary, i.e. the definition!

Many words have different definitions depending on the time period and context. This tells you that, in relational database terms, we are dealing with a *one to many* relationship, which cannot be captured by a single table.

What we need is a separate table of definitions with a *join key* that corresponds to the unique id (UID) in the glossary table. So let's read in another table.

In [24]:
# read in definition data

definitions = pd.read_csv('../../DropBox/Active_Directories/Digital_Humanities/Datasets/exported_database_data/definitions.csv', \
                       names=['UID', 'JoinKey', 'Def', 'Spec', 'Source'])



In [26]:
definitions.sample(5)

Unnamed: 0,UID,JoinKey,Def,Spec,Source
117,167,108,"""A daily allowance to pensioners of any kind.""",,508.0
476,614,195,"""A word of foreign importation, it came to mea...",mughals,983.0
500,640,260,"""A petition for a grant of a mansab submitted ...",mughals,983.0
37,54,49,اصطبل، بارها در سال 1842 به معناي جاي اعدام به...,ferghana,283.0
643,815,450,"суст, беҳол, бемадор, заиф",,1065.0


So now we have all of the raw data we have for a nice dictionary that offers multiple definitions for a single term, and even with some nice additional data about the context and source of that particular definition. But, currently, the computer has no idea that these two dataframes are related to one another. We need to change that.


In [39]:
merge = pd.merge(glossary, definitions, left_on="UID", right_on="JoinKey")


In [34]:
merge.sample(5)

Unnamed: 0,UID_x,Term,Emic_Term,Translit,Scope,Tags,UID_y,JoinKey,Def,Spec,Source
471,310,rikab-i zafarrikab-i humayun,رکاب ظفررکاب همایون,rikāb-i ẓafar,,,508,310,,,237.0
176,98,actum,,,,terminologydiplomatics,159,98,"""The date given on a document might be either ...",,
597,408,taqrir,تقریر,,,,708,408,"""Statement or narrative of an individual case,...",,992.0
302,181,sayl,سیل,,islam,ritual,282,181,"""Sayil (lit. ‘flowing’) is a term used to deno...",transoxania,
34,14,kharaj,خراج,kharāj,islam,agriculturetaxes,86,14,"""харадж 10-ю часть с жатвы""",tashkent,286.0


Already this is pretty useful: a single table that combines information from both of our tables, but much more efficiently and flexibly than if we had inputted the data in this format to begin with.

Now let's clean up the table a bit by eliminating some of the duplicate columns.

In [40]:
# drop duplicate columns
merge = merge.drop('UID_x', axis=1)

# rename columns
merge.rename(columns = {'UID_y':'UID'}, inplace = True)


In [42]:
#merge.sample(5)

There is a separate table with information about the Source a given definition came from: we can add that later using the same `merge` method we used to join the terms to their definitions.

But for now let's implement some basic sorting and searching functionality. Say we want to search for all of the different definitions of the word 'mahzar':

In [43]:
query_mask = merge["Term"].str.contains("mahzar", na=False)

In [45]:
results = merge[query_mask]


In [46]:
results

Unnamed: 0,Term,Emic_Term,Translit,Scope,Tags,UID,JoinKey,Def,Spec,Source
285,mahzar,محضر,,islam,legaldocument_type,255,163,"""The mahdar refers to any of two different typ...",,
286,mahzar,محضر,,islam,legaldocument_type,349,163,"""Most of them relate to grievances of an area,...",mughals,
287,mahzar,محضر,,islam,legaldocument_type,632,163,"""Before the matter was finally settled in favo...",mughals,983.0
288,mahzar,محضر,,islam,legaldocument_type,707,163,"""Mahzar Nama (Collective Testimony): A Mahzar ...",india,992.0


In [51]:
# change dispaly options so we can see the full definition
pd.set_option('display.max_colwidth', None)

In [50]:
results

Unnamed: 0,Term,Emic_Term,Translit,Scope,Tags,UID,JoinKey,Def,Spec,Source
285,mahzar,محضر,,islam,legaldocument_type,255,163,"""The mahdar refers to any of two different types of document: (1) a statement made by witnesses to the effect that someone has, for instance, sold, bought, pledged or acknowledged something. 'It consists of that upon which the judge's decision is based'; (2) a record of the two parties' actions and claims taking place in the presence of the qadd, who must sign it before witnesses in order for it to be complete.""",,
286,mahzar,محضر,,islam,legaldocument_type,349,163,"""Most of them relate to grievances of an area, especially, restoratin of madad-i-maash grants or a request from the grantees for non-interference of revenue officials in their grants.""",mughals,
287,mahzar,محضر,,islam,legaldocument_type,632,163,"""Before the matter was finally settled in favour of the heir, in view of some dispute that might crop up in regard to his title, the jagirdar or the nuvvab of the pargana executed a document attested by the qadi and other witnesses to bear testimony to the confirmed possession of the old grant in favour of the recipient of the relevant documents. It was just like a public recognition of the possession by the heir thereof... The details were recorded in the form of a mahdar-nama which was attested by the seal of the qadi and signed by the witnesses.""",mughals,983.0
288,mahzar,محضر,,islam,legaldocument_type,707,163,"""Mahzar Nama (Collective Testimony): A Mahzar Nama is similar to an affidavit. It consists of a statement of facts regarding a property, a person or certain events, solicited by specific persons, and attested to by their professional and social associates. Such a document was usually prepared in the context of a legal dispute, sometimes to replace missing or damaged legal deeds. It was nearly always sealed and verified by an Islamic judge, a qazi. Mughal Mahzar Namas often began with a quote from the Quran exhorting witnesses not to conceal testimony. They were different from mahzars in the Marathi-writing parts of the country, which were records of decisions made by judicial assemblies. In Sunni Hanafi legal texts such as the imperially sponsored Fatawa-yi Alamgiri, mahzars are similarly described as documents that recorded the proceedings of a legal dispute.""",india,992.0


Note that Dataframes can read *regular expressions*, a concept we have already encountered. So say you remember that there was a word that started with a 't', and ended with a 'gha', but you weren't sure which vowel came in between. The regular expression for that would be `t.*gha`.
- `.` means 'any character will do'
- `*` means 'whatever the previous character was, there can be as many of those until we hit the next search term (which is `gha` in this case)

In [58]:
query_mask = merge["Term"].str.contains("t.*gha", na=False)
results = merge[query_mask]
results

Unnamed: 0,Term,Emic_Term,Translit,Scope,Tags,UID,JoinKey,Def,Spec,Source
225,istighasa,,,indic,legal,192,128,"""Demanding justice, preferring a complaint.""",,508.0
226,istighasa,,,indic,legal,644,128,"Arabic term means ""appeal for aid.""",,
315,tamgha,تمغا,,transoxania,taxessignatureseal,300,195,﻿Customs duties and commercial taxes; frequently used to denote all taxes contrary to the shari'a and therefore was often the target of pious Muslims seeking to abrogate such non-canonical levies.,,
316,tamgha,تمغا,,transoxania,taxessignatureseal,301,195,"﻿An abstract seal or stamp used by Eurasian nomadic peoples and by cultures influenced by them. The tamga was normally the emblem of a particular tribe, clan or family. They were common among the Eurasian nomads throughout Classical Antiquity and the Middle Ages (including Alans, Mongols, Sarmatians, Scythians and Turkic peoples).",,
317,tamgha,تمغا,,transoxania,taxessignatureseal,614,195,"""A word of foreign importation, it came to mean in India as a grant under the red seal of the Emperor, or to which red ink was applied... Such land assignments were reserved for an officer who applied for a grant as a state pension in his own home village (ba-jihat-i-vaṭan) in which he was born or desired to settle down.""",mughals,983.0
318,tamgha,تمغا,,transoxania,taxessignatureseal,862,195,"example of word being used in Russian to indicate a mark (in this case an 'X') in place of a signature at the end of a document : ""К сему прошению простел Умир Матьханов приложил тамгу X"". i.e. the document was written by a scribe fluent in Russian, but the petitioner was illiterate.",bukhara,1263.0
319,tamgha,تمغا,,transoxania,taxessignatureseal,878,195,Turki formula for using a symbol in place of a signature: شول سوزیمنی راست لیغینه قولوم قیوب تمخم باسدوم,russian_turkestan,1135.0


ah, but we didn't tell it that the term we are looking for necessarily needs to *start* with `t`; we can fix that usuing regex with `^`, which anchors the character that follows it as the first character of the instance.

In [59]:
query_mask = merge["Term"].str.contains("^t.*gha", na=False)
results = merge[query_mask]
results

Unnamed: 0,Term,Emic_Term,Translit,Scope,Tags,UID,JoinKey,Def,Spec,Source
315,tamgha,تمغا,,transoxania,taxessignatureseal,300,195,﻿Customs duties and commercial taxes; frequently used to denote all taxes contrary to the shari'a and therefore was often the target of pious Muslims seeking to abrogate such non-canonical levies.,,
316,tamgha,تمغا,,transoxania,taxessignatureseal,301,195,"﻿An abstract seal or stamp used by Eurasian nomadic peoples and by cultures influenced by them. The tamga was normally the emblem of a particular tribe, clan or family. They were common among the Eurasian nomads throughout Classical Antiquity and the Middle Ages (including Alans, Mongols, Sarmatians, Scythians and Turkic peoples).",,
317,tamgha,تمغا,,transoxania,taxessignatureseal,614,195,"""A word of foreign importation, it came to mean in India as a grant under the red seal of the Emperor, or to which red ink was applied... Such land assignments were reserved for an officer who applied for a grant as a state pension in his own home village (ba-jihat-i-vaṭan) in which he was born or desired to settle down.""",mughals,983.0
318,tamgha,تمغا,,transoxania,taxessignatureseal,862,195,"example of word being used in Russian to indicate a mark (in this case an 'X') in place of a signature at the end of a document : ""К сему прошению простел Умир Матьханов приложил тамгу X"". i.e. the document was written by a scribe fluent in Russian, but the petitioner was illiterate.",bukhara,1263.0
319,tamgha,تمغا,,transoxania,taxessignatureseal,878,195,Turki formula for using a symbol in place of a signature: شول سوزیمنی راست لیغینه قولوم قیوب تمخم باسدوم,russian_turkestan,1135.0


### Where to go from here?

One of the wonderful things about opensource data formats is that you are not limited to any one single application. A minimalist use case of Pandas would be to use its joining functionality to produce various permutations of different merges as your needs dictate and then export those merged tables back to CSV to use in other programs (e.g., Excel). Or you can run your searches within the Jupyter notebook, as we did above. Or you could write functions to create interactive searches on your command line - which is easier than it sounds.

As we will see in future weeks (if we have time), Python has other libraries that can accomplish other tasks. For instance, if you were tracking social network relationships in your database, you could use the above methods to get your data in the right format and subset that you need, and then visualize those relationships graphically right within your Jupyter notebook. Ditto for geospatial analysis.

But, for now, let's get this merged table back into a CSV file just so that we know how to do so (it is currently saved in a local dataframe variable within this instance of Jupyter notebook). *Caution*: you would not want to modify data after the export of the merged table; it is only really for reference and viewing. Your 'base' data should be those original CSV files. Remember, you can run this entire routine in an instance since it is code, so there is no real drawback to having separate tables for analysis and data entry.



In [60]:
merge.to_csv("exported_data/sample.csv")
# note that tags on separate lines gets messed up in export