# Using Zotero2Graph

This notebook explains the basics about the script of zotero2graph. It has been programmed with the stylometric bibliography curated by Christof Schöch and published here:

https://www.zotero.org/groups/643516/stylometry_bibliography


First we have to import the folder where the script zotero2graph.py is in your computer. Since this notebook and the script are in the same subfolder in my computer, the path is just "./". In your case could be "/home/juancarlos/myprogramms/zotero2graph"

In [1]:
import sys
import os
sys.path.append(os.path.abspath("./"))


Let's now import the script:

In [2]:
import zotero2graph

import pandas as pd
import glob
from collections import Counter
import re
import os


# Functions
Now let's try the different functions one by one (if you are not interested in that, only use the script, you can jump to header *Use* on the bottom of this notebook). First we need a folder in our computer that must contain two subfolders: one for the input data and the other one for the output.
For the input, we need the CSV file that Zotero creates when we click on the "Export library" option, CSV format. It should create a file like "Stylometry Bibliography.csv". We place this file in the input subfolder:

In [3]:
wdir = "./"
data = "data/"
results = "results/"

doc = wdir+data+"Stylometry Bibliography.csv"
file_name = os.path.splitext(os.path.split(doc)[1])[0]
print(doc)

./data/Stylometry Bibliography.csv


Now we load this file and print some of he results

In [4]:
authors_articles = zotero2graph.load_bibliography(doc)
print(authors_articles[1:5])

[['Arasu, Arvind', ' Cho, Junghoo', ' Garcia-Molina, Hector', ' Paepcke, Andreas', ' Raghavan, Sriram'], ['Green, T. R. G.'], ['Lafon, Pierre'], ['Kim, Yunhyong', ' Ross, Seamus']]


As we see, the bibliography has already become a list of lists. Each sub-list is an  article, and every article contain a string with the name of each author. In this case, we are seeing four different articles: the first with 5 authors (from "Arasu, Arvind" to "Raghavan, Sriram"), two articles each written by a single author (Green and Lafon) and the last one by two authors (Kim and Ross). After that, we want to clean a bit  the names of the authors and only use the inicial letter of their first name and the surname:

In [5]:
authors_articles = zotero2graph.clean_authors_articles(authors_articles, file_name, wdir, results)
authors_articles[1:5]

[['Arasu_A', 'Cho_J', 'Garcia-Molina_H', 'Paepcke_A', 'Raghavan_S'],
 ['Green_T'],
 ['Lafon_P'],
 ['Kim_Y', 'Ross_S']]

Once we have the names in a nice way, we create the list of the nodes:


In [6]:
authors = zotero2graph.create_authors(authors_articles, file_name, wdir, results)
authors.iloc[-5:,:]

Unnamed: 0,Id,Weight,Label
961,Smith_M,35,Smith_M
1122,Holmes_D,35,Holmes_D
2083,Craig_H,59,Craig_H
957,Burrows_J,60,Burrows_J
540,Juola_P,97,Juola_P


These 5 authors have the biggest amount of written papers in this bibliography. The names of columns are optimized for Gephi.

After the nodes, we need something else: edges!

In [7]:
edges = zotero2graph.create_edges(authors_articles, file_name, wdir,results)
edges.iloc[-10:,:]

Unnamed: 0,Source,Target,Weight,Type
3533,Milic_L,Milic_L,15,Undirected
3225,Foster_D,Foster_D,16,Undirected
3278,Koppel_M,Schler_J,18,Undirected
2993,Merriam_T,Merriam_T,20,Undirected
3651,Hoover_D,Hoover_D,20,Undirected
442,Rudman_J,Rudman_J,20,Undirected
3358,Smith_M,Smith_M,30,Undirected
2422,Craig_H,Craig_H,31,Undirected
73,Burrows_J,Burrows_J,36,Undirected
4076,Juola_P,Juola_P,52,Undirected


What does it mean that Juola_P is the source 52 times and he is also the target of these edges? Well, I decided to mark as a self loop the articles with a single author. That means that Patrik_J wrote 52 articles as a single author in this bibliography. After that, Burrows_J wrote 36 articles. So, actually, the most frequent real co-authorship between two different people is the one between Koppel_M and Schler_J (18 articles as co-authors).

Why would you encoded as a self-loop an article with a single author, if we already have the information about the amount of articles published in the nodes? Well, becuase Gephi doesn't allow to use the attributes of the nodes for some functions, like for example to change the size of the nodes (it does for other functions like the color of the node). That is the reason. The script has an attribute to change that and you could also filter the self loops in Gephi.

And  after that, you are good to open Gephi and load your two files as nodes and edges. 

Now let's see how to use the script in a single call.

# Use
We can use directly the function *main* and let the program create the two files (nodes and edges). We just have to pass where the data is and where do I want my results:

In [8]:
zotero2graph.main(
    wdir = "./",
     data = "data/",
     results = "results/"
     )    

./data/Stylometry Bibliography.csv
done


After that, we will have in the "results" subfolder two new files:
* the one with the nodes (authors): Stylometry Bibliography_authors.csv
* the one with the edges (coatuhorsip): Stylometry Bibliography_edges.csv

Ready to load in Gephi. Have fun!