# Using Stylo in Python


## Why would you do that?

Since a couple of years I have been using stylometric methods to analyse texts, mainly literary ones. I learned about the great stylometric tool Stylo (written in R) at the European Summer School of Digital Humanities in Leipzig from two of the developers: Maciej Eder and Jan Rybicki. 

Some months after that I started my PhD at the University of Würzburg at the Computerphilologie Professorship (hold by Prof. Jannidis). I was told that I had to learn Python because that was the *programming mother tongue* of the department. So I did. Since then many of my projects are a mix of very basic R script that call Stylo and other more sofisticated scripts that make the preprocess and the evaluation in Python.

I am not the only person in this R-Python situation and actually in the last years at least two tools for Stylometry have been written in Python: Pystyl and PyDelta. Why do I keep working with Stylo if I know more Python? For several reasons:

 * Stylo is very well documented
 * It has a mailing group where you get answers and help
 * It has been tested by hundreds of researchers
 * The developers teach about the tool
 * And they use the feedback of these workshops (we are talking about hundreds of students!) to improve Stylo (I have seen myself Maciej speed coding some changes in Stylo during the class, uploading to CRAN, and asking the people to update Stylo)
 * Because my PhD-tutors recommend me to do so

My stylometric tests are  becoming more and more complex so it is starting to be a pain to jump all the time between two groups of scripts. I knew that you can use other programming languages inside Python, so I thought that it was worth a try to see if it was possible with R too.

This Notebook and the sibling blog post at http://cligs.hypotheses.org/blog are the first findings. I would be really happy to receive opinion and feedback.

## rpy2
The module that we are going to use is *rpy2* https://rpy2.readthedocs.io/en/version_2.8.x/, which allows you to work with R in Python. Installing rpy2 was not the difficult part, the difficult part was to make it work. After some time I realised that the problem was the version of R in my computer. Although the documentation of rpy2 says that a 3.0 version of R should be ok, it was not. Updating R in Ubuntu was trickier than expected, so I uninstalled and reinstalled R and Stylo again, making sure that the version was higher than 3.0. I am currently working with 3.3.

So, enough talking, let's do some code:

In [53]:
import rpy2.robjects as ro
R = ro.r
print("Version of R that is been used: "+ str(R.version[6][0]))

Version of R that is been used: 3.3


I am not going to explain how exactly rpy2 works (because it is not the poing of this notebook and because I couldn't). Let's just say that when we see anything starting with an R., it will be a R object that we calling from Python. Example:

In [73]:
R.pi
print(type(R.pi))
print(R.pi)

<class 'rpy2.robjects.vectors.FloatVector'>
[1] 3.141593



We can convert this objects to Python objects:

In [76]:
pi = R.pi[0]
print(type(pi))
print(pi)

<class 'float'>
3.141592653589793


## Stylo in Python
In the same way we can call Stylo in Python:

In [31]:
R.library("stylo")

print(R.help("stylo"))

R Help on ‘stylo’stylo                  package:stylo                   R Documentation

_S_t_y_l_o_m_e_t_r_i_c _m_u_l_t_i_d_i_m_e_n_s_i_o_n_a_l _a_n_a_l_y_s_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     It is quite a long story what this function does. Basically, it is
     an all-in-one tool for a variety of experiments in computational
     stylistics. For a more detailed description, refer to HOWTO
     available at: <URL:
     https://sites.google.com/site/computationalstylistics/>

_U_s_a_g_e:

     stylo(gui = TRUE, frequencies = NULL, parsed.corpus = NULL,
           features = NULL, path = NULL, corpus.dir = "corpus", ...)
     
_A_r_g_u_m_e_n_t_s:

     gui: an optional argument; if switched on, a simple yet effective
          graphical interface (GUI) will appear. Default value is
          ‘TRUE’.

frequencies: using this optional argument, one can load a custom table
          containing frequencies/counts for several varia

In the repository if this Notebook you find in a subfolder a one Spanish repository  of the CLiGS Textbox (https://github.com/cligs/textbox) that are already prepared for stylometric tests. So I will define the path just as the current place and I will call Stylo without the graphical user interface (if I would need the GUI we would just work in R!).

In [77]:
R.setwd(".")
all_data = R.stylo(
            gui = False,
            )

using current directory...

Performing no sampling (using entire text as sample)

loading Bazan_Pazos-ne0077.txt	...

loading Bazan_Piedra-ne0082.txt	...

loading Bazan_Sirena-ne0085.txt	...

loading BlascoIbanez_Arroz-ne0163.txt	...

loading BlascoIbanez_Barraca-ne0164.txt	...

loading BlascoIbanez_Bodega-ne0019.txt	...

loading Clarin_Cuesta-ne0170.txt	...

loading Clarin_Hijo-ne0135.txt	...

loading Clarin_Regenta-ne0325.txt	...

loading Galdos_Bringas-ne0027.txt	...

loading Galdos_Misericordia-ne0002.txt	...

loading Galdos_Tristana-ne0005.txt	...

loading Miro_Amigo-ne0044.txt	...

loading Miro_Hilvan-ne0041.txt	...

loading Miro_Vivir-ne0042.txt	...

loading Pereda_Pedro-ne0144.txt	...

loading Pereda_Penas-ne0145.txt	...

loading Pereda_Sotileza-ne0146.txt	...

loading Picon_Dulce-ne0155.txt	...

loading Picon_JuanV-ne0162.txt	...

loading Picon_Lazaro-ne0161.txt	...

loading Valera_Genio-ne0151.txt	...

loading Valera_Juanita-ne0152.txt	...

loading Valera_Morsamor-ne0153.txt	






It is cool to see the answers of Stylo in a Jupyter Notebook running on Python, right? It sends us a couple of warning messages: I think the problem is in the kind of answer that Stylo gives you in command line of R while running, that cannot give you in the same ways in Python.

When it is finished, a pop-up window from R will appear with the classic dendrogram that we all know:

![title](dendrogram.png)

## Passing arguments
Now, what happens when I want to define the arguments of Stylo? Because, as explained in the documentation, the arguments to define the maximum and minimum MFW are called *mfw.min* and *mfw.max*. I wee try this:

In [56]:
R.setwd(".")
all_data = R.stylo(
            gui = False,
            mfw.min = 5000,
            mfw.max = 5000
            )

SyntaxError: keyword can't be an expression (<ipython-input-56-5f8daf7e2087>, line 4)

Python complains: it doesn't expect a dot in a variable name. For this cases the documentation of rpy2 (http://rpy.sourceforge.net/rpy2/doc-2.2/html/robjects_functions.html) recommends to pass the arguments as a python dictionary in which the keys are strings with the names of the arguments in Stylo:

In [78]:
I_love_this_stuff = R.stylo(
    **{
        "gui" : False,
        "analyzed.features" : "w",
        "ngram.size" : 1,
        "preserve.case" : False,
        "mfw.min" : 5000,
        "mfw.max" : 5000,
        "mfw.list.cutoff" : 5000,
        "analysis.type" : "CA",
        "distance.measure" : "dist.eder",
        "sampling" : "no.sampling",
        "display.on.screen" : False,
        "write.png.file" : True,
        "save.distance.tables" : True,
        "save.analyzed.features" : True,
        "save.analyzed.freqs" : True,
    }
)

using current directory...

Performing no sampling (using entire text as sample)

loading Bazan_Pazos-ne0077.txt	...

loading Bazan_Piedra-ne0082.txt	...

loading Bazan_Sirena-ne0085.txt	...

loading BlascoIbanez_Arroz-ne0163.txt	...

loading BlascoIbanez_Barraca-ne0164.txt	...

loading BlascoIbanez_Bodega-ne0019.txt	...

loading Clarin_Cuesta-ne0170.txt	...

loading Clarin_Hijo-ne0135.txt	...

loading Clarin_Regenta-ne0325.txt	...

loading Galdos_Bringas-ne0027.txt	...

loading Galdos_Misericordia-ne0002.txt	...

loading Galdos_Tristana-ne0005.txt	...

loading Miro_Amigo-ne0044.txt	...

loading Miro_Hilvan-ne0041.txt	...

loading Miro_Vivir-ne0042.txt	...

loading Pereda_Pedro-ne0144.txt	...

loading Pereda_Penas-ne0145.txt	...

loading Pereda_Sotileza-ne0146.txt	...

loading Picon_Dulce-ne0155.txt	...

loading Picon_JuanV-ne0162.txt	...

loading Picon_Lazaro-ne0161.txt	...

loading Valera_Genio-ne0151.txt	...

loading Valera_Juanita-ne0152.txt	...

loading Valera_Morsamor-ne0153.txt	






So now we have in our folder all the files that we have asked: png, distance table, features used... But what if I want to work further with this data in Python?

## Using the data from Stylo in Python

In the cell above I have called stylo() and saved its output in a variable called I_love_this_stuff (following the documentation of stylo ;) ):

In [62]:
print(type(I_love_this_stuff))
print(len(I_love_this_stuff))
print(I_love_this_stuff)

<class 'rpy2.robjects.vectors.ListVector'>
9

Function call:

Depending on your chosen options, some results should have been written
into a few files; you should be able to find them in your current
(working) directory. Usually, these include a list of words/features
used to build a table of frequencies, the table itself, a file containing
recent configuration, etc.

Advanced users: you can pipe the results to a variable, e.g.:
	 hip.hip.hurrah = stylo() 
this will create a class "hip.hip.hurrah" containing some presumably
interesting stuff. The class created, you can type, e.g.:
	 summary(hip.hip.hurrah)
to see which variables are stored there and how to use them.


for suggestions how to cite this software, type: citation("stylo")





As we see, my data variable is a List of Vector of length 9. Each of these items contain different information from the analysis I have done. For example, the first item contains actually the distance matrix:

In [63]:
print(type(I_love_this_stuff[0]))
print(I_love_this_stuff[0])

<class 'rpy2.robjects.vectors.Matrix'>

--------------------------------------------
final distances between each pair of samples 
--------------------------------------------

                            Bazan_Pazos-ne0077 Bazan_Piedra-ne0082
Bazan_Pazos-ne0077          0                  1888.5245394       
Bazan_Piedra-ne0082         1888.5245394       0                  
Bazan_Sirena-ne0085         2133.6560362       2101.6861626       
BlascoIbanez_Arroz-ne0163   2261.4495052       2413.2292456       
BlascoIbanez_Barraca-ne0164 2485.5174716       2618.0304242       
BlascoIbanez_Bodega-ne0019  2292.5467693       2411.1421787       
Clarin_Cuesta-ne0170        2833.7510012       2788.7184217       
Clarin_Hijo-ne0135          2255.7227789       2371.0966337       
Clarin_Regenta-ne0325       2107.9031058       2301.6654136       
Galdos_Bringas-ne0027       2165.8468701       2303.483447        
                            ...                ...                
                   

As we see, this object is a matrix in R. Working in Python we would be happier with a Pandas Dataframe. For doing that, we convert first the matrix in a Numpy array, we use this array to load I_love_this_stuff to the dataframe, and we pass the names of the rows and the columns.

In [65]:
import pandas as pd
import numpy as np

I_love_this_deltamatrix = pd.DataFrame(np.array(I_love_this_stuff[0]), index=list(R.colnames(I_love_this_stuff[0])), columns=list(R.colnames(I_love_this_stuff[0])))

I_love_this_deltamatrix

Unnamed: 0,Bazan_Pazos-ne0077,Bazan_Piedra-ne0082,Bazan_Sirena-ne0085,BlascoIbanez_Arroz-ne0163,BlascoIbanez_Barraca-ne0164,BlascoIbanez_Bodega-ne0019,Clarin_Cuesta-ne0170,Clarin_Hijo-ne0135,Clarin_Regenta-ne0325,Galdos_Bringas-ne0027,...,Miro_Vivir-ne0042,Pereda_Pedro-ne0144,Pereda_Penas-ne0145,Pereda_Sotileza-ne0146,Picon_Dulce-ne0155,Picon_JuanV-ne0162,Picon_Lazaro-ne0161,Valera_Genio-ne0151,Valera_Juanita-ne0152,Valera_Morsamor-ne0153
Bazan_Pazos-ne0077,0.0,1888.524539,2133.656036,2261.449505,2485.517472,2292.546769,2833.751001,2255.722779,2107.903106,2165.84687,...,2692.634044,2129.738339,2052.474196,2097.021096,2181.173081,2537.517957,2568.952628,2467.28403,2152.521632,2373.088271
Bazan_Piedra-ne0082,1888.524539,0.0,2101.686163,2413.229246,2618.030424,2411.142179,2788.718422,2371.096634,2301.665414,2303.483447,...,2714.389804,2208.201252,2213.70759,2159.672964,2274.322568,2560.370894,2618.89868,2504.64914,2284.564593,2391.257775
Bazan_Sirena-ne0085,2133.656036,2101.686163,0.0,2574.080062,2869.269357,2603.611752,2694.86825,2494.749683,2469.059592,2422.727418,...,2842.195431,2324.406443,2349.662223,2439.833464,2333.246418,2699.575492,2661.347957,2502.916654,2430.757873,2510.821798
BlascoIbanez_Arroz-ne0163,2261.449505,2413.229246,2574.080062,0.0,2140.168343,1923.197453,2955.406759,2380.717076,2307.240663,2367.146174,...,2848.758451,2364.020159,2376.773223,2445.424526,2445.485729,2597.534945,2617.38028,2644.515895,2452.854914,2589.977293
BlascoIbanez_Barraca-ne0164,2485.517472,2618.030424,2869.269357,2140.168343,0.0,2087.081492,3183.523131,2753.838045,2679.923329,2780.074767,...,2778.458402,2728.936723,2569.968181,2670.966947,2870.449013,2991.170549,2845.063189,2963.664838,2751.053181,2785.147995
BlascoIbanez_Bodega-ne0019,2292.546769,2411.142179,2603.611752,1923.197453,2087.081492,0.0,2951.362769,2460.758914,2381.056994,2545.838272,...,2761.898703,2399.087185,2362.920219,2485.379047,2590.448137,2714.085476,2660.198587,2677.056816,2477.59782,2500.963617
Clarin_Cuesta-ne0170,2833.751001,2788.718422,2694.86825,2955.406759,3183.523131,2951.362769,0.0,2599.740875,2704.231502,2874.876338,...,3116.691359,2709.556446,2700.670607,2898.486426,2911.209593,3001.134998,2979.386879,2779.595212,2799.188649,2840.853888
Clarin_Hijo-ne0135,2255.722779,2371.096634,2494.749683,2380.717076,2753.838045,2460.758914,2599.740875,0.0,1752.054597,2291.415609,...,3046.43438,2247.97423,2264.813618,2397.839703,2282.923914,2548.352462,2607.618169,2384.867753,2252.733794,2451.050301
Clarin_Regenta-ne0325,2107.903106,2301.665414,2469.059592,2307.240663,2679.923329,2381.056994,2704.231502,1752.054597,0.0,2287.29481,...,2944.264941,2241.958666,2283.56718,2366.994209,2309.467804,2633.959138,2607.336197,2513.086162,2264.576739,2522.379293
Galdos_Bringas-ne0027,2165.84687,2303.483447,2422.727418,2367.146174,2780.074767,2545.838272,2874.876338,2291.415609,2287.29481,0.0,...,2979.822925,2150.504746,2142.769529,2224.926537,2263.362507,2505.7182,2689.517108,2457.253401,2186.79902,2516.545984


There you have your beautiful Delta Matrix of your corpus as Pandas Dataframe, using Stylo but working only with Python scripts. Yey!

## Feedback, please!

This is just a try. Many things could be done in different ways, I probably have overseen things, maybe there are better ways to deal with this Python-R problem... So, please, let me know your thoughts. Together with this notebook I have also written a blog post, where you can write coments. 