# Interoperability with R

There are many popular R packages for text mining, topic modeling and NLP like [tm](https://cran.r-project.org/web/packages/tm/index.html) or [topicmodels](https://cran.r-project.org/web/packages/topicmodels/index.html). If for some reason you need to implement parts of your work in Python with tmtoolkit and other parts in R, you can do that quite easily.

First of all, you can import and export all tabular data to and from Python using tabular data formats like CSV or Excel. See for example the sections on [tabular tokens output](preprocessing.ipynb#Accessing-tokens-and-token-attributes) or [exporting topic modeling results](topic_modeling.ipynb#Displaying-and-exporting-topic-modeling-results) and check out the [load_corpus_from_tokens_table](api.rst#tmtoolkit.corpus.load_corpus_from_tokens_table) function.

However, if you only want to load a document-term matrix (DTM) that you generated with tmtoolkit into R or vice versa, the most efficient way is to store this matrix along with all necessary metadata to an [RDS file](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS) as explained in the following section. 

<div class="alert alert-info">
    
**Note**

You will need to install tmtoolkit with the "rinterop" option in order to use the functions explained in this chapter: `pip install tmtoolkit[rinterop]`. This is only available since version 0.12.0.

</div>


## Saving a (sparse) document-term matrix to an RDS file

A common scenario is that you used tmtoolkit for preprocessing your text corpus and generated a DTM along with document labels and the corpus vocabulary. For further processing you want to use R, e.g. for topic modeling with the *topicmodels* package. You can do so by using the [save_dtm_to_rds](api.rst#tmtoolkit.bow.dtm.save_dtm_to_rds) function.

First, we generate a DTM from some sample data:

In [1]:
import tmtoolkit.corpus as c

corp = c.Corpus.from_builtin_corpus('en-News100', sample=10)
c.print_summary(corp)

Corpus with 10 documents in English
> News100-1771 (522 tokens): Trump invited Navy SEAL widow Carryn Owens to join...
> News100-3539 (1624 tokens): Ana 's Fate Rested With An Asylum Officer Who Had ...
> News100-3031 (1440 tokens): Battle over Ireland 's last Magdalene laundry    I...
> News100-368 (578 tokens): Air pollution concerns potential overseas talent  ...
> News100-2515 (167 tokens): Four killed in Austrian avalanche    Four Swiss me...
> News100-2483 (840 tokens): UN report : Israel has established an ' apartheid ...
> News100-895 (1378 tokens): German retailer KiK compensates Pakistan 's ' indu...
> News100-3228 (599 tokens): Neil Gorsuch facing ' rigorous ' confirmation hear...
> News100-1813 (1743 tokens): Press review : Russia changes anti - doping tune a...
> News100-2787 (1184 tokens): French architect Le Corbusier 's foray into the Fa...
total number of tokens: 10075 / vocabulary size: 2759


In [2]:
c.lemmatize(corp)
c.to_lowercase(corp)
c.filter_clean_tokens(corp, remove_numbers=True)
c.remove_common_tokens(corp, df_threshold=0.9)
c.remove_uncommon_tokens(corp, df_threshold=0.1)

c.print_summary(corp)

Corpus with 10 documents in English
> News100-1771 (122 tokens): trump invite white house wednesday share detail su...
> News100-3539 (355 tokens): rest officer tell doubt word bear head trump admin...
> News100-3031 (331 tokens): ireland ireland sale want memorial woman abuse han...
> News100-368 (138 tokens): air potential percent foreign worker problem air u...
> News100-2515 (41 tokens): kill swiss man kill group away western austria pol...
> News100-2483 (188 tokens): report establish white report break new ground sit...
> News100-895 (330 tokens): industrial family company release $ compensation k...
> News100-3228 (138 tokens): face hearing week president donald trump justice c...
> News100-1813 (376 tokens): press change language story press thursday march f...
> News100-2787 (218 tokens): east national western world summer good man job bu...
total number of tokens: 2237 / vocabulary size: 494


In [3]:
dtm, doc_labels, vocab = c.dtm(corp, return_doc_labels=True, return_vocab=True)

In [4]:
print('first 10 document labels:')
print(doc_labels[:10])

print('first 10 vocabulary tokens:')
print(vocab[:10])

print('DTM shape:')
print(dtm.shape)

first 10 document labels:
['News100-1771', 'News100-1813', 'News100-2483', 'News100-2515', 'News100-2787', 'News100-3031', 'News100-3228', 'News100-3539', 'News100-368', 'News100-895']
first 10 vocabulary tokens:
['$', 'able', 'abuse', 'accept', 'access', 'accord', 'account', 'accuse', 'acknowledge', 'act']
DTM shape:
(10, 494)


The DTM is stored a sparse matrix. **It's highly recommended to use a sparse matrix representation, especially when you're working with large text corpora.**

In [5]:
dtm

<10x494 sparse matrix of type '<class 'numpy.int32'>'
	with 1348 stored elements in Compressed Sparse Row format>

Now, we save the DTM along with the document labels and the vocabulary as sparse matrix to an RDS file, that we can load into R:

In [6]:
import os
from tmtoolkit.bow.dtm import save_dtm_to_rds

rds_file = os.path.join('data', 'dtm.RDS')
print(f'saving DTM, document labels and vocabulary to file "{rds_file}"')
save_dtm_to_rds(rds_file, dtm, doc_labels, vocab)


saving DTM, document labels and vocabulary to file "data/dtm.RDS"


The following R code would load this DTM from the RDS file and fit a topic model via LDA with 20 topics:

```R
library(Matrix)       # for sparseMatrix in RDS file
library(topicmodels)  # for LDA()
library(slam)         # for as.simple_triplet_matrix()

# load data 
dtm <- readRDS('data/dtm.RDS')
class(dtm)
dtm  # sparse matrix with document labels as row names, vocabulary as column names

# convert sparse matrix to triplet format required for LDA
dtm <- as.simple_triplet_matrix(dtm)

# fit a topic model
topicmodel <- LDA(dtm, k = 20, method = 'Gibbs')

# investigate the topics
terms(topicmodel, 5)
```

## Load a (sparse) document-term matrix from an RDS file

The opposite direction is also possible. For example, you may have preprocessed a text corpus in R and generated a (sparse) DTM along with its document labels and vocabulary. You can write this data to an RDS file and load it into Python/tmtoolkit. The following R code shows an example to generate a sparse DTM and store it to `data/dtm2.RDS`:

```R
library(Matrix)       # for sparseMatrix
library(tm)           # for DocumentTermMatrix

data("crude")

dtm <- DocumentTermMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE))

dtm_out <- sparseMatrix(i = dtm$i, j = dtm$j, x = dtm$v, dims = dim(dtm),
                        dimnames = dimnames(dtm))

saveRDS(dtm_out, 'data/dtm2.RDS')
```

We can now load the DTM along with its document labels and vocabulary from this RDS file:

In [7]:
import os.path
from tmtoolkit.bow.dtm import read_dtm_from_rds


rds_file = os.path.join('data', 'dtm2.RDS')
print(f'loading DTM, document labels and vocabulary from file "{rds_file}"')
dtm, doc_labels, vocab = read_dtm_from_rds(rds_file)

print('first 10 document labels:')
print(doc_labels[:10])

print('first 10 vocabulary tokens:')
print(vocab[:10])

print('DTM shape:')
print(dtm.shape)

loading DTM, document labels and vocabulary from file "data/dtm2.RDS"
first 10 document labels:
['127', '144', '191', '194', '211', '236', '237', '242', '246', '248']
first 10 vocabulary tokens:
['100000', '108', '111', '115', '12217', '1232', '1381', '13member', '13nation', '150']
DTM shape:
(20, 1000)


In [8]:
dtm

<20x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 1738 stored elements in Compressed Sparse Column format>

Note that the DTM was loaded as floating point matrix, but it makes more sense to represent the term frequencies as integers, since they are essentially counts:

In [9]:
dtm = dtm.astype('int')
dtm

<20x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 1738 stored elements in Compressed Sparse Column format>

We could now further process and analyze this DTM with tmtoolkit. For example, we can display to three most frequent tokens per document:

In [10]:
from tmtoolkit.bow.bow_stats import sorted_terms_table

# selecting only the first 5 documents
sorted_terms_table(dtm[:5, :], vocab=vocab, doc_labels=doc_labels[:5], top_n=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,token,value
doc,rank,Unnamed: 2_level_1,Unnamed: 3_level_1
127,1,oil,5
127,2,prices,3
127,3,said,3
144,1,opec,13
144,2,oil,12
144,3,said,11
191,1,canadian,2
191,2,texaco,2
191,3,crude,2
194,1,crude,3
