In [None]:
# hide
import warnings
from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer
warnings.filterwarnings("ignore")

# doc_collection

> A __universal__ tool for collecting documentation of python libraries.

## How to install


`git clone https://github.com/omemaxim/doc_collection`

## Purpose

Every well-maintained PyPi module has documentation in it's source. This tool provides functionality to collect, store and search in that documentation. You can either collect documentation of one specific library or documentation of all packages you've pipped at once.

# How to use

## 1. Collection

You can either collect documentation of one specific library or documentation of all packages in your python (including standard ones):

### One library

As mentioned above, there are two main functions: ``extract_one`` and ``extract``. The first one allows you to create a dataframe of documentation of specific library:

In [None]:
from doc_collection.core import extract_one

In [None]:
extract_one('pandas')

Unnamed: 0,text,name
0,pandas.DataFrame.columns. AxisProperty\n\n ...,pandas.DataFrame.columns
1,pandas.Series.index. AxisProperty\n\n The i...,pandas.Series.index
2,pandas.DataFrame.index. AxisProperty\n\n Th...,pandas.DataFrame.index
3,pandas.IntervalIndex.is_non_overlapping_monoto...,pandas.IntervalIndex.is_non_overlapping_monotonic
4,pandas.IntervalIndex.is_unique. CachedProperty...,pandas.IntervalIndex.is_unique
...,...,...
2631,pandas.core.internals.SingleArrayManager.is_vi...,pandas.core.internals.SingleArrayManager.is_view
2632,pandas.core.internals.SingleBlockManager.is_vi...,pandas.core.internals.SingleBlockManager.is_view
2633,pandas.HDFStore.is_open. property\n\n retur...,pandas.HDFStore.is_open
2634,pandas.RangeIndex.is_unique. property\n\n r...,pandas.RangeIndex.is_unique


Function above returns datafrane of two columns: __text__ column contains documentation and __name__ column contains name of  corresponding method/class/etc

### All libraries at once

Function below iterates over all modules in your local python and calls ``extract_one`` for all of it one by one

In [None]:
from doc_collection.core import extract

In [None]:
extract()

--------------- exception during theano documentation extracting
--------------- exception during tensorflow-io-gcs-filesystem documentation extracting


Unnamed: 0,text,name,library
0,pandas.DataFrame.columns. AxisProperty\n\n ...,pandas.DataFrame.columns,pandas
1,pandas.Series.index. AxisProperty\n\n The i...,pandas.Series.index,pandas
2,pandas.DataFrame.index. AxisProperty\n\n Th...,pandas.DataFrame.index,pandas
3,ipykernel.comm.Comm.topic. Bytes in module tra...,ipykernel.comm.Comm.topic,ipykernel
4,ipywidgets.Audio.value. Bytes in module traitl...,ipywidgets.Audio.value,ipykernel
...,...,...,...
57211,aiohttp.ClientResponse.url. reify\n\n,aiohttp.ClientResponse.url,aiohttp
57212,aiohttp.ClientResponse.url_obj. reify\n\n,aiohttp.ClientResponse.url_obj,aiohttp
57213,aiohttp.ClientResponse.history. reify\n\n A...,aiohttp.ClientResponse.history,aiohttp
57214,aiohttp.BodyPartReader.filename. reify\n\n ...,aiohttp.BodyPartReader.filename,aiohttp


Command above will return DataFrame of three columns: __text__ contains documentation of an object, __name__ contains the name and __library__ contains library of an object.

## 2. Storing

A tool uses __ElasticSearch__ to store and search data, __sentence_transformers__ library to calculate embeddings for better search quallity.

### Prelims

Make sure you initialised __elasticsearch__ and sequence2vec model like below

In [None]:
es = Elasticsearch('https://localhost:9200')

model_name = 'sentence-transformers/all-mpnet-base-v2'
model = SentenceTransformer(model_name)

### Creating index & indexing data

There are function to create index and index data from dataframe mentioned above:

In [None]:
doc_collection.search import index_data

In [None]:
# index_data(d, es, INDEX_NAME='my_index_name', BATCH_SIZE=5000)

### Search

In [None]:
from doc_collection.search import query

There are several ways to present data that was found. 
Query with following signature returns raw elasticsearch response:

In [None]:
# query(text, size, es, model, INDEX_NAME)

By specifying ``field`` parameter into __name__, you can get only names of methods.

In [None]:
# query(text='How to drop a column of dataframe?', size=10, es=es, model=model, INDEX_NAME='bert', field='name')

['pandas.DataFrame.drop',
 'pandas.Series.drop',
 'pandas.DataFrame.dropna',
 'pyarrow.Table.drop',
 'datasets.Dataset.drop_index',
 'pandas.DataFrame.droplevel',
 'datasets.IterableDatasetDict.remove_columns',
 'datasets.IterableDataset.remove_columns',
 'pandas.Series.droplevel',
 'datasets.DatasetDict.remove_columns']

By spicifying ``field`` parameter into __text__, you'll be able to get documentation:

In [None]:
# print(query(text='How to drop a column of dataframe?', size=1, es=es, model=model, INDEX_NAME='bert', field='text')[0])

pandas.DataFrame.drop. function drop in module pandas.core.frame

ddrroopp(self, labels=None, axis: 'Axis' = 0, index=None, columns=None, level: 'Level | None' = None, inplace: 'bool' = False, errors: 'str' = 'raise')
    Drop specified labels from rows or columns.
    
    Remove rows or columns by specifying label names and corresponding
    axis, or by specifying directly index or column names. When using a
    multi-index, labels on different levels can be removed by specifying
    the level. See the `user guide <advanced.shown_levels>`
    for more information about the now unused levels.
    
    Parameters
    ----------
    labels : single label or list-like
        Index or column labels to drop. A tuple will be used as a single
        label and not treated as a list-like.
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Whether to drop labels from the index (0 or 'index') or
        columns (1 or 'columns').
    index : single label or list-like
        Alter

# Afterwords

This tool highly depends on python build of the one who run code. That's why unexpected behavior is likely to happen. Feel free to add issues if you have one!