# fitgrid.lm() API Tutorial

What exactly is in my_grid and how do I get it out for further analysis?

1. first show what a stastmodel ols regression fit object looks like in the first place ... the big bundle of attributes and methods. Display some examples ... param, pvalue, influence to illustrate. I think statsmodels will be unfamiliar to many people and it is important to make the point that each "fit" is a fat object and not just a table of scalars.

2. show how after fitgrid.lm() is run, the grid is a 2-D array of these fit objects ... show the shape, type, and head of the *grid*.

3. show how the fit attributes are gridded, again show the shape, type, and, head of a few attributes.

4. show how to use the row x column slicing syntax to access regions of the array in both dimensions

5. show a few different kinds of examples to illustrate what data frames look like for grid queries that come back with a scalar (like p values) vs. those that don't like params, influence stats and such.

To understand the `fitgrid linear regression` and how to use it in analysis, the following describes a `fit` object and how `fitgrid.lm()` is structured. Examples of usage and data frame queries for statistical values of interest are also provided below. (via ``statsmodels``' ``ols``)

## 1. Fit Object

Following the initial fitgrid tutorial, here we are using the same set of data to understand a `fit` object. 

A `fit` object in this case stems from `statsmodels` ordinary least squares regression function.

In [None]:
import fitgrid

In [None]:
# example_filename = 'example.h5'
# epochs = fitgrid.epochs_from_hdf(
epochs_df = fitgrid.generate().table.set_index('Time', append=True)
epochs = fitgrid.epochs_from_dataframe(
    epochs_df,
    time='Time',
    epoch_id='Epoch_idx',
    channels=['channel0', 'channel1']
)

In [None]:
import statsmodels.formula.api as smf

In [None]:
# consider at time 0 for channel0
statsmodel_fit = smf.ols(formula= "channel0 ~ continuous + categorical", 
                         data=epochs.table[epochs.table['Time']==0]).fit()

Using the $R$-style formula language, the statsmodels ordinary least squares function includes a function for `fit`. This outputs a `fit` object.

An useful function for the `fit` object is examining the statistical properties of the ordinary least square regression. This conviniently outputs a summary table for the linear model you have constructed, which contains information such as the coefficients and p-values. 

In [None]:
statsmodel_fit.summary()

Notice we can also query the desired information, as they are an attribute of the `fit` object.

In [None]:
# query the coefficients
statsmodel_fit.params

In [None]:
# query the p-values
statsmodel_fit.pvalues

In [None]:
# query the cooks distance
statsmodel_influence = statsmodel_fit.get_influence()
# look at the first 5 points' cook's distance
statsmodel_influence.cooks_distance[0][0:5]

From the first two examples you can see that the coefficients and p-values are attributes of the `fit` object, whereas the third example, getting the influence is a method of the `fit` object that contains detailed information of the influence of each point. One example of influence is examining the cook's distance, as queried above.

The `fit` object contains useful information as attributes and methods that you may want to examine. <br>
However, notice from the formula language, only **channel0** is used for this model. <br>
To use the same data and predictors for a different channel, say **channel1**, using ``statsmodels``, you would have to run the model set-up again in order to produce the relevant `fit` object.

This poses as a time and space consuming problem that `fitgrid.lm()` solves.

## 2. Fit Objects into a Grid

`fitgrid.lm()` solves the problem through breaking up the formula into the right hand side (RHS) and the left hand side (LHS). 

* LHS: **response variables** in a list. This can be data stream(s) you are trying to model. In this example, it will be different channels.
* RHS: **explanatory variables** that models every stream in the LHS. This one model will be used to analyze every response varaible in the LHS list. In this example, it will be both continuous and categorical.

In [None]:
# set up the fitgrid linear model
lm_grid = fitgrid.lm(epochs, LHS=['channel0', 'channel1'], RHS='continuous + categorical')

Understanding a fit object, fitgrid.lm() stores exactly the fit objects into a **grid** form that is set up as time by channel. For example, for the example data, there are **100 time points** with results showing for **channel0** and **channel1**. 

`fitgid` is set up time x channels, or response variables, which in the example, is 100 x 2. <br>
This is a two-dimensional array of `fit` objects.

To see this clearer, you can simply print the object to look at the shape.

In [None]:
lm_grid

Notice that using `TAB` completion for lm_grid and statsmodel_fit above, we get the same attributes and methods.

The type of the fitgrid.lm() output is a so-called `LMFitGrid`, this is the grid based set-up object in which the statsmodels linear regression results wrapped in each cell of the grid.

In [None]:
type(lm_grid)

To look at the **exact same** object we are interested as the statsmodel_fit above, we can extract through the following:

In [None]:
# consider at time 0 for channel0
lm_grid[0,'channel0']

The desired values, for example the coefficients, p-values, and cook's distance, can be queried as follows. 

We note that the values are exactly the same as the `fit` object we queried above.

Also note that when we query the summary, however, only the wrapper pointer will be printed.
Summary for one may be useful to be printed, but for hundreds of models, it is unhelpful to print summary for all. 
Querying the desired output individually for the hundreds of models is better to evaluate the results.

In [None]:
# query the summary
print(lm_grid[0,'channel0'].summary())

In [None]:
# query the coefficients
lm_grid[0,'channel0'].params

In [None]:
# query the p-values
lm_grid[0,'channel0'].pvalues

In [None]:
# query the cook's distance
time0_ch0_infl = lm_grid[0,'channel0'].get_influence()
# look at the first 5 points' cook's distance
time0_ch0_infl.cooks_distance[0:5]

## 3. Fitgrid Attributes

The benefit of using the `fitgrid` (a grid essentially with fit objects) is
1. obtain results with multiple time points
2. obtain results with multiple response variables
3. slicing (will expand in section 4)

Instead of only considering the data at time point 0 for the single **channel0**, `fitgrid.lm()` output allows query for multiple time points and multiple response variables. For example, we can query the coefficients and other attributes.

In [None]:
# looking at the head of multiple time points and channels coefficients
lm_grid.params.head(12)

In [None]:
# looking at the head of multiple time points and channels p-values
lm_grid.pvalues.head(12)

When extracting the shape, we can see that the example with 100 time points has 3 coefficients each: Intercept, categorical, continuous, and modeling for 2 channels: channel0 and channel1. <br>
The resulting dataframe shape is 300 by 2 that contains the entirety of the results.

In [None]:
# looking at the shape of the output
lm_grid.params.shape

In [None]:
# looking at the type of the output
type(lm_grid.params)

To query more attributes, considering using `TAB` completion to get `fit` attributes and methods.

## 4. Slicing

Since the `fitgrid` output that is organized time by channel (or response variables), we can slice out desired models and outputs.

We are showcasing:
1. single `fit` object slicing
2. one dimensional time interval with one channel, or response variable
3. two dimensional time interval by multiple channels, or response variables

#### 4.1 single `fit` object slicing

In [None]:
# consider at time 2 for channel1
lm_grid[2,'channel1']

#### 4.2 1-D slicing

In [None]:
# slicing for all the time and one channel
lm_grid[:,'channel0']

In [None]:
# slicing for time interval and one channel
lm_grid[0:76,'channel0']

In [None]:
# plotting the coefficients for a time interval for a single channel
lm_grid[0:48,'channel0'].plot_betas()

#### 4.3 2-D slicing

In [None]:
# slicing for time interval and all channels
lm_grid[0:76,:]

In [None]:
# plotting the adjusted R-squared for a time interval for both channels
lm_grid[0:54,:].plot_adj_rsquared()

## 5. Grid Query Examples

Now that you have seen how to query the `fitgrid` and obtain `fit` objects. 

We move on to query examples for output of interests. <br>
Generally we can think of the query outputs as two main categories: scalars and objects.

First we examine the scalar outputs. Examples of this would be coefficients and p-values. The results simple is stored into a pandas DataFrame for easy slicing and querying needs.

In [None]:
# considering the type of objects
type(lm_grid.params), type(lm_grid.pvalues)

In [None]:
# multi-index slicing grabbing the intercept coefficients for time 0
lm_grid.params.loc[(0, 'Intercept')]

We note the data type is float64 for coefficients, which is a **scalar** value. <br>
The `fit` object not only has the scalar values, the `influence objects` shown below are also part of the fit object.
To obtain the `influence object` from each `fit` object, you would have to use the method to **get** the influence and then you can query influence measures from there.

In [None]:
# query the influence object
infl_example = lm_grid[0:50,:].get_influence()

In [None]:
# the influence object is stored into influence grid
type(infl_example)

In [None]:
# query the cook's distance 
infl_example.cooks_distance[15:25]

Notice the influence grid is not the same as the `fit` grid. The attibutes in describing the influence is based on each singular point as the example above.