# Data preprocessing

Two primary data types are gathered to evaluate species richness diversity. The first is individual-based abundance data. This data type focuses on individuals as the sampling units, noting the count of each species observed. The second type is sample-based incidence data, where the sampling units are specified areas or intervals, such as quadrats, plots, transects, or time intervals. In this case, the data records simply the presence or absence of each species within these units, as outlined by Colwell et al. (2012).

In the 'copia' software package, these data types are represented through the `AbundanceData` and `IncidenceData` objects, respectively.

For data analysis, all functions within 'copia' require input data to be formatted as instances of either `AbundanceData` or `IncidenceData`. To facilitate this, the `copia.data.to_copia_dataset()` function is designed to convert a collection of observations into a structured 'copia' dataset. This process is compatible with both abundance and incidence data types. Below, we provide various examples to illustrate how datasets can be created and initialized using this function.

## Abundance Data
Now, let's focus on handling abundance data. Suppose you have gathered a list of species observations. To effectively utilize these observations in the 'copia' software, you can organize them into a copia dataset in the following manner:

In [1]:
from copia.data import to_copia_dataset

observations = 'duck', 'duck', 'eagle', 'dove', 'dove', 'dove', 'hawk'
ds = to_copia_dataset(observations, data_type="abundance", input_type="observations")
ds

AbundanceData(S_obs=4, f1=2, f2=1, n=7, counts=array([3, 2, 1, 1]))

The `AbundanceData` object is designed to transform your observations into a structured array. This array enumerates the counts for each unique species identified in your dataset. Additionally, it computes basic statistical measures, including the count of singletons (f1) and doubletons (f2). These metrics are crucial for numerous estimation functions within the Copia framework.

In scenarios where you're working with count data directly, as opposed to raw observations, constructing a similar data object is straightforward:

In [2]:
import numpy as np

counts = np.array([3, 2, 1, 1])
ds = to_copia_dataset(counts, data_type="abundance", input_type="counts")
ds

AbundanceData(S_obs=4, f1=2, f2=1, n=7, counts=array([3, 2, 1, 1]))

Similarly, we can use a Pandas `Series` or `DataFrame` object to construct the dataset:

In [3]:
import pandas as pd

counts = pd.Series([3, 2, 1, 1], index=["dove", "duck", "eagle", "hawk"])
counts

dove     3
duck     2
eagle    1
hawk     1
dtype: int64

In [4]:
ds = to_copia_dataset(counts, data_type="abundance", input_type="counts")
ds

AbundanceData(S_obs=4, f1=2, f2=1, n=7, counts=array([3, 2, 1, 1]))

## Incidence Data

Moving on, let's explore the handling of incidence data in Copia. Similar to abundance data, you can input either raw observation data or count data. There are two ways to input raw incidence observations:

#### 1. Observation Matrix
This method is suitable for data presented in a matrix format, which can be either a NumPy array or a pandas DataFrame. In this matrix, rows (or columns) correspond to unique items, while columns (or rows) signify unique locations. A non-zero entry in the matrix indicates the occurrence of an item at a specific location.

In [5]:
observation_matrix = np.array([[1, 0], [0, 1], [1, 1]])
ds = to_copia_dataset(observation_matrix, data_type="incidence", 
                      input_type='observation_matrix', n_sampling_units=3)
ds

IncidenceData(S_obs=3, f1=2, f2=1, n=4, counts=array([1, 1, 2]), T=3)

#### 2. Observation List
This approach is ideal for data structured as a list, tuple, dict, or NumPy array consisting of (item, location) pairs. It can also be a pandas DataFrame with designated columns for items and locations.

In [11]:
observation_list = [
    ('item1', 'loc1'), ('item2', 'loc2'), 
    ('item3', 'loc1'), ('item3', 'loc2')
]

ds = to_copia_dataset(observation_list, data_type="incidence", 
                      input_type='observation_list', n_sampling_units=3)
ds

IncidenceData(S_obs=3, f1=2, f2=1, n=4, counts=array([1, 1, 2]), T=3)

One could also provide a Pandas DataFrame object and specify the columns holding the items and locations respectively:

In [12]:
observation_df = pd.DataFrame(observation_list, columns=['item', 'location'])
observation_df

Unnamed: 0,item,location
0,item1,loc1
1,item2,loc2
2,item3,loc1
3,item3,loc2


In [8]:
ds = to_copia_dataset(
    observation_list, data_type="incidence", 
    input_type='observation_list', 
    location_column='location',
    index_column='item',
    n_sampling_units=3)
ds

IncidenceData(S_obs=3, f1=2, f2=1, n=4, counts=array([1, 1, 2]), T=3)

In addition to raw observations, Copia also supports the use of count data for analyzing incidence. You have the flexibility to provide this data in various formats:

1. List or Array: Simple count data can be supplied as a list or an array, where the counts are directly enumerated.
2. Pandas `Series` or `DataFrame`: For a more comprehensive analysis, you can opt for a detailed format by using a pandas `Series` or `DataFrame`. This allows for a richer representation of the data, including additional attributes and more complex structures.

The following examples illustrate how count data for incidence can be incorporated into Copia:

In [9]:
counts = [2, 1, 1]

ds = to_copia_dataset(
    counts, data_type='incidence', input_type='counts', n_sampling_units=3)
ds

IncidenceData(S_obs=3, f1=2, f2=1, n=4, counts=array([2, 1, 1]), T=3)

In [10]:
df = pd.DataFrame([
    {'item': 'item1', 'count': 1},
    {'item': 'item2', 'count': 1},
    {'item': 'item3', 'count': 2}])

ds = to_copia_dataset(
    counts, data_type='incidence', 
    input_type='counts',
    index_column='item',
    count_column='count',
    n_sampling_units=3)
ds

IncidenceData(S_obs=3, f1=2, f2=1, n=4, counts=array([2, 1, 1]), T=3)

For more comprehensive details and specific guidance on utilizing the functions within Copia, we strongly encourage you to refer to our detailed documentation. This resource will provide you with in-depth explanations and additional examples to enhance your understanding and usage of the software.

Copia offers a versatile and robust framework that efficiently accommodates both abundance and incidence data types. It's important to reiterate that for successful analysis and estimation within Copia, your data must be formatted as an instance of either `AbundanceData` or `IncidenceData`. Adhering to this requirement ensures accurate processing and effective utilization of Copia's capabilities.