# Arrays and dataframes: An introduction to tabular data

by Koenraad De Smedt at UiB

---
Many datasets have the shape of a table, i.e. an n-dimensional data structure of fixed size. Two-dimensional data have the following axes:

0.  *rows*, also called *records* (often given numerical indexes)
1.  *columns*, also called *fields*

>
Row index   | Column 0     | Column 1
------------|--------------|--------------
0 →         | Row 0, Col 0 | Row 0, Col 1 |
1 →         | Row 1, Col 0 | Row 1, Col 1 |
2 →         | Row 3, Col 0 | Row 3, Col 1 |
>

It is of course possible to have data structures with [only one dimension or more than two dimensions](https://www.awesomegrasp.com/wp-content/uploads/2019/10/python-array-and-axis-e1570454038223-768x405.png), but we will not give examples involving more than two dimensions.

This notebook provides an introduction to working with tabular data using these Python libraries:

1.   [*numpy*](https://numpy.org/doc/stable/user/absolute_beginners.html) provides operations on n-dimensional *arrays* of items of the same type and size.
2.   [*pandas*](https://pandas.pydata.org/pandas-docs/version/2.1/getting_started/intro_tutorials/index.html) provides operations on *dataframes*, which are two-dimensional labeled datastructures, and *series*, which have only one data column.

---

In [None]:
import numpy as np
import pandas as pd
pd.__version__

The first example is based on an actual study of variations in pronunciation. In particular, pronunciations with the dipththong *øy* or with the monophthong (simple vowel) *ø* in certain words were counted. We will keep the example small, using only a few studied words in one recording.

Suppose we have the results as a dict where the keys are words and the values are pairs of diphthong and monophthong counts, respectively. The list of dict `values` has the shape 4 by 2, but it is still a list, not an array.

In [None]:
di = {'køyre': (35, 12), 'høyre': (3, 17),
      'køyrde/køyrt': (4,  1), 'høyrde/høyrt': (0,  8)}

li = list(di.values())
li

NB: Remember that in Python 3.7 and later, dicts are ordered. This means the list of values is in the same order as they appear in the dict.

# Arrays with *numpy*


Make an array from the list of values.

In [None]:
ar = np.array(li)
ar

Display information about this array: shape and total size.

In [None]:
print(ar.shape, ar.size)

With an array, we can do things which are not so easy with a list. For instance, we can easily compute the sum of all numbers in the entire array.

In [None]:
ar.sum()

We can also compute the sums of row values in each column. This gives us a new array. In this case, we see that there are more diphthongs than monophthongs.

N.B. using 1 as the argument gives the opposite: sum of column values in each row.

In [None]:
ar.sum(0)

Check if something is anywhere in the array.

In [None]:
4 in ar, 99 in ar

## Selecting rows and columns

We can take some rows. This gives us a new array. The `:` notation is similar to that for sequences.

In [None]:
ar[1:3]

After the comma, columns can be specified. Take, for instance, all rows and only column 0.

In [None]:
ar[:, 0]

We can also transpose the whole array, so that columns become rows and vice versa.

In [None]:
ar.transpose()

There is more one can do with arrays. But as long as we have no labels on the columns and indexes, they may be a bit hard to interpret when we are manipulating them.

# Dataframes with *pandas*

The *pandas* module provides more practical ways of working with tabular data than *numpy*.

An object of type *DataFrame* is a two-dimensional *labeled* data structure. You can think of it as two-dimensional array with labels on the rows and columns, much like a spreadsheet or a dataframe in [R](https://www.r-project.org/). Unlike an array, a dataframe may contain data of different types.

A dataframe can be created in many ways. The following creates a dataframe from the numpy array above and results in a plain dataframe with numerical indexes for the rows and numerical column labels.

In [None]:
df = pd.DataFrame(ar)
df

We can put names (labels) on the columns and rows. Labeling makes dataframes often more practical to view and work with than arrays.

In [None]:
df.columns = ('diphthong', 'monophthong')
df.index = di.keys()
df

Alternatively, we can make a dataframe directly from the dict. The `orient` argument indicates whether the dict keys become indexes (as in this case) or column labels.

In [None]:
df = pd.DataFrame.from_dict(di, orient='index', columns=('diphthong', 'monophthong'))
df

Show shape and size.

In [None]:
print(df.shape, df.size)

Make a bar plot of the values side by side.

In [None]:
df.plot.bar()

## Selecting rows and columns

Rows and columns can be selected in a similar way as arrays, but now we can use labels instead of numerical indexes. If the result is two-dimensional, it will be a new dataframe.

In [None]:
df.loc['køyre':'høyre', :]

Columns can be omitted if all are to be selected.

In [None]:
df.loc['køyre':'høyre']

If the result is one-dimensional, its type is Series. Here we select a single column.

In [None]:
df.loc[:, 'diphthong']

For selecting one column, you can also use the following notation, where `.loc` is omitted.

In [None]:
df['diphthong']

Sometimes, you also see the following abbreviation for selecting one column, but this works only if the column name is a valid Python name.

In [None]:
df.diphthong

There are many operations on pandas dataframes. Let's make a new dataframe with percentages for the distribution in each row. First we divide each value along the index by the total of the columns in that row, and then we multiply by 100.

In [None]:
pct = df.div(df.sum(axis='columns'), axis='index').mul(100)
pct

## Styling dataframes with colors etc.

Dataframes can be *styled* in different ways, affecting their look.

For instance, dataframes containing numbers can be styled with a background gradient dependent on the numeric values. This view, often called a *heatmap*, may help in getting a quick overview of contrasts in the values.

In [None]:
dfsty = df.style.background_gradient(axis=None, cmap='Greens')
dfsty

In the following example, the color depends on the distribution within each column only. Also, floating point numbers are formatted so that they have only one decimal and are followed by a percent sign.

In [None]:
pctsty = pct.style.background_gradient(axis='index', cmap='Reds').format('{:.1f} %')
pctsty

Alternatively, one can make a downloadable heatmap picture with a legend and other options. The `annot` parameter controls whether the values are displayed in the cells.

In [None]:
import seaborn as sns
sns.set(font_scale=1.2)

s = sns.heatmap(df, cmap='Purples', annot=True)
s.set(ylabel='Word', xlabel='Pronunciation count')
s.plot()

For larger examples of heatmaps, see for instance [Measuring Harmful Representations in Scandinavian Language Models](https://arxiv.org/pdf/2211.11678.pdf).

Large dataframes are most often made by collecting tabular data from external sources, rather than typing in values. Tables and plots can also be written to file in several formats. These features will be demonstrated in later notebooks.

## Exercise

Heatmaps are often used for [a confusion matrix with the results of a classifier](https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea).
Assume the following results of a model that classifies newspaper articles:

*  10 Sports articles are classified correctly and 5 are classified as Politics;
*  2 Politics articles are classified as Sports and 13 are classified correctly.

Make a 2 x 2 dataframe in which the rows are labeled with the true classes and the columns are the classes assigned by the model. Visualize as a heatmap, which should look about like [this](https://drive.google.com/file/d/1U-42mCVfrl6g-DlkilRW1o1dKfREdOoZ/view?usp=sharing) or like [this](https://drive.google.com/file/d/1_ArAW1LWINuuLFHW7wgDOqQF-TfoLqQD/view?usp=sharing).
