This document is a Python exploration of this R-based document: https://m-clark.github.io/data-processing-and-visualization/data_structures.html. Code is not optimized for anything but learning. In addition, all the content is located with the main document, not here, so some sections may not be included. I only focus on reproducing the code chunks.

# Data Structures

## Vectors

In [1]:
import numpy as np
import pandas as pd

In [2]:
# a basic numeric vector
x = np.asmatrix([1, 3, 2, 5, 4])

x.dtype

dtype('int64')

### Character strings

Character strings are the basic text object for Python.  Their actual class may vary.

In [3]:
# character vector as list
x = ['... Of Your Fake Dimension', 'Ephemeron', 'Dryswch', 'Isotasy', 'Memory']

type(x)

list

In [4]:
# as np object
x = np.asarray(['... Of Your Fake Dimension', 'Ephemeron', 'Dryswch', 'Isotasy', 'Memory'])

x

array(['... Of Your Fake Dimension', 'Ephemeron', 'Dryswch', 'Isotasy',
       'Memory'], dtype='<U26')

I've always liked Python's basic 'paste' ability, which does not require an explicit function as with R.

In [5]:
"a" + "b" + "c"

'abc'

### Factors

Factors are used to represent categorical data structures. Although not exactly precise, one can think of factors as integers with labels. For example, the underlying representation of a variable for sex is 1:2 with labels ‘Male’ and ‘Female’. Note that factors are specific to pandas. The main thing to note is that factors are generally a statistical phenomenon, and are required to do statistical things with data that would otherwise be a simple character string. For other things, such as text analysis, you’ll almost certainly want character strings instead, and in many cases it will be required.

In [6]:
x_string = np.repeat(['a', 'b', 'c'], 10)

x_factor = pd.Categorical(x_string)

x_factor

x_factor.dtype

CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

In [7]:
# np.sum(x_factor)           # not what you want

x_num = x_factor.codes

np.sum(x_num)                # can use as numeric 

30

### Logical

Logical scalar/vectors are those that take on one of two values: `True` or `False`. They are especially useful in flagging whether to run certain parts of code, and indexing certain parts of data structures (e.g. taking rows that correspond to TRUE). We’ll talk about the latter usage later.

Here is a logical vector.

In [8]:
my_logic = np.array([True, False, True, False, True, True], dtype=bool)

type(my_logic)

numpy.ndarray

In [9]:
np.logical_not(my_logic)

array([False,  True, False,  True, False, False])

Note also that logicals are also treated as binary 0:1, and so, for example, taking the mean will provide the proportion of `True` values.

In [10]:
my_logic.astype(int)

array([1, 0, 1, 0, 1, 1])

In [11]:
np.mean(my_logic)

0.6666666666666666

### Numeric

The most common type of data structure you’ll deal with are integer and numeric vectors.

In [12]:
x = np.arange(3)
#type([np.arange(3)])

In [13]:
np.random.randn(5)

array([ 1.23545678, -1.54670501, -1.2642365 , -0.80285373,  0.23192578])

## Matrices

With multiple dimensions, we are dealing with arrays. Matrices are two dimensional (2-d) arrays, and extremely commonly used for scientific computing. The vectors making up a matrix must all be of the same type. For example, all values in a matrix might be numeric, or all character strings.

Creating a matrix can be done in a variety of ways.

In [14]:
# vectors; note that the sequence is open on the right, i.e. will not include the stop number
x = np.arange(1, 5)

y = np.arange(5, 9)

z = np.arange(9, 13)

np.vstack((x, y, z))   

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [15]:
np.hstack((x, y, z))  # not really what I would expect, but it is consistent with documentation

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [16]:
# or reshape to 2d matrix and use concatenate
x = np.arange(1,5).reshape(1,4)

y = np.arange(5,9).reshape(1,4)

z = np.arange(9,13).reshape(1,4)

np.concatenate((x,y,z), axis=0)

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

## Lists

Lists are as fundamental to Python generally as they are to R, so you should be comfortable using them, even if not for data science purposes.

In [17]:
x = [1, 'apple', [3, 'cat']]
x

[1, 'apple', [3, 'cat']]

In [18]:
for elem in x: print(type(elem))

<class 'int'>
<class 'str'>
<class 'list'>


A named list.

In [19]:
x = {'a':25, 'b':-1, 'c':0}
x['b']

-1

## Data Frames


Data frames are a very commonly used data structure, and are essentially a representation of data in a table format with rows and columns. Elements of a data frame can be different types. As such, everything about lists applies to them. But they can also be indexed by row or column as well, just like matrices.  Data frames are unique to pandas, and while there are some differences, the inspiration from R's data.frame class allows it them to be used straightforwardly in Python as well.

Usually your data frame will come directly from import or manipulation of other objects (e.g. matrices). However, you should know how to create one from scratch.

In [20]:
mydf = pd.DataFrame({'a': [1,5,2],
                     'b': [3,8,1]})
mydf

Unnamed: 0,a,b
0,1,3
1,5,8
2,2,1


In [21]:
mydf.index  = {'row' + str(x) for x in np.arange(1,4)}
mydf

Unnamed: 0,a,b
row3,1,3
row2,5,8
row1,2,1


In [22]:
mydf = pd.DataFrame({'A':[1,2,3], 'B':['a','b','c']})
mydf

Unnamed: 0,A,B
0,1,a
1,2,b
2,3,c


In [23]:
mylist = [['a','b'], [1,2,3], mydf]
mylist

[['a', 'b'],
 [1, 2, 3],
    A  B
 0  1  a
 1  2  b
 2  3  c]

In [24]:
type(mylist)

list

## Exercises

### Exercise 1

Create an object that is a matrix and/or a data.frame, and inspect its class or structure (use the `type` function or `dtype`  method on the object you just created).


### Exercise 2

Create a list of 3 elements, the first of which contains character strings, the second numbers, and the third, the data.frame or matrix you just created in Exercise 1.

### Thinking Exercises

- How is a factor different from a character vector?

- How is a data.frame the same as and different from a matrix?

- How is a data.frame the same as and different from a list?
