# Intro to NumPy and Pandas

## Crash Course in NumPy

Just the essentials. Most will be handled by Pandas.

The examples here use trivial "datasets" for clarity and simplicity. In real world use, NumPy can easily handle arrays with 1M elements (e.g. 1000 x 1000 matrix). With careful memory management it can handle 100 to 1000x that.

### Import

By convention, NumPy is usually imported as `np` - shorthand notation.

In [None]:
import numpy as np

### Creating Arrays

Use NumPy's array method (`np.array`) to convert an "array-like" object into a NumPy array.

In [None]:
# Only the basics needed for ML work
row = [1, 2, 3, 4, 5]

# 1D array from a list (table row)
arr_1d = np.array(row)
print(arr_1d)
type(arr_1d)


[1 2 3 4 5]


numpy.ndarray

In [None]:
table = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
]

# 2D array from a nested list (list of lists, or table)
arr_2d = np.array(table)
print(arr_2d)
type(arr_2d)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


numpy.ndarray

### Data Types

NumPy arrays are homogeneous - all elements are the same type. The types do not match those found in base Python.

The `dtype` attribute identifies the type of the array and thus all values in it.

In [None]:
# create arrays of all integer and all float values
int_arr = np.array([1, 2, 3])
float_arr = np.array([1., 2., 3.])
text_arr = np.array(['a', 'b', 'c'])

print("Integer array:", int_arr.dtype)   # int64, a 64-bit integer
print("Float array:", float_arr.dtype)   # float64, a 64-bit floating point
print("Text array:", text_arr.dtype)     # <U1, Unicode string of length 1

Integer array: int64
Float array: float64
Text array: <U1


NumPy is optimized for numerical computations with homogeneous arrays. Data with mixed types, missing values, or other object types (e.g. dates) will be lumped into a default "object" type.

In these situations NumPy might not perform as expected.

In [None]:
from datetime import datetime

with_missing = np.array([1, None, 3])
with_dates = np.array([datetime.now(), datetime.now()])

print("With missing:", with_missing.dtype)
print("With dates:", with_dates.dtype)

With missing: object
With dates: object


Take care when creating NumPy arrays from mixed data. It can be aggressive about implicit type conversion, which can lead to unexpected results.

In [None]:
mixed_arr = np.array([1, 2, 3.5])
mixed_text = np.array([1, 'text', '3.14'])
mixed_text_2 = np.array([1, 2, 'text'])

print("Mixed array:", mixed_arr.dtype)   # float64 (upcast to preserve decimals)
print("Mixed text:", mixed_text.dtype)   # Unicode - numbers converted to text!
print("Mixed text 2:", mixed_text_2.dtype)

print(mixed_arr)
print(mixed_text)
print(mixed_text_2)

Mixed array: float64
Mixed text: <U21
Mixed text 2: <U21
[1.  2.  3.5]
['1' 'text' '3.14']
['1' '2' 'text']


### Shape and Dimensions

- Dimension: the number of axes in an array
  - accessible via `ndim` attribute
- Shape: the length of each axis, expressed in column, row order
  - accessible via `shape` attribute

For the row array:

In [None]:
print("Dimension of row:", arr_1d.ndim)
print("Shape of row:", arr_1d.shape)

Dimension of row: 1
Shape of row: (5,)


Shape is always expressed as a tuple, which Python represents as a comma-separated list of values. For a single-value tuple, this includes a trailing comma.

Don't confuse this with other container types in Python, like list and set, that also use comma separated values. Those are represented by the surrounding braces, e.g. `[]` for lists and `{}` for sets. Though tuples are shown with parentheses, it is the commas that make a tuple.


In [None]:
single_number = (5)

# empty tuple is a special case; no commas
empty_tuple = ()

# common syntax for tuples
single_tuple = (5,)
two_tuple = (5, 2)

# but parenthesis are optional
also_tuple = 5, 2



For the table array:


In [None]:
print("Dimension of table:", arr_2d.ndim)
print("Shape of table:", arr_2d.shape)

Dimension of table: 2
Shape of table: (3, 3)


### Indexing

Similar to base Python, but with a more flexible syntax, including traditional row, column notation.

Each provided index selects a single position along the corresponding axis.

In [None]:
# for row data, syntax is the same as base python
print("First element of row:", arr_1d[0])

# for multi-dimensional data, using tuple-style indexing (preferred)
print("Middle element of table:", arr_2d[1,1])

# sequence of index operations also allowed, like base Python (clunky)
print("First element of table:", arr_2d[0][0])


First element of row: 1
Middle element of table: 5
First element of table: 1


### Slicing

A slice is a subset of data specified as a range of indexes.

Similar to base Python, slices use the colon notation `start:stop[:step]`, where `step` is optional and `stop` is exclusive (i.e., not included in the range).

In [None]:
# for row data, syntax is the same as base python
print("First three elements of row:", arr_1d[0:3])

# for multi-dimensional data, slice and index can be combined
print("First two elements of second row:", arr_2d[1, 0:2])

# empty start/stop implies full range, like base Python
print("All elements after first:", arr_1d[1:])
print("First two rows of table:\n", arr_2d[:2])

First three elements of row: [1 2 3]
First two elements of second row: [4 5]
All elements after first: [2 3 4 5]
First two rows of table:
 [[1 2 3]
 [4 5 6]]


Any index that is omitted is treated as a full slice.

In [None]:
# equivalent to arr_2d[0,:]
print("First row of table:", arr_2d[0])

# must specify both axes for column
print("First column of table:", arr_2d[:,0])

First row of table: [1 2 3]
First column of table: [1 4 7]


The ability to pull a column of data in this way is powerful. In base Python you'd have to do a loop and extract the desired column from each row.

In [None]:
col = []
for row in table:
  col.append(row[1])

print(col)

# alternatively, in a list comprehension
[row[1] for row in table]

When dealing with tabular data, where rows are observations and columns are attributes, we will often want to work with columns. NumPy makes this easy. Pandas leans into the distinction, taking a column-based approach to working with data.

### Reshaping

We will sometimes need to convert data between row and column form or reshape it in different ways without changing the total number of elements.

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
print(arr.shape)

(12,)


In [None]:
print("As row:", arr.reshape(1, 12))      # one row
print("As column:\n", arr.reshape(12, 1))   # one column
print("As table:\n", arr.reshape(3, 4))     # 3 rows, 4 cols
print("As cube:\n", arr.reshape(2, 2, 3))   # 2x2x3 3D array

As row: [[ 1  2  3  4  5  6  7  8  9 10 11 12]]
As column:
 [[ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]]
As table:
 [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
As cube:
 [[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]


### Array Creation

NumPy also includes a variety of methods for constructing arrays from scratch.

In [None]:
# Fixed values
print("Zeros array:")
print(np.zeros((2, 3)))  # shape specified as tuple

print("\nOnes array:")
print(np.ones((2, 3)))

print("\nIdentity matrix:")
print(np.eye(3))  # square matrix, size specified as single number

# Sequences
print("\nRange-like sequence:")
print(np.arange(5))  # like Python's range()

print("\nEvenly spaced values:")
print(np.linspace(0, 1, 5))  # 5 values from 0 to 1 inclusive

# Random values
print("\nRandom uniform [0,1):")
print(np.random.random((2, 3)))

print("\nRandom normal (mean=0, std=1):")
print(np.random.normal(size=(2, 3)))

Zeros array:
[[0. 0. 0.]
 [0. 0. 0.]]

Ones array:
[[1. 1. 1.]
 [1. 1. 1.]]

Identity matrix:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Range-like sequence:
[0 1 2 3 4]

Evenly spaced values:
[0.   0.25 0.5  0.75 1.  ]

Random uniform [0,1):
[[0.17549925 0.08853666 0.98857472]
 [0.24138093 0.43329179 0.92876078]]

Random normal (mean=0, std=1):
[[-0.49262098 -1.36156708 -0.75672532]
 [ 0.62580998  0.29302839 -0.01190148]]


## Introducing Pandas

Pandas is designed primarily for manipulating tabular (i.e., 2D).

The *DataFrame* is Pandas' primary data structure - a 2D table where columns are *Series* (1D arrays) that can have different types. This matches how we typically think about data: observations (rows) described by named attributes (columns).

Convert the `table` array into a DataFrame.


In [None]:
# by convention, Pandas is loaded as pd
import pandas as pd

# Let's convert our table array into a pandas DataFrame
# Same data, but now with column names
df = pd.DataFrame(arr_2d, columns=['A', 'B', 'C'])
print(df)

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9


Pandas simplifies column access by supporting named references.

In [None]:
# NumPy uses somewhat cryptic indexing syntax
print("NumPy - first column:", arr_2d[:,0])

# Pandas allows named columns
print("\nPandas - column 'A':")
print(df['A'])  # direct access by name

NumPy - first column: [1 4 7]

Pandas - column 'A':
0    1
1    4
2    7
Name: A, dtype: int64


Conceptually,

- A Pandas Series is like a 1D NumPy array with (optional) labels for indicies
- A Pandas Dataframe is like a group of named Series that represent the columnar data.

Where NumPy arrays are homogeneous, every column (Series) in a Pandas DataFrame has it's own type.

This is much more in line with real world use cases, where each attribute represents different observed values. For example, a table of collected data might include the age, sex / gender, country of origin, date of birth, and score for each participant in a study.

In [None]:
# Create a DataFrame with example participant data
df = pd.DataFrame({
   'age': [25, 34, 28, 22],
   'gender': ['F', 'M', 'F', 'NB'],
   'country': ['USA', 'India', 'Canada', 'Mexico'],
   'birth_date': pd.to_datetime(['1999-03-15', '1990-08-22', '1996-11-03', '2002-05-30']),
   'score': [82.5, 91.0, 88.5, 79.0]
})

# Show the data types of each column
print("Column types:")
print(df.dtypes)

# Show the first few rows of data
print("\nFirst few rows:")
print(df)

Column types:
age                    int64
gender                object
country               object
birth_date    datetime64[ns]
score                float64
dtype: object

First few rows:
   age gender country birth_date  score
0   25      F     USA 1999-03-15   82.5
1   34      M   India 1990-08-22   91.0
2   28      F  Canada 1996-11-03   88.5
3   22     NB  Mexico 2002-05-30   79.0
