<a href="https://colab.research.google.com/github/krauseannelize/nb-py-ms-exercises/blob/sprint03/notebooks/s03_pandas_foundation/30_intro_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to `Pandas`

## What is `Pandas`?

- `Pandas` is an open-source Python library used for data manipulation and analysis
- It provides data structures like **DataFrames** and **Series** to handle structured data efficiently
- Provide efficient handling of structured data
  - Series (1D)
  - DataFrame (2D)
- **`pd`** used as alias for convenience

## Why `Pandas`?

- Simplifies data cleaning and preparation
- Enables efficient data exploration and analysis
Integrates well with other Python libraries (e.g. `NumPy`, `Matplotlib`, `Scikit-learn`)
- `Pandas` is widely used in various fields including data science, machine learning, finance, research, and more

## Installing `Pandas` library

```python
!pip install pandas
```

## Importing `Pandas` & `NumPy` library

In [64]:
import pandas as pd
import numpy as np

## `Pandas` Series

`Pandas` **Series**:

- is a one dimensional array
- can consist of any data type (int, str, float, etc)
- values are labeled with their index number using zero-indexing
- value labels can be customized using the `index` parameter

It is very similar to a `NumPy` **array** (in fact it is built on top of the `NumPy` array object). What differentiates the `NumPy` **array** from a `Pandas` **Series**, is that:

- a **Series** can have _axis labels_, meaning it can be indexed by a label, instead of just a number location
- it doesn't need to hold numeric data, it can hold any arbitrary Python Object.

## Creating a Series

The `pd.Series()` constructor is used to create a **Series** by passing any of the following to it:

- a list
- a `NumPy` **array**
- a dictionary
- a scalar value and an index
- any functions

Key attributes of a `Pandas` **Series** include:

| Attribute | Description |
| --- | --- |
| **values** | Returns the data as a `NumPy` array |
| **index** | Returns the index labels |
| **dtype** | The data type of the values in the **Series** |
| **shape** | The dimensions of the **Series** (always 1D) |
| **size** | The number of elements in the **Series** |

In [65]:
# from a list
my_list = [10, 20, 30, 40, 50]
my_series = pd.Series(my_list)
print(my_series)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [66]:
# from a list with custom index labels
my_list = [10, 20, 30, 40, 50]
my_series = pd.Series(data=my_list, index=['label1', 'label2', 'label3', 'label4', 'label5'])
print(my_series)

label1    10
label2    20
label3    30
label4    40
label5    50
dtype: int64


In [67]:
# from a NumPy array
my_array = np.array([10, 20, 30, 40, 50])
my_series = pd.Series(my_array)
print(my_series)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [68]:
# from a NumPy array using custom labels
my_array = np.array([10, 20, 30, 40, 50])
my_labels = ['label1', 'label2', 'label3', 'label4', 'label5']
my_series = pd.Series(data=my_array, index=my_labels)
print(my_series)

label1    10
label2    20
label3    30
label4    40
label5    50
dtype: int64


In [69]:
# from a dictionary - dictionary key becomes series labels
my_dict = {'key1': 10, 'key2': 20, 'key3': 30, 'key4': 40, 'key5': 50}
my_series = pd.Series(my_dict)
print(my_series)

key1    10
key2    20
key3    30
key4    40
key5    50
dtype: int64


In [70]:
# from a dictionary using a specific index to select only certain values
my_series_subset = pd.Series(my_dict, index=['key3', 'key1'])
print(my_series_subset)

key3    30
key1    10
dtype: int64


In [71]:
# from a scalar value and an index
my_series = pd.Series(10, index=[0, 1, 2, 3, 4])
print(my_series)

0    10
1    10
2    10
3    10
4    10
dtype: int64


In [72]:
# from functions (although unlikely that you will use this)

# function counts the number of vowels in a string
def find_vowels(text):
  vowels = 0
  for letter in text:
    if letter.lower() in 'aeiou':
      vowels += 1
  return vowels

# function counts the number of consonants in a string
def find_consonants(text):
  consonants = 0
  for letter in text:
    # .isalpha() checks if character is an alphabetical letter
    if letter.isalpha() and letter.lower() not in 'aeiou':
      consonants += 1
  return consonants

# create Pandas Series to store the funcitons
my_functions = pd.Series([len, find_vowels, find_consonants])

def analyze_string():
  try:
    # prompt user for string
    text = input("Enter a string: ")

    print(f"Your string has {my_functions[0](text)} characters.")
    print(f"It has {my_functions[1](text)} vowels.")
    print(f"It has {my_functions[2](text)} consonants.")

  except ValueError:
    print("Invalid input")
    return None

analyze_string()

Enter a string: Bob the Builder! Can we fix it?
Your string has 31 characters.
It has 9 vowels.
It has 14 consonants.


## Using a Series Index

The key to using a **Series** is to understand its index. It's a key-value system that allows for fast lookups, much like a Python dictionary.

You can use the index label directly inside square brackets `[]` to get a single value.

In [73]:
num_series = pd.Series(data=[10, 20, 30, 40, 50], index = ['a', 'b', 'c', 'd', 'e'])
print(num_series) # print the entire series

a    10
b    20
c    30
d    40
e    50
dtype: int64


In [74]:
print(num_series['d'])  # get the value of the 'd' label

40


## Basic Operations on Series

- perform element-wise operations with arithmetic operations
- access subsets in the **Series** using the index argument
- use `boolean` conditions to filter data
- perform aggregations on the **Series** using methods like `.sum()`, `.mean()`, `.max()` and `.min()`

In [75]:
# arithmetic operations

ser1 = pd.Series(data=[10, 20, 30, 40], index = ['C', 'A','B','E'])
ser2 = pd.Series(data=[5, 12, 23, 34], index = ['C', 'A','D', 'E'])
print(ser1 + ser2)

# when an index exists in one Series but not the other, the result of a normal
# addition would be NaN because there is no corresponding value to add

A    32.0
B     NaN
C    15.0
D     NaN
E    74.0
dtype: float64


In [76]:
print(ser1 * ser2)

A     240.0
B       NaN
C      50.0
D       NaN
E    1360.0
dtype: float64


In [77]:
# filtering

print(ser1[ser1 > 25])

B    30
E    40
dtype: int64


In [78]:
# aggregation

print("Sum:", ser1.sum())
print("Mean:", ser1.mean())

Sum: 100
Mean: 25.0


---

# Intro to Numpy

## What is `NumPy`?

- `NumPy` is a Linear Algebra Library for Python
- `NumPy` stands for **Numerical Python**
- **`np`** used as alias for convenience

## Why `NumPy`?

- almost all of the libraries in the **PyData Ecosystem** rely on `NumPy` as one of their main building blocks
- `NumPy` arrays are used internally in `Pandas` for efficient data storage and computation
- it is incredibly fast, as it has bindings to C libraries
- can handle large datasets with efficient array operations

## Installing `NumPy` library

```python
!pip install numpy
```

## Importing `NumPy` and `Pandas` library

In [79]:
import numpy as np
import pandas as pd

## Creating `NumPy` Arrays

We have various possibilities to create arrays:

- converting lists or tuples, including nested
- arrays of zeros, ones or other specific constant
- range of values
- random arrays

In [80]:
# converting a list into an array
my_list = [10, 20, 30, 40, 50]
np.array(my_list)

array([10, 20, 30, 40, 50])

In [81]:
# converting a nested list into an array
my_matrix = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
np.array(my_matrix)

array([[10, 20, 30],
       [40, 50, 60],
       [70, 80, 90]])

In [82]:
# creating a 3x5 array of zeros
# default datatype is float
np.zeros((3,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [83]:
# creating a 4x8 array of ones
# adding int datatype
np.ones(shape=(4,8), dtype='int')

array([[1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1]])

In [84]:
# creating a 2x3 array with a constant
np.full(shape=(2,3), fill_value=8, dtype='int')

array([[8, 8, 8],
       [8, 8, 8]])

In [85]:
# creating an array from 3 to 30 with a step of 3
np.arange(3, 31, 3)

array([ 3,  6,  9, 12, 15, 18, 21, 24, 27, 30])

In [86]:
# creating an array with 5 evenly spaced values between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

## Creating Random `NumPy` Arrays

### `rand`

Create an array of the given shape and populate it with
random samples from a uniform distribution
over `[0, 1)`.

In [87]:
# creating a random array
np.random.rand(3, 4)

array([[0.48866607, 0.30236561, 0.16817913, 0.79237033],
       [0.12040477, 0.82127751, 0.71987327, 0.68194714],
       [0.06129358, 0.68803277, 0.056451  , 0.25705492]])

### `randn`

Return a sample (or samples) from the standard normal distribution, unlike rand which is uniform:

In [88]:
np.random.randn(4, 3)

array([[ 1.0120467 , -0.13811967,  1.27927516],
       [ 1.30646441, -0.67820475,  1.27321861],
       [ 0.19191152, -0.2783504 ,  1.8141007 ],
       [ 0.59781609, -0.22063936, -0.1436565 ]])

### `randint`

Return random integers from `low` (inclusive) to `high` (exclusive).

In [89]:
# only return 1 random number
np.random.randint(1, 100)

76

In [90]:
# return a set number of random numbers
np.random.randint(1, 100, 10)

array([86, 33, 32, 36,  8, 25, 21, 91, 20,  2])

## Array Attributes & Methods

- array can operate with different attributes and methods
- attributes are properties of the array that provide useful informations
- methods are functions associated with an object

| Attribute/Method | Example Code | Description |
| --- | --- | --- |
| `.ndim` | `array.ndim` | Returns number of dimensions |
| `.size` | `array.size` | Returns total number of elements |
| `.dtype` | `array.dtype` | Returns data type of array elements |
| `.reshape()` | `array.reshape()` | Changes shape without altering array data |
| `.flatten()` | `array.flatten()` | Returns a 1D copy of the array |
| `.transpose()` | `array.transpose()` | Swaps array axes (rows/columns) |
| `.astype()` | `array.astype()` | Convert data type of array elements |


## Statistical Methods

Arithmetic operations are **element-wise** by default.

In [111]:
matrix1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix2 = np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]])
print(f"The first array:\n{matrix1}")
print(f"The second array:\n{matrix2}")
print(f"Addition was performed element-wise:\n{matrix1 + matrix2}")

The first array:
[[1 2 3]
 [4 5 6]
 [7 8 9]]
The second array:
[[10 20 30]
 [40 50 60]
 [70 80 90]]
Addition was performed element-wise:
[[11 22 33]
 [44 55 66]
 [77 88 99]]


`NumPy` provides **multiple aggregation** functions like:

- `sum`
- `mean`
- `min`
- `max`
- `std`
- `var`

They will be applied to the entire array by default, but using `axis` can change the direction of the aggregation:

- **axis=0**: columns (vertical/downward)
- **axis=1**: rows (horizontal/across)


In [112]:
matrix = np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]])
print(f"The array is:\n{matrix}")
print(f"The sum of each column is:\n{matrix.sum(axis=0)}")
print(f"The mean of each row is:\n{matrix.mean(axis=1)}")
print(f"The smallest number in the array is:{matrix.min()}")
print(f"The largest number in the array is:{matrix.max()}")

The array is:
[[10 20 30]
 [40 50 60]
 [70 80 90]]
The sum of each column is:
[120 150 180]
The mean of each row is:
[20. 50. 80.]
The smallest number in the array is:10
The largest number in the array is:90


## `NumPy` Indexing & Slicing

### 1D Array Slicing

You can access subarrays using the syntax: `array[start:stop:step]`

- `start`: index to begin slicing (inclusive)
- `stop`: index to end slicing (exclusive)
- `step`: interval between elements (optional)

### 2D Array Slicing

Slicing is done by specifying slices for rows and columns: `array[row_start:row_stop, col_start:col_stop]`

In [113]:
array_1d = np.array([10, 20, 30, 40, 50])
print(f"The array is:\n{array_1d}")
print(f"The third element is:\n{array_1d[2]}")
print(f"The first 3 elements are:\n{array_1d[:3]}")

The array is:
[10 20 30 40 50]
The third element is:
30
The first 3 elements are:
[10 20 30]


In [114]:
array_2d = np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]])
print(f"The array is:\n{array_2d}")
print(f"The first row is:\n{array_2d[0]}")
print(f"The second element in the second row is:\n{array_2d[1:2, 1:2]}")

The array is:
[[10 20 30]
 [40 50 60]
 [70 80 90]]
The first row is:
[10 20 30]
The second element in the second row is:
[[50]]


### Boolean Indexing

Identifying which array elements meet a condition

In [115]:
# create a Boolean mask: which elements are greater than 4?
matrix = np.arange(1, 11)
print(f"The array is:\n{matrix}")
print(f"Check which numbers are greater than 4:\n{matrix > 4}")

The array is:
[ 1  2  3  4  5  6  7  8  9 10]
Check which numbers are greater than 4:
[False False False False  True  True  True  True  True  True]


In [117]:
# store the Boolean mask in a variable and use it to filter the array
bool_matrix = matrix > 4
print(f"Boolean mask is:\n{bool_matrix}")
print(f"The filtered array is:\n{matrix[bool_matrix]}")

Boolean mask is:
[False False False False  True  True  True  True  True  True]
The filtered array is:
[ 5  6  7  8  9 10]


In [118]:
# filter condition applied directly in indexing brackets
print(f"The filtered array is:\n{matrix[matrix > 6]}")

The filtered array is:
[ 7  8  9 10]


In [119]:
# conditions can be dynamic—based on variables
x = 3
print(f"Elements greater than x (where x = {x}):\n{matrix[matrix > x]}")

Elements greater than x (where x = 3):
[ 4  5  6  7  8  9 10]
