## **#01: Python Basics**
- Instructor: [Jaeung Sim](https://jaeungs.github.io/) (University of Connecticut)
- Course: OPIM 5671: Data Mining and Time Series Forecasting
- Last updated: January 29, 2025

**Objectives**
1. Set your Colab environment using Google Drive.
2. Understand basic commands and functions in Python.
3. Understand data structure in Python.
4. Process a dataset using NumPy and Pandas.

**Contents**
* Part 1: Colab Environment
* Part 2: Basic Commands in Python
* Part 3: `NumPy`
* Part 4: `Pandas`

**References**
* [Welcome to Colab!](https://colab.research.google.com/)
* [NumPy User Guide](https://numpy.org/doc/stable/user/index.html)
* [Pandas Tutorial - W3Schools](https://www.w3schools.com/python/pandas/)
* [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/)
* [Introduction to Data Science with Python](https://nustat.github.io/DataScience_Intro_python/)

### **Part 1: Colab Environment**
* Connect your Google Drive with Colab.
* Set your path and load files.
* Import and export Python codes.

Step 1. Access to https://colab.research.google.com/

Step 2. Start a notebook (File > "New notebook" or "Open notebook" or "Upload notebook")

Step 3. Open your Google Drive folder (Colab Notebooks) and check your notebook.

Step 4. Set a specific folder in Google Drive as your working directory.

In [None]:
# Import Google Drive to Colab
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Check your current directory
import os
os.getcwd()

In [None]:
# Set your working directory
os.chdir('/content/drive/My Drive/Colab Notebooks/OPIM 5671 (Spring 2025)') # Change the directory to your own

### **Part 2: Basic Commands in Python**
* Execute and revise code cells.
* Import and use libraries.
* Define a function and calculate numbers.


### **2.1. Understanding code cells**

In [None]:
# Type 'ctrl' + 'enter'
print("Hello, Huskies!")

In [None]:
# Revise the code to print the sentence "Goodbye, Wolves!"
print("Hello, Huskies!")

In [None]:
# Define a variable and print its value
x = 5
y = "John"
print(x)
print(y)
print(type(x))
print(type(y))

In [None]:
# Only the most recent definition works
x = 4       # x is of type int
x = "Sally" # x is now of type str
print(x)
x

In [None]:
# Define a variable with calculation
seconds_in_a_day = 24 * 60 * 60
print(seconds_in_a_day)

seconds_in_a_week = 7 * seconds_in_a_day
seconds_in_a_week

In [None]:
# Need indentation to remove an error
if 5 > 2:
  print("Five is greater than two!") # Space or tab here

In [None]:
# Try the if...else function with different numbers
a = 400
b = 200
if b > a:
  print("b is greater than a")
elif a == b:
  print("a and b are equal")
else:
  print("a is greater than b")

### **2.2. Defining a function**

In Python, a function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusability.

Defining a function in Python involves a few key components and a basic structure that can be extended based on the complexity of the task the function is designed to perform.

**1. Function Definition:** To define a function, you use the `def` keyword, followed by the function name and parentheses. The general syntax looks like this:

```python
def function_name(parameters):
    # function body
```

**2. Parameters (Optional):** Inside the parentheses, you can optionally list parameters (also known as arguments) separated by commas. These parameters are inputs that the function can accept, allowing you to pass different values to the function each time you call it.

**3. Function Body:** After the colon, the next line starts the block of code known as the function body. This is where you write the code that defines what the function should do. The function body is indented, usually by four spaces. This indentation is essential, as Python uses whitespace to define scope.

**4. Return Statement (Optional):** Within the function body, you can optionally include a `return` statement. This statement specifies what value the function should return after it finishes executing. If there is no `return` statement, the function will return `None` by default.

**5. Calling a Function:** Once a function is defined, you can call it from other parts of your code using its name followed by parentheses. If the function expects parameters, you provide values within these parentheses.

In [None]:
# Define a function
def greet(name): # Function definition
    return f"Hello, {name}!" # Function body (`f` formatted string)

# Call the function
print(greet("Huskies"))

In [None]:
def add_numbers(a, b):
    result = a + b
    return result

# Calling the function
sum_value = add_numbers(5, 3)
print(sum_value)

In [None]:
def square(n):
    return n * n

for i in range(1, 6):
    print(f"Square of {i} is {square(i)}")

### **Part 3: `NumPy`**

`NumPy` (**Num**erical **Py**thon) provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these data structures, and it is highly efficient for numerical calculations.

In [None]:
# Install NumPy library
!pip install numpy # Already installed

In [None]:
# Import NumPy
import numpy as np # Shorten the imported name to np for better readability of code using NumPy

#### **3.1. Array**

An **array** is a central data structure of the `NumPy` library. It is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways.

**Dimension of arrays**
* One-dimensional: like **a list**
* Two-dimensional: like **a table**
* Three-dimensional: like **a set of tables**
* An arbitrary number of dimensions? Generalized as `ndarray`!

<img src="https://nustat.github.io/DataScience_Intro_python/Datasets/numpy_image.png" width="1000" height="350">

Most `NumPy` arrays have some restrictions. For instance:
* All elements of the array must be of the **same type** of data.
* Once created, the **total size** of the array **can't change**.
* The shape must be **rectangular**, **not jagged**
  * e.g., each row of a two-dimensional array must have the same number of columns.

**3.1.1. Creating arrays from scratch**

In [None]:
# Create and print example arrays
a = np.array([1, 2, 3, 4, 5, 6])
A = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print(a[0])
print(a[1])
print(A[0])
print(A[1])

In [None]:
# Create various arrays
np.zeros(3) # An array filled with 0's

In [None]:
np.ones(3) # An array filled with 1's

In [None]:
np.arange(5) # An array with a range of elements

**3.1.2. Basic atributes of `numpy` array**

In [None]:
# Let's play with array `a` and `A`
a

In [None]:
A

`ndim`: shows the number of dimensions (or axes) of the array

In [None]:
print("a.ndim is", a.ndim)
print("A.ndim is", A.ndim)

`shape`: the size of the array in each dimension. $(m, n)$ for $m$ rows and $n$ columns. The length of the shape tuple is the rank or the number of dimensions `ndim`.

In [None]:
print("a.shape is", a.shape)
print("A.shape is", A.shape)

`size`: the total number of elements of the array, equivalent to the product of the elements of shape

In [None]:
print("a.size is", a.size)
print("A.size is", A.size)

`dtype`: the type of elements in the array

* Data type examples are available here: <https://www.geeksforgeeks.org/python-data-types/>



In [None]:
print("a.dtype is", a.dtype)
print("A.dtype is", A.dtype)

`T`: used to transpose the `NumPy` array

In [None]:
a.T

In [None]:
A.T

**3.1.3. Array operations**

You can add, subtract, multiplicate, and divide arrays. Also, you can use various statistical functions, such as maximum, minimum, sum, mean, product, and standard deviation.

In [None]:
# Define two arrays
data = np.array([1, 2])
ones = np.ones(2, dtype=int) # An array with 1's
print(data)
print(ones)

In [None]:
# Add, subtract, multiplicate, and divide arrays
print(data + ones)
print(data - ones)
print(data * data)
print(data / data)

In [None]:
# Find the maximum, minimum, and sum of the elements in 'data' array
print(data.max())
print(data.min())
print(data.sum())

In [None]:
# Find the mean, product, and standard deviation of the elements in 'data' array
print(data.mean())
print(data.prod())
print(data.std())

### **Part 4: `Pandas`**

`Pandas` provides two primary data structures: `Series` (one-dimensional) and `DataFrame` (two-dimensional, like a table). With Pandas, you can easily clean, filter, transform, and visualize data, making it an essential tool for data wrangling in data science and analytics. It is built on top of `NumPy` and integrates seamlessly with other Python libraries.

In [None]:
# Run this code to import the NumPy and Pandas modules
import numpy as np
import pandas as pd

#### **4.1. Series**

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

> **Series (`Pandas`) vs. Array (`NumPy`)**
>
> While Series (`pandas`) and array (`numpy`) may seem similar as one-dimensional arrays, their differences in terms of data handling capabilities, efficiency, and use cases make them suitable for different scenarios in data analysis and scientific computing.
>
> **1. Use Case:** If you need labeled data, are working with heterogeneous data, or need to align data by labels, `Series` is more suitable. It's also preferable when working with tabular data and when integrating with `pandas` DataFrames.
>
> **2. Performance:** For numerical operations, especially on large datasets where performance is a concern and where data homogeneity is maintained, `array` (`numpy`) is generally faster and more memory-efficient.
>
> **3. Functionality:** `Series` provides more functionalities (like handling missing data seamlessly) that are very useful in data analysis and manipulation, especially in data science workflows.
>

**4.1.1. Creating a Series**

In [None]:
# Create a series from a list
s1 = [1, 7, 2]
myvar = pd.Series(s1)
print(myvar)

In [None]:
print(myvar[0]) # Call the first item

In [None]:
# Create a series with labels
s1 = [1, 7, 2]
myvar = pd.Series(s1, index=["x", "y", "z"])
print(myvar)

In [None]:
print(myvar[0]) # Call an item with a number
print(myvar["x"]) # Call an item with an index

In [None]:
# Create a series from dictionary 'd1'
d1 = {"a": 0.0, "b": 1.0, "c": 2.0}
pd.Series(d1)

In [None]:
pd.Series(d1, index=["d", "c", "b", "a"]) # Change the index order

In [None]:
# Create a series from a dictionary 'calories'
calories = {"day1": 1420, "day2": 1380, "day3": 1390}
myvar = pd.Series(calories, index = ["day1", "day2"]) # Insufficient index to store all items
print(myvar) # Only first two days will appear

**4.1.2. Exploring Series and Operations**

In [None]:
# Define a Series
s2 = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
print(s2)

In [None]:
# Series is ndarray-like
print(s2[0]) # 1st item
print("----------")
print(s2[:3]) # Before 4th item
print("----------")
print(s2[s2 > s2.median()]) # Items above median
print("----------")
print(s2[[1, 2, 3]]) # 2nd - 4th items

In [None]:
# Series is dict-like
print(s2["a"]) # Item with index "a"
print(s2["e"])
print("e" in s2) # If "e" is in series "s2"
print("f" in s2)
print(s2.get("e")) # Item with index "e"
print(s2.get("f")) # Return 'None'

In [None]:
# Vectorized operations
print(s2 + s2)
print("----------")
print(s2 - s2)
print("----------")
print(s2 * s2)
print("----------")
print(s2 / s2)
print("----------")
print(np.exp(s2)) # Take exponential to each item
print("----------")
print(s2[1:] + s2[:-1]) # Integrate two sub-series

#### **4.2. DataFrame**

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used `pandas` object.



**4.2.1. Creating and Exploring a DataFrame**

In [None]:
# Create data with two arrays
d1 = {
  "calories": [420, 380, 390, 522],
  "duration": [50, 40, 45, 36]
}

# Load data into a DataFrame object
df1 = pd.DataFrame(d1)

print(df1)

In [None]:
# Explore rows
print(df1.loc[0]) # 1st row
print('------------')
print(df1.loc[[0, 2]]) # 1st and 3rd rows

In [None]:
# Explore rows with named indexes
df1 = pd.DataFrame(d1, index = ["day1", "day2", "day3", "day4"])

df1

In [None]:
print(df1.loc["day1"]) # 1st row
print('------------')
print(df1.loc[0]) # Error message

In [None]:
# Explore columns
print(df1["calories"]) # 'calories' column
print('------------')
print(df1["duration"]) # 'duration' column
print('------------')
print(df1[["calories", "duration"]]) # 'calories' and 'duration' column

In [None]:
# Add columns
df1["joules"] = df1["calories"] * 4.184
df1["hours"] = df1["duration"] / 60
df1["minutes"] = "minutes" # See what happens
df1["seconds"] = df1["duration"][:2] * 60 # Restrict observations to 2nd row

print(df1)
print('------------')

df1.insert(1, "order", np.arange(1, 5)) # DataFrame.insert(loc, column, value, allow_duplicates=_NoDefault.no_default)
print(df1)

In [None]:
# Delete columns
del df1["minutes"]
df1.pop("seconds")

df1

In [None]:
# Transposing
print(df1.transpose())
print('------------')
print(df1[:2].transpose()) # Restrict observations to 2nd row and then transpose

**4.2.2. Merging and Joining**

`pandas` provides various facilities for easily combining together `Series` or `DataFrame` with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations. Also, it has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL.

This library offers a single function, `merge()`, as the entry point for all standard database join operations between `DataFrame` or named `Series` objects:

```python
pd.merge(
    left,
    right,
    how="inner",
    on=None,
    left_on=None,
    right_on=None,
    left_index=False,
    right_index=False,
    sort=True,
    suffixes=("_x", "_y"),
    copy=True,
    indicator=False,
    validate=None,
)
```

* `left`: A DataFrame or named Series object.
* `right`: Another DataFrame or named Series object.
* `on`: Column or index level names to join on. Must be found in both the left and right DataFrame and/or Series objects. If not passed and `left_index` and `right_index` are `False`, the intersection of the columns in the DataFrames and/or Series will be inferred to be the join keys.
* `left_on`: Columns or index levels from the left DataFrame or Series to use as keys. Can either be column names, index level names, or arrays with length equal to the length of the DataFrame or Series.
* `right_on`: Columns or index levels from the right DataFrame or Series to use as keys. Can either be column names, index level names, or arrays with length equal to the length of the DataFrame or Series.
* `left_index`: If `True`, use the index (row labels) from the left DataFrame or Series as its join key(s). In the case of a DataFrame or Series with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame or Series.
* `right_index`: Same usage as `left_index` for the right DataFrame or Series
* `how`: One of `'left'`, `'right'`, `'outer'`, `'inner'`, `'cross'`. Defaults to inner.
* `sort`: Sort the result DataFrame by the join keys in lexicographical order. Defaults to `True`, setting to `False` will improve performance substantially in many cases.
* `suffixes`: A tuple of string suffixes to apply to overlapping columns. Defaults to `('_x', '_y')`.
* `copy`: Always copy data (default `True`) from the passed DataFrame or named Series objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.
* `indicator`: Add a column to the output DataFrame called `_merge` with information on the source of each row. `_merge` is Categorical-type and takes on a value of `left_only` for observations whose merge key only appears in `'left'` DataFrame or Series, `right_only` for observations whose merge key only appears in `'right'` DataFrame or Series, and `both` if the observation’s merge key is found in both.
* `validate`: string, default `None`. If specified, checks if merge is of specified type.

For your information, please refer to the following references:
* [**Pandas Merge, Join, Concatenate, and Compare**](https://pandas.pydata.org/docs/user_guide/merging.html)
* [**Pandas Codebook**](https://pandas.pydata.org/docs/user_guide/cookbook.html)
* [**Pandas Comparison with SQL**](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html)

![image](https://media.geeksforgeeks.org/wp-content/uploads/joinimages.png)

In [None]:
# Left join with a single key
left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

result = pd.merge(left, right, how="left", on="key")
result

In [None]:
# Left join with two keys
left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K1"],
        "key2": ["L0", "L1", "L0", "L1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["L0", "L0", "L0", "L0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

result = pd.merge(left, right, how="left", on=["key1", "key2"])
result

In [None]:
# Right join with two keys (using 'left' and 'right' defined above)
result = pd.merge(left, right, how="right", on=["key1", "key2"]) # Set 'right' join
result

In [None]:
# Outer join with two keys (using 'left' and 'right' defined above)
result = pd.merge(left, right, how="outer", on=["key1", "key2"]) # Set 'outer' join
result

In [None]:
# Inner join with two keys (using 'left' and 'right' defined above)
result = pd.merge(left, right, how="inner", on=["key1", "key2"]) # Set 'inner' join
result

In [None]:
# Outer join with two keys + Indicator (using 'left' and 'right' defined above)
result = pd.merge(left, right, how="outer", on=["key1", "key2"], indicator="matched") # Set 'outer' join + Indicator
result

`DataFrame.join()` is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

```python
DataFrame.join(
  other,
  on=None,
  how='left',
  lsuffix='',
  rsuffix='',
  sort=False,
  validate=None
  )
```

Please run the following code to see how simple it is to join DataFrames with an index.

In [None]:
# Define two dataframes sharing an index
left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
)


right = pd.DataFrame(
    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
)

In [None]:
# The two codes yield the same outcome
print(left.join(right))
print('------------')
print(pd.merge(left, right, how="left", left_index=True, right_index=True))

In [None]:
# The two codes yield the same outcome
print(left.join(right, how="outer"))
print('------------')
print(pd.merge(left, right, how="outer", left_index=True, right_index=True))

**GroupBy**

A `groupby()` operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

```python
DataFrame.groupby(
  by=None,
  axis=0,
  level=None,
  as_index=True,
  sort=True,
  group_keys=_NoDefault.no_default,
  squeeze=_NoDefault.no_default,
  observed=False,
  dropna=True
  )
```

Please refer to [**pandas.DataFrame.groupby**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) for details of syntax.

In [None]:
# Run the code to create a dataframe (24 rows x 6 columns)
import datetime

df3 = pd.DataFrame(
    {
        "Z": [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8],
        "A": ["one", "one", "two", "three"] * 6,
        "B": ["a", "b", "c"] * 8,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4,
        "D": np.random.randn(24),
        "E": np.random.randn(24),
        "F": [datetime.datetime(2024, i, 1) for i in range(1, 13)]
        + [datetime.datetime(2024, i, 15) for i in range(1, 13)],
    }
)
print(df3)

In [None]:
# Briefly explore 'df3'
df3.head(5)

In [None]:
# Apply 'groupby' (worked in 2024 but not yielding errors now)
print(df3.groupby(["A"]).mean()) # Mean by A
print('------------')
print(df3.groupby(["A", "B"]).mean()) # Mean by (A x B) combination
print('------------')
print(df3.groupby(["A", "B"]).sum()) # Sum by (A x B) combination

In [None]:
# Apply 'groupby' (revised in Jan 24, 2025)
print(df3.groupby(["A"]).mean(["D", "E", "F"])) # Mean by A
print('------------')
print(df3.groupby(["A", "B"]).mean(["D", "E", "F"])) # Mean by (A x B) combination
print('------------')
print(df3.groupby(["A", "B"]).sum(["D", "E", "F"])) # Sum by (A x B) combination