<a href="https://colab.research.google.com/github/mthimkulu23/Python-Tools-for-Data-Science-Fundamentals---Activities/blob/main/Copy_of_Python_Tools_for_Data_Science_Fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Tools for Data Science Fundamentals - Activities

**Goal:** Introduce the foundational Python libraries, **NumPy** and **Pandas**, that serve as the building blocks for all data manipulation and analysis in Python.

This notebook focuses on core concepts, with all Python code cells replaced by **docstrings** that explain the required **To Do** task and **Expected Outcome** for each learning objective.

---

## Section 1: Introduction to NumPy: The Array Foundation

**Goal:** Explore the NumPy library, focusing on its core object: the `ndarray` (N-dimensional array).

### Step 1: Import and Create a 1D Array (LO: Install and import the NumPy library. Create 1-dimensional and 2-dimensional NumPy arrays)


In [None]:
# """
# TO DO:
# 1. Import NumPy using the standard alias (`import numpy as np`).
# 2. Create a 1D array (`array_1d`) from a Python list using `np.array()`.

# EXPECTED OUTCOME:
# A NumPy `ndarray` object containing a sequence of values.
# """

!pip install NumPy
import numpy as np

sequence_list = [1, 2, 3, 4, 5 , 6]

array_1d = np.array(sequence_list)
array_1d








array([1, 2, 3, 4, 5, 6])

### Step 2: Create a 2D Array

In [None]:
# """
# TO DO:
# 1. Create a 2D array (`array_2d`) from a list of lists, representing a matrix (e.g., 3 rows, 2 columns).

# EXPECTED OUTCOME:
# A matrix-like `ndarray` suitable for linear algebra and tabular representation.
# """

array_2d = np.array([[1, 2],[3, 4],[5, 6]])

array_2d



array([[1, 2],
       [3, 4],
       [5, 6]])

### Step 3: Check Array Attributes (LO: Check the shape, size, and data type (.dtype) of an array)

In [None]:
# """
# TO DO:
# 1. Access the `.shape` attribute of `array_2d`.
# 2. Access the `.dtype` attribute of `array_2d`.

# EXPECTED OUTCOME:
# The tuple representing dimensions (e.g., (3, 2)) and the data type (e.g., `int64` or `float64`).
# """



array_2d = np.array([[1, 2],[3, 4],[5, 6]])




array_shape = array_2d.shape
array_shape


array_dtype = array_2d.dtype
array_dtype




dtype('int64')

### Step 4: Creating Arrays with `np.zeros` and `np.arange`

In [None]:
# """
# TO DO:
# 1. Use `np.zeros()` to create a 3x3 array initialized entirely with the value 0.
# 2. Use `np.arange()` to create an array with a sequential range of numbers.

# EXPECTED OUTCOME:
# Arrays created quickly for initialization, without needing explicit list definition.
# """

import numpy as np

zeros_array = np.zeros((3, 3))
zeros_array

range_array = np.arange(0, 10)
range_array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Step 5: Vectorised Arithmetic

In [None]:
# """
# TO DO:
# 1. Create a 1D array, `array_A`.
# 2. Perform a simple element-wise operation (e.g., `array_A + 10` or `array_A ** 2`).

# EXPECTED OUTCOME:
# A new array where the operation has been applied individually to every element, demonstrating vectorization.
# """

import numpy as np

array_A = np.array([2, 4, 6, 8, 10])

array_A

added_array = array_A + 10

added_array

squared_array = array_A ** 2
squared_array

array([  4,  16,  36,  64, 100])

---

## Section 2: Introduction to Pandas: The DataFrame

**Goal:** Introduce the Pandas library and its primary structure: the DataFrame.

### Step 1: Import Pandas and Create a Series (LO: Install and import the Pandas library. Understand the relationship between Series and DataFrames)

In [None]:
# """
# TO DO:
# 1. Import Pandas using the standard alias (`import pandas as pd`).
# 2. Create a Pandas Series (`series_A`) from a Python list.

# EXPECTED OUTCOME:
# A labeled, one-dimensional Pandas Series object (conceptualizing a single DataFrame column).
# """


import pandas as pd

series_A = pd.Series([1, 2, 3, 4, 5, 6])
series_A

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5
5,6


### Step 2: Create a DataFrame from a Dictionary (LO: Create a DataFrame from various sources)

In [None]:
# """
# TO DO:
# 1. Use `pd.DataFrame()` to create a DataFrame from a Python dictionary (where keys are column names).

# EXPECTED OUTCOME:
# A standard, labeled, two-dimensional Pandas DataFrame.
# """



data = {
    "Name": ["Thabang", "Sasekani", "Mike", "Anna"],
    "Age": [25, 30, 22, 28],
    "City": ["Zonkizizwe", "Tembisa", "Durban", "Pretoria"]
}

display_frame = pd.DataFrame(data)

display_frame

Unnamed: 0,Name,Age,City
0,Thabang,25,Zonkizizwe
1,Sasekani,30,Tembisa
2,Mike,22,Durban
3,Anna,28,Pretoria


### Step 3: DataFrame from NumPy Array (LO: Create a DataFrame from various sources)

In [None]:
# """
# TO DO:
# 1. Create a 3x3 NumPy array.
# 2. Use `pd.DataFrame()` to convert the NumPy array into a DataFrame, explicitly passing a list of column names.

# EXPECTED OUTCOME:
# A DataFrame with numerical data where the columns are labeled according to the provided list.
# """





array_3x3 = np.array([[10, 20, 30],[40, 50, 60],[70, 80, 90]])

display_frame_from_array = pd.DataFrame(array_3x3, columns=["Column_A", "Column_B", "Column_C"])

display_frame_from_array



Unnamed: 0,Column_A,Column_B,Column_C
0,10,20,30
1,40,50,60
2,70,80,90


### Step 4: Initial Inspection with `.head()` (LO: Use .head(), .info(), and .shape to inspect a new DataFrame)

In [None]:
# """
# TO DO:
# 1. Use the `.head(3)` method on the DataFrame created in Step 2.

# EXPECTED OUTCOME:
# A preview of the first 3 rows, confirming the data loaded correctly.
# """




array_3x3 = np.array([[10, 20, 30],[40, 50, 60],[70, 80, 90]])

df_from_array = pd.DataFrame(array_3x3, columns=["Column_A", "Column_B", "Column_C"])

preview = df_from_array.head(3)
preview

Unnamed: 0,Column_A,Column_B,Column_C
0,10,20,30
1,40,50,60
2,70,80,90


### Step 5: Full Structural Summary with `.info()`

In [None]:
# """
# TO DO:
# 1. Use the `.info()` method on the DataFrame.
# 2. Write a docstring explaining which critical pieces of information are shown (Non-Null Counts, Data Types).

# EXPECTED OUTCOME:
# A summary revealing missing data points and the current data type of each column.
# """




array_3x3 = np.array([[10, 20, 30],[40, 50, 60],[70, 80, 90]])

df_from_array = pd.DataFrame(array_3x3, columns=["Column_A", "Column_B", "Column_C"])

df_from_array.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Column_A  3 non-null      int64
 1   Column_B  3 non-null      int64
 2   Column_C  3 non-null      int64
dtypes: int64(3)
memory usage: 204.0 bytes


### Step 6: Checking Dimensions with `.shape`

In [None]:
# """
# TO DO:
# 1. Access the `.shape` attribute of the DataFrame.
# 2. Print the result.

# EXPECTED OUTCOME:
# A tuple (Rows, Columns) confirming the size of the dataset.
# """





array_3x3 = np.array([[10, 20, 30],[40, 50, 60],[70, 80, 90]])

df_from_array = pd.DataFrame(array_3x3, columns=["Column_A", "Column_B", "Column_C"])

data_shape = df_from_array.shape

data_shape

(3, 3)

---

## Section 3: Indexing, Slicing, and Selecting Data

**Goal:** Learn essential methods for selecting and retrieving data subsets from a Pandas DataFrame.

### Step 1: Selecting a Single Column (LO: Select columns using bracket notation)

In [None]:
# """
# TO DO:
# 1. Re-initialize a sample DataFrame with labeled columns.
# 2. Select a single column using the standard bracket notation (e.g., `df['Column_A']`).

# EXPECTED OUTCOME:
# A Pandas Series object containing the values from that single column.
# """




read_data = {"Column_A": [10, 20, 30],"Column_B": [40, 50, 60],"Column_C": [70, 80, 90]}

show_data = pd.DataFrame(read_data)

selected_column = show_data["Column_A"]

selected_column

Unnamed: 0,Column_A
0,10
1,20
2,30


### Step 2: Selecting Multiple Columns

In [None]:
# """
# TO DO:
# 1. Select two or more columns using the double bracket notation, passing a list of column names (e.g., `df[['Col_A', 'Col_B']]`).

# EXPECTED OUTCOME:
# A new DataFrame object containing only the specified columns, maintaining a tabular structure.
# """





data = {"Column_A": [10, 20, 30],"Column_B": [40, 50, 60],"Column_C": [70, 80, 90]}

read_data = pd.DataFrame(data)

selected_columns = read_data[["Column_A", "Column_B"]]


selected_columns


Unnamed: 0,Column_A,Column_B
0,10,40
1,20,50
2,30,60


### Step 3: Label-Based Indexing with `.loc[]` (LO: Use label-based indexing with .loc[])

In [None]:
# """
# TO DO:
# 1. Use `.loc[]` to select data, specifying both the **row index label** (e.g., `2`) and the **column name** (e.g., `'Value'`).
# 2. Example: `df.loc[2, 'Value']`.

# EXPECTED OUTCOME:
# The specific cell value, retrieved by using the defined row/column labels.
# """


data = {"Value": [100, 200, 300],"Score": [10, 20, 30]
}

read_data = pd.DataFrame(data)


cell_value = read_data.loc[2, "Value"]


cell_value


np.int64(300)

### Step 4: Position-Based Indexing with `.iloc[]` (LO: Use integer-location-based indexing with .iloc[])

In [None]:
# """
# TO DO:
# 1. Use `.iloc[]` to select data, specifying the **row position** (e.g., `[1:4]`) and the **column position** (e.g., `[0]`).
# 2. Example: `df.iloc[1:4, 0]`.

# EXPECTED OUTCOME:
# A slice of the DataFrame (rows 1, 2, 3 and the first column) retrieved purely by integer position.
# """






data = {"Value": [100, 200, 300, 400, 500],"Score": [10, 20, 30, 40, 50]}

read_data = pd.DataFrame(data)


selected_slice = read_data.iloc[1:4, 0]


selected_slice


Unnamed: 0,Value
1,200
2,300
3,400


### Step 5: Simple Boolean Filtering

In [None]:
# """
# TO DO:
# 1. Create a boolean mask (e.g., `df['Value'] > 50`).
# 2. Apply the mask directly to the DataFrame (e.g., `df[mask]`).

# EXPECTED OUTCOME:
# A subset DataFrame containing only the rows where the condition (e.g., Value > 50) is True.
# """



data = {"Value": [30, 60, 90, 20, 150],"Score": [5, 10, 15, 3, 25]}

read_data = pd.DataFrame(data)


mask = read_data["Value"] > 50

mask


filtered_data = read_data[mask]


filtered_data

Unnamed: 0,Value,Score
1,60,10
2,90,15
4,150,25


### Step 6: Advanced Boolean Filtering (LO: Apply boolean conditions for complex row filtering)

In [None]:
# """
# TO DO:
# 1. Apply a complex filter using two conditions linked by the logical AND operator (`&`) and parentheses.
# 2. Example: `df[(df['Value'] > 50) & (df['Region'] == 'East')]`.

# EXPECTED OUTCOME:
# A subset DataFrame containing only rows that meet *all* specified criteria.
# """




data = {
    "Value": [30, 60, 90, 20, 150],
    "Region": ["East", "West", "East", "South", "East"],
    "Score": [5, 10, 15, 3, 25]
}

read_data = pd.DataFrame(data)

filtered_data = read_data[(read_data["Value"] > 50) & (read_data["Region"] == "East")]


filtered_data


Unnamed: 0,Value,Region,Score
2,90,East,15
4,150,East,25
