# Workshop 1: NumPy and Pandas Basics

Welcome to our first workshop on Data Science fundamentals! Today we'll cover NumPy and Pandas,
two essential libraries for data manipulation in Python.

NumPy (Numerical Python) is fundamental for:
- Scientific computing
- Array operations
- Mathematical functions
- Working with matrices

Pandas builds on NumPy and is essential for:
- Data analysis
- Data cleaning
- Working with tabular data
- Reading/writing various file formats

In [4]:
import numpy as np
import pandas as pd

# 1. NumPy Fundamentals

## 1.1 Creating Arrays

Key points to discuss:
- NumPy arrays are more efficient than Python lists
- Arrays must contain elements of the same type
- Arrays can be created in multiple ways:
  * From Python lists
  * Using NumPy helper functions (zeros, ones, random)
  * Using range functions (arange, linspace)

Let's demonstrate different ways to create arrays:

In [2]:
# Creating arrays from lists
basic_array = np.array([1, 2, 3, 4, 5])
print("Basic array:", basic_array)

Basic array: [1 2 3 4 5]


Now let's look at special arrays. These are frequently used in:
- Initializing weights in machine learning
- Creating placeholder arrays
- Generating random data for testing

In [3]:
# Creating special arrays
zeros = np.zeros(5)
ones = np.ones((2, 3))
random_array = np.random.rand(3, 3)

print("\nZeros array:", zeros)
print("\nOnes array:\n", ones)
print("\nRandom array:\n", random_array)

Zeros array: [0. 0. 0. 0. 0.]

Ones array:
[[1. 1. 1.]
 [1. 1. 1.]]

Random array:
[[0.12345678 0.23456789 0.34567890]
 [0.45678901 0.56789012 0.67890123]
 [0.78901234 0.89012345 0.90123456]]


## 1.2 Array Operations

Key points to discuss:
- Element-wise operations are automatic (broadcasting)
- Operations are much faster than Python loops
- No need to write explicit loops for basic operations
- Perfect for mathematical computations

Common operations include:
- Basic arithmetic (+, -, *, /)
- Power operations (**)
- Matrix multiplication (@)

In [4]:
# Mathematical operations
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])

print("Array addition:", array1 + array2)
print("Array multiplication:", array1 * array2)
print("Array square:", array1 ** 2)

Array addition: [5 7 9]
Array multiplication: [ 4 10 18]
Array square: [1 4 9]


## 1.3 Array Indexing and Slicing

Key points to discuss:
- Similar to Python list indexing but more powerful
- Can handle multiple dimensions easily
- Negative indexing works like in Python lists
- Slicing follows start:stop:step pattern
- 2D arrays use comma-separated indexing [row, column]

Important concepts:
- Basic indexing
- Slicing
- Boolean indexing (covered later)

In [5]:
# Create a 2D array
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

print("2D array:\n", matrix)
print("First row:", matrix[0])
print("Element at position (1,2):", matrix[1,2])
print("Sliced array:\n", matrix[0:2, 1:3])

2D array:
[[1 2 3]
 [4 5 6]
 [7 8 9]]
First row: [1 2 3]
Element at position (1,2): 6
Sliced array:
[[2 3]
 [5 6]]


# 2. Pandas Fundamentals

## 2.1 Creating DataFrames

Key points to discuss:
- DataFrame is a 2D labeled data structure
- Like a spreadsheet or SQL table
- Can be created from various data sources:
  * Dictionaries
  * Lists
  * NumPy arrays
  * CSV files
  * Excel files

Common use cases:
- Loading data from files
- Converting other data structures to DataFrame
- Creating sample data for testing

In [6]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['John', 'Emma', 'Alex', 'Sarah'],
    'Age': [28, 24, 32, 27],
    'City': ['New York', 'Paris', 'London', 'Tokyo']
}

df = pd.DataFrame(data)
print("Basic DataFrame:\n", df)

Basic DataFrame:
    Name  Age     City
0   John   28  New York
1   Emma   24    Paris
2   Alex   32   London
3  Sarah   27    Tokyo


## 2.2 Basic DataFrame Operations

Key points to discuss:
- Columns can be accessed using [] or dot notation
- New columns can be added easily
- Built-in statistical functions
- Column operations are vectorized

Common operations:
- Accessing data
- Adding/removing columns
- Basic statistics
- Data type conversion

In [None]:
# Accessing columns and basic operations
print("Accessing 'Name' column:\n", df['Name'])

# Adding a new column
df['Country'] = ['USA', 'France', 'UK', 'Japan']
print("\nDataFrame with new column:\n", df)

# Basic statistics
print("\nBasic statistics for Age:\n", df['Age'].describe())

Accessing 'Name' column:
0    John
1    Emma
2    Alex
3   Sarah
Name: Name, dtype: object

DataFrame with new column:
    Name  Age     City Country
0   John   28  New York     USA
1   Emma   24    Paris  France
2   Alex   32   London      UK
3  Sarah   27    Tokyo   Japan

Basic statistics for Age:
count     4.000000
mean     27.750000
std       3.304038
min      24.000000
25%      25.750000
50%      27.500000
75%      29.250000
max      32.000000
Name: Age, dtype: float64


## 2.3 Data Selection and Filtering

Key points to discuss:
- Boolean indexing
- Multiple conditions using & (and) and | (or)
- Using .loc[] for label-based indexing
- Using .iloc[] for integer-based indexing

Common use cases:
- Filtering data based on conditions
- Selecting specific rows/columns
- Combining multiple conditions

In [8]:
# Filtering data
young_people = df[df['Age'] < 30]
print("People under 30:\n", young_people)

# Multiple conditions
young_in_asia = df[(df['Age'] < 30) & (df['Country'].isin(['Japan']))]
print("\nPeople under 30 in Asia:\n", young_in_asia)

People under 30:
    Name  Age     City Country
0   John   28  New York     USA
1   Emma   24    Paris  France
3  Sarah   27    Tokyo   Japan

People under 30 in Asia:
    Name  Age   City Country
3  Sarah   27  Tokyo   Japan


# 3. Practical Exercises

Key points to discuss before exercises:
- Importance of practice
- Real-world applications
- Common pitfalls
- Best practices

Exercise structure:
1. Start with simple tasks
2. Build up to more complex operations
3. Combine multiple concepts
4. Relate to real-world scenarios

## Exercise 1: NumPy Arrays
1. Create a 3x3 matrix with random integers between 1 and 100
2. Calculate the mean of each row
3. Find the maximum value in the entire matrix

## Exercise 2: Pandas DataFrames
1. Create a DataFrame with student information (Name, Grade, Subject)
2. Calculate the average grade per subject
3. Filter students with grades above 85

In [6]:
# Example solution for Exercise 1
exercise_matrix = np.random.randint(1, 100, size=(3, 3))
print("\nExercise Matrix:\n", exercise_matrix)
print("Row means:", exercise_matrix.mean(axis=1))
print("Maximum value:", exercise_matrix.max())


Exercise Matrix:
 [[32 16 88]
 [32 42 23]
 [84 93 86]]
Row means: 55.111111111111114
Maximum value: 93


In [None]:
# Example solution for Exercise 2
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Grade': [88, 92, 85, 78],
    'Subject': ['Math', 'Math', 'Physics', 'Math']
}

student_df = pd.DataFrame(student_data)
print("\nAverage grade per subject:\n", student_df.groupby('Subject')['Grade'].mean())
print("\nStudents with grades above 85:\n", student_df[student_df['Grade'] > 85])

## Wrap-up Discussion Points:
1. Review key concepts covered
2. Address common questions
3. Preview next workshop on data preprocessing
4. Suggest additional resources for practice
5. Assign optional homework if applicable