Introduction to Pandas, NumPy, Scikit-Learn, and Other Essential Modules in Python


---

### 2.2.1. NumPy: Numerical Computing


NumPy provides support for arrays, matrices, and fast mathematical operations.


Creating a numpy array

In [1]:
import numpy as np

# From a Python list
arr1 = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1)

# 2D Array (Matrix)
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", arr2)

1D Array: [1 2 3 4 5]
2D Array:
 [[1 2 3]
 [4 5 6]]


In [None]:
# Arrays filled with zeros or ones
zeros = np.zeros((2, 3))
ones = np.ones((3, 3))
print("Zeros:\n", zeros)
print("Ones:\n", ones)

# Array with a range
arr_range = np.arange(0, 10, 2)   # from 0 to 10 step 2
print("Range array:", arr_range)

# Evenly spaced values
arr_linspace = np.linspace(0, 1, 5)   # 5 numbers between 0 and 1
print("Linspace array:", arr_linspace)

# Random numbers
rand_arr = np.random.rand(3, 3)   # Uniform [0,1)
rand_ints = np.random.randint(1, 10, size=(2, 4))
print("Random array:\n", rand_arr)
print("Random integers:\n", rand_ints)


In [None]:
# TODO: Create your own numpy array and perform some operations on it

array properties

In [None]:
print("Shape:", arr2.shape)
print("Size:", arr2.size)
print("Data type:", arr2.dtype)
print("Dimension:", arr2.ndim)


slicing and indexing

In [None]:
arr = np.array([10, 20, 30, 40, 50])

print(arr[0])     # First element
print(arr[-1])    # Last element
print(arr[1:4])   # Slice elements

# 2D indexing
mat = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

print(mat[0, 2])   # First row, third column
print(mat[1:, :2]) # Rows 2-3, first 2 columns


reshaping

In [None]:
arr = np.arange(12)   # [0,1,2,...,11]
reshaped = arr.reshape(3, 4)
print("Reshaped (3x4):\n", reshaped)

flattened = reshaped.flatten()
print("Flattened:", flattened)


Some mathematical operation

In [None]:
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

print("Addition:", a + b)
print("Multiplication:", a * b)
print("Square roots:", np.sqrt(a))
print("Dot product:", np.dot(a, b))


Random Numbers

In [None]:
# Single integer between 0 and 10
num = np.random.randint(0, 10)
print("Random integer:", num)

# Random integers
print("Random integers:", np.random.randint(1, 10, size=5))

# 2D array (3x4) of integers between 10 and 50
arr2d = np.random.randint(10, 50, size=(3, 4))
print("2D array of random integers:\n", arr2d)

# Random floats (between 0 and 1)
print("Random floats:", np.random.rand(3))

# Normal distribution
print("Normal distribution:", np.random.randn(3))

# Single random float
print("Random float:", np.random.rand())

# 1D array of 5 floats
print("Random floats:", np.random.rand(5))

# 2D array (2x3) of floats
print("Random float matrix:\n", np.random.rand(2, 3))

Randoms according to Normal Distribution

In [None]:

# Single value
print("Normal random value:", np.random.randn())

# 1D array of 5 values
print("Normal random array:", np.random.randn(5))

# 2D array (3x3)
print("Normal distribution matrix:\n", np.random.randn(3, 3))


Choosing from a list

In [None]:
# Pick one element from a list
print("Choice:", np.random.choice([10, 20, 30, 40]))

# Pick multiple elements with replacement
print("Choices with replacement:", np.random.choice([1, 2, 3, 4, 5], size=3))

# Pick without replacement
print("Choices without replacement:", np.random.choice([1, 2, 3, 4, 5], size=3, replace=False))


Linear Algebra

In [None]:
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Dot product
print("Matrix multiplication:\n", np.dot(A, B))

# Transpose
print("Transpose:\n", A.T)

# Determinant
print("Determinant:", np.linalg.det(A))

# Inverse
print("Inverse:\n", np.linalg.inv(A))


Statistics

In [None]:
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Standard Deviation:", np.std(data))


Broadcasting

In [None]:
mat = np.array([[1, 2, 3],
                [4, 5, 6]])

# Add scalar to all elements
print(mat + 10)

# Add row vector to matrix
row_vec = np.array([1, 0, -1])
print(mat + row_vec)


Saving numpy as a CSV file and loading CSV to numpy

In [None]:
np.savetxt("data_array.csv", data, delimiter=",")

In [None]:
data_loaded = np.loadtxt('data_array.csv', delimiter=",")

####  **Exercise** 
1. Create two arrays, array1 and array2 they will represent a vector \
`arra1 <- [2, 1, 10, 6]`
`arra2 <- [2.5, 0.5, 9, 6]`
2. print both arrays
3. perfrom element wise operations `+, -, /, *`
4. print the results
5. calculate the `dot product` of both arrays
6. print the results
7. calculate the `mean and standard deviation` of both arrays
8. print the results

In [None]:
# Solve the Exercise below to see the output

Do the same but create random array1 and random array2

In [None]:
# Solve the Exercise below to see the outptut

***

### 2.2.2. Pandas and DataFrame

DataFrame is the perfect module to work with tabular data.\
It is like an **Excel Spreed Sheet** or **Database Table**.\
Rows represents observations, and Columns representations Variables.\
Pandas is the moule that Provide DataFrame

In [3]:
import pandas as pd

# Create a DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [24, 27, 22],
    "Score": [85, 90, 88]
}
df = pd.DataFrame(data)

print("DataFrame:\n", df)

DataFrame:
       Name  Age  Score
0    Alice   24     85
1      Bob   27     90
2  Charlie   22     88


In [None]:
# Access columns
print("Ages:", df["Age"])

# Basic statistics
print("Average Score:", df["Score"].mean())

In [None]:
# TODO: Create a dataframe with two names and their ages and scores call it df2 and print it out



Inspecting Data Frame

In [None]:
print('Head of DataFrame')
print(df.head())       # First 5 rows
print('----')
print('Tail of DataFrame')
print(df.tail(2))      # Last 2 rows
print('----')
print('Size of DataFrame')
print(df.shape)        # (rows, columns)
print('----')
print('Column names')
print(df.columns)      # Column names
print('----')
print('Summary info')
print(df.info())       # Summary info
print('----')
print('Statistics')
print(df.describe())   # Statistics (for numeric columns)
print('----')

In [None]:
# TODO: Try with your own data frame

Selecting Data

In [None]:
# Rows
print(df.iloc[0])      # First row by position
print(df.loc[1])       # Row with index = 1

# Columns
print(df["Name"])            # Single column (Series)
print(df[["Name", "Score"]]) # Multiple columns (DataFrame)

# Filtering with condition
print(df[df["Score"] > 88])  # Rows where Score > 88


Adding & Modifying Columns

In [None]:
df["Passed"] = df["Score"] >= 90   # New column
df["Age"] = df["Age"] + 1          # Modify column
print(df)


Deleting Columns or Rows

In [None]:
df = df.drop(columns=["Passed"])   # Drop a column
df = df.drop(index=2)              # Drop a row by index
print(df)


Sorting Data

In [None]:
sorted_df = df.sort_values(by="Score", ascending=False)
print(sorted_df)


saving and loading

In [None]:
# Save to CSV
df.to_csv("students.csv", index=False)

# Load from CSV
new_df = pd.read_csv("students.csv")
print(new_df)


In [None]:
# TODO: Save your own data to df2.csv and load it back, view it using notepad or excel


In [None]:
# TODO: Add your data frame to itself

In [None]:
# TODO: Add your data frame to the one given at the begining of the section

***

### 2.2.3. Conversion between Data Frame and Numpy and Vice Versa

In [None]:
import numpy as np
import pandas as pd

# Create a NumPy array
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# Convert to DataFrame
df = pd.DataFrame(arr, columns=["Col1", "Col2", "Col3"])
print("DataFrame from NumPy:\n", df)

In [None]:
# Convert DataFrame to NumPy array
back_to_array = df.to_numpy()
print("NumPy array from DataFrame:\n", back_to_array)


In [None]:
df2 = pd.DataFrame(arr, 
                   columns=["Feature1", "Feature2", "Feature3"], 
                   index=["Row1", "Row2", "Row3"])
print("Labeled DataFrame:\n", df2)

In [None]:
# Example: Features and Labels
X = np.array([[5.1, 3.5, 1.4],
              [4.9, 3.0, 1.4],
              [6.2, 3.4, 5.4]])

y = np.array([0, 0, 1])  # Labels

# Convert to DataFrame for inspection
df_ml = pd.DataFrame(X, columns=["SepalLength", "SepalWidth", "PetalLength"])
df_ml["Target"] = y

print("ML-style DataFrame:\n", df_ml)

### Exercise

1. **Load the Dataset**
   - Use `pandas` to read the CSV file named **`tips.csv`** into a DataFrame.
   - Hint: Use `pd.read_csv()`.

2. **Display Basic Data Overview**
   - Print the **first 5 rows** of the DataFrame using `.head()`.

3. **Check Data Structure**
   - Print the **column names** and **data types** using `.dtypes` or `.info()`.

4. **Statistical Summary**
   - Use `.describe()` to display the **mean**, **min**, **max**, and other statistics for numerical columns.
   - Also calculate the **median** for each numerical column using `.median()`.

5. **Inspect the 'day' Column**
   - Print all **unique values** in the `'day'` and  `'time` column.s
   - Count how many times smokers and non smokers appears using `.value_counts()`.

6. **Filter Data**
   - Create a new DataFrame containing only rows with somkers.
   - Display the **first 5 rows** of the filtered DataFrame.

7. **Group and Analyze**
   - Group the data by the `'sex'` column using `.groupby()`.
   - Calculate the **mean** of each numeric column for each species.
   - Print the resulting grouped DataFrame.
