Introduction to Pandas, NumPy, Scikit-Learn, and Other Essential Modules in Python


---

### 2.2.1. Pandas and DataFrame

DataFrame is the perfect module to work with tabular data.\
It is like an **Excel Spreed Sheet** or **Database Table**.\
Rows represents observations, and Columns representations Variables.\
Pandas is the moule that Provide DataFrame

In [23]:
import pandas as pd

# Create a DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [24, 27, 22],
    "Score": [85, 90, 88]
}
df = pd.DataFrame(data)

print("DataFrame:\n", df)

# Access columns
print("Ages:", df["Age"])

# Basic statistics
print("Average Score:", df["Score"].mean())

DataFrame:
       Name  Age  Score
0    Alice   24     85
1      Bob   27     90
2  Charlie   22     88
Ages: 0    24
1    27
2    22
Name: Age, dtype: int64
Average Score: 87.66666666666667


Inspecting Data Frame

In [24]:
print('Head of DataFrame')
print(df.head())       # First 5 rows
print('----')
print('Tail of DataFrame')
print(df.tail(2))      # Last 2 rows
print('----')
print('Size of DataFrame')
print(df.shape)        # (rows, columns)
print('----')
print('Column names')
print(df.columns)      # Column names
print('----')
print('Summary info')
print(df.info())       # Summary info
print('----')
print('Statistics')
print(df.describe())   # Statistics (for numeric columns)
print('----')

Head of DataFrame
      Name  Age  Score
0    Alice   24     85
1      Bob   27     90
2  Charlie   22     88
----
Tail of DataFrame
      Name  Age  Score
1      Bob   27     90
2  Charlie   22     88
----
Size of DataFrame
(3, 3)
----
Column names
Index(['Name', 'Age', 'Score'], dtype='object')
----
Summary info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   Score   3 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes
None
----
Statistics
             Age      Score
count   3.000000   3.000000
mean   24.333333  87.666667
std     2.516611   2.516611
min    22.000000  85.000000
25%    23.000000  86.500000
50%    24.000000  88.000000
75%    25.500000  89.000000
max    27.000000  90.000000
----


Selecting Data

In [25]:
# Rows
print(df.iloc[0])      # First row by position
print(df.loc[1])       # Row with index = 1

# Columns
print(df["Name"])            # Single column (Series)
print(df[["Name", "Score"]]) # Multiple columns (DataFrame)

# Filtering with condition
print(df[df["Score"] > 88])  # Rows where Score > 88


Name     Alice
Age         24
Score       85
Name: 0, dtype: object
Name     Bob
Age       27
Score     90
Name: 1, dtype: object
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
      Name  Score
0    Alice     85
1      Bob     90
2  Charlie     88
  Name  Age  Score
1  Bob   27     90


Adding & Modifying Columns

In [26]:
df["Passed"] = df["Score"] >= 90   # New column
df["Age"] = df["Age"] + 1          # Modify column
print(df)


      Name  Age  Score  Passed
0    Alice   25     85   False
1      Bob   28     90    True
2  Charlie   23     88   False


Deleting Columns or Rows

In [27]:
df = df.drop(columns=["Passed"])   # Drop a column
df = df.drop(index=2)              # Drop a row by index
print(df)


    Name  Age  Score
0  Alice   25     85
1    Bob   28     90


Sorting Data

In [28]:
sorted_df = df.sort_values(by="Score", ascending=False)
print(sorted_df)


    Name  Age  Score
1    Bob   28     90
0  Alice   25     85


saving and loading

In [29]:
# Save to CSV
df.to_csv("students.csv", index=False)

# Load from CSV
new_df = pd.read_csv("students.csv")
print(new_df)


    Name  Age  Score
0  Alice   25     85
1    Bob   28     90


### 2.2.2. NumPy: Numerical Computing


NumPy provides support for arrays, matrices, and fast mathematical operations.


Creating a numpy array

In [30]:
import numpy as np

# From a Python list
arr1 = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1)

# 2D Array (Matrix)
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", arr2)

# Arrays filled with zeros or ones
zeros = np.zeros((2, 3))
ones = np.ones((3, 3))
print("Zeros:\n", zeros)
print("Ones:\n", ones)

# Array with a range
arr_range = np.arange(0, 10, 2)   # from 0 to 10 step 2
print("Range array:", arr_range)

# Evenly spaced values
arr_linspace = np.linspace(0, 1, 5)   # 5 numbers between 0 and 1
print("Linspace array:", arr_linspace)

# Random numbers
rand_arr = np.random.rand(3, 3)   # Uniform [0,1)
rand_ints = np.random.randint(1, 10, size=(2, 4))
print("Random array:\n", rand_arr)
print("Random integers:\n", rand_ints)


1D Array: [1 2 3 4 5]
2D Array:
 [[1 2 3]
 [4 5 6]]
Zeros:
 [[0. 0. 0.]
 [0. 0. 0.]]
Ones:
 [[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
Range array: [0 2 4 6 8]
Linspace array: [0.   0.25 0.5  0.75 1.  ]
Random array:
 [[0.01715991 0.23188792 0.014384  ]
 [0.72679633 0.07505027 0.86674514]
 [0.94317095 0.24298863 0.12875717]]
Random integers:
 [[6 6 5 9]
 [4 3 3 5]]


array properties

In [31]:
print("Shape:", arr2.shape)
print("Size:", arr2.size)
print("Data type:", arr2.dtype)
print("Dimension:", arr2.ndim)


Shape: (2, 3)
Size: 6
Data type: int32
Dimension: 2


slicing and indexing

In [32]:
arr = np.array([10, 20, 30, 40, 50])

print(arr[0])     # First element
print(arr[-1])    # Last element
print(arr[1:4])   # Slice elements

# 2D indexing
mat = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

print(mat[0, 2])   # First row, third column
print(mat[1:, :2]) # Rows 2-3, first 2 columns


10
50
[20 30 40]
3
[[4 5]
 [7 8]]


reshaping

In [33]:
arr = np.arange(12)   # [0,1,2,...,11]
reshaped = arr.reshape(3, 4)
print("Reshaped (3x4):\n", reshaped)

flattened = reshaped.flatten()
print("Flattened:", flattened)


Reshaped (3x4):
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Flattened: [ 0  1  2  3  4  5  6  7  8  9 10 11]


Some mathematical operation

In [34]:
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

print("Addition:", a + b)
print("Multiplication:", a * b)
print("Square roots:", np.sqrt(a))
print("Dot product:", np.dot(a, b))


Addition: [11 22 33 44]
Multiplication: [ 10  40  90 160]
Square roots: [1.         1.41421356 1.73205081 2.        ]
Dot product: 300


Random Numbers

In [None]:
# Single integer between 0 and 10
num = np.random.randint(0, 10)
print("Random integer:", num)

# Random integers
print("Random integers:", np.random.randint(1, 10, size=5))

# 2D array (3x4) of integers between 10 and 50
arr2d = np.random.randint(10, 50, size=(3, 4))
print("2D array of random integers:\n", arr2d)

# Random floats (between 0 and 1)
print("Random floats:", np.random.rand(3))

# Normal distribution
print("Normal distribution:", np.random.randn(3))

# Single random float
print("Random float:", np.random.rand())

# 1D array of 5 floats
print("Random floats:", np.random.rand(5))

# 2D array (2x3) of floats
print("Random float matrix:\n", np.random.rand(2, 3))

Random integer: 5
Random integers: [7 1 9 7 7]
2D array of random integers:
 [[21 10 17 18]
 [48 14 39 14]
 [38 39 37 31]]
Random floats: [0.44771352 0.03457791 0.91482054]
Normal distribution: [-0.98059166 -1.24697571 -1.42810696]


Randoms according to Normal Distribution

In [None]:

# Single value
print("Normal random value:", np.random.randn())

# 1D array of 5 values
print("Normal random array:", np.random.randn(5))

# 2D array (3x3)
print("Normal distribution matrix:\n", np.random.randn(3, 3))


Choosing from a list

In [None]:
# Pick one element from a list
print("Choice:", np.random.choice([10, 20, 30, 40]))

# Pick multiple elements with replacement
print("Choices with replacement:", np.random.choice([1, 2, 3, 4, 5], size=3))

# Pick without replacement
print("Choices without replacement:", np.random.choice([1, 2, 3, 4, 5], size=3, replace=False))


Linear Algebra

In [36]:
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Dot product   
print("Matrix multiplication:\n", np.dot(A, B))

# Cross product
print(f"Cross product {np.cross(A, B)}")

# Transpose
print("Transpose:\n", A.T)

# Determinant
print("Determinant:", np.linalg.det(A))

# Inverse
print("Inverse:\n", np.linalg.inv(A))


Matrix multiplication:
 [[19 22]
 [43 50]]
Cross product [-4 -4]
Transpose:
 [[1 3]
 [2 4]]
Determinant: -2.0000000000000004
Inverse:
 [[-2.   1. ]
 [ 1.5 -0.5]]


Statistics

In [37]:
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Standard Deviation:", np.std(data))


Mean: 5.0
Median: 5.0
Standard Deviation: 2.581988897471611


Broadcasting

In [38]:
mat = np.array([[1, 2, 3],
                [4, 5, 6]])

# Add scalar to all elements
print(mat + 10)

# Add row vector to matrix
row_vec = np.array([1, 0, -1])
print(mat + row_vec)


[[11 12 13]
 [14 15 16]]
[[2 2 2]
 [5 5 5]]


### 2.2.3. Conversion between Data Frame and Numpy and Vice Versa

In [39]:
import numpy as np
import pandas as pd

# Create a NumPy array
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# Convert to DataFrame
df = pd.DataFrame(arr, columns=["Col1", "Col2", "Col3"])
print("DataFrame from NumPy:\n", df)

DataFrame from NumPy:
    Col1  Col2  Col3
0     1     2     3
1     4     5     6
2     7     8     9


In [40]:
# Convert DataFrame to NumPy array
back_to_array = df.to_numpy()
print("NumPy array from DataFrame:\n", back_to_array)


NumPy array from DataFrame:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]


In [41]:
df2 = pd.DataFrame(arr, 
                   columns=["Feature1", "Feature2", "Feature3"], 
                   index=["Row1", "Row2", "Row3"])
print("Labeled DataFrame:\n", df2)

Labeled DataFrame:
       Feature1  Feature2  Feature3
Row1         1         2         3
Row2         4         5         6
Row3         7         8         9


In [42]:
# Example: Features and Labels
X = np.array([[5.1, 3.5, 1.4],
              [4.9, 3.0, 1.4],
              [6.2, 3.4, 5.4]])

y = np.array([0, 0, 1])  # Labels

# Convert to DataFrame for inspection
df_ml = pd.DataFrame(X, columns=["SepalLength", "SepalWidth", "PetalLength"])
df_ml["Target"] = y

print("ML-style DataFrame:\n", df_ml)

ML-style DataFrame:
    SepalLength  SepalWidth  PetalLength  Target
0          5.1         3.5          1.4       0
1          4.9         3.0          1.4       0
2          6.2         3.4          5.4       1
