Introduction to Pandas, NumPy, Scikit-Learn, and Other Essential Modules in Python


---

### 2.2.1. NumPy: Numerical Computing


NumPy provides support for arrays, matrices, and fast mathematical operations.


Creating a numpy array

In [3]:
%pip install numpy

Collecting numpy
  Downloading numpy-2.3.3-cp313-cp313-win_amd64.whl.metadata (60 kB)
Downloading numpy-2.3.3-cp313-cp313-win_amd64.whl (12.8 MB)
   ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
   -- ------------------------------------- 0.8/12.8 MB 5.8 MB/s eta 0:00:03
   ----- ---------------------------------- 1.8/12.8 MB 5.5 MB/s eta 0:00:03
   --------- ------------------------------ 3.1/12.8 MB 5.6 MB/s eta 0:00:02
   ------------- -------------------------- 4.5/12.8 MB 5.7 MB/s eta 0:00:02
   ------------------ --------------------- 5.8/12.8 MB 5.9 MB/s eta 0:00:02
   ------------------------ --------------- 7.9/12.8 MB 6.7 MB/s eta 0:00:01
   ---------------------------- ----------- 9.2/12.8 MB 6.7 MB/s eta 0:00:01
   -------------------------------- ------- 10.5/12.8 MB 6.6 MB/s eta 0:00:01
   ------------------------------------ --- 11.5/12.8 MB 6.4 MB/s eta 0:00:01
   -------------------------------------- - 12.3/12.8 MB 6.2 MB/s eta 0:00:01
   --------

In [4]:
import numpy as np


In [8]:

# From a Python list
arr1 = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1)

# 2D Array (Matrix)
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", arr2)

1D Array: [1 2 3 4 5]
2D Array:
 [[1 2 3]
 [4 5 6]]


In [9]:
# Arrays filled with zeros or ones
zeros = np.zeros((2, 3))
ones = np.ones((3, 3))
print("Zeros:\n", zeros)
print("Ones:\n", ones)

# Array with a range
arr_range = np.arange(0, 10, 2)   # from 0 to 10 step 2
print("Range array:", arr_range)

# Evenly spaced values
arr_linspace = np.linspace(0, 1, 5)   # 5 numbers between 0 and 1
print("Linspace array:", arr_linspace)

# Random numbers
rand_arr = np.random.rand(3, 3)   # Uniform [0,1)
rand_ints = np.random.randint(1, 10, size=(2, 4))
print("Random array:\n", rand_arr)
print("Random integers:\n", rand_ints)


Zeros:
 [[0. 0. 0.]
 [0. 0. 0.]]
Ones:
 [[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
Range array: [0 2 4 6 8]
Linspace array: [0.   0.25 0.5  0.75 1.  ]
Random array:
 [[0.13804653 0.94057777 0.09699248]
 [0.26022935 0.29818756 0.12492159]
 [0.17958438 0.71202732 0.56947471]]
Random integers:
 [[7 6 1 4]
 [1 3 1 3]]


In [7]:
# TODO: Create your own numpy array and perform some operations on it
np.zeros((1,3, 3))

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

array properties

In [10]:
print("Shape:", arr2.shape)
print("Size:", arr2.size)
print("Data type:", arr2.dtype)
print("Dimension:", arr2.ndim)


Shape: (2, 3)
Size: 6
Data type: int64
Dimension: 2


slicing and indexing

In [None]:
arr = np.array([10, 20, 30, 40, 50])

print(arr[0])     # First element
print(arr[-1])    # Last element
print(arr[1:4])   # Slice elements

# 2D indexing
mat = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

print(mat[0, 2])   # First row, third column
print(mat[1:, :2]) # Rows 2-3, first 2 columns


In [15]:
arr = np.ones((1,3)) + 0.5
arr.shape, arr

((1, 3), array([[1.5, 1.5, 1.5]]))

In [27]:
arr[0, 1] = arr[0, 1] + 1
arr[0, 1] += 1
arr

array([[ 1.5, 25.5,  1.5]])

reshaping

In [29]:
arr = np.arange(12)   # [0,1,2,...,11] range()
reshaped = arr.reshape(3, 4)
print("Reshaped (3x4):\n", reshaped)

flattened = reshaped.flatten()
print("Flattened:", flattened)


Reshaped (3x4):
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Flattened: [ 0  1  2  3  4  5  6  7  8  9 10 11]


In [28]:
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


Some mathematical operation

In [30]:
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

print("Addition:", a + b)
print("Multiplication:", a * b)
print("Square roots:", np.sqrt(a))
print("Dot product:", np.dot(a, b))


Addition: [11 22 33 44]
Multiplication: [ 10  40  90 160]
Square roots: [1.         1.41421356 1.73205081 2.        ]
Dot product: 300


Random Numbers

In [31]:
# Single integer between 0 and 10
num = np.random.randint(0, 10)
print("Random integer:", num)

# Random integers
print("Random integers:", np.random.randint(1, 10, size=5))

# 2D array (3x4) of integers between 10 and 50
arr2d = np.random.randint(10, 50, size=(3, 4))
print("2D array of random integers:\n", arr2d)

# Random floats (between 0 and 1)
print("Random floats:", np.random.rand(3))

Random integer: 5
Random integers: [7 8 5 6 7]
2D array of random integers:
 [[39 37 35 44]
 [44 29 33 16]
 [22 28 11 31]]
Random floats: [0.78129968 0.96364101 0.74185995]


Randoms according to Normal Distribution

In [32]:

# Single value
print("Normal random value:", np.random.randn())

# 1D array of 5 values
print("Normal random array:", np.random.randn(5))

# 2D array (3x3)
print("Normal distribution matrix:\n", np.random.randn(3, 3))


Normal random value: -0.4566414738108407
Normal random array: [-0.96439379 -0.36011131  1.57526935  0.68641321 -1.08225126]
Normal distribution matrix:
 [[-0.20194773  0.66044019 -1.32926321]
 [ 0.2408747   0.55702286 -0.83424126]
 [-0.91691598 -0.89651974  0.35133144]]


Choosing from a list

In [None]:
# Pick one element from a list
print("Choice:", np.random.choice([10, 20, 30, 40]))

# Pick multiple elements with replacement
print("Choices with replacement:", np.random.choice([1, 2, 3, 4, 5], size=3))

# Pick without replacement
print("Choices without replacement:", np.random.choice([1, 2, 3, 4, 5], size=3, replace=False))


Linear Algebra

In [33]:
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Dot product
print("Matrix multiplication:\n", np.dot(A, B))

# Transpose
print("Transpose:\n", A.T)

# Determinant
print("Determinant:", np.linalg.det(A))

# Inverse
print("Inverse:\n", np.linalg.inv(A))


Matrix multiplication:
 [[19 22]
 [43 50]]
Transpose:
 [[1 3]
 [2 4]]
Determinant: -2.0000000000000004
Inverse:
 [[-2.   1. ]
 [ 1.5 -0.5]]


Statistics

In [34]:
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Standard Deviation:", np.std(data))


Mean: 5.0
Median: 5.0
Standard Deviation: 2.581988897471611


Broadcasting

In [36]:
mat = np.array([[1, 2, 3],
                [4, 5, 6]])
print(mat)
# Add scalar to all elements
print(mat + 10)

# Add row vector to matrix
row_vec = np.array([1, 0, -1])
print(mat + row_vec)


[[1 2 3]
 [4 5 6]]
[[11 12 13]
 [14 15 16]]
[[2 2 2]
 [5 5 5]]


Saving numpy as a CSV file and loading CSV to numpy

In [None]:
np.savetxt("data_array.csv", data, delimiter=",")

In [None]:
data_loaded = np.loadtxt('data_array.csv', delimiter=",")

####  **Exercise** 
1. Create two arrays, array1 and array2 they will represent a vector \
`arra1 <- [2, 1, 10, 6]`
`arra2 <- [2.5, 0.5, 9, 6]`
2. print both arrays
3. perfrom element wise operations `+, -, /, *`
4. print the results
5. calculate the `dot product` of both arrays
6. print the results
7. calculate the `mean and standard deviation` of both arrays
8. print the results

In [None]:
# Solve the Exercise below to see the output

Do the same but create random array1 and random array2

In [None]:
# Solve the Exercise below to see the outptut

***

### 2.2.2. Pandas and DataFrame

DataFrame is the perfect module to work with tabular data.\
It is like an **Excel Spreed Sheet** or **Database Table**.\
Rows represents observations, and Columns representations Variables.\
Pandas is the moule that Provide DataFrame

In [45]:
import pandas as pd
import numpy as np
# Create a DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [24, 27, 22],
    "Score": [85, 90, 88]
}
df = pd.DataFrame(data)

print("DataFrame:\n", df)

DataFrame:
       Name  Age  Score
0    Alice   24     85
1      Bob   27     90
2  Charlie   22     88


In [47]:
# Access columns
print("Ages:", df["Age"])

# Basic statistics
print("Average Score:", df["Score"].mean())

Ages: 0    24
1    27
2    22
Name: Age, dtype: int64
Average Score: 87.66666666666667


In [None]:
# TODO: Create a dataframe with two names and their ages and scores call it df2 and print it out



Inspecting Data Frame

In [48]:
print('Head of DataFrame')
print(df.head())       # First 5 rows
print('----')
print('Tail of DataFrame')
print(df.tail(2))      # Last 2 rows
print('----')
print('Size of DataFrame')
print(df.shape)        # (rows, columns)
print('----')
print('Column names')
print(df.columns)      # Column names
print('----')
print('Summary info')
print(df.info())       # Summary info
print('----')
print('Statistics')
print(df.describe())   # Statistics (for numeric columns)
print('----')

Head of DataFrame
      Name  Age  Score
0    Alice   24     85
1      Bob   27     90
2  Charlie   22     88
----
Tail of DataFrame
      Name  Age  Score
1      Bob   27     90
2  Charlie   22     88
----
Size of DataFrame
(3, 3)
----
Column names
Index(['Name', 'Age', 'Score'], dtype='object')
----
Summary info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   Score   3 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 204.0+ bytes
None
----
Statistics
             Age      Score
count   3.000000   3.000000
mean   24.333333  87.666667
std     2.516611   2.516611
min    22.000000  85.000000
25%    23.000000  86.500000
50%    24.000000  88.000000
75%    25.500000  89.000000
max    27.000000  90.000000
----


In [None]:
# TODO: Try with your own data frame

Selecting Data

In [51]:
# Rows
print(df.iloc[0, 0], df['Name'][0])      # First row by position
print(df.loc[1])       # Row with index = 1

# Columns
print(df["Name"])            # Single column (Series)
print(df[["Name", "Score"]]) # Multiple columns (DataFrame)

# Filtering with condition
print(df[df["Score"] > 88])  # Rows where Score > 88


Alice Alice
Name     Bob
Age       27
Score     90
Name: 1, dtype: object
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
      Name  Score
0    Alice     85
1      Bob     90
2  Charlie     88
  Name  Age  Score
1  Bob   27     90


Adding & Modifying Columns

In [52]:
df["Passed"] = df["Score"] >= 90   # New column
df["Age"] = df["Age"] + 1          # Modify column
print(df)


      Name  Age  Score  Passed
0    Alice   25     85   False
1      Bob   28     90    True
2  Charlie   23     88   False


Deleting Columns or Rows

In [53]:
df = df.drop(columns=["Passed"])   # Drop a column
df = df.drop(index=2)              # Drop a row by index
print(df.describe())


            Age      Score
count   2.00000   2.000000
mean   26.50000  87.500000
std     2.12132   3.535534
min    25.00000  85.000000
25%    25.75000  86.250000
50%    26.50000  87.500000
75%    27.25000  88.750000
max    28.00000  90.000000


Sorting Data

In [56]:
sorted_df = df.sort_values(by="Name", ascending=True)
print(sorted_df)


    Name  Age  Score
0  Alice   25     85
1    Bob   28     90


saving and loading

In [57]:
# Save to CSV
df.to_csv("students.csv", index=False)

# Load from CSV
new_df = pd.read_csv("students.csv")
print(new_df)


    Name  Age  Score
0  Alice   25     85
1    Bob   28     90


In [None]:
# TODO: Save your own data to df2.csv and load it back, view it using notepad or excel


In [None]:
# TODO: Add your data frame to itself

In [None]:
# TODO: Add your data frame to the one given at the begining of the section

***

### 2.2.3. Conversion between Data Frame and Numpy and Vice Versa

In [None]:
import numpy as np
import pandas as pd

# Create a NumPy array
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# Convert to DataFrame
df = pd.DataFrame(arr, columns=["Col1", "Col2", "Col3"])
print("DataFrame from NumPy:\n", df)

In [None]:
# Convert DataFrame to NumPy array
back_to_array = df.to_numpy()
print("NumPy array from DataFrame:\n", back_to_array)


In [None]:
df2 = pd.DataFrame(arr, 
                   columns=["Feature1", "Feature2", "Feature3"], 
                   index=["Row1", "Row2", "Row3"])
print("Labeled DataFrame:\n", df2)

In [None]:
# Example: Features and Labels
X = np.array([[5.1, 3.5, 1.4],
              [4.9, 3.0, 1.4],
              [6.2, 3.4, 5.4]])

y = np.array([0, 0, 1])  # Labels

# Convert to DataFrame for inspection
df_ml = pd.DataFrame(X, columns=["SepalLength", "SepalWidth", "PetalLength"])
df_ml["Target"] = y

print("ML-style DataFrame:\n", df_ml)

### Exercise

1. **Load the Dataset**
   - Use `pandas` to read the CSV file named **`tips.csv`** into a DataFrame.
   - Hint: Use `pd.read_csv()`.

2. **Display Basic Data Overview**
   - Print the **first 5 rows** of the DataFrame using `.head()`.

3. **Check Data Structure**
   - Print the **column names** and **data types** using `.dtypes` or `.info()`.

4. **Statistical Summary**
   - Use `.describe()` to display the **mean**, **min**, **max**, and other statistics for numerical columns.
   - Also calculate the **median** for each numerical column using `.median()`.

5. **Inspect the 'day' Column**
   - Print all **unique values** in the `'day'` and  `'time` column.s
   - Count how many times smokers and non smokers appears using `.value_counts()`.

6. **Filter Data**
   - Create a new DataFrame containing only rows with somkers.
   - Display the **first 5 rows** of the filtered DataFrame.

7. **Group and Analyze**
   - Group the data by the `'sex'` column using `.groupby()`.
   - Calculate the **mean** of each numeric column for each species.
   - Print the resulting grouped DataFrame.
