## Objectives
* To perform fundamental data analysis on real data using NumPy.

1. Load the data using genfromtxt, specifying the delimiter as ';' with excluding the headers, and optimize the numpy array size by reducing the data types. Use np.float32 and verify that the resulting numpy array weighs 76800 bytes.

In [20]:
import numpy as np

# Load the data using genfromtxt
# - delimiter=';' for semicolon-separated values
# - skip_header=1 to exclude the header row
# - dtype=np.float32 to optimize memory usage
wine_data = np.genfromtxt(
    '../data/winequality-red.csv',
    delimiter=';',
    skip_header=1,
    dtype=np.float32
)

print("Wine data shape:", wine_data.shape)
print("Wine data dtype:", wine_data.dtype)
print("Wine data size in bytes:", wine_data.nbytes)
print()

# Verify the size is 76800 bytes
expected_size = 76800
actual_size = wine_data.nbytes

print(f"Expected size: {expected_size} bytes")
print(f"Actual size: {actual_size} bytes")
print(f"Size verification: {'✓ PASSED' if actual_size == expected_size else '✗ FAILED'}")

if actual_size != expected_size:
    print(f"\nNote: To get exactly {expected_size} bytes:")
    print(f"  - With float32 (4 bytes each): need {expected_size // 4} elements")
    print(f"  - Current shape {wine_data.shape} = {wine_data.size} elements")
    print(f"  - For 12 columns: need {expected_size // 4 // 12} rows")
    print(f"  - For 13 columns: need {expected_size // 4 // 13} rows")
print()

# Display first few rows
print("First 5 rows of the data:")
print(wine_data[:5])
print()

# Calculate how the size is determined
num_elements = wine_data.size
bytes_per_element = wine_data.itemsize
print(f"Number of elements: {num_elements}")
print(f"Bytes per element (float32): {bytes_per_element}")
print(f"Total: {num_elements} × {bytes_per_element} = {num_elements * bytes_per_element} bytes")

Wine data shape: (1599, 12)
Wine data dtype: float32
Wine data size in bytes: 76752

Expected size: 76800 bytes
Actual size: 76752 bytes
Size verification: ✗ FAILED

Note: To get exactly 76800 bytes:
  - With float32 (4 bytes each): need 19200 elements
  - Current shape (1599, 12) = 19188 elements
  - For 12 columns: need 1600 rows
  - For 13 columns: need 1476 rows

First 5 rows of the data:
[[7.400e+00 7.000e-01 0.000e+00 1.900e+00 7.600e-02 1.100e+01 3.400e+01
  9.978e-01 3.510e+00 5.600e-01 9.400e+00 5.000e+00]
 [7.800e+00 8.800e-01 0.000e+00 2.600e+00 9.800e-02 2.500e+01 6.700e+01
  9.968e-01 3.200e+00 6.800e-01 9.800e+00 5.000e+00]
 [7.800e+00 7.600e-01 4.000e-02 2.300e+00 9.200e-02 1.500e+01 5.400e+01
  9.970e-01 3.260e+00 6.500e-01 9.800e+00 5.000e+00]
 [1.120e+01 2.800e-01 5.600e-01 1.900e+00 7.500e-02 1.700e+01 6.000e+01
  9.980e-01 3.160e+00 5.800e-01 9.800e+00 6.000e+00]
 [7.400e+00 7.000e-01 0.000e+00 1.900e+00 7.600e-02 1.100e+01 3.400e+01
  9.978e-01 3.510e+00 5.600e-01 

2. Display the 2nd, 7th, and 12th rows as a two-dimensional array. Exclude np.nan values if present.

In [32]:
selected_rows = wine_data[[1, 6, 11], :]

clean_rows = np.array([row[~np.isnan(row)] for row in selected_rows])

print(clean_rows)

[[7.800e+00 8.800e-01 0.000e+00 2.600e+00 9.800e-02 2.500e+01 6.700e+01
  9.968e-01 3.200e+00 6.800e-01 9.800e+00 5.000e+00]
 [7.900e+00 6.000e-01 6.000e-02 1.600e+00 6.900e-02 1.500e+01 5.900e+01
  9.964e-01 3.300e+00 4.600e-01 9.400e+00 5.000e+00]
 [7.500e+00 5.000e-01 3.600e-01 6.100e+00 7.100e-02 1.700e+01 1.020e+02
  9.978e-01 3.350e+00 8.000e-01 1.050e+01 5.000e+00]]
