In [23]:
import numpy as np

Load the data using genfromtxt, specifying the delimiter as ';' with excluding the headers, and optimize the numpy array size by reducing the data types. Use np.float32 and verify that the resulting numpy array weighs 76800 bytes.

In [29]:
data = np.genfromtxt('winequality-red.csv', delimiter=';', skip_header=1, dtype=np.float32)
 
print("Array size in bytes:", data.nbytes)

Array size in bytes: 76752


Display the 2nd, 7th, and 12th rows as a two-dimensional array. Exclude np.nan values if present.

In [None]:
selected_rows = data[[1, 6, 11]]

clean_rows = selected_rows[~np.isnan(selected_rows).any(axis=1)]

print("\nSelected Rows (2nd, 7th, 12th) without NaNs:")
print(clean_rows)


2. Selected Rows (2nd, 7th, 12th) without NaNs:
[[7.800e+00 8.800e-01 0.000e+00 2.600e+00 9.800e-02 2.500e+01 6.700e+01
  9.968e-01 3.200e+00 6.800e-01 9.800e+00 5.000e+00]
 [7.900e+00 6.000e-01 6.000e-02 1.600e+00 6.900e-02 1.500e+01 5.900e+01
  9.964e-01 3.300e+00 4.600e-01 9.400e+00 5.000e+00]
 [7.500e+00 5.000e-01 3.600e-01 6.100e+00 7.100e-02 1.700e+01 1.020e+02
  9.978e-01 3.350e+00 8.000e-01 1.050e+01 5.000e+00]]


Determine if there is any wine in the dataset with an alcohol percentage greater than 20%. Return True or False.

In [36]:
high_alcohol = np.any(data[:, 10] > 20)
print(f"Is there any wine with > 20% alcohol? {high_alcohol}")

Is there any wine with > 20% alcohol? False


Calculate the average alcohol percentage across all wines in the dataset. Exclude np.nan values if present.

In [35]:
avg_alcohol = np.nanmean(data[:, 10])
print(f"Average alcohol percentage: {avg_alcohol:.2f}%")

Average alcohol percentage: 10.42%


Compute various statistical measures (minimum, maximum, 25th percentile, 50th percentile, 75th percentile and the mean for the pH values).

In [38]:
ph_col = data[:, 8]

print("pH Statistics:")
print(f"   Minimum:         {np.nanmin(ph_col):.2f}")
print(f"   Maximum:         {np.nanmax(ph_col):.2f}")
print(f"   25th Percentile: {np.nanpercentile(ph_col, 25):.2f}")
print(f"   50th Percentile: {np.nanpercentile(ph_col, 50):.2f}") # This is the median
print(f"   75th Percentile: {np.nanpercentile(ph_col, 75):.2f}")
print(f"   Mean:            {np.nanmean(ph_col):.2f}")

pH Statistics:
   Minimum:         2.74
   Maximum:         4.01
   25th Percentile: 3.21
   50th Percentile: 3.31
   75th Percentile: 3.40
   Mean:            3.31


Find the average quality score of wines with the 20% least sulphate content.

In [None]:
p20_sulphates = np.nanpercentile(data[:, 9], 20)

low_sulphate_mask = data[:, 9] < p20_sulphates

avg_quality_low_sulphates = np.nanmean(data[low_sulphate_mask, 11])
print(f"Average quality of wines with lowest 20% sulphates: {avg_quality_low_sulphates:.2f}")

Average quality of wines with lowest 20% sulphates: 5.19


Compute the mean of all variables for wines with the best quality. Also, do the same for wines with the worst quality.

In [None]:
best_score = np.nanmax(data[:, 11])
worst_score = np.nanmin(data[:, 11])

best_wines_mean = np.nanmean(data[data[:, 11] == best_score], axis=0)
worst_wines_mean = np.nanmean(data[data[:, 11] == worst_score], axis=0)

np.set_printoptions(precision=2, suppress=True)

print(f"Mean of all variables for BEST quality wines (Score: {best_score}):")
print(best_wines_mean)

print(f"\nMean of all variables for WORST quality wines (Score: {worst_score}):")
print(worst_wines_mean)

Mean of all variables for BEST quality wines (Score: 8.0):
[ 8.57  0.42  0.39  2.58  0.07 13.28 33.44  1.    3.27  0.77 12.09  8.  ]

Mean of all variables for WORST quality wines (Score: 3.0):
[ 8.36  0.88  0.17  2.64  0.12 11.   24.9   1.    3.4   0.57  9.95  3.  ]
