In [4]:
import numpy as np

Load the data using genfromtxt, specifying the delimiter as ';' with excluding the headers, and optimize the numpy array size by reducing the data types. Use np.float32 and verify that the resulting numpy array weighs 76800 bytes.

In [5]:
data = np.genfromtxt('winequality-red.csv', delimiter=';', skip_header=1, dtype=np.float32)
 
print("Array size in bytes:", data.nbytes)

Array size in bytes: 76752


Display the 2nd, 7th, and 12th rows as a two-dimensional array. Exclude np.nan values if present.

In [6]:
selected_rows = data[[1, 6, 11]]

# Create a boolean mask to keep only rows that do NOT have any NaNs
clean_rows = selected_rows[~np.isnan(selected_rows).any(axis=1)]
np.set_printoptions(suppress=True, precision = 5)
# print("\nSelected Rows (2nd, 7th, 12th) without NaNs:")
print(clean_rows)

[[  7.8      0.88     0.       2.6      0.098   25.      67.       0.9968
    3.2      0.68     9.8      5.    ]
 [  7.9      0.6      0.06     1.6      0.069   15.      59.       0.9964
    3.3      0.46     9.4      5.    ]
 [  7.5      0.5      0.36     6.1      0.071   17.     102.       0.9978
    3.35     0.8     10.5      5.    ]]


Determine if there is any wine in the dataset with an alcohol percentage greater than 20%. Return True or False.

In [7]:
high_alcohol = np.any(data[:, 10] > 20)
print(f"Is there any wine with > 20% alcohol? {high_alcohol}")

Is there any wine with > 20% alcohol? False


Calculate the average alcohol percentage across all wines in the dataset. Exclude np.nan values if present.

In [8]:
avg_alcohol = np.nanmean(data[:, 10])
print(f"Average alcohol percentage: {avg_alcohol:.2f}%")

Average alcohol percentage: 10.42%


Compute various statistical measures (minimum, maximum, 25th percentile, 50th percentile, 75th percentile and the mean for the pH values).

In [9]:
ph_col = data[:, 8]

print("pH Statistics:")
print(f"   Minimum:         {np.nanmin(ph_col):.2f}")
print(f"   Maximum:         {np.nanmax(ph_col):.2f}")
print(f"   25th Percentile: {np.nanpercentile(ph_col, 25):.2f}")
print(f"   50th Percentile: {np.nanpercentile(ph_col, 50):.2f}") # This is the median
print(f"   75th Percentile: {np.nanpercentile(ph_col, 75):.2f}")
print(f"   Mean:            {np.nanmean(ph_col):.2f}")

pH Statistics:
   Minimum:         2.74
   Maximum:         4.01
   25th Percentile: 3.21
   50th Percentile: 3.31
   75th Percentile: 3.40
   Mean:            3.31


Find the average quality score of wines with the 20% least sulphate content.

In [10]:
p20_sulphates = np.nanpercentile(data[:, 9], 20)

low_sulphate_mask = data[:, 9] < p20_sulphates

avg_quality_low_sulphates = np.nanmean(data[low_sulphate_mask, 11])
print(f"Average quality of wines with lowest 20% sulphates: {avg_quality_low_sulphates:.2f}")

Average quality of wines with lowest 20% sulphates: 5.19


Compute the mean of all variables for wines with the best quality. Also, do the same for wines with the worst quality.

In [11]:
best_score = np.nanmax(data[:, 11])
worst_score = np.nanmin(data[:, 11])

best_wines_mean = np.nanmean(data[data[:, 11] == best_score], axis=0)
worst_wines_mean = np.nanmean(data[data[:, 11] == worst_score], axis=0)


print(f"Mean of all variables for BEST quality wines (Score: {best_score}):")
print(best_wines_mean)

print(f"\nMean of all variables for WORST quality wines (Score: {worst_score}):")
print(worst_wines_mean)

Mean of all variables for BEST quality wines (Score: 8.0):
[ 8.56667  0.42333  0.39111  2.57778  0.06844 13.27778 33.44444  0.99521
  3.26722  0.76778 12.09444  8.     ]

Mean of all variables for WORST quality wines (Score: 3.0):
[ 8.36     0.8845   0.171    2.635    0.1225  11.      24.9      0.99746
  3.398    0.57     9.955    3.     ]
