# Assignment 3: Classification of Wine Dataset


## Task 1: Load the Dataset
1. Use sklearn.datasets.load_wine() to load the dataset.
2. Convert it to a Pandas DataFrame for easier processing.
3. Display basic info: number of rows, columns, feature names, and class distribution.

In [101]:
from sklearn.datasets import load_wine
import pandas as pd
from sklearn.preprocessing import StandardScaler



In [3]:
wine = load_wine()
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df['target'] = wine.target
print(df.head())

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  target  
0          

In [12]:
print('This is the basic info for wine dataset: \n')
df.info()

num_rows, num_columns = df.shape
print(f'Dimensions of the dataframe: {num_rows} rows and {num_columns} coulumns')

This is the basic info for wine dataset: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline              

In [99]:
print('The features are: \n' ,wine.feature_names)


The features are: 
 ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']


In [100]:
print("Class distribution is given by: \n")
classes = df['target'].value_counts().sort_index()
classes


Class distribution is given by: 



target
0    59
1    71
2    48
Name: count, dtype: int64

In [95]:
df_description = df.describe()
df_description

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258,0.938202
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474,0.775035
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0,0.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5,0.0
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5,1.0
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0,2.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0,2.0


# Task 2: Data Preprocessing
1. Check for missing values in the dataset.
2. Perform outlier detection
3. Apply normalization or standardization if necessary (state your reason)

In [None]:
wine_isnull = df.isnull()

for index in range(0, len(wine_isnull.columns)-1):
    breaker = '-'*50
    null_value_index = wine_isnull.loc[wine_isnull[wine_isnull.columns[index]]== True]
    print(f"For the {wine_isnull.columns[index]} column: \n")
    print(null_value_index)
    print(f'{breaker} \n')

In [103]:
outliers = {}

breaker = '-' * 50
for column in df.columns:
    q1 = df_description.loc['25%', column]
    q3 = df_description.loc['75%', column]
    IQR = q3 - q1

    lower_bound = q1 - 1.5 * IQR
    upper_bound = q3 + 1.5 * IQR
    
    outliers_boolean_table = ((df[column] < (lower_bound)) | (df[column] > (upper_bound)))
    
    outliers_df = df[outliers_boolean_table]

    outliers[f'{column}_outlier'] = outliers_df[column]

    outlier_count = outliers[f'{column}_outlier'].count()

    print(f"The {column} column has {outlier_count} outliers. They are: \n")
    print(outliers[f'{column}_outlier'])
    print(f'{breaker} \n')

The alcohol column has 0 outliers. They are: 

Series([], Name: alcohol, dtype: float64)
-------------------------------------------------- 

The malic_acid column has 3 outliers. They are: 

123    5.80
137    5.51
173    5.65
Name: malic_acid, dtype: float64
-------------------------------------------------- 

The ash column has 3 outliers. They are: 

25     3.22
59     1.36
121    3.23
Name: ash, dtype: float64
-------------------------------------------------- 

The alcalinity_of_ash column has 4 outliers. They are: 

59     10.6
73     30.0
121    28.5
127    28.5
Name: alcalinity_of_ash, dtype: float64
-------------------------------------------------- 

The magnesium column has 4 outliers. They are: 

69    151.0
73    139.0
78    136.0
95    162.0
Name: magnesium, dtype: float64
-------------------------------------------------- 

The total_phenols column has 0 outliers. They are: 

Series([], Name: total_phenols, dtype: float64)
-----------------------------------------------

In [108]:
#dropping the target column as we do not scale that one
new_features = df.drop('target', axis = 1)

features = new_features.columns

scaler = StandardScaler()
scaler_features_array = scaler.fit_transform(new_features)

df_standard = pd.DataFrame(data= scaler_features_array, columns = features)

#Adding the target column back
df_standard['target']= df['target']

df_standard.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,1.518613,-0.56225,0.232053,-1.169593,1.913905,0.808997,1.034819,-0.659563,1.224884,0.251717,0.362177,1.84792,1.013009,0
1,0.24629,-0.499413,-0.827996,-2.490847,0.018145,0.568648,0.733629,-0.820719,-0.544721,-0.293321,0.406051,1.113449,0.965242,0
2,0.196879,0.021231,1.109334,-0.268738,0.088358,0.808997,1.215533,-0.498407,2.135968,0.26902,0.318304,0.788587,1.395148,0
3,1.69155,-0.346811,0.487926,-0.809251,0.930918,2.491446,1.466525,-0.981875,1.032155,1.186068,-0.427544,1.184071,2.334574,0
4,0.2957,0.227694,1.840403,0.451946,1.281985,0.808997,0.663351,0.226796,0.401404,-0.319276,0.362177,0.449601,-0.037874,0


In [109]:
#Comparing outlier values to see effect of stabdardization

original_value = df['malic_acid'].loc[123]
print(original_value)

standardized_value = df_standard['malic_acid'].loc[123]

print(f'The original value for malic_acid column outlier at row 127 was {original_value} which has now become standardized as {standardized_value}')


5.8
The original value for malic_acid column outlier at row 127 was 5.8 which has now become standardized as 3.1091924671589037


## Task 3: Data Imbalance Check and Handling

1. Check class distribution in the target variable.
2. If the dataset is imbalanced, use SMOTE (Synthetic Minority Over-sampling Technique) to balance it.


# Task 4: Model Building and Evaluation
1. Train and evaluate two classifiers: Naive Bayes, Logistic Regression
2. Split the dataset into train/test sets (e.g., 80/20 split). Evaluate both models using: Accuracy, Precision, Recall, F1-score