# Wine

Conduct a feature engineering analysis on the wine quality dataset to create new features that better explain wine quality variations. How do these engineered features impact model performance

In [1]:
#import libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [7]:
df=pd.read_csv('Wine.csv')
df.head()

Unnamed: 0,Wine,Alcohol,Malic.acid,Ash,Acl,Mg,Phenols,Flavanoids,Nonflavanoid.phenols,Proanth,Color.int,Hue,OD,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [8]:
df.describe()

Unnamed: 0,Wine,Alcohol,Malic.acid,Ash,Acl,Mg,Phenols,Flavanoids,Nonflavanoid.phenols,Proanth,Color.int,Hue,OD,Proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Wine                  178 non-null    int64  
 1   Alcohol               178 non-null    float64
 2   Malic.acid            178 non-null    float64
 3   Ash                   178 non-null    float64
 4   Acl                   178 non-null    float64
 5   Mg                    178 non-null    int64  
 6   Phenols               178 non-null    float64
 7   Flavanoids            178 non-null    float64
 8   Nonflavanoid.phenols  178 non-null    float64
 9   Proanth               178 non-null    float64
 10  Color.int             178 non-null    float64
 11  Hue                   178 non-null    float64
 12  OD                    178 non-null    float64
 13  Proline               178 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 19.6 KB


In [21]:
#checking for null values
df.isnull().sum()

Wine                    0
Alcohol                 0
Malic.acid              0
Ash                     0
Acl                     0
Mg                      0
Phenols                 0
Flavanoids              0
Nonflavanoid.phenols    0
Proanth                 0
Color.int               0
Hue                     0
OD                      0
Proline                 0
dtype: int64

In [24]:
# Create a new feature for the interaction between alcohol and pH.
df['interaction_between_alcohol_and_pH'] = df['Alcohol'] * df['Malic.acid']
print(df)

     Wine  Alcohol  Malic.acid   Ash   Acl   Mg  Phenols  Flavanoids  \
0       1    14.23        1.71  2.43  15.6  127     2.80        3.06   
1       1    13.20        1.78  2.14  11.2  100     2.65        2.76   
2       1    13.16        2.36  2.67  18.6  101     2.80        3.24   
3       1    14.37        1.95  2.50  16.8  113     3.85        3.49   
4       1    13.24        2.59  2.87  21.0  118     2.80        2.69   
..    ...      ...         ...   ...   ...  ...      ...         ...   
173     3    13.71        5.65  2.45  20.5   95     1.68        0.61   
174     3    13.40        3.91  2.48  23.0  102     1.80        0.75   
175     3    13.27        4.28  2.26  20.0  120     1.59        0.69   
176     3    13.17        2.59  2.37  20.0  120     1.65        0.68   
177     3    14.13        4.10  2.74  24.5   96     2.05        0.76   

     Nonflavanoid.phenols  Proanth  Color.int   Hue    OD  Proline  \
0                    0.28     2.29       5.64  1.04  3.92     106

In [27]:
df['ratio_of_ash_to_malic_acid'] = df['Ash'] / df['Malic.acid']
df['ratio_of_flavonoids_to_phenols'] = df['Flavanoids'] / df['Phenols']
df['interaction_between_alcohol_and_MG'] = df['Alcohol'] * df['Mg']
print(df)

     Wine  Alcohol  Malic.acid   Ash   Acl   Mg  Phenols  Flavanoids  \
0       1    14.23        1.71  2.43  15.6  127     2.80        3.06   
1       1    13.20        1.78  2.14  11.2  100     2.65        2.76   
2       1    13.16        2.36  2.67  18.6  101     2.80        3.24   
3       1    14.37        1.95  2.50  16.8  113     3.85        3.49   
4       1    13.24        2.59  2.87  21.0  118     2.80        2.69   
..    ...      ...         ...   ...   ...  ...      ...         ...   
173     3    13.71        5.65  2.45  20.5   95     1.68        0.61   
174     3    13.40        3.91  2.48  23.0  102     1.80        0.75   
175     3    13.27        4.28  2.26  20.0  120     1.59        0.69   
176     3    13.17        2.59  2.37  20.0  120     1.65        0.68   
177     3    14.13        4.10  2.74  24.5   96     2.05        0.76   

     Nonflavanoid.phenols  Proanth  Color.int   Hue    OD  Proline  \
0                    0.28     2.29       5.64  1.04  3.92     106

# How the feature engineering affect model performance

Feature engineering plays a crucial role in determining the performance of machine learning models. By creating informative and relevant features, feature engineering can significantly impact the accuracy and effectiveness of a model. Here are some ways in which feature engineering affects model performance:

1. Improved predictive power: Feature engineering allows you to extract meaningful information from the raw data and create new features that capture important patterns and relationships. By incorporating domain knowledge and transforming the data appropriately, you can enhance the predictive power of the model. Well-engineered features can highlight relevant patterns, making it easier for the model to learn and make accurate predictions.

2. Noise reduction: The raw data often contains noise, outliers, and irrelevant information. Feature engineering techniques such as outlier detection, scaling, normalization, and handling missing values can help reduce the impact of noise in the data. By removing or minimizing noise, you can improve the signal-to-noise ratio and enable the model to focus on the relevant patterns, leading to better performance.

3. Dimensionality reduction: In many real-world problems, the number of features can be large, leading to the curse of dimensionality. Feature engineering techniques like PCA (Principal Component Analysis), feature selection, and feature extraction can reduce the dimensionality of the data while preserving important information. Dimensionality reduction can help overcome issues such as overfitting, improve computational efficiency, and enhance model generalization.

4. Handling non-linearity: Some machine learning models, such as linear regression or logistic regression, assume linearity between the features and the target variable. Feature engineering can transform the data by applying mathematical functions (e.g., logarithm, square root) or creating interaction terms to capture non-linear relationships. By incorporating non-linear transformations, the model can better capture complex patterns and improve its performance.

5. Addressing categorical variables: Categorical variables, such as gender or product categories, need to be encoded into numerical values for most machine learning algorithms. Feature engineering techniques like one-hot encoding, label encoding, or target encoding can effectively represent categorical variables in a way that the model can understand. Proper encoding can prevent the model from assigning incorrect ordinality to categorical variables and improve its ability to learn from these features.

6. Time-series feature engineering: In time-series analysis, feature engineering becomes crucial in capturing temporal patterns and dependencies. Creating lagged features, rolling statistics, or time-based aggregations can help the model capture trends, seasonality, and other time-related patterns. Time-series feature engineering can significantly enhance the model's ability to make accurate predictions in forecasting or anomaly detection tasks.

