Q1. Key Features of the Wine Quality Dataset

The Wine Quality dataset includes the following features:

1. Fixed acidity – Tartaric acid amount.
2. Volatile acidity – Acetic acid content.
3. Citric acid – Adds freshness and flavor.
4. Residual sugar – Sugar left after fermentation.
5. Chlorides – Salt content.
6. Free sulfur dioxide – Protects wine from oxidation.
7. Total sulfur dioxide – Sum of free and bound SO2.
8. Density – Related to sugar and alcohol content.
9. pH – Acidity level.
10. Sulphates – Wine preservation.
11. Alcohol – Alcohol content.

Target variable: quality (score between 0 and 10)

Importance:
- Alcohol and volatile acidity have the highest impact.
- Features like sulphates and citric acid also affect the taste and perception of wine.

---

Q2. Handling Missing Data in the Wine Quality Dataset

Methods:

1. Mean/Median Imputation:
   - Easy to implement.
   - Best for numerical features.
   - May not preserve variability.

2. Mode Imputation:
   - Best for categorical features.

3. KNN Imputation:
   - Considers neighboring values.
   - More accurate but computationally expensive.

4. Drop missing rows:
   - Simple but may reduce dataset size.

Example using mean imputation:

```python
import pandas as pd

df = pd.read_csv("winequality-red.csv")
df.fillna(df.mean(), inplace=True)
```

---

Q3. Key Factors Affecting Student Exam Performance

- Study time
- Absences
- Family background
- Parental education
- Health status
- School support

Analysis techniques:
- Correlation matrix
- Regression analysis
- Chi-square test (for categorical variables)
- Box plots to compare categories

---

Q4. Feature Engineering on Student Performance Dataset

Steps:

1. Load dataset
2. Handle missing values
3. Encode categorical features
4. Create new features (e.g., average grade)
5. Normalize numerical data

Example:

```python
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv("student-mat.csv")
df['avg_grade'] = (df['G1'] + df['G2'] + df['G3']) / 3
label_cols = ['school', 'sex', 'address']
for col in label_cols:
    df[col] = LabelEncoder().fit_transform(df[col])
```

---

Q5. EDA on Wine Quality Dataset to Check Distribution

```python
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("winequality-red.csv")
df.hist(figsize=(12, 10))
plt.tight_layout()
plt.show()

# Check skewness
print(df.skew())
```

Features with high skew (non-normal): residual sugar, chlorides, total sulfur dioxide

Possible transformations:
- Log transformation
- Square root transformation
- Box-Cox transformation

Example:

```python
import numpy as np
df['residual sugar'] = np.log1p(df['residual sugar'])
```

---

Q6. PCA on Wine Quality Dataset

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("winequality-red.csv")
X = df.drop('quality', axis=1)
X_scaled = StandardScaler().fit_transform(X)

pca = PCA()
pca.fit(X_scaled)
explained_var = pca.explained_variance_ratio_

# Cumulative variance
import numpy as np
cumulative_variance = np.cumsum(explained_var)

# Number of components to explain 90% variance
components_required = np.argmax(cumulative_variance >= 0.90) + 1
print("Minimum principal components to explain 90% variance:", components_required)
