## **Classifying Wine Origins from Physicochemical Properties**

### **Previous Resources**

Link to the resources of lab class is given below:

1. [Simple Linear Regression](https://github.com/mirsazzathossain/CSE421-Machine-Learning/blob/main/linear-regression.ipynb)
2. [Naive Bayes Classifier](https://github.com/mirsazzathossain/CSE421-Machine-Learning/blob/main/naive-bayes.ipynb)
3. [Video Recording](https://drive.google.com/file/d/1z1dD2z8MX1dgEZbQhZ6419BymmRVM87m/view?usp=sharing)

### **Problem Statement**

In this assignment, you will use the [Wine Dataset](https://github.com/mirsazzathossain/CSE421-Machine-Learning/blob/main/datasets/wine.csv) to classify the origin of wines based on their physicochemical properties. The dataset contains 178 instances with 13 numeric attributes. The attributes are:

1. Alcohol: Alcohol content of the wine (in % vol)
2. Malic acid: Malic acid content of the wine (in g/l)
3. Ash: Ash content of the wine (in g/l)
4. Alcalinity of ash: Alcalinity of ash of the wine (in mEq/l)
5. Magnesium: Magnesium content in the wine (in mg/l)
6. Total phenols: Total phenols content of the wine (in g/l)
7. Flavanoids: Flavanoids content of the wine (in g/l)
8. Nonflavanoid phenols: Nonflavanoids phenols content of the wine (in g/l)
9. Proanthocyanins: Proanthocyanins content of the wine (in g/l)
10. Color intensity: Color intensity of the wine (in OD absorbance units)
11. Hue: Hue of the wine (in 1-10 scale)
12. OD280/OD315 of diluted wines: OD280/OD315 of diluted wines (in OD absorbance units)
13. Proline: Proline content in the wine (in mg/l)

The dataset is divided into three classes, with 59, 71, and 48 instances each, corresponding to wines from three different origins: Barolo, Grignolino, and Barbera. 0 denotes Barolo, 1 denotes Grignolino, and 2 denotes Barbera.


### **Import necessary libraries**


In [1]:
# Write your code here

### **Load the dataset**

Download the csv file form [here](https://github.com/mirsazzathossain/CSE421-Machine-Learning/blob/main/datasets/wine.csv), don't use dataset from any other source. Load the dataset using pandas.


In [2]:
# Write your code here

### **Data Preprocessing**

#### **Check the information and statistical summary of the dataset**

Check if there is any missing value in the dataset. You can use `df.info()` to get the information about the dataset and `df.describe()` to get the statistical summary of the dataset. Observe we didn't got 178 entries for all the columns, which means there are some missing values in the dataset.


In [3]:
# Write your code here

#### **Handle the missing values**

There are several ways to handle missing values as we discussed in the class. Those are:

1. Delete the rows with missing values
2. Fill the missing values with mean, median, mode
3. Fill the missing values with a constant value (maximum or minimum)

Another way to fill the missing values is to use machine learning algorithms. In this assignment, you have to fill the missing values with **linear regression**. You can follow the steps below to fill the missing values:

1. Plot a pair plot of the dataset using `sns.pairplot()`. You can use `hue` parameter to differentiate between the classes. Observe the plot and find out the column where the missing values are present and the column which is most correlated to the column with missing values. To justify your answer, you can use `df.corr()` to find the correlation between the columns and `sns.heatmap()` to plot the correlation matrix.


In [4]:
# Write your code here

2. Let's say column `A` has missing values and column `B` is most correlated to column `A`. Now, you have to find the linear regression line between column `A` and column `B`. To do that, you have to create a new dataframe with two columns, column `A` and column `B`. Then, you have to divide the dataframe into two parts, one without missing values (let's call it `df_train`) and another with missing values (let's call it `df_test`). Now, you have to find the linear regression line between column `A` and column `B` using `df_train` and then predict the missing values of column `A` using `df_test`. You can use `df_test = new_df[new_df['A'].isnull()]` to get the dataframe `df_test` and `df_train = new_df.dropna()` to get the dataframe `df_train`. Create `df_train` and `df_test` and print them.


In [5]:
# Write your code here

3. Train a linear regression model using `df_train` and predict the missing values of column `A` using `df_test`. You can use `from sklearn.linear_model import LinearRegression` to import the linear regression model. Create the model, train the model, and predict the missing values using the model. Store the predicted values.


In [6]:
# Write your code here

4. Now, fill the missing values of column `A` with the predicted values in original dataframe `df` and check if there is any missing value left.


In [7]:
# Write your code here

#### **Check for outliers**

Now that you have handled the missing values, check for outliers in the dataset. You can use boxplot to check for outliers. If you find any outliers, remove them. If you wonder how to get the indices of the outliers, without plotting the boxplot, you can use the following code snippet:

```python
def get_outliers(df, col):
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    return df[(df[col] < lower_bound) | (df[col] > upper_bound)].index
```

Call the function `get_outliers(df, col)` with the dataframe and the column name as parameters to get the indices of the outliers and then replace the outliers with the median of the column. You can use `df[col].median()` to get the median of the column. Loop through all the columns and replace the outliers with median value. To get the list of all the columns, you can use `df.columns`.


In [8]:
# Write your code here

### **Model Building**

After preprocessing the dataset, we will now build the model. In this assignment, you will use **Decision Tree** algorithm to classify the origin of the wines.

A decision tree is a flowchart-like structure in which each internal node represents a feature (or attribute), each branch represents a decision rule, and each leaf node represents the outcome.

The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.

![Decision Tree](https://images.datacamp.com/image/upload/v1677504957/decision_tree_for_heart_attack_prevention_2140bd762d.png)

For a better understanding of decision tree, you can watch [this video](https://youtu.be/_L39rN6gz7Y?si=jj5_TyoloYdCVytJ) from StatQuest and/or read [this article](https://www.datacamp.com/tutorial/decision-tree-classification-python) from datacamp.


#### **Feature Selection**

Before building the model, we have to select the features. You can use all the features or you can use some of the features. To select the features, you refer to the correlation matrix you plotted earlier. You can select the features which are most correlated to the target variable. You can also use `df.corr()['target']` to get the correlation of the features with the target variable and select the features which have correlation greater than 0.5 or 0.6. Selection of features is up to you. You can use all the features or you can use some of the features, **test the model with different features, and select the features which give the best result**. Then drop the other features from the dataset using `df.drop(['col1', 'col2', ...], axis=1)`.


In [9]:
# Write your code here

#### **Split the dataset into features and target variable**

Split the dataset into features and target variable. You can use `df.drop(['target'], axis=1)` to get the features and `df['target']` to get the target variable. Store the features in variable `X` and target variable in variable `y`.


In [10]:
# Write your code here

#### **Building Decision Tree Model and Evaluation**

Now, you have to build the decision tree model. You can use `from sklearn.tree import DecisionTreeClassifier` to import the decision tree classifier. Create the model, train the model, and predict the target variable using the model. Initialize the classifier model. Use `gini` or `entropy` as the criterion and different values for `max_depth`.

After building the model, you have to evaluate the model. You can use cross validation to evaluate the model. You can use `from sklearn.model_selection import cross_val_score` to import cross validation. Use `cross_val_score()` to evaluate the model. You can use `cv=5` to use 5-fold cross validation. Print the accuracy of the model for different values of `max_depth` and `criterion` (i.e. gini or entropy) also different sets of features. Select the best model based on the accuracy. To call `cross_val_score()`, you can use the following code snippet:

```python
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
```

The above code snippet will return an array of accuracy scores for 5-fold cross validation. You can print the accuracy scores using `print(scores)` and the mean accuracy score using `print(scores.mean())`. Select the best model based on the accuracy.

For more information on cross validation, you can refer to [this video](https://youtu.be/fSytzGwwBVw?si=RIazXC9inPcQuGrf) from StatQuest.

**Note:** Bonus marks may be awarded to the students with the best score in the class. 😉


In [11]:
# Write your code here

#### **Devide the dataset into train and test set**

After selecting the best model, you have to divide the dataset into train and test set. You can use `from sklearn.model_selection import train_test_split` to import train test split. Use `train_test_split()` to divide the dataset into train and test set. You can use `test_size=0.2` to use 20% of the dataset as test set. Print the shape of train and test set to check if the dataset is divided correctly.


In [12]:
# Write your code here

#### **Train the model and make prediction**

Now, you have to train the model using the train set. You can use `clf.fit(X_train, y_train)` to train the model. After training the model, you have to make prediction on the test set. You can use `clf.predict(X_test)` to make prediction on the test set. Store the predicted values in variable `y_pred`. Print the accuracy of the model using `accuracy_score(y_test, y_pred)`.


In [13]:
# Write your code here

#### **Classification Report**

You can use `from sklearn.metrics import classification_report` to import classification report. Use `classification_report(y_test, y_pred)` to print the classification report.

Also, find out the confusion matrix using `from sklearn.metrics import confusion_matrix` and plot the confusion matrix using `sns.heatmap()`.


In [14]:
# Write your code here

#### **Visualize the Decision Tree**

Use the following code snippet to visualize the decision tree:

```python
from six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True, special_characters=True, feature_names = feature_cols, class_names=['0', '1', '2'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('wine.png')
Image(graph.create_png())
```


In [15]:
# Write your code here