# Assignment 2

## 1.  Wine Quality Dataset [60 Points]

The Wine Quality dataset is available in the UCI Machine Learning Repository, download the Wine Quality dataset [HERE](https://archive.ics.uci.edu/dataset/186/wine+quality). 

The dataset consists of two separate datasets that we will combine to create a new combined dataset. One dataset contains red wine samples and the other contains white
wine samples. Each sample consists of several physicochemical features (e.g., pH, alcohol content) and a quality rating between 0 and 10.

1. Use pandas to load both datasets into separate DataFrames. 
2. Add a new column to each DataFrame to identify the wine type: red or white. 
    - For example, add a column named wineType with a value of "red" to the red wine DataFrame, 
    - and add a column named wineType with a value of "white" to the white wine DataFrame.
3. Concatenate the two DataFrames into a single DataFrame using the pandas. 
    - Make sure to set the ignore index parameter to True to reset the index of the combined DataFrame. 
4. You now have a single DataFrame containing both the red and white wine datasets combined

In [5]:
import pandas as pd
import numpy as np

In [6]:
# STEP 1: Load the dataset
df_red = pd.read_csv('/workspaces/NU-CS-6220/Module02/data/winequality-red.csv', sep=';')
df_white = pd.read_csv('/workspaces/NU-CS-6220/Module02/data/winequality-white.csv', sep=';')

# ouptut the df shape and column names
print(f"Red wine shape: {df_red.shape}")
print(f"Red wine columns: {df_red.columns}")

print(f"White wine shape: {df_white.shape}")
print(f"White wine columns: {df_white.columns}")


Red wine shape: (1599, 12)
Red wine columns: Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')
White wine shape: (4898, 12)
White wine columns: Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')


In [7]:
#STEP 2: add a column to indicate the color of the wine
df_red['color'] = 'red'
df_white['color'] = 'white'

# ouptut the df shape and column names, again
print(f"Red wine shape: {df_red.shape}")
print(f"Red wine columns: {df_red.columns}")

print(f"White wine shape: {df_white.shape}")
print(f"White wine columns: {df_white.columns}")

Red wine shape: (1599, 13)
Red wine columns: Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'color'],
      dtype='object')
White wine shape: (4898, 13)
White wine columns: Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'color'],
      dtype='object')


In [8]:
# STEP 3: Concatenate the two dataframes
df = pd.concat([df_red, df_white], ignore_index=True)

# ouptut the df shape and column names, again
print(f"Combined wine shape: {df.shape}")
print(f"Combined wine columns: {df.columns}")

# STEP 4: Save the combined dataframe to a new CSV file
df.to_csv('/workspaces/NU-CS-6220/Module02/data/winequality.csv', index=False)

Combined wine shape: (6497, 13)
Combined wine columns: Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'color'],
      dtype='object')


### 1.1 Summary Statistics [10 Points]

Compute and display summary statistics for each feature available in the dataset. These must include:

1. minimum value
2. maximum value
3. mean
4. range
5. standard deviation
6. variance
7. count
8. 25:50:75 percentiles



In [9]:
# Summary statistics for each feature available in the dataset
print("\nSummary statistics for each feature available in the dataset:")
print(df.describe())



Summary statistics for each feature available in the dataset:
       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    6497.000000       6497.000000  6497.000000     6497.000000   
mean        7.215307          0.339666     0.318633        5.443235   
std         1.296434          0.164636     0.145318        4.757804   
min         3.800000          0.080000     0.000000        0.600000   
25%         6.400000          0.230000     0.250000        1.800000   
50%         7.000000          0.290000     0.310000        3.000000   
75%         7.700000          0.400000     0.390000        8.100000   
max        15.900000          1.580000     1.660000       65.800000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  6497.000000          6497.000000           6497.000000  6497.000000   
mean      0.056034            30.525319            115.744574     0.994697   
std       0.035034            17.749400             56.521855  

### 1.2 Data Visualization [25 Points]

**Histograms**: To illustrate the feature distributions, create a histogram for each feature in the dataset. You may plot each histogram individually or combine them all into a single plot. When generating histograms for this assignment, use the default number of bins. Recall that a histogram provides a graphical representation of the distribution of the data.

**Box Plots**: To further assess the data, create a boxplot for each feature in the dataset. All of the boxplots will be combined into a single plot. Recall that a boxplot provides a graphical representation of the location and variation of the data through their quartiles; they are especially useful for comparing distributions and identifying outliers.

**Pairwise Plot**: To understand the relationship between the features, create scatter plot for each pair of the features. If there are n features in the dataset, there should be nC2 plots.

**Class-wise Visualization**: Create pairwise plots for each pair of features in a similar way for each of the different classes present in the data.



### 1.3 Conceptual Questions [25 Points]

Answer the following questions about the analysis you just performed. Include the answers to these questions as text content (using markdown or text cells on Jupyter notebook) in the same notebook file used for visualization.

1. How many features are there? What are the types of the features (e.g., numeric, nominal, discrete, continuous)?


2. What can you conclude from the histograms about the distribution of the features in the dataset? Are there any features that are approximately normally distributed? Are there any features that are highly skewed?


3. Based on the box plots, are there any features that appear to have many outliers? Are there any features that appear to have a similar spread of values across different quality ratings? Are there any features that appear to have different spreads of values across different quality ratings?


4. Based on the pairwise plots, which features appear to be highly correlated? Are there any features that do not appear to be correlated with any other features?


5. Based on the class-wise visualizations, are there any pairs of features that appear to be more correlated for certain wine types than for others?

## 2 Forest Fires Dataset [40 Points]

The Forest Fires dataset is a dataset of meteorological and other data from Portugal that is used to predict the size of forest fires. You can download the dataset from the UCI Machine Learning Repository here (http://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/forestfires.csv).

The dataset includes information about the location and date of the fire, as well as variables such as temperature, humidity, wind speed, and rain, among others.
A description of this dataset can be found here (http://archive.ics.uci.edu/ml/datasets/Forest+Fires).

### 2.1 Summary Statistics [5 Points]

Similarly as in Section 1, Compute and display summary statistics for each feature available
in the dataset. These must include 1) minimum value, 2) maximum value, 3) mean, 4) range,
1) standard deviation, 6) variance, 7) count, 8) 25:50:75 percentiles.

### 2.2 Data Visualization [15 Points]

As done in Section 1, create histograms and boxplots for the dataset. Now, create another
boxplot without the outliers. You can use showfliers=False to remove outliers from the
boxplots. You are expected to present two Boxplots in total.

### 2.3 Conceptual Questions [20 Points]

Answer the following questions about the analysis you just performed. Include the answers
to this questions as text content (using markdown or text cells on Jupyter notebook) in the
same notebook file used for visualization.
1. From the boxplot without outliers, which features has a significantly different distribu-
tion from others? Why?
1. What does the outlier in the features mean? If you remove the outliers from the dataset,
what problems might arise?
1. Create a histogram for only FFMC after removing all the values outside of range [88,
96].
1. What distribution does the new histogram follow?