# Extra2 Data Analysis

In [None]:
# install libraries for extra contents
!pip install matplotlib
!pip install pandas
!pip install scikit-learn
!pip install seaborn

import pprint
from matplotlib import pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sns
from sklearn import datasets

## Problem 1: Import dataset

For this exercise, we will use the Wine dataset provided by the UCI Machine Learning Repository.[Ref](https://archive.ics.uci.edu/dataset/109/wine)
This dataset is the result of chemical analysis of wines grown in the same region of Italy and derived from three different varietals.

The data definitions are as follows:

### target

| variable name | definition   |
|---------------|--------------|
| target        | Type of Wine |

### feature_names

| variable name                | definition                                                             |
|------------------------------|------------------------------------------------------------------------|
| alcohol                      | alcohol concentration                                                  |
| malic_acid                   | malic acid concentration                                               |
| ash                          | ash concentration                                                      |
| alcalinity_of_ash            | alkalinity of ash                                                      |
| magnesium                    | amount of magnesium                                                    |
| total_phenol                 | total amount of phenol                                                 |
| flavanoids                   | amount of flavanoids                                                   |
| nonflavanoid_phenols         | amount of non-flavanoid phenols                                        |
| proanthocyanins              | amount of proanthocyanin                                               |
| color_intensity              | color intensity                                                        |
| hue                          | hue                                                                    |
| od280/OD315_of_diluted wines | Turbidity of diluted wine to light at wavelengths of 280 nm and 315 nm |
| proline                      | amount of proline                                                      |

[Reference material](https://www.jstage.jst.go.jp/article/jbrewsocjapan1915/75/8/75_8_631/_pdf/-char/ja) on the taste and composition of wine

### Import wine dataset

Wine datasets are stored in the `sklearn.datasets` file `import` above.
To load a dataset from `sklearn.datasets`, run the following code. [Ref](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html)

In [None]:
wine_data = datasets.load_wine()

Let's look at the actual contents of the loaded data.
The `pprint` module, which is part of the `Python` standard library, allows you to display objects of type list, dictionary, etc. in a nicely formatted manner. [Ref](https://docs.python.org/3/library/pprint.html#module-pprint)

In [None]:
pprint.pprint(wine_data)

You can see that the details of the Wine dataset, data, feature names, etc. are stored in a dictionary type.

### Problem 1-1

Extract and output only `data` from the data set `wine_data` read above. (*Hints are [here](https://docs.python.org/3/tutorial/datastructures.html#dictionaries))

In [None]:
# ===== Please enter your answer code below. =====


### Problem 1-2

Convert the data read in problem 1-1 into a pandas dataframe. Then, use `wine_df` for the variable name and `wine_data.feature_names` for the column name. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html))

In [None]:
# ===== Please enter your answer code below. =====
wine_df = None

### Problem 1-3

Show the first 10 rows of the `wine_df` converted in problem 1-2 in the Notebook. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html))

In [None]:
# ===== Please enter your answer code below. =====


### Problem 1-4

Check the number of rows and columns in `wine_df`. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html))

In [None]:
# ===== Please enter your answer code below. =====


### Problem 1-5

Check the list of column names in `wine_df`. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html))

In [None]:
# ===== Please enter your answer code below. =====


### Problem 1-6

Extract the column `alcohol` from the `wine_df`.

In [None]:
# ===== Please enter your answer code below. =====


### Problem 1-7

Extract multiple columns `color_intensity` and `hue` from `wine_df`.

In [None]:
# ===== Please enter your answer code below. =====


## Problem 2: Data handling

Let's practice some handling on dataframes.

### Problem 2-1

Extract rows from the `wine_df` with `magnesium` values greater than `100` and display them in the Notebook. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html))

Also, how many rows have `magnesium` values greater than or equal to `100`?

In [None]:
# ===== Please enter your answer code below. =====


In [None]:
# ===== Please enter your answer code below. =====
print()

### Problem 2-2

Sort the `wine_df` to `ash` values in descending order and display the top 10 rows in the Notebook. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html))

In [None]:
# ===== Please enter your answer code below. =====


### Problem 2-3

Divide the `wine_df` into groups by `target` i.e. wine type and calculate the average value of `flavanoids`. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html))

What is the target with the smallest average as a result?

In [None]:
# Add new column, target, into wine_df.
wine_df['target'] = pd.DataFrame(wine_data.target)

# ===== Please enter your answer code below. =====


In [None]:
# ===== Please enter your answer code below. =====
print()

### Problem 2-4

For column `alcohol` in `wine_df`, create a new function `is_alcohol_high`

`1` for `alcohol` >= 13%,
`0` for `alcohol` < 13%.

(*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html))

Once created, check the results by outputting the bottom 10 rows.

In [None]:
# ===== Please enter your answer code below. =====
def is_alcohol_high(alcohol):
    pass

In [None]:
# ===== Please enter your answer code below. =====


## Problem 3: Statistical analysis

Perform a statistical analysis on the data you have created.

### Problem 3-1

Use the `describe` method of the dataframe to check various statistics for numerical data. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html))

Which of the features has the largest standard deviation?

In [None]:
# ===== Please enter your answer code below. =====


In [None]:
# ===== Please enter your answer code below. =====
print()

### Problem 3-2

Use the `plot.hist` method of the dataframe to visualize the distribution of the `proline`. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html))

In [None]:
# ===== Please enter your answer code below. =====


### Problem 3-3

Use the `plot.scatter` method of the dataframe to visualize the distribution of `total_phenols`, `flavanoids`. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.scatter.html))

What do you think you can say as a result of visualization?

In [None]:
# ===== Please enter your answer code below. =====


In [None]:
# ===== Please enter your answer code below. =====
print()

### Problem 3-4

Use the `corr` method of the dataframe to compute the correlation coefficient for `total_phenols`, `flavanoids`. (*Hints are [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html))

In [None]:
# ===== Please enter your answer code below. =====


### Problem 4: Exploratory data analysis（EDA）

EDA and identify features and trends in the data.

### Problem 4-1

Calculate the average of the feature values for each wine type `target`.
Select two features that differ for each wine type `target`.

In [None]:
# ===== Please enter your answer code below. =====


In [None]:
# ===== Please enter your answer code below. =====
print()
print()

### Problem 4-2

Use a scatter plot to see if the two features selected in Problem 4-1 can successfully classify the type of wine.
Here we will use `seaborn`'s `scatterplot` to draw. (*Hints are [here](https://seaborn.pydata.org/generated/seaborn.scatterplot.html))

In [None]:
# ===== Please enter your answer code below. =====
