# <span style = "color:rebeccapurple"> Datasets</span>

<span style="text-transform: uppercase;
        font-size: 14px;
        letter-spacing: 1px;
        font-family: 'Segoe UI', sans-serif;">
    Author
</span><br>
efrén cruz cortés
<hr style="border: none; height: 1px; background: linear-gradient(to right, transparent 0%, #ccc 10%, transparent 100%); margin-top: 10px;">

**Main points**
- `sklearn.datasets` allows you to get datasets from sklearn
- Datasets come with a variety of metadata
- Datasets can be numpy arrays of pandas dataframes

## <span style = "color:darkorchid"> Imports

First, as with all scripts, let's import all the modules we will use for this workshop. `scikit-learn` is referred to as `sklearn`, pronounced "S - K - Learn".

In [1]:
# Scikit-learn specifics:
from sklearn import datasets

# Helper modules
import pandas as pd

<span style="color:red">**Warning**</span>

`datasets` here is a module of the `sklearn` library. However, there is another standalone module called datasets, maintained by HuggingFace. These are unrelated, so remember when you are using sklearn.datasets and when you are using HF's datasets.

We will obtain data in two ways: through `scikit-learn`, which gets the datasets from their website, and using `pandas` to load our own datasets.

## <span style = "color:darkorchid"> Loading Data with scikit-learn

`scikit-learn` contains some toy datasets you can load directly from them. We will be using a couple of those in this tutorial. It requires the `datasets` module, which we already imported above.

**Alice Collects Diabetes Data**

![alice-test-tube](images/alice_test_tube_v2_cropped.png){width=35%}

In [2]:
# Loading diabetes dataset
diabetes = datasets.load_diabetes()

A dataset will not be just the raw data, it includes some metadata also, let's do a quick exploration. First, turns out sklearn datasets are "Bunch" types.

In [3]:
type(diabetes)

sklearn.utils._bunch.Bunch

These are dictionary-like objects. We can see what they contain by looking at the keys:

In [4]:
# What type of object is diabetes?
diabetes.keys()

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])

The data itself will be in `data` and in `target`. But let's take a quick look at other aspects of this dataset.

The `DESCR` key contains a description of the dataset. It is a long string, so use it with the `print()` function:

In [5]:
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:
    - age     age in years
    - sex
    - bmi     body mass index
    - bp      average blood pressure
    - s1      tc, total serum cholesterol
    - s2      ldl, low-density lipoproteins
    - s3      hdl, high-density lipoproteins
    - s4      tch, total cholesterol / HDL
    - s5      ltg, possibly log of serum triglycerides level
    - s6      glu, blood sugar level

Note: Each of these 10 feature variables have bee

**Note**

We can obtain the elements of a dataset either by key name, as in dictionaries, or as if they were class attributes. For example, for `feature_names`:

In [6]:
# Dictionary syntax:
diabetes['feature_names']

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [7]:
# Attribute syntax:
diabetes.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

Now for what we came for: the `data` object.

By default, it is returned as a numpy array:

In [8]:
# Default datasets: arrays
diabetes.data

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286131, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04688253,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452873, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00422151,  0.00306441]], shape=(442, 10))

But you can also ask for it to be a pandas dataframe:

In [9]:
diabetes_df = datasets.load_diabetes(as_frame = True)
diabetes_df.data.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [10]:
# The prediction target comes as a separate element:
diabetes_df.target.head()

0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
Name: target, dtype: float64

`scikit-learn` has other "not toy" datasets you can obtain but which we won't go into. You can find more about them <a href = "https://scikit-learn.org/stable/datasets.html" target = "blank_">here</a>

#### <span style = "color:red"> EXERCISE

**Bo's Field Data Collection**

![bo-iris](images/bo_iris_cropped.png){width=35%}

`scikit-learn` has an "iris" dataset. Based on how we loaded "diabetes", can you guess how you can load "iris"?
1. Load iris, use the `as_frame` argument.
2. Show the keys so you know how it is structured.
3. Print the description of the dataset
4. Show the target names.
5. Obtain the feature data and the target data.
6. Show the first few rows of the feature data.
7. Show the full target vector.

In [None]:
# Load iris


In [None]:
# Show keys


In [None]:
# Print description


In [None]:
# Show labels (target names)


In [None]:
# Get feature data


In [None]:
# Get target data


In [None]:
# Show first few rows of features (predictive data)


In [None]:
# Show target


## <span style = "color:darkorchid">Loading our own data

<b>NOTE</b> If you are using Google Colab, instead of working on your own computer, you'll need to load the datasets onto Colab as follows:

In [11]:
# For COLAB USERS only: change the variable below to 1
get_file_yn = 0
if get_file_yn:
    !wget https://raw.githubusercontent.com/efren-cc/scikit-learn-workshop/main/data/penguins.csv
    !wget https://raw.githubusercontent.com/efren-cc/scikit-learn-workshop/main/data/fish.csv

Let's load the "penguins" dataset from our "data" folder using pandas. Remember we imported pandas as pd above.

In [12]:
penguins_df = pd.read_csv("data/penguins.csv")

In [13]:
penguins_df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


<b>Summary:</b> If we have our own .csv data, we can load it using the pandas `.read_csv()` method. scikit-learn also has toy datasets we can play with, in our case we used `load_diabetes()` to obtain the diabetes dataset, but others are available. See the full list <a href = "https://scikit-learn.org/stable/datasets/toy_dataset.html" target = "_blank">here</a>.

#### <span style = "color:red"> EXERCISE

1. Load the fish dataset from our data folder as a pandas dataframe.
2. Visualize the first few rows.
3. Make a new dataframe with only the "weight" column.
4. Drop the weight column from the original dataframe, make sure the change is permanent.

In [None]:
# Read fish dataset


In [None]:
# Show first few rows


In [None]:
# Make new dataframe with only weigth column


In [None]:
# Make new dataframe without weight column. Hint: .drop() method.
