### **Tools for Working with Data in Scikit-learn**

Scikit-learn offers several built-in datasets that you can use to practice and experiment with machine learning algorithms. These datasets are small and simple, making them ideal for learning and testing your models. Let's explore how to load and work with some of these datasets.

----------

### **Scikit-learn Built-in Datasets**

The following datasets are available directly from Scikit-learn:

-   **Iris Dataset**: A famous dataset containing 150 samples of iris flowers, with 4 features (sepal length, sepal width, petal length, petal width) and 3 target classes (species).
-   **Digits Dataset**: Contains 1797 images of handwritten digits (0-9), with 64 features (8x8 pixel images).
-   **Wine Dataset**: Contains 178 samples of wine, with 13 features, and target values representing 3 types of wine.
-   **Breast Cancer Dataset**: A dataset for binary classification, containing features about cell measurements, with the target indicating malignant or benign tumors.

These datasets are easy to load and work with. Let’s see how to import and use them in a machine learning workflow.

Loading Datasets from Scikit-learn

You can easily load these datasets using the datasets module from Scikit-learn. Below is an example of how to load and use each dataset.

Loading the Iris Dataset

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

print(iris_df.head())


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


Loading the Digits Dataset

In [3]:
from sklearn.datasets import load_digits

digits = load_digits()

digits_df = pd.DataFrame(data=digits.data)

print(digits_df.head())


    0    1    2     3     4     5    6    7    8    9   ...   54   55   56  \
0  0.0  0.0  5.0  13.0   9.0   1.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
1  0.0  0.0  0.0  12.0  13.0   5.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
2  0.0  0.0  0.0   4.0  15.0  12.0  0.0  0.0  0.0  0.0  ...  5.0  0.0  0.0   
3  0.0  0.0  7.0  15.0  13.0   1.0  0.0  0.0  0.0  8.0  ...  9.0  0.0  0.0   
4  0.0  0.0  0.0   1.0  11.0   0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   

    57   58    59    60    61   62   63  
0  0.0  6.0  13.0  10.0   0.0  0.0  0.0  
1  0.0  0.0  11.0  16.0  10.0  0.0  0.0  
2  0.0  0.0   3.0  11.0  16.0  9.0  0.0  
3  0.0  7.0  13.0  13.0   9.0  0.0  0.0  
4  0.0  0.0   2.0  16.0   4.0  0.0  0.0  

[5 rows x 64 columns]


Loading the Wine Dataset

In [4]:
from sklearn.datasets import load_wine

wine = load_wine()
wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
wine_df['target'] = wine.target

print(wine_df.head())


   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  target  
0          

Loading the Breast Cancer Dataset

In [6]:
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
cancer_df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
cancer_df['target'] = cancer.target

print(cancer_df.head())


   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

Other Useful Dataset Tools

In addition to the built-in datasets, Scikit-learn also supports loading datasets from other sources and creating custom datasets for experimentation. You can use the following methods:

Loading Data from CSV Files: You can easily load datasets from CSV files into pandas DataFrames and then convert them to Scikit-learn formats for model training.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("your_dataset.csv")
X = data.drop("target", axis=1)
y = data["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Making Custom Datasets: Scikit-learn has a make_classification and make_regression function to generate synthetic classification and regression datasets for experimentation.

In [8]:
from sklearn.datasets import make_classification

# Create a synthetic classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


### **4. Summary**

Scikit-learn provides easy access to a variety of built-in datasets for learning and testing machine learning algorithms. These datasets can be directly imported and used with minimal setup, making them ideal for practice. Here’s a quick recap:

-   **Iris Dataset**: For classification with 3 target classes of flowers.
-   **Digits Dataset**: For classification of digits (0-9) from images.
-   **Wine Dataset**: For classification based on chemical properties of wine.
-   **Breast Cancer Dataset**: For binary classification of tumor malignancy.