In [6]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [13]:
df = pd.read_csv('medicine_dataset.csv')

In [14]:
df.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,...,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,455.0,0.0
mean,31272730.0,14.101624,19.330044,91.761407,653.092967,0.096399,0.102785,0.087554,0.048528,0.180801,...,25.715758,106.818396,875.890989,0.132168,0.248725,0.266856,0.113331,0.288862,0.083481,
std,126883500.0,3.537382,4.351432,24.418623,355.771676,0.01387,0.052524,0.079708,0.0387,0.027667,...,6.152919,33.496205,569.464574,0.02186,0.15274,0.206016,0.064933,0.060625,0.01787,
min,8670.0,7.691,9.71,47.98,170.4,0.06251,0.01938,0.0,0.0,0.106,...,12.02,54.49,223.6,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,871421.0,11.65,16.195,74.72,414.45,0.086185,0.062975,0.02962,0.02029,0.1614,...,20.97,83.68,509.25,0.11705,0.1436,0.1104,0.06306,0.25055,0.071365,
50%,907914.0,13.28,18.89,85.98,545.2,0.09646,0.09218,0.05929,0.03279,0.1793,...,25.47,97.19,677.9,0.1316,0.2097,0.2249,0.09851,0.2823,0.07944,
75%,8885476.0,15.75,21.805,103.7,777.25,0.1051,0.12975,0.12195,0.07214,0.1958,...,30.08,125.05,1077.0,0.146,0.3304,0.37515,0.1611,0.31715,0.091825,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1447,0.3454,0.4268,0.2012,0.304,...,47.16,251.2,4254.0,0.2098,1.058,1.252,0.291,0.6638,0.2075,


## About: 

The dataset is the “Breast Cancer Wisconsin (Diagnostic) Dataset,” which consists of diagnostic imaging data used to detect breast cancer. This dataset can be obtained from Kaggle.com by the University of California Irvine Machine Learning. The data is composed of computer-digitized images from a fine needle aspiration (FNA) of a breast mass. An FNA procedure is a minimally invasive medical procedure where samples of tissue or fluid are taken from areas of interest for diagnostic purposes. This is done by inserting a thin, hollow needle into suspected tissue to collect samples that are placed on a glass slide for analysis. The slides are digitized, and the boundaries of the cells are analyzed using a graphical user interface where the boundaries of each cell nucleus are defined with a contour model known as a “snake.” Inside each nucleus, ten features are analyzed in addition to the mean, largest or worst value, and standard error for these features.

This dataset, from the UCI Machine Learning Repository, contains information about the cellular characteristics of patients diagnosed with cancer from the Diagnostic Wisconsin Breast Cancer Database. 

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  They describe characteristics of the cell nuclei present in the image.

The dataset contains a unique ID for each patient, the type of cancer (diagnosis), the visual characteristics of the cancer and the average values of these characteristics.

## Variables:

Radius: The mean of distances from the centroid to points along the snake perimeter of the cell.

Texture: This measures the standard deviation of grayscale intensities. This is calculated by analyzing the pixels to look for visual patterns or structures that can indicate the roughness, smoothness, or other properties pertaining to texture. A high standard deviation in the gray scale values indicates a rough or complicated texture. A low standard deviation where the pixels are bunched together indicates a smooth or constant texture.

Perimeter: The measured perimeter of the cell nucleus.

Area: The calculated area of the cell nucleus.

Compactness: The perimeter and area are combined to give the compactness through the formula: perimeter squared, divided by the area minus one. This measurement can be biased toward higher values for smaller cells as there is a decreased accuracy when measuring.

Smoothness: A measure of the difference between the length of a radial line and the mean lengths of lines surrounding it.

Concavity: A measure of the count and gravity of concavities or indentations in the cell nucleus. This is measured by drawing chords between non-adjacent snake points on the perimeter, then measuring the boundary of the nucleus that resides inside each chord.

Concave Points: This measurement is similar to concavity, only it measures the number of contour cavities rather than the extent.

Symmetry: To measure the symmetry of a cell, first the longest chord through the center or major axis is defined, then chords perpendicular to the major axis in each direction are defined, splitting the cell up into a grid. Lastly, the difference of the perpendicular lines is measured. If the major axis cuts through the cell’s perimeter due to a concavity, further measurements are used.

Fractal Dimension: To obtain this measurement, a method known as “coastline approximation” is used. This is done by measuring the perimeter with increasingly larger scales, which decreases the precision as the scale size increases. These values are plotted on a log scale where the measured downward slope is measured, achieving an approximate fractal dimension.

### What to predict: 

Diagnosis: The target variable for this dataset is a column called “diagnosis.” This column contains two classes, “M” for malignant or “B” for benign.

The datasets can be viewed as classification task.  In particular, what features in cell nuclei have the greatest correlation with a malignant diagnosis for breast cancer?

## About: 

This dataset, from the UCI Machine Learning Repository, contains information about the chemical properties of different types of wine and their correlation with overall quality. You could use this to predict wine quality based on chemical composition. 

This dataset is related to red variants of the Portuguese "Vinho Verde" wine. Physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

The datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).



The main features of the dataset are as follows:
- id: Represents a unique ID of each patient.
- diagnosis: Indicates the type of cancer. This property can take the values "M" (Malignant - Benign) or "B" (Benign - Malignant).
- radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean,concave points_mean: Represents the mean values of the cancer's visual characteristics.

There are also several categorical features where patients in the dataset are labeled with numerical values. You can examine them in the Chart area.
Other features contain specific ranges of average values of the features of the cancer image:

- radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean
- Each of these features is mapped to a table containing the number of values in a given range. You can examine the Chart Tables
Each sample contains the patient's unique ID, the cancer diagnosis and the average values of the cancer's visual characteristics.

Such a dataset can be used to train or test models and algorithms used to make cancer diagnoses. Understanding and analyzing the dataset can contribute to the improvement of cancer-related visual features and diagnosis.