# **Breast cancer preprocessing notebook**

## Definición del problema a resolver
El data set contiene las caracteristicas de una foto tomada mediante microscopio a un area celular bien definida de una biopsia a una masa mamaria.
Las características se calculan a partir de una imagen digitalizada con aguja fina (PAAF) de la masa.
Los datos describen  10 caracteristicas de cada muestra que dan cuenta del tamaño, forma y textura de los núcleos celulares presentes en la imagen.
Se computan la media, error standard y  los valores extremos para cada una de las 10 caracteristicas de cada imagen estudiada resultando un total de 30 caracteristicas para cada muestra.
El data set obtenido por el  Dr. Wolberg es conocido como  Wisconsin Breast Cancer Data y ha sido empleado para estudiar y clasificar correctamente casos de tumores malignos.
La idea es utilizar el aprendizaje de maquina para determinar si la masa analizada es Benigna o Maligna. Para lograrlo se experimentara con distintos modelos de clasificacion para llegar a resultados que puedan contribuir a detectar casos de cancer.


Para mas informacion sobre el procedimiento visitar :  https://pages.cs.wisc.edu/~olvi/uwmp/mpml.html



## Descripcion de los datos de entrada y salida

| Feature name |Type|Missing %| Description and values
|---|----------| --- | --- |--- |
|diagnosis (Target) |Object  | 4.61%|Diagnosis : B (Benign) M (Malignant) for each case|
| erty| int64 | 0%|...|
| iuytr| Object | 4.18%|...|
| idID number | int64 | 3.09% |unique identifier of the observation|
| index | int64 | 0% |...|
| radius_mean| Object | 3.60% |mean of distances from center to points on the perimeter|
| texture_mean | Object | 3.83% |standard deviation of gray-scale values|
| perimeter_mean | Object | 3.85% |mean size of the core tumor|
| area_mean | Object | 4.46% |...|
| smoothness_mean | Object | 4.49% |mean of local variation in radius lengths|
| Compactness mean | Object | 3.93% |mean of perimeter^2 / area - 1.0|
| Concavity mean| Object | 4.36% |mean of severity of concave portions of the contour|
| concave points_mean | Object | 3.70% |mean for number of concave portions of the contour|
| symmetry_mean | Object | 4.18% |...|
| fractal_dimension_mean: | Object | 3.37% |mean for “coastline approximation” - 1|
| radius_se | Object | 3.30% |standard error for the mean of distances from center to points on the perimeter|
| texture_se| Object | 3.37%|standard error for standard deviation of gray-scale values|
| perimeter_Se | Object | 3.52% |...|
| area_se | Object | 4.13% |...|
| smoothness_se| Object | 3.04% |standard error for local variation in radius lengths|
| compactness_se | Object | 3.57% |standard error for perimeter^2 / area - 1.0|
| concavity_se| Object | 4.08% |standard error for severity of concave portions of the contour|
| concave points_se: | Object | 3.6% |standard error for number of concave portions of the contour|
| symmetry_se | Object | 3.22% |...|
| fractal_dimension_se | Object | 3.42% |standard error for “coastline approximation” - 1|
| radius_worst | Object | 3.17% |“worst” or largest mean value for mean of distances from center to points on the perimeter|
| texture_worst| Object | 3.29% |“worst” or largest mean value for standard deviation of gray-scale values|
| perimeter_worst | Object | 3.23% |...|
| area_worst | Object | 3.29% |...|
| smoothness_worst | Object | 3.57% |“worst” or largest mean value for local variation in radius lengths|
| compactness_worst| Object | 3.09% | “worst” or largest mean value for perimeter^2 / area - 1.0|
| concavity_worst | Object | 3.32% |“worst” or largest mean value for severity of concave portions of the contour|
| concave_points_worst | Object | 2.96%. |...|
| simmetry_worst| Object | 3.17% |“worst” or largest mean value for severity of concave portions of the contour|
| fractal_dimension_worst| Object | 3.12% | “worst” or largest mean value for “coastline approximation” - 1



**Checking the datatypes and values we can see the columns are not in the datatype they are supposed to, we need to change dtype Object for a numeric datatype in order to perform the analysis, we also see there are a few low values for each column. We'll solve these errors.**

# Importar librerias

In [33]:
import pandas as pd
import numpy as np
import fastparquet
import pyarrow

# Cargar Datasets

In [34]:
df=pd.read_csv('../data/raw/BreastCancerDS.csv',index_col=0)


# Descripcion general del dataset
Raw datset contains 19710 rows and 35 columns, datatypes are not correct since  columns that are suppossed to be numeric are currently objects, we'll fix this problem later .

In [None]:
df.info()

## Limpieza y calidad de datos general

In [None]:
df.isnull().sum() ##checking amount of null data per column


In [None]:
df[['index','id']] #checking differences between index and id


Checking null values in index and id columns

In [None]:
print("index null count : "+str(df["index"].isnull().sum()), ", id null count: "+ str(df["id"].isnull().sum()))

Checking duplicated rows we can see there are 15773 duplicated rows, it's necessary to fix it because since we are working on a dataset containing characteristics of different cases of breast mass cells, there is no reason for a case to be twice on the same dataset

In [None]:
df.duplicated().value_counts() ##checking duplicates


In [41]:
df=df[(df['diagnosis']=='B')|(df['diagnosis']=='M')] ## only correctly labeled examples are kept for the analysis.


In [42]:
df['diagnosis'].value_counts() #now we only have B or M diagnosis in our dataset

B    10245
M     5935
Name: diagnosis, dtype: int64

## Deleting iuytr y erty columns
* 'Erty' Column contains the same value for each row, we can delete it since it provides no information.
* 'iuytr' column contains the same information in 'symmetry_mean' we delete one of them, in this case 'iuytr' is deleted.

In [43]:
df['erty'].value_counts() ## todos los valores son iguales, se borra erty


908765434567    16180
Name: erty, dtype: int64

In [44]:
df.drop('erty',axis=1,inplace=True)

In [45]:
df[['iuytr','symmetry_mean']] ## estas dos columnas tienen la misma informacion, borrar una


Unnamed: 0,iuytr,symmetry_mean
1,211.0,211.0
2,-88888765432345.0,-88888765432345.0
4,?,?
5,0.1671,0.1671
7,0.1667,0.1667
...,...,...
19703,216.0,216.0
19705,0.1671,0.1671
19706,0.1405,0.1405
19708,0.2275,0.2275


In [46]:
df.drop('iuytr',axis=1,inplace=True)

In [47]:
df

Unnamed: 0,index,perimeter_se,radius_worst,concave points_mean,smoothness_mean,area_mean,concavity_se,texture_mean,concavity_worst,smoothness_se,...,id,symmetry_mean,symmetry_worst,diagnosis,fractal_dimension_se,perimeter_mean,compactness_worst,symmetry_se,compactness_se,radius_se
1,279,1.83,999765432456788.0,0.03711,0.09516,587.4,0.01457,15.18,0.1456,0.004235,...,8911834,211.0,0.2955,B,0.001593,999765432456788.0,0.1724,0.01528,0.01541,999765432456788.0
2,307,1144.0,9699.0,0.003472,-88888765432345.0,246.3,0.003681,14.4,0.01472,0.007389,...,89346,-88888765432345.0,0.2991,B,0.002153,56.36,0.05232,0.02701,0.004883,0.1746
4,576,2183.0,?,0.02278,0.09524,409.0,0.01349,18.75,?,0.008328,...,85759902,?,0.3306,B,0.002386,73.34,?,0.03218,0.008722,0.3249
5,350,2225.0,13.28,0.01162,0.07561,421.0,0.005949,17.07,0.03046,0.006583,...,899187,0.1671,0.2731,B,0.002668,73.7,0.06476,0.02216,0.006991,0.3534
7,120,1103.0,12.82,0.02623,0.09373,403.3,0.01514,10.82,0.2102,?,...,865137,0.1667,0.3016,B,0.002206,73.34,239.0,0.01344,?,?
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19703,602,2844.0,?,0.05613,0.1008,809.8,0.02219,21.54,0.2992,0.004877,...,86730502,216.0,?,M,?,106.2,0.3055,0.01535,0.01952,0.4332
19705,350,2225.0,rxctf378968 7656463sdfg,0.01162,0.07561,421.0,0.005949,17.07,0.03046,rxctf378968 7656463sdfg,...,899187,0.1671,0.2731,B,rxctf378968 7656463sdfg,73.7,0.06476,0.02216,0.006991,0.3534
19706,442,2235.0,15.27,0.009937,-88888765432345.0,585.9,0.007741,15.79,0.03517,-88888765432345.0,...,90944601,0.1405,0.1859,B,0.002564,-88888765432345.0,0.1071,-88888765432345.0,0.01156,0.3563
19708,501,2974.0,16.01,0.06759,0.1162,595.9,0.03476,24.49,0.3381,0.00968,...,91504,0.2275,0.3651,M,0.006995,92.33,0.3966,0.02434,0.03856,0.4751


## Identifying missing values
We found around 3-4% of na values for each feature, we'll analize what to do with these features after a more in depth analisis. See below.

In [48]:
missing = df.isnull().sum()
missing[missing>0]*100/len(df)

perimeter_se               3.275649
radius_worst               3.182942
concave points_mean        3.244747
smoothness_mean            4.295426
area_mean                  4.140915
concavity_se               3.831891
texture_mean               3.522868
concavity_worst            3.213844
smoothness_se              2.626700
concave points_se          3.430161
area_worst                 3.337454
compactness_mean           3.770087
radius_mean                3.491965
area_se                    3.831891
concave points_worst       2.997528
fractal_dimension_worst    2.904821
perimeter_worst            3.152040
texture_se                 3.182942
fractal_dimension_mean     3.090235
texture_worst              3.244747
smoothness_worst           3.337454
concavity_mean             4.295426
id                         2.657602
symmetry_mean              3.831891
symmetry_worst             2.843016
fractal_dimension_se       3.059333
perimeter_mean             3.770087
compactness_worst          2

## Cleaning blankspaces after and before data
* At this point we need to check if our data has spaces at the beggining or at the end that we cannot see, in order to do this, we change index type to string and use a for cicle to apply str.strip() function to all the columns, so the spaces after and before each string are deleted. We'll change datatypes to numeric again as they are supossed to be,but this allows us to handle possible mistakes when cleaning.

In [49]:
df['index']=df['index'].astype('string')
list= ['index', 'perimeter_se', 'radius_worst', 'concave points_mean',
       'smoothness_mean', 'area_mean', 'concavity_se', 'texture_mean',
       'concavity_worst', 'smoothness_se', 'concave points_se', 'area_worst',
       'compactness_mean', 'radius_mean', 'area_se', 'concave points_worst'
       , 'fractal_dimension_worst', 'perimeter_worst', 'texture_se',
       'fractal_dimension_mean', 'texture_worst', 'smoothness_worst',
       'concavity_mean', 'id', 'symmetry_mean', 'symmetry_worst', 'diagnosis',
       'fractal_dimension_se', 'perimeter_mean', 'compactness_worst',
       'symmetry_se', 'compactness_se', 'radius_se']
for i in list:
    df[i]= df[i].str.strip()

## Handling corrupt values
* Checking the whole dataset, we can see there are 4 types of errors present in many (almost all) columns, we create a list with these errors and use a for cycle to replace these errors for np.nan in each column of our dataset (df) in order to better handle possible mistakes using methods for np.nan.


 Errors found:
- 'rxctf378968 7656463sdfg'
- '-88888765432345.0'
- '999765432456788.0'
- '?'

 **Checking the columns  one by one to find error types present in each column (only 6 value counts for each row are shown for presentation and understanding purposes).** Now we can confirm that the 4 error types  mentioned above are present in almost all columns.

In [50]:
for i in df.columns:
    print(df[i].value_counts().head(6))
    print('\n'+'------------------------')

122    30
238    30
149    30
618    30
628    30
153    30
Name: index, dtype: Int64

------------------------
rxctf378968 7656463sdfg    545
-88888765432345.0          525
?                          475
999765432456788.0          465
1778.0                      85
2765.0                      80
Name: perimeter_se, dtype: int64

------------------------
-88888765432345.0          520
999765432456788.0          515
rxctf378968 7656463sdfg    510
?                          490
13.34                      115
14.34                      115
Name: radius_worst, dtype: int64

------------------------
-88888765432345.0          535
rxctf378968 7656463sdfg    535
999765432456788.0          500
?                          485
0.0                        300
0.05564                     80
Name: concave points_mean, dtype: int64

------------------------
rxctf378968 7656463sdfg    555
999765432456788.0          505
?                          500
-88888765432345.0          475
0.1007                

Creating a list with the 4 error types found and replacing those errors with na over the whole dataset.

In [51]:
list=['rxctf378968 7656463sdfg','-88888765432345.0','999765432456788.0','?']
for i in list:
       df.replace(i,np.nan,inplace=True)


In [52]:
df

Unnamed: 0,index,perimeter_se,radius_worst,concave points_mean,smoothness_mean,area_mean,concavity_se,texture_mean,concavity_worst,smoothness_se,...,id,symmetry_mean,symmetry_worst,diagnosis,fractal_dimension_se,perimeter_mean,compactness_worst,symmetry_se,compactness_se,radius_se
1,279,1.83,,0.03711,0.09516,587.4,0.01457,15.18,0.1456,0.004235,...,8911834,211.0,0.2955,B,0.001593,,0.1724,0.01528,0.01541,
2,307,1144.0,9699.0,0.003472,,246.3,0.003681,14.4,0.01472,0.007389,...,89346,,0.2991,B,0.002153,56.36,0.05232,0.02701,0.004883,0.1746
4,576,2183.0,,0.02278,0.09524,409.0,0.01349,18.75,,0.008328,...,85759902,,0.3306,B,0.002386,73.34,,0.03218,0.008722,0.3249
5,350,2225.0,13.28,0.01162,0.07561,421.0,0.005949,17.07,0.03046,0.006583,...,899187,0.1671,0.2731,B,0.002668,73.7,0.06476,0.02216,0.006991,0.3534
7,120,1103.0,12.82,0.02623,0.09373,403.3,0.01514,10.82,0.2102,,...,865137,0.1667,0.3016,B,0.002206,73.34,239.0,0.01344,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19703,602,2844.0,,0.05613,0.1008,809.8,0.02219,21.54,0.2992,0.004877,...,86730502,216.0,,M,,106.2,0.3055,0.01535,0.01952,0.4332
19705,350,2225.0,,0.01162,0.07561,421.0,0.005949,17.07,0.03046,,...,899187,0.1671,0.2731,B,,73.7,0.06476,0.02216,0.006991,0.3534
19706,442,2235.0,15.27,0.009937,,585.9,0.007741,15.79,0.03517,,...,90944601,0.1405,0.1859,B,0.002564,,0.1071,,0.01156,0.3563
19708,501,2974.0,16.01,0.06759,0.1162,595.9,0.03476,24.49,0.3381,0.00968,...,91504,0.2275,0.3651,M,0.006995,92.33,0.3966,0.02434,0.03856,0.4751


## Enconding target
We use scikit learn LabelEncoder to encode diagnosis as follows:
0 if diagnosis is B (Beningn) and  1 if its M (malignant)

In [53]:
from sklearn import preprocessing
label_encoding = preprocessing.LabelEncoder()
df['diagnosis'] = label_encoding.fit_transform(df['diagnosis'])

## 0 es beningno y 1 es maligno


## Seting dataset column dtype as float and diagnosis as category

In [54]:
df=df.astype('float')

In [55]:
df['diagnosis']=df['diagnosis'].astype('int').astype('category')
df['index']=df['index'].astype('int')




In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16180 entries, 1 to 19709
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   index                    16180 non-null  int32   
 1   perimeter_se             13640 non-null  float64 
 2   radius_worst             13630 non-null  float64 
 3   concave points_mean      13600 non-null  float64 
 4   smoothness_mean          13450 non-null  float64 
 5   area_mean                13425 non-null  float64 
 6   concavity_se             13475 non-null  float64 
 7   texture_mean             13565 non-null  float64 
 8   concavity_worst          13540 non-null  float64 
 9   smoothness_se            13610 non-null  float64 
 10  concave points_se        13600 non-null  float64 
 11  area_worst               13530 non-null  float64 
 12  compactness_mean         13480 non-null  float64 
 13  radius_mean              13495 non-null  float64 
 14  area_s

## First describe
We can now make a first description of the raw data

In [57]:
df.describe()

Unnamed: 0,index,perimeter_se,radius_worst,concave points_mean,smoothness_mean,area_mean,concavity_se,texture_mean,concavity_worst,smoothness_se,...,concavity_mean,id,symmetry_mean,symmetry_worst,fractal_dimension_se,perimeter_mean,compactness_worst,symmetry_se,compactness_se,radius_se
count,16180.0,13640.0,13630.0,13600.0,13450.0,13425.0,13475.0,13565.0,13540.0,13610.0,...,13415.0,14720.0,13485.0,13600.0,13555.0,13450.0,13725.0,13580.0,13570.0,13595.0
mean,328.05068,2536.409364,323.328936,2.424616,3.835104,656.435978,1.147903,19.416823,24.441271,0.007034,...,7.370354,30758980000000.0,16.305977,31.056838,0.010388,92.300212,25.303201,0.219976,0.145002,81.108863
std,189.47624,1738.237381,1680.786074,16.207594,19.928257,349.18159,17.733254,4.385227,106.06891,0.00313,...,34.793982,182461200000000.0,52.51652,92.196651,0.199383,24.473417,95.437704,2.066789,1.474177,292.899986
min,0.0,0.7714,7.93,0.0,0.05263,143.5,0.0,9.71,0.0,0.001713,...,0.0,-88888770000000.0,0.1167,0.1565,0.000895,43.79,0.02729,0.007882,0.002252,0.1115
25%,165.75,1482.0,13.29,0.02027,0.08641,420.5,0.01514,16.33,0.1211,0.005033,...,0.02995,865468.0,0.1634,0.2523,0.002217,75.49,0.1508,0.01502,0.0134,0.2351
50%,326.0,2143.0,15.3,0.0337,0.09592,556.7,0.02626,18.9,0.2571,0.006307,...,0.06387,907915.0,0.1813,0.2884,0.003053,86.91,0.2297,0.018525,0.02048,0.3438
75%,494.0,3168.0,19.92,0.07752,0.1061,782.7,0.04256,21.87,0.426725,0.008109,...,0.1457,8912944.0,0.2027,0.3313,0.004463,104.3,0.3856,0.02324,0.03247,0.5907
max,656.0,9807.0,9981.0,162.0,123.0,2501.0,396.0,39.28,1252.0,0.03113,...,313.0,999765400000000.0,304.0,544.0,6.0,188.5,1058.0,31.0,27.0,2873.0


In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16180 entries, 1 to 19709
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   index                    16180 non-null  int32   
 1   perimeter_se             13640 non-null  float64 
 2   radius_worst             13630 non-null  float64 
 3   concave points_mean      13600 non-null  float64 
 4   smoothness_mean          13450 non-null  float64 
 5   area_mean                13425 non-null  float64 
 6   concavity_se             13475 non-null  float64 
 7   texture_mean             13565 non-null  float64 
 8   concavity_worst          13540 non-null  float64 
 9   smoothness_se            13610 non-null  float64 
 10  concave points_se        13600 non-null  float64 
 11  area_worst               13530 non-null  float64 
 12  compactness_mean         13480 non-null  float64 
 13  radius_mean              13495 non-null  float64 
 14  area_s

## Saving current data into a parquet file
Saving a first dataset which contains 16180 entries with mistakes replaced for na and target column clean and only containing correctly labeled rows.

In [59]:
df.to_parquet("../data/interim/BreastCancer.parquet", index = False)


In [60]:

df=pd.read_parquet("../data/interim/BreastCancer.parquet", engine='pyarrow')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16180 entries, 0 to 16179
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   index                    16180 non-null  int32  
 1   perimeter_se             13640 non-null  float64
 2   radius_worst             13630 non-null  float64
 3   concave points_mean      13600 non-null  float64
 4   smoothness_mean          13450 non-null  float64
 5   area_mean                13425 non-null  float64
 6   concavity_se             13475 non-null  float64
 7   texture_mean             13565 non-null  float64
 8   concavity_worst          13540 non-null  float64
 9   smoothness_se            13610 non-null  float64
 10  concave points_se        13600 non-null  float64
 11  area_worst               13530 non-null  float64
 12  compactness_mean         13480 non-null  float64
 13  radius_mean              13495 non-null  float64
 14  area_se               

## Dropping unnecessary columns
Next step is drop unnecessary columns like index and id because we dont need them for our analysis.

In [61]:
df=df.drop('id',axis=1)
df=df.drop('index',axis=1)


## Deleting duplicate rows
We first delete all duplicated rows which are exactly the same matching all their features, that's the reason for not especifying subset='col' in drop_duplicates function. After doing so, we get a dataset with 3159 rows and 31 columns.

In [62]:
df=df.drop_duplicates(keep='first', ignore_index=False)


In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3159 entries, 0 to 3235
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   perimeter_se             2651 non-null   float64
 1   radius_worst             2649 non-null   float64
 2   concave points_mean      2643 non-null   float64
 3   smoothness_mean          2613 non-null   float64
 4   area_mean                2608 non-null   float64
 5   concavity_se             2618 non-null   float64
 6   texture_mean             2636 non-null   float64
 7   concavity_worst          2631 non-null   float64
 8   smoothness_se            2645 non-null   float64
 9   concave points_se        2643 non-null   float64
 10  area_worst               2629 non-null   float64
 11  compactness_mean         2619 non-null   float64
 12  radius_mean              2622 non-null   float64
 13  area_se                  2618 non-null   float64
 14  concave points_worst    

**Saving a dataset without duplicates and including nas, this is done in order to use this dataset in other possible experiments (breastcancerdeduplicated.parquet).**

In [64]:
df.to_parquet("../data/interim/BreastCancerdeduplicated.parquet", index = False)


## Data set for the first experiment.
For our first experiment we'll use a dataset with no NaN's. We delete rows with na features, it means we only use examples with all valid and filled features.

In [65]:
df.dropna(inplace=True)

In [66]:
df['diagnosis']=df['diagnosis'].astype('category') ##seting diagnosis column as category

In [67]:
df.info() ##Datatypes are now in their correct dtypes (float64 for numeric values and category for target)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 569 entries, 3 to 3234
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   perimeter_se             569 non-null    float64 
 1   radius_worst             569 non-null    float64 
 2   concave points_mean      569 non-null    float64 
 3   smoothness_mean          569 non-null    float64 
 4   area_mean                569 non-null    float64 
 5   concavity_se             569 non-null    float64 
 6   texture_mean             569 non-null    float64 
 7   concavity_worst          569 non-null    float64 
 8   smoothness_se            569 non-null    float64 
 9   concave points_se        569 non-null    float64 
 10  area_worst               569 non-null    float64 
 11  compactness_mean         569 non-null    float64 
 12  radius_mean              569 non-null    float64 
 13  area_se                  569 non-null    float64 
 14  concave p


## After cleaning the data we get a dataset with no nas nor mistakes. The resulting dataset contains 569 examples (rows), 30 features and 1 target (diagnosis). We'll use this dataset (Breastclean1) for the first experiment.

In [68]:
df.to_parquet("../data/interim/Breastclean1.parquet", index = False) ##saving dataset (569 examples 30 features and 1 target)


## Partial Results
**Initial raw data (dataset name: BreastCancerDS.csv)**
- 19719 rows, 35 Columns , memory usage: 5.4+ MB


**After cleaning process without duplicates ,including nas (dataset name: breastcancerdeduplicated.parquet) :**
- 3159 rows , 31 Columns , memory usage: 789.8 KB



**Clean dataset after dropping nas and duplicates (dataset name: Breastclean1.parquet):**
- 569 rows, 31 columns, memory usage :138.5 KB


