# Cleaning Practice
Let's first practice handling missing values and duplicate data using the `cancer_data_means.csv` file.

In [6]:
import pandas as pd
cancer_data = pd.read_csv("cancer_data_means.csv")
print(cancer_data)

           id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0      842302         M        17.99           NaN          122.80     1001.0   
1      842517         M        20.57         17.77          132.90     1326.0   
2    84300903         M        19.69         21.25          130.00     1203.0   
3    84348301         M        11.42         20.38           77.58      386.1   
4    84358402         M        20.29         14.34          135.10     1297.0   
..        ...       ...          ...           ...             ...        ...   
564    926424         M        21.56         22.39          142.00     1479.0   
565    926682         M        20.13         28.25          131.20     1261.0   
566    926954         M        16.60         28.08          108.30      858.1   
567    927241         M        20.60         29.33          140.10     1265.0   
568     92751         B         7.76         24.54           47.92      181.0   

     smoothness_mean  compa

In [8]:
missing_values = cancer_data.isnull().sum()

print("Columns with missing values:")
print(missing_values[missing_values > 0])




Columns with missing values:
texture_mean       21
smoothness_mean    48
symmetry_mean      65
dtype: int64


In [9]:
cancer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      569 non-null    int64  
 1   diagnosis               569 non-null    object 
 2   radius_mean             569 non-null    float64
 3   texture_mean            548 non-null    float64
 4   perimeter_mean          569 non-null    float64
 5   area_mean               569 non-null    float64
 6   smoothness_mean         521 non-null    float64
 7   compactness_mean        569 non-null    float64
 8   concavity_mean          569 non-null    float64
 9   concave_points_mean     569 non-null    float64
 10  symmetry_mean           504 non-null    float64
 11  fractal_dimension_mean  569 non-null    float64
dtypes: float64(10), int64(1), object(1)
memory usage: 53.5+ KB


In [10]:
num_of_duplicates = cancer_data.duplicated().sum()
print("Number of duplicates:", num_of_duplicates)



Number of duplicates: 5


In [11]:
cancer_data_cleaned = cancer_data.drop_duplicates()
print("Shape of DataFrame after dropping duplicates:", cancer_data_cleaned.shape)



Shape of DataFrame after dropping duplicates: (564, 12)


In [None]:
# confirm correction by rechecking for duplicates in the data


## Renaming Columns
Since we also previously changed our dataset to only include means of tumor features, the "_mean" at the end of each feature seems unnecessary. It just takes extra time to type in our analysis later. Rename the columns of the dataframe to remove "_mean".

In [12]:
cancer_data_cleaned.columns = [col.replace('_mean', '') for col in cancer_data_cleaned.columns]
print(cancer_data_cleaned.head())



         id diagnosis  radius  texture  perimeter    area  smoothness  \
0    842302         M   17.99      NaN     122.80  1001.0     0.11840   
1    842517         M   20.57    17.77     132.90  1326.0     0.08474   
2  84300903         M   19.69    21.25     130.00  1203.0     0.10960   
3  84348301         M   11.42    20.38      77.58   386.1         NaN   
4  84358402         M   20.29    14.34     135.10  1297.0     0.10030   

   compactness  concavity  concave_points  symmetry  fractal_dimension  
0      0.27760     0.3001         0.14710    0.2419            0.07871  
1      0.07864     0.0869         0.07017    0.1812            0.05667  
2      0.15990     0.1974         0.12790    0.2069            0.05999  
3      0.28390     0.2414         0.10520    0.2597            0.09744  
4      0.13280     0.1980         0.10430    0.1809            0.05883  


In [13]:
print(cancer_data_cleaned.head())


         id diagnosis  radius  texture  perimeter    area  smoothness  \
0    842302         M   17.99      NaN     122.80  1001.0     0.11840   
1    842517         M   20.57    17.77     132.90  1326.0     0.08474   
2  84300903         M   19.69    21.25     130.00  1203.0     0.10960   
3  84348301         M   11.42    20.38      77.58   386.1         NaN   
4  84358402         M   20.29    14.34     135.10  1297.0     0.10030   

   compactness  concavity  concave_points  symmetry  fractal_dimension  
0      0.27760     0.3001         0.14710    0.2419            0.07871  
1      0.07864     0.0869         0.07017    0.1812            0.05667  
2      0.15990     0.1974         0.12790    0.2069            0.05999  
3      0.28390     0.2414         0.10520    0.2597            0.09744  
4      0.13280     0.1980         0.10430    0.1809            0.05883  


In [14]:
cancer_data_cleaned.to_csv("cancer_data_edited.csv", index=False)
