# Cleaning Practice
Let's first practice handling missing values and duplicate data using the `cancer_data_means.csv` file, which you created and saved in the "Assessing and Building Intuition" notebook a few pages back. If you created this CSV file in that notebook, it should still be available in this workspace for you to load into the notebook here.

In [1]:
# import pandas and load cancer data
import pandas as pd
df = pd.read_csv("cancer_data_means.csv")

# check which columns have missing values with info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 12 columns):
id                        569 non-null int64
diagnosis                 569 non-null object
radius_mean               569 non-null float64
texture_mean              548 non-null float64
perimeter_mean            569 non-null float64
area_mean                 569 non-null float64
smoothness_mean           521 non-null float64
compactness_mean          569 non-null float64
concavity_mean            569 non-null float64
concave_points_mean       569 non-null float64
symmetry_mean             504 non-null float64
fractal_dimension_mean    569 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 53.4+ KB


In [4]:
df.isnull().sum()

id                         0
diagnosis                  0
radius_mean                0
texture_mean              21
perimeter_mean             0
area_mean                  0
smoothness_mean           48
compactness_mean           0
concavity_mean             0
concave_points_mean        0
symmetry_mean             65
fractal_dimension_mean     0
dtype: int64

In [12]:
# use means to fill in missing values
mean = df["texture_mean"].mean()
mean
mean1 = df["smoothness_mean"].mean()
mean1
mean2 = df["symmetry_mean"].mean()
mean2
df["texture_mean"].fillna(mean, inplace=True)
df["smoothness_mean"].fillna(mean1, inplace=True)
df["symmetry_mean"].fillna(mean2, inplace=True)
# confirm your correction with info()
df

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean
0,842302,M,17.990,19.293431,122.80,1001.0,0.118400,0.27760,0.300100,0.147100,0.241900,0.07871
1,842517,M,20.570,17.770000,132.90,1326.0,0.084740,0.07864,0.086900,0.070170,0.181200,0.05667
2,84300903,M,19.690,21.250000,130.00,1203.0,0.109600,0.15990,0.197400,0.127900,0.206900,0.05999
3,84348301,M,11.420,20.380000,77.58,386.1,0.096087,0.28390,0.241400,0.105200,0.259700,0.09744
4,84358402,M,20.290,14.340000,135.10,1297.0,0.100300,0.13280,0.198000,0.104300,0.180900,0.05883
5,843786,M,12.450,15.700000,82.57,477.1,0.127800,0.17000,0.157800,0.080890,0.208700,0.07613
6,844359,M,18.250,19.980000,119.60,1040.0,0.094630,0.10900,0.112700,0.074000,0.181091,0.05742
7,84458202,M,13.710,20.830000,90.20,577.9,0.118900,0.16450,0.093660,0.059850,0.219600,0.07451
8,844981,M,13.000,21.820000,87.50,519.8,0.127300,0.19320,0.185900,0.093530,0.235000,0.07389
9,84501001,M,12.460,24.040000,83.97,475.9,0.118600,0.23960,0.227300,0.085430,0.203000,0.08243


In [13]:
df.isnull().sum()

id                        0
diagnosis                 0
radius_mean               0
texture_mean              0
perimeter_mean            0
area_mean                 0
smoothness_mean           0
compactness_mean          0
concavity_mean            0
concave_points_mean       0
symmetry_mean             0
fractal_dimension_mean    0
dtype: int64

In [15]:
# check for duplicates in the data
df.duplicated().sum()

5

In [18]:
# drop duplicates
df.drop_duplicates(inplace=True)

In [19]:
# confirm correction by rechecking for duplicates in the data
df.duplicated().sum()

0

## Renaming Columns
Since we also previously changed our dataset to only include means of tumor features, the "_mean" at the end of each feature seems unnecessary. It just takes extra time to type in our analysis later. Let's come up with a list of new labels to assign to our columns.

In [20]:
# remove "_mean" from column names
new_labels = []
for col in df.columns:
    if '_mean' in col:
        new_labels.append(col[:-5])  # exclude last 6 characters
    else:
        new_labels.append(col)

# new labels for our columns
new_labels

['id',
 'diagnosis',
 'radius',
 'texture',
 'perimeter',
 'area',
 'smoothness',
 'compactness',
 'concavity',
 'concave_points',
 'symmetry',
 'fractal_dimension']

In [21]:
# assign new labels to columns in dataframe
df.columns = new_labels

# display first few rows of dataframe to confirm changes
df.head()

Unnamed: 0,id,diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave_points,symmetry,fractal_dimension
0,842302,M,17.99,19.293431,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,84348301,M,11.42,20.38,77.58,386.1,0.096087,0.2839,0.2414,0.1052,0.2597,0.09744
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


In [23]:
# save this for later
df.to_csv('cancer_data_edited.csv', index=False)