#**Basic Data Cleaning**

In this tutorial, you will learn:

* How to identify and remove column variables that only have a single value.
* How to identify and consider column variables with very few unique values.
* How to identify and remove rows that contain duplicate observations.

Adpated from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

#Messy Dataset


Breast cancer dataset classifies breast cancer
patient as either a recurrence or no recurrence of cancer. 

```
Number of Instances: 289
Number of Attributes: 9 + the class attribute
Attribute Information:
   1. Class: no-recurrence-events, recurrence-events
   2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
   3. menopause: lt40, ge40, premeno.
   4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
   5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
   6. node-caps: yes, no.
   7. deg-malig: 1, 2, 3.
   8. breast: left, right.
   9. breast-quad: left-up, left-low, right-up,	right-low, central.
  10. irradiat:	yes, no.
Missing Attribute Values: (denoted by "?")
   Attribute #:  Number of instances with missing values:
   6.             8
   9.             1.
Class Distribution:
    1. no-recurrence-events: 201 instances
    2. recurrence-events: 85 instances 
```
You can learn more about the dataset here:
* Breast Cancer Dataset ([breast-cancer.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv))
* Breast Cancer Dataset Description ([breast-cancer.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.names))


The messy dataset was modified from Breast Cancer Dataset.


###Download messy data file

In [None]:
!wget "https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/messy_data.csv" -O messy_data.csv
!head messy_data.csv

--2022-06-06 21:16:26--  https://raw.githubusercontent.com/udel-cbcb/al_ml_workshop/main/data/messy_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25496 (25K) [text/plain]
Saving to: ‘messy_data.csv’


2022-06-06 21:16:26 (9.80 MB/s) - ‘messy_data.csv’ saved [25496/25496]

'40-49','premeno','15-19','0-2','yes',4','3','right','left_up','no','recurrence-events'
'50-59','ge40','15-19','0-2','no',4','1','right','central','no','no-recurrence-events'
'50-59','ge40','35-39','0-2','no',4','2','left','left_low','no','recurrence-events'
'40-49','premeno','35-39','0-2','yes',4','3','right','left_low','yes','no-recurrence-events'
'40-49','premeno','30-34','3-5','yes',4','2','left','right_up','no','recurrence-events'
'50-59','premeno','25-29','3-5','no',4','2',

#Identify Columns That Contain a Single Value


In [None]:
# summarize the number of unique values for each column using pandas
from pandas import read_csv
# load the dataset
df = read_csv('messy_data.csv', header=None)
# summarize the number of unique values in each column using nunique()
print("Shape of messy data: ", df.shape)
print("Column\t#Unique values ")
print(df.nunique())

Shape of messy data:  (289, 11)
Column	#Unique values 
0      6
1      3
2     11
3      7
4      2
5      1
6      3
7      2
8      5
9      2
10     2
dtype: int64


We can see that column index 5 only has a single value and should be removed.

#Delete columns that contain a single value

In [None]:
# delete columns with a single unique value
from pandas import read_csv
# load the dataset
df = read_csv('messy_data.csv', header=None)
print(df.shape)
# get number of unique values for each column
counts = df.nunique()
# record columns to delete
to_del = [i for i,v in enumerate(counts) if v == 1]
print(to_del)
# drop useless columns
df.drop(to_del, axis=1, inplace=True)
print(df.shape)

(289, 11)
[5]
(289, 10)


#Identify columns that have very few values

In [None]:
from pandas import read_csv
# load the dataset
df = read_csv('messy_data.csv', header=None)
# summarize the number of unique values in each column
print("Column, Count, <1%")
for i, v in enumerate(df.nunique()):
  # Percent of number of unique values across rows
  percentage = float(v) / df.shape[0] * 100
  if percentage < 1:
    print('%d, %d, %.1f%%' % (i, v, percentage))

Column, Count, <1%
4, 2, 0.7%
5, 1, 0.3%
7, 2, 0.7%
9, 2, 0.7%
10, 2, 0.7%


#Drop columns with unique values less than 1 percent of rows

In [None]:
# delete columns where number of unique values is less than 1% of the rows
from pandas import read_csv
# load the dataset
df = read_csv('messy_data.csv', header=None)
print(df.shape)
# get number of unique values for each column
counts = df.nunique()
# record columns to delete
to_del = [i for i,v in enumerate(counts) if (float(v)/df.shape[0] * 100) < 1]
print("Columns to delete: ", to_del)
# drop useless columns
df.drop(to_del, axis=1, inplace=True)
print(df.shape)

(289, 11)
Columns to delete:  [4, 5, 7, 9, 10]
(289, 6)


#Identify rows that contain duplicate data

In [None]:
# locate rows of duplicate data
from pandas import read_csv
# load the dataset
df = read_csv('messy_data.csv', header=None)
# calculate duplicates
dups = df.duplicated()
# report if there are any duplicates
print("Any duplicates? ", dups.any())
# list all duplicate rows
print("Duplicated rows:")
print(df[dups])

Any duplicates?  True
Duplicated rows:
          0          1        2      3      4   5    6        7           8   \
17   '60-69'     'ge40'  '15-19'  '0-2'   'no'  4'  '2'  'right'   'left_up'   
27   '40-49'  'premeno'  '10-14'  '0-2'   'no'  4'  '1'  'right'   'left_up'   
44   '30-39'  'premeno'  '15-19'  '0-2'   'no'  4'  '1'   'left'  'left_low'   
65   '50-59'     'ge40'  '15-19'  '0-2'   'no'  4'  '1'  'right'   'central'   
117  '60-69'     'ge40'  '10-14'  '0-2'   'no'  4'  '1'   'left'   'left_up'   
178  '40-49'  'premeno'  '25-29'  '0-2'   'no'  4'  '2'  'right'  'left_low'   
190  '50-59'  'premeno'  '25-29'  '0-2'   'no'  4'  '2'   'left'  'right_up'   
214  '40-49'  'premeno'  '20-24'  '0-2'   'no'  4'  '2'  'right'   'left_up'   
217  '50-59'  'premeno'  '25-29'  '0-2'   'no'  4'  '2'   'left'  'left_low'   
221  '50-59'     'ge40'  '20-24'  '0-2'   'no'  4'  '3'   'left'   'left_up'   
237  '30-39'     'lt40'  '15-19'  '0-2'   'no'  4'  '3'  'right'   'left_up'   
2

#Delete rows that contain duplicate data

In [None]:
# delete rows of duplicate data from the dataset
from pandas import read_csv
# load the dataset
df = read_csv('messy_data.csv', header=None)
print(df.shape)
# delete duplicate rows
df.drop_duplicates(inplace=True)
print(df.shape)

(289, 11)
(272, 11)
