## Data Cleaning

A common task in Data Science is cleaning raw data to make it usable for further analysis.
Following are types of data that cleaning will correct or remove.

- incorrectly formatted
- duplicate fields or lines
- inconsistent, non-standard, or mislabeled categories or classes 
- corrupted, invalid, inaccuate, irrelevant, outliers, or incomplete
- has typos, misspellings, or syntax errors
- missing codes, data, or fields


### Removing duplicate lines in a file

A common task in data cleaning is removing duplicates in files. 
In this example we have a file with duplicate lines, and we want to remove the duplicates so we just have unique lines in the file. In this case we have a sample data file with some duplicate lines. 
Before removing duplicate lines, the file may need to be [*sorted*](#algorithms_sorting) first.
The program idea is that we look at the *previous line* to decide if a line is a duplicate: 
- case 1: the previous line is the same, then the current line is a duplicate
- case 2: the previous line is not the same, then the current line is new and should be kept (printed back out)

We then see there are two more cases:
- case 3: the first line in the file. There is no previous line, so the current line is new and should be kept.
- case 4: we run out of lines in the file. There is no current line, so there is nothing we have to do.

We can see in this case that:
- the first line is A, line 0 (we always count from 0, this is case 3 so we print it
- the second line, line 1, is B, this is case 2 so we print it
- the third line, line 2, is B, this is case 1 and we ignore it
- the fourth line, line 3, is C, this is case 2 so we print it
- the fifth line, line 4, is D, this is case 2 so we print it
- the sixth line, line 4, is D, this is case 1 so we ignore it

![duplicate1.jpg](duplicate1.jpg)

We can now write the program. `last` is the last line we saw. `first` tells us whether we are at the first line of the file, which is case 3. We read all the lines of the file, and don't need to do anything at the end, just close the file.

Let's say the file `duplicates.txt` is this:
```
A
B
B
C
D
D
```
This is the program. `strip` is used because the line actually has a `newline`, "\n", at the end which we can ignore. 
The line beginning with `#` is a *comment*.
It is for notes we make to about what the program is doing, it is ignored when the program runs.
The first time we read a line, `first` is `True`, we have case 3 and we print the line.
We use `elif` to test the line after the first is read and `first` is set to `False`.
If the test is true, the line is different from the last and we have case 2, and we print the line.
If the test is false, the line is the same as the last and we have case 1, and we ignore the line.

In [None]:
first = True
f = open("duplicates.txt", "r")
for line in f:
    current = line.strip()
    if first:
        first = False
        last = current
        print(current)
    elif current != last:
        print(current)
    # otherwise, if current == last it is a duplicate and we ignore it
    last = current
f.close()

### Another way to remove duplicate lines

If the purpose is to remove duplicates and the order of the lines do not matter, this is another way to remove duplicate lines. 
A dictionary has only unique keys.
When a dictionary is first assigned with a key, a new entry is created.
The second time the dictionary is assigned with the key, no new entry is created but the previous value is replaced.
By using the line that is read as the key, they keys will only keep one copy of it.
We can read the file, assign a dictionary entry with the line as the key with any value, and print the keys at the end.

In this
Let's say the file `duplicates.txt` is this:
```
D
D
A
C
B
B
```
In this case the lines with the duplication removed come out in the same order, but that is not guaranteed. This is show as this.

<img src="duplicate2.jpg" width="600">

In [None]:
f = open("unordered-duplicates.txt", "r")
unique_lines = {}
for line in f:
    current = line.strip()
    unique_lines[current] = 1
f.close()
print('unique lines:')
for line in unique_lines.keys():
    print(line)

### Fixing inconsistent codes

Another data cleaning task is replacing inconsistent spelling with one recognizable value.
"Not applicable" may be used as "NA" or "N/A". 
"Drive" may be "Dr." or "Dr".
A dictionary can have the incorrect spelling as the key and the correct spelling as the value.

Let's say the file `misspellings.txt` is this:
```
Smith,John,N/A,230 Overland Dr.
Jones,Michael,NA,34 Blue Ridge Drive
Lund,Mary,Not applicable,Main St
```

In [None]:
misspellings = {"N/A": "NA", "Not applicable": "NA", "Drive": "Dr", "Dr.": "Dr"}
f = open("misspellings.txt", "r")
for line in f:
    current = line.strip()
    for term in misspellings.keys():
        current = current.replace(term, misspellings[term])
    print(current)
f.close()