## Approaches

If the result unique IDs could fit in the memory, this problem is easy to solve by read the input line by line, store the unique IDs in a `set` in memoery. After processing all the input IDs, write the this set to a file. 

However, if the memory is limited, we can still address this problem in the following different approaches.  
- Linux Tools
- Writing a Program
- Database

### Approach I: Linux's **sort** and **uniq**

The `sort` could do a [external sort](https://en.wikipedia.org/wiki/External_sorting) when the file is too large, and then `uniq` can get all the unique values.

Here is the bash command.
```bash
$sort input_id.txt | uniq > uniq_id.txt
```

### Approach II: Devide and Conquer (Python Program)

**Algorithm**  
1. Go through the whole input file, collect the unique IDs starting with a specific character (such as 'a').  
2. Append these unique IDs in a the output file, or write to a speperate file.
3. Continue from step 1 but choose another target leading char (such as 'b').
4. End until all the leadding chars a-zA-Z0-9 has been processed.


**Example**  

```
xNDGN3R
8guP0Af
VHgwqgA
Wy0AXoo
wy0AXoo
xNDGN3R
toS3f6T
8bIgIXy
xNDGN3R
```

Iteration 1: Read the file line by line, and collect all the uqique IDs start with 'a': None.
...
Iteration 20: Read the file line by line again,and collect all the uqique IDs start with 't': ('toS3f6T'). So, append it to the output.
...
Iteration X: ... process the uqique IDs start with 'x': ('xNDGN3R'). So, append it to the output file.
Iteration 61: ...collect all the uqique IDs start with '8': ('8bIgIXy', '8guP0Af'), and append it to the output.
...

After 62 iterations, the output would be:
```
toS3f6T
xNDGN3R
VHgwqgA
Wy0AXoo
8bIgIXy
8guP0Af
```

**Implemention**

Here is the python program.

In [None]:
# python 3.6+

import os
import fileinput
import collections

from datetime import datetime

CHAR_SET = 'abcdefghijklmnopqrstuvwxyz' + 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' + '0123456789'

INPUT_FILE = 'data/sample_id.txt' # default input file
OUTPUT_FILE = 'uniq_id.txt' # default output file

#Return a set of IDs started with `start_char` in file `file
def extract_uniq_ids(input_file, output_file):
    uniq_ids = set()

    for char in CHAR_SET:
        uniq_ids.clear()

        start = datetime.now()
        with fileinput.input(files = input_file) as f:
            for line in f: # cannot use f.readlines() since the file is too large
                line = line.strip()
                if line.startswith(char) and line not in uniq_ids:
                    uniq_ids.add(line)

        if uniq_ids:
            with open(output_file, 'a') as f: # write to seperate files if > 100MB
                f.write('\n'.join(uniq_ids))
                f.write('\n')

        end = datetime.now()
        print(f'Extracted {len(uniq_ids)} unique IDs starting with {char} using {end - start}.')
    return uniq_ids

if __name__ == '__main__':
    extract_uniq_ids(INPUT_FILE, OUTPUT_FILE)

**Limitation**

The success of the above algorithm replys on the volume of unique IDs starting with each character. If the data is highly skewed, certain initial char has so many unique IDs that could not be fit in the memoery, this program will fail due to out of memory issue. Developing other rules to 'partition' the data properly could fix this issue, such as taking the first and last char, or use string's hash value mod N (N is the times of the space/size of unique IDs / available memeory).

Note:
Sampling data could get a rough skewness reference, and `shuf -n x file` in Linux can get a random sample of x lines.

**Complexity and Optimization**

The complexity is linear $O(N)$ because of scanning the big file 62 times, while 'N' is the total number of IDs in the input file. There is no way to reduce the complexity dramtically because you have to go through all the N IDs to find out all the unique IDs.

However, we could process the huge file in batch mode and then combine these IDs with privious IDs stored in a file. The algorithm is decribed as below:

1. Read several million lines each time, and collect the unique IDs grouped by the initial character.
2. Save each group to a file, such as id-a.txt, id-b.txt, etc.
3. Continue to process the next 10 million lines, and merge the group to previous group in the file.
4. Proceed until it reaches the end of the file.
5. Then append these small files to a new file to get all the unique IDs of the input file.

In this way, the complexity could be reduced to $O(M*S)$, while 'M' is the number of unique IDs, and 'S' is the number of splits. (If you have 300 millions and process 10 million each time, the S is 30.) In this way, the huge file could only be read once, while the drawback is the intemediate small files will be read multiple times.

Here is the optimized program.

In [6]:
PREFIX = 'id-'
BATCH_NUM = 2000000
INPUT_FILE = 'data/sample_id.txt'

import os.path
import fileinput
from datetime import datetime

from collections import defaultdict

def merge_ids(file, other_ids):
    ids = set()

    if os.path.isfile(file):
        with open(file, 'r') as f:
            for line in f:
                line = line.strip()
                if line and line not in ids:
                    ids.add(line)
            f.close()

    ids = ids.union(other_ids)

    with open(file, 'w') as f:
        f.write('\n'.join(ids))
        f.write('\n')
        f.close()

    return True

def extract_ids_batch(input_file, prefix = PREFIX, batch = BATCH_NUM):
    m = defaultdict(set) # k:'[a-z|A-Z|0-9]', value = set()
    with fileinput.input(files = input_file) as f:
        while True:
            m.clear()
            for _ in range(BATCH_NUM):
                line = f.readline()
                if line == '': break # EOF
                if line != '\n': # skip the empty lines
                    m[line[:1]].add(line.strip())

            if not m:break

            start = datetime.now()
            for start_char, ids in m.items():
                if start_char.isupper(): start_char += '_' # bugfix for case insensitive OS like mac

                file_to_merge = prefix + start_char + '.txt'
                merge_ids(file_to_merge, ids)
            
            end = datetime.now()
            num = sum(len(m[ch]) for ch in m)
            print(f'Merged {num} unique IDs using {end - start}.')

if __name__ == '__main__':
    extract_ids_batch(INPUT_FILE)

Note:
The output are small files like id-0.txt, id-1.txt...id-_.txt, to merge all these files in a single file, use:  
`cat id-*.txt > uniq_id_linux.txt`

### Approach III: Use database column's 'unique' feature

**Intution**
1. Take the same step as before to split the files into small files (using `split` in Linux).
2. Create a table `unique_id` with one column `id` which is unique in MySQL.
3. Load one file into the database table using `ignore` keword each time.
4. Continue to process all the splitted files.  

The reason why it works is because the duplicated values are ignored.   
Here is the SQL code for reference.

```sql
mysql> create table if not EXISTS unique_id (id char(7) UNIQUE not NULL);
mysql> load data infile '/var/lib/mysql/input_id.txt' ignore into table unique_id;
```

### Effenciency Comparation

**Hardware**: AWS Lightsail (1 GB RAM, 1 vCPU, 40 GB SSD)  
**Software**: Ubuntu 16.04.5 LTS (GNU/Linux 4.4.0-1074-aws x86_64)  
**Free Memory**: 733MB

#### Profile 1

**Data**

Input File: data/sample_id.txt  
Size:       7.7MB  
Lines:      1000000 (1 million)  
Output Size:7.6MB
Uniq ID#:   990964  


**Approaches**  

| Approach                | Running Time  | Note                                   |
| ----------------------- | ------------- | ---------------------------------------|
| Linux `sort` and `uniq` | 1.9s          |                                        |
| Programming             | 36s           |                                        |
| Programming (Enhanced)  | 1.9s          | batch number: 200000 (20% of the input)|
| *Database (MySQL)*      | 38s           |                                        |

**Observation**  
- The Linux command is most effecient, using 18x faster than the normal program way.  
- The enhanced python program reduced the running time significantly, from 36s to 1.9s.  
- The performance of the normal programming and database ways are the worst.  
- Using database seems heave for this problem, so it will not be measured in later profiles.  

#### Profile 2

**Data**  
Input File : data/input_id.txt  
Input Size:  100MB  
Input Line#: 13107200
Output Size: 93MB
Unique ID#:  12073380

**Approaches**  

| Approach                | Running Time  | Note                                  |
| ----------------------- | ------------- | ------------------------------------- |
| Linux `sort` and `uniq` |  36s          |                                       |
| Programming             |  488s     |                                       |
| Programming (Enhanced)  |  40s      | batch size: 2000000 (~15%)            |
| Programming (Enhanced)  |  500s     | batch size: 200000 (~1.5%)           |

**Observation**   
- The fastest appoach is still Linux's `sort` and `uniq` (And the result is in sort order as well).
- The normal programming way imcrease linearly as the file size growths. The ratio of file size / running seconds is around 0.21MB/sec.
- The enhanced programming way keeps pace with the Linux way with a high batch size (~15%).  
- The performace of enhanced programming decreases when the batch number reduces, which could be as a proof of the time complexity of $O(M*S)$ as mentioned before.
- The performance of normal program and enhanced program are similiar in case of small batch (~1.5%).  

#### Profile 3

**Data**  
Input File:  data/bigdata_id.txt  
Size:        1.1GB  
Line#:       134261128 (134 million)  
Output Size: 519MB  
Output ID#:  67998882  


| Approach                | Running Time  | Note                                  |
| ----------------------- | ------------- | ------------------------------------- |
| Linux `sort` and `uniq` | 532s  |                         |
| Programming             | 4836s | Estimated time according to 1'18'' * 62       |
| Programming (Enhanced)  | n/a   | Batch number is 10 mln (~7%). Killed due to system OoM! |
| Programming (Enhanced)  | 2951s | Batch number is 5 million.               |


**Observation**  
- The normal program processes ~0.23MB/s data liearly with the same rate as before.  
- The enhanced program will be killed by operation system due to out of memory with too high batch number.
- The enhanced program spent about 60% of the normal process.

## Conclusion

* Using Linux command (sort and uniq) is a effietient way to solve this unique ID problem.
* The count each leading char program will process about 0.23MB/sec.
* The enhanced program, which scann the input file only once, will boost the performance significantly.

## Appendix

### Data Generation

The goal is to generate a about 1GB size files having IDs (a-z|A-Z|0-9), the unique IDs size of which is above 500MB.

Quick estimation:
- The lenght of a single ID is 7B.
- The total ID (lines) in the generated file should be more than 130 millions.
- The unique ID (lines) should be more than 65 millions.

In [1]:
### Prepare a input ID file for demostration purpose: input_id.txt

import random

def gen_ids(valid_chars, length, N):
    """Generate N number of id:
    - each char of the id is in valid_chars
    - the lenght of id is length
    """
    ids = []
    for n in range(N):
        id = random.choices(valid_chars, k=length)
        ids.append(''.join(id))
    return ids


CHAR_SET = 'abcdefghijklmnopqrstuvwxyz' + 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' + '0123456789'
INPUT_FILE = 'input_id.txt' # file for generated IDs
ID_LENGTH = 7
N = 10000000 # 10 million, run 7 times and then generate the rest of IDs by shuffle this fine using `shuf` command

with open(INPUT_FILE, 'a') as f:
    f.write('\n'.join(gen_ids(CHAR_SET, ID_LENGTH, N)))

### Data Validation

Use this program to validate the format of IDs.

In [11]:
# File name: validate_id.py
# Usage: python3 validate_id.py input_file.dat

import os
import sys
import fileinput

ID_LENGTH = 7
CHARS_SET = set('abcdefghijklmnopqrstuvwxyz' + 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' + '0123456789')

def validate_id(path): # path is either file or dir
    if os.path.isfile(path):
        with fileinput.input(files=path) as f:
            for line in f:
                if line == os.linesep: continue # skip the blank lines
                if line == '': break # reach the end of line
                line = line.strip()
                if not (line and len(line) == ID_LENGTH and set(line) <= CHARS_SET):
                    print(f'Failed! Violation word:{line}')
                    return False
        print(f'Validated {path}')
    elif os.path.isdir(path):
        for root, dirs, files in os.walk(path):
            for file in files:
                file = os.sep.join((root, file))
                validate_id(file)
    else:
        print(f'File not exists: {path}')
    return True

if __name__ == '__main__':
    if len(sys.argv) > 1:
        for argv in sys.argv[1:]:
            validate_id(argv)

Validated data/snippet_id.txt


True

### Data Distribution

In [None]:
# check the distribution of the first char (a-zA-Z0-9) the input file

INPUT_FILE = 'data/sample_id.txt'

import os
import fileinput
from collections import Counter

with fileinput.input(files=INPUT_FILE) as f:    
    counter = Counter(id.strip()[:1] for id in f if id and id != os.linesep)
    for ch in sorted(counter):
        print(ch, counter[ch])