In [1]:
import pandas as pd
import numpy as np
import csv

# Some useful stuff for copies and pastes

## Reading and Writing to csv

Other encodings include `utf-8`, `unicode` and many others. `latin_1` seems to work well in Windows. Especially if you have an excel sheet that you save as csv.

In [2]:
df = pd.read_csv('data/file1.csv', encoding='latin_1')

df.head(3)

Unnamed: 0,a,b,c
0,-0.403026,0.559875,0.755273
1,-1.795062,-0.195291,-0.550756
2,-2.151132,-0.915356,0.848399


In [3]:
df.to_csv('data/file2.csv', index=False, quoting=csv.QUOTE_ALL)

In [10]:
!head data/file2.csv

==> data/file2.csv <==
"a","b","c"
"-0.4030264624911308","0.5598745496930929","0.7552732765837198"
"-1.7950623985555558","-0.19529096820893035","-0.5507564378674887"
"-2.151131926398676","-0.9153555453197798","0.8483986080862593"
"-2.213299712589293","2.3633395845233323","0.1299953998671199"
"0.6944889727161473","0.3403618588316733","1.0778166623472378"
"-1.1615298464767665","-0.37769120518367455","-0.2900550146167669"
"1.2391075524770283","0.6355713890761568","0.5060922705495678"
"-0.02788863354969921","0.5251839979595172","-0.37175171628525305"
"1.1313591996036818","0.2564746443673867","0.05222260163368722"


head: cannot open '3' for reading: No such file or directory


## Exclusive deduplication

Sometimes I need to check a list of IDs for the ones in either list that only appear in one list. Normal deduplication functions leave in one copy of the duplicate value. I want no copies of the duplicate values. I just want the values that only appear a single time in the set of values that includes both lists.

Behold. In Python, the answer is always a list comprehension.

In [4]:
def dedupe_exclusive(x, y):
    xs = [i for i in x if (i not in y)]
    ys = [i for i in y if (i not in x)]
    return xs + ys

In [5]:
x = np.arange(1, 9)
x

array([1, 2, 3, 4, 5, 6, 7, 8])

In [6]:
y = np.arange(3, 11)
y

array([ 3,  4,  5,  6,  7,  8,  9, 10])

In [7]:
dedupe_exclusive(x, y)

[1, 2, 9, 10]