# Sets

A set is an unordered collection with no duplicate elements. Basic uses include membership testing and eliminating duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference. Sets can also have important efficiency benefits.

## One Motivation -- Lists can be slooooooow....
One motivation for using sets is that several important operations (adding an element, determining whether an element is in the set) take *constant time* regardless of the size of the set, rather than linear time in the size of the list.

In [1]:
tiny_num = 10 # ten
tiny_num_list = list(range(tiny_num)) 
print(tiny_num_list)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [2]:
big_num = 10000000 # ten million
big_num_list = list(range(big_num)) 
print(len(big_num_list))

10000000


In [3]:
small_num = 100
small_num_list = list(range(big_num - small_num, big_num))
print (small_num_list)

[9999900, 9999901, 9999902, 9999903, 9999904, 9999905, 9999906, 9999907, 9999908, 9999909, 9999910, 9999911, 9999912, 9999913, 9999914, 9999915, 9999916, 9999917, 9999918, 9999919, 9999920, 9999921, 9999922, 9999923, 9999924, 9999925, 9999926, 9999927, 9999928, 9999929, 9999930, 9999931, 9999932, 9999933, 9999934, 9999935, 9999936, 9999937, 9999938, 9999939, 9999940, 9999941, 9999942, 9999943, 9999944, 9999945, 9999946, 9999947, 9999948, 9999949, 9999950, 9999951, 9999952, 9999953, 9999954, 9999955, 9999956, 9999957, 9999958, 9999959, 9999960, 9999961, 9999962, 9999963, 9999964, 9999965, 9999966, 9999967, 9999968, 9999969, 9999970, 9999971, 9999972, 9999973, 9999974, 9999975, 9999976, 9999977, 9999978, 9999979, 9999980, 9999981, 9999982, 9999983, 9999984, 9999985, 9999986, 9999987, 9999988, 9999989, 9999990, 9999991, 9999992, 9999993, 9999994, 9999995, 9999996, 9999997, 9999998, 9999999]


**How long do you think the following will take?**
1. less than 1 second
2. longer than 1 second but less than 10 seconds
3. longer than 10 seconds but less than 1 minute
4. longer than 1 minute

In [4]:
# how many of small_num_list elements are in big_num_list?
import time
start = time.time()

count = 0
print("counting...")
for i in small_num_list:
    if i in big_num_list:
        count += 1
        
end = time.time()
print("count using list:", count, "; time:", end-start, "sec")

counting...
count using list: 100 ; time: 3.9594051837921143 sec


**How long for the following different version?**

In [5]:
# how many of small_num_list elements are in big_num_set?
start = time.time()

big_num_set = set(big_num_list) #include the time to build this

end = time.time()
print("time building set: ", end-start, "sec")

time building set:  0.21669769287109375 sec


In [6]:
#small_num_list = big_num_list
start = time.time()

count = 0
for i in small_num_list:
    if i in big_num_set:
        count += 1
        
end = time.time()
print("count using set:", count, "; time:", end-start, "sec")

count using set: 100 ; time: 5.269050598144531e-05 sec


In [7]:
start = time.time()

small_num_set = set(small_num_list)
count_intersection = len(big_num_set.intersection(small_num_set))

end = time.time()
print("count using set intersection:", count_intersection, "; time:", end-start, "sec")

count using set intersection: 100 ; time: 3.910064697265625e-05 sec


## Another Motivation -- Conceptual clarity with set operations

In [None]:
# Lists can have duplicate elements, and lists are ordered
basket = ['apple', 'orange', 'apple', 'pear', 'orange']

# Creating a set from a list results in a set without duplicate elements
fruit1 = set(basket)
print(fruit1)

In [None]:
# Adding a new element will change (mutate) the set...
fruit1.add('banana')
print(fruit1)

In [None]:
# Adding the same element again to a set doesn't change the set
fruit1.add('apple')
print(fruit1)

### There are multiple ways to remove items from your set  

In [None]:
fruit1.remove('apple') # no exception if element in set
print(fruit1)

In [None]:
fruit1.remove('grape')  # exception if element not in set
print(fruit1)

In [None]:
fruit1.discard('grape')  #no exception if element not in set
print(fruit1)

In [None]:
my_fruit = fruit1.pop()  #no exception if element not in set
print("popped:", my_fruit)
print (fruit1)

In [None]:
# Sets are unordered: cannot index or slice into a set
fruit1[0:]

In [None]:
# Can iterate over the elements in a set, in loops or comprehensions
for elt in fruit1:
    if 'n' in elt: 
        print(elt)
        
print([elt for elt in fruit1 if 'n' in elt])  #list comprehension
print({elt for elt in fruit1 if 'n' in elt})  #set comprehension

### Basic set operations

In [None]:
fruit1 = {'orange', 'apple', 'pear'}
fruit2 = {'orange', 'apple', 'berry', 'grape'}
print("fruit1 =", fruit1)
print("fruit2 =", fruit2)

In [None]:
# Function Notation

print("Intersection:", fruit1.intersection(fruit2))

print("Union:", fruit1.union(fruit2))

print("Difference, fruit1 - fruit2:", fruit1.difference(fruit2))
print("Difference, fruit2 - fruit1:", fruit2.difference(fruit1))

print("Symmetric Difference:", fruit1.symmetric_difference(fruit2)) #elements in union but NOT in intersection

In [None]:
# Operator Notation

#Intersection
print("Intersection:", fruit1 & fruit2)

#Union
print("Union:", fruit1 | fruit2)

#Difference
print("Difference, fruit1 - fruit2:", fruit1 - fruit2)
print("Difference, fruit2 - fruit1:", fruit2 - fruit1)

#Symmetric Difference
print("Symmetric Difference:", fruit1 ^ fruit2)

### Word of Caution! These operations all create new sets

In [None]:
old_fruits = {'apple', 'pear', 'melon'}
new_fruits = {'melon', 'guava'}
print ("old_fruits = ", old_fruits)
print ("new_fruits = ", new_fruits)

In [None]:
my_fruits = old_fruits.union(new_fruits)
print ("my_fruits = ", my_fruits)
print ("old_fruits = ", old_fruits)
print ("new_fruits = ", new_fruits)

In [None]:
my_fruits = old_fruits.update(new_fruits) # is also the |= operator
print ("my_fruits = ", my_fruits)
print ("old_fruits = ", old_fruits)
print ("new_fruits = ", new_fruits)

In [None]:
start = time.time()

total_set = set()
for i in range(30000):
    current_set = set(range(i, i+4))
    total_set = total_set.union(current_set)
    
end = time.time()
print("time using union: ", end-start, "sec")

In [None]:
start = time.time()

total_set = set()
for i in range(30000):
    current_set = set(range(i, i+4))
    total_set.update(current_set)
    
end = time.time()
print("time using update: ", end-start, "sec")

## What kind of objects can be in a set?
The elements of sets must be hashable objects. Python's primitive immutable data types are all hashable -- e.g., strings, numbers, booleans, `None`. The "values" associated with these types are unique and thus instances of these types can serve as unique members in a set. In contrast, Python's built-in list type is mutable: the "value" of a list instance or object (e.g., `[1, 2]`) can mutate and thus change, so lists are deemed not hashable -- a persistent hash computed on the "value" of the list might change if the list mutates -- and thus lists *cannot* be members of sets. 

Tuples are an interesting case -- they can be members of sets *if* the elements of the tuple are themselves (recursively) hashable. Thus `(1, "foo")` can be in a set (it's "value" will never change). But the tuple `(1, [2])` has a second element that could be mutated, and thus this tuple is not hashable and cannot be in a set.
By this reasoning, sets themselves cannot be members of sets! (See frozensets if you're interested in an immutable/hashable variant of sets, that *can* be elements of a set.) 

The hashable restriction is what makes it possible to determine whether an element is in a set using constant time with respect to the size of the set; i.e., one does not need to iterate over all elements of a set to determine whether that element is in the set. (See 6.006 for more details on how this hashing works.)

For those interested: instances of user-defined classes are hashable by default. But the user can control or change the hashable nature of their class, depending on including `__eq__` and `__hash__` methods in their class definition. Read more about that in the Python documentation if you're interested in that advanced concept.

# Example: Is the number met before
This section takes advantage of sets to solve a simple problem. Here we input a list of integers, your job is to return a list of Booleans which gives True if the number is met earlier in the input list and False if not:

In [None]:
data = [7, 4, 7, 3]
#expected output: result = [False, False, True, False]

**Approach:** use a set to store the numbers that we met before, and determine whether the next integer is met or not via the membership testing of set which takes constant time. 

In [None]:
met = set() #Create an empty set
result = [False]*len(data) #Initialize a list with the same length of data
for index, number in enumerate(data):
    if number in met:
        result[index] = True #Indicate that the number has been met before
    else:
        met.add(number) #If not, add it to the set of previously met elements
print("data:", data)
print("result:", result)
print("met:", met)

**Side note and Caution!** In the above, something like `[False]*3` would create the list `[False, False, False]`. Since `False` is immutable, everything is good and no confusion arises when we later change individual elements of the `result` list. But let's say we wanted a list of length 3, but with each element being its own empty list to which we'll add elements. Using `x = [[]]*3` creates a list that prints as `[[], [], []]` so might look good. But each of the elements are the **same** empty list! Thus `x[0].append(1)` results in `x` printings as `[[1], [1], [1]]`, and is a **very common aliasing bug**. You are much better off with a different approach to create a deeper data structure, e.g., using a list comprehension that ensures that each element of the list is it's own instance, such as `x = [[] for _ in range(3)]`.

In [8]:
x = [[]]*3
print("x:", x)
x[0].append(1)
print("x now:", x, "-- aliased!\n")

y = [[] for _ in range(3)]
print("y:", y)
y[0].append(1)
print("y now:", y, "-- not aliased")

x: [[], [], []]
x now: [[1], [1], [1]] -- aliased!

y: [[], [], []]
y now: [[1], [], []] -- not aliased


**Alternative "is number met before":** different implementations to our goal or problem above are possible, and might have different efficiencies. For example, we could create the result list one item at a time, but using repeated appends:

In [None]:
met = set()
result = []
for index, number in enumerate(data):
    if number in met:
        result.append(True)
    else:
        met.add(number) #If not, add it to the set of previously met elements
        result.append(False)
print("data:", data)
print("result:", result)
print("met:", met)

In [None]:
#Even more (probably TOO) pythonic; many frown upon mutating inside a comprehension
met = set()
result = [True if val in met else (False, met.add(val))[0] for val in data]
print("data:", data)
print("result:", result)
print("met:", met)