## Data Structures


We have covered in detail much of the basics of python's primitive data types. Its now useful to consider how these basic types can be collected in ways that are meaningful and useful for a variety of tasks. Data structures are a fundamental component of programming, a collection of elements of data that adhere to certain properties, depending on the type. In these notes, we'll present three basic data structures, the list, the set, and the dictionary. Python data structures are very rich, and beyond the scope of this simple primer. Please see [the documentation](http://docs.python.org/2/tutorial/datastructures.html) for a more complete view.


### Set:

A set is a data structure where all elements are unique. Sets are unordered. In fact, the order of the elements observed when printing a set might change at different points during a programs execution, depending on the state of python's internal representation of the set. Sets are ideal for membership queries, for instance, is a user amongst those users who have received a promotion? 

Sets are specified by curly braces, `{ }`, containing one or more comma separated values. To specify an empty list, you can use the alternative construct, `set()`.

In [None]:
# creating sets
some_set = {1, 2, 3, 4, 4, 4, 4}
another_set = {4, 5, 6}

In [None]:
print(some_set)

In [None]:
# creating an empty set; notice that we do *not* use the "empty set = {}" command
# as someone would expect based on the way that we create an empty list
empty_set = set()

We can also create a set from a list:

In [None]:
my_list = [1, 2, 3, 0, 5, 10, 11, 1, 5]
my_set = set(my_list)
print(my_set)
print(len(my_set))
print(len(my_list))

#### Exercise 

* What is the number of distinct words in the `washington_post` variable (defined above)?

#### Checking for membership in a set

The easiest way to check for membership in a set is to use the `in` keyword, checking if a needle is "`in`" the set.

In [None]:
my_set = {1, 2, 3, 4}

In [None]:
val = 1
print("The value", val ,"appears in the variable my_set:", val in my_set)

In [None]:
val = 0
print("The value", val ,"appears in the variable my_set:", val in my_set)

We also have the "`not in`" operator

In [None]:
val = 5
print("Value {d} does not appear in my_set: {tf}".format(d=val, tf=(val not in some_set)))
val = 1
print("Value {d} does not appear in my_set: {tf}".format(d=val, tf=(val not in some_set)))


#### Set operators: Add, remove elements; Union, intersection, subset

Some other common set functionality:

+ `set_a.add(x)`: add an element to a set
+ `set_a.remove(x)`: remove an element from a set
+ `set_a - set_b`: elements in a but not in b. Equivalent to `set_a.difference(set_b)`
+ `set_a | set_b`: elements in a or b. Equivalent to `set_a.union(set_b)`
+ `set_a & set_b`: elements in both a and b. Equivalent to `set_a.intersection(set_b)`
+ `set_a ^ set_b`: elements in a or b but not both. Equivalent to `set_a.symmetric_difference(set_b)` 
+ `set_a <= set_b`:	tests whether every element in set_a is in set_b. Equivalent to `set_a.issubset(set_b)`


#### Exercise

Try the above yourself using the `my_set` and `another_set` variables from above, and compute the difference, union, intersection, and symmetric difference, between the two sets.

In [None]:
# Your code here
set_A = {1, 2, 3, 4, 5}
set_B = {4, 5, 6, 7}
print("Set A", set_A)
print("Set B", set_B)
print("Difference A-B", {} )
print("Union", {})
print("Intersection", {})
print("Symmetric Difference", {})

Now, lets try to use the [Jaccard index similarity](https://en.wikipedia.org/wiki/Jaccard_index) to compute the similarity of the two sets. The Jaccard coefficient is defined as the ratio of the size of the intersection of the two sets, divided by the size of the union of the two sets.

#### Exercise

Now, let's pick a few news articles from the web and paste them in the notebook (as in the case of the Washington Post above). Then compute the similarity of these articles using the Jaccard similarity.

In [None]:
wsj = """
Yahoo Inc. disclosed a massive security breach by a “state-sponsored actor” affecting at least 500 million users, potentially the largest such data breach on record and the latest hurdle for the beaten-down internet company as it works through the sale of its core business.
Yahoo said certain user account information—including names, email addresses, telephone numbers, dates of birth, hashed passwords and, in some cases, encrypted or unencrypted security questions and answers—was stolen from the company’s network in late 2014 by what it believes is a state-sponsored actor.
Yahoo said it is notifying potentially affected users and has taken steps to secure their accounts by invalidating unencrypted security questions and answers so they can’t be used to access an account and asking potentially affected users to change their passwords.
Yahoo recommended users who haven’t changed their passwords since 2014 do so. It also encouraged users change their passwords as well as security questions and answers for any other accounts on which they use the same or similar information used for their Yahoo account.
The company, which is working with law enforcement, said the continuing investigation indicates that stolen information didn't include unprotected passwords, payment-card data or bank account information.
With 500 million user accounts affected, this is the largest-ever publicly disclosed data breach, according to Paul Stephens, director of policy and advocacy with Privacy Rights Clearing House, a not-for-profit group that compiles information on data breaches.
No evidence has been found to suggest the state-sponsored actor is currently in Yahoo’s network, and Yahoo didn’t name the country it suspected was involved. In August, a hacker called “Peace” appeared in online forums, offering to sell 200 million of the company’s usernames and passwords for about $1,900 in total. Peace had previously sold data taken from breaches at Myspace and LinkedIn Corp.
"""

ust = """
SAN FRANCISCO — Information from at least 500 million Yahoo accounts was stolen from the company in 2014, and the  company said Thursday it believes that a state-sponsored actor was behind the hack.
The information may have included names, email addresses, telephone numbers, dates of birth, and, in some cases, encrypted or unencrypted security questions and answers, Yahoo said.
Claims surfaced in early August that a hacker using the name "Peace" was trying to sell the usernames, passwords and dates of birth of Yahoo account users on the dark web — a black market of thousands of secret websites.
The FBI said it was aware of the matter. The compromise of public and private sector systems is something the agency takes very seriously and it said it will continue to investigate and hold accountable all who pose a threat in cyberspace, the agency said in an emailed statement.
Yahoo recommends that users who haven’t changed their passwords since 2014 do so. The company said it was notifying potentially affected users and taking steps to secure their accounts. That included invalidating unencrypted security questions and answers and asking users to change their passwords.
The announcement comes as Yahoo looks to complete its $4.8. billion sale of its core Internet business to media giant Verizon Communications, which said it was notified of the Yahoo breach "within the last two days."
"We understand that Yahoo is conducting an active investigation of this matter, but we otherwise have limited information and understanding of the impact," Verizon said.
Given the unsettled nature of Yahoo's ownership just now, “regulators should be concerned with who will take responsibility for the response to this compromise. It can be easy for the ‘right thing to do’ to slip through the cracks in a multi-billion dollar transition," said Tim Erlin, senior director of IT security and risk strategy at Tripwire, a computer security firm.
Yahoo Chief Executive Officer Marissa Mayer has pledged to stay on with the company through the close of the merger, which is being overseen by Verizon's Marni Walden and AOL CEO Tim Armstrong. Yahoo shares (YHOO) were flat Thursday. Verizon (VZ) shares were up 1% at $52.39.
"""