# [Python Data Structures](https://towardsdatascience.com/python-lists-are-overrated-776e87cda3e5#:~:text=Although%20tuples%20don't%20allow,the%20same%20amount%20of%20data.)

### Lists
Lists are arguably the most popular data structure in Python, and it’s easy to see why. They have tremendous utility and versatility.

Unfortunately, with a heavy reliance on lists comes an omission of other capable data structures that can arguably perform better than lists.

You might be wondering: why does this even matter? If lists get the job done, why bother with other tools?

This line of reasoning is sound if you are content with manipulating a small amount of data. However, if you aim to use your skills to serve other companies, you should understand that you can’t be so lax about how you store and process your data.

Companies, who are often burdened with petabytes of data, stress the importance of developing efficient programs that are optimized in terms of time and memory consumption.

Efficiency is just as important as functionality!

You need to understand that even if there are multiple tools that can get the job done, some are simply superior to others. Learning which data structure is ideal for a specific scenario will serve as an immense step for enhancing the quality of your code.

While lists certainly are a reliable data structure, they are not the most fitting for every case. Here are a few alternatives that are worthy of consideration.

### Tuples
Tuples share many key similarities with lists. Both data structures store heterogeneous data types and assign a specific order to their elements.

The main difference between the two lies in the fact that tuples, unlike lists, are immutable. Their assigned values can not be altered in any way.

This makes lists seem more appealing, given how often programmers need to modify the data they store. However, omitting tuples altogether for this reason is a mistake.

Although tuples don’t allow you to change it’s values, it still has a major advantage over lists: memory usage. Tuples require much less space than lists to store the same amount of data.

Let’s demonstrate this with a quick example.

Here, we create a list and tuple storing the same values and use the sys module to determine the sizes of both objects.

In [2]:
import sys
import numpy as np

In [3]:
# create a list and tuple with 10 elements
list1 = [1,2,3,4,5,6,7,8,9,10]
tuple1 = (1,2,3,4,5,6,7,8,9,10)

# get size of list and tuple
list_size = sys.getsizeof(list1)
tuple_size = sys.getsizeof(tuple1)
print('Size of list: {} bytes'.format(list_size))
print('Size of tuple: {} bytes'.format(tuple_size))

Size of list: 136 bytes
Size of tuple: 120 bytes


As shown above, tuples require much less memory to store the same data as lists.

Creating memory efficient programs is a must, which is why it is necessary to keep this data structure in your arsenal.

Since tuples are immutable, they can only be used to store values and look them up. Thus, the ideal scenario for using tuples is when you know the values in the tuple and won’t need to modify them.

### Numpy Arrays
Unlike lists, numpy arrays are not a built-in data structure. They come from the numpy module, which specializes in conducting mathematical operations.

Unlike lists, numpy arrays only store homogenous values (i.e. the elements of the array must be of the same type).

Fortunately, they make up for this shortcoming by allowing you to carry out a variety of calculations with less memory usage and a quicker run time.

Here is an example that showcases the quick run time of numpy arrays.

Suppose we wish to create 2 lists of 1,000,000 values ranging from 0 to 999,999 and then add the elements of each list together to form a third set of values. Let’s find out the time it takes to achieve this with lists and numpy arrays.

Note: Run times will be derived with the use of the “%%timeit” magic function

In [4]:
# performing the computation with lists
list1 = [*range(1000000)]
list2 = [*range(1000000)]
%timeit list3 = [num1+num2 for num1, num2 in zip(list1,list2)]

290 ms ± 23.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
# performing the computation with numpy arrays
np_array1 = np.arange(1000000)
np_array2 = np.arange(1000000)
%timeit np_array3 = np_array1+np_array2

3.37 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


The difference in run times attests to the usability of numpy arrays.

In addition to better run times, numpy arrays are also more memory efficient. Lets use the sys module to compare the size of a list with 1,000,000 values to that of a numpy array with 1,000,000 values.

In [6]:
# create list and numpy array
list1 = [*range(1000000)]
np_array = np.arange(1000000)

# obtain sizes of list and numpy array
size_list1 = sys.getsizeof(list1)
size_np_array = np_array.size * np_array.itemsize

print('Size of list: {} bytes'.format(size_list1))
print('Size of numpy array: {} bytes'.format(size_np_array))

Size of list: 9000104 bytes
Size of numpy array: 4000000 bytes


When storing 1,000,000 values, numpy arrays use less than half the memory of lists.

Overall, numpy arrays surpass lists in both run times and memory usage. Although it is completely fine to use lists for simple calculations, when it comes to computationally intensive calculations, numpy arrays are your best best.

### Sets
Sets are, in my opinion, the most overlooked data structure in Python.

By definition, sets store a mutable collection of distinct values. Given that sets don’t allow duplicates, people may opt to stick with lists as they are already comfortable with using the latter.

However, neglecting sets means casting aside all of the functions that come with them.

Sets are remarkable when it comes to performing searches.

As an example, let’s take these text corpuses (accessible with this [link](https://www.shawlocal.com/2017/04/06/what-is-your-opinion-on-social-media/ajmj3gi/)) and split them into a collection of words. These words will be stored in lists and sets.

In [7]:
# text from first corpus
text1 = "I think social media is a great way to stay in touch with family and friends. \
         There is no doubt it is entertaining. I also believe that social media is intrusive \
         to relationships and may cause many to lose focus on projects that need to be accomplished."

# text from second corpus
text2 = "Social media platforms allow us to share information and education to individuals in \
        a great capacity and on a grand scale. However, when used for negative, social media can be \
        extremely detrimental to our mental health and has been the trigger for increased anxiety \
        and social problems in our world."

# storing words from each corpus into lists
list1 = text1.split(' ')
list2 = text2.split(' ')

# storing words from each corpus into sets
set1 = set(text1.split(' '))
set2 = set(text2.split(' '))

Before comparing the times it takes to search for the word ‘’believe” in the list and the set of the first corpus, I will remove duplicate values from the list for a fair comparison.

In [8]:
list1_no_dup = []
for word in list1:
    if word not in list1_no_dup:
        list1_no_dup.append(word)

Here is how the two searches compare:

In [9]:
# search for 'believe' in the list
%timeit believe_list = 'believe' in list1_no_dup

1.14 µs ± 24.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [10]:
# search for the word 'believe' in set1
%timeit believe_set = 'believe' in set1

160 ns ± 1.85 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


The difference in run times is night and day.

Additionally, sets provide numerous functions that make searches much easier.

For instance, how would you find all the words that are present in both text corpuses? Here’s how that would be achieved using lists.

In [11]:
%%timeit
# store words in both corpuses in list
common_words = []
for word in list1:
    if word in list2 and word not in common_words:
        common_words.append(word)

146 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Here’s how that would look like using sets.

In [13]:
# store words in both corpuses in set
%timeit common_words = set1.intersection(set2)

1.6 µs ± 10.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


The set not only accomplishes the task with less time, it also manages this with fewer lines of code. Sets offer many useful functions like ‘intersection’ that enable programmers to extracted needed data from different collections.

Thus, sets easily trump lists when it comes to performing searches. However, there is a speed-memory tradeoff one must consider when using sets. Although sets can yield smaller run times, they also require more memory.

In [14]:
# obtain size of list and set
size_list1 = sys.getsizeof(list1)
size_set1 = sys.getsizeof(set1)

print('Size of list for corpus 1: {} bytes'.format(size_list1))
print('Size of set for corpus 1: {} bytes'.format(size_set1))

Size of list for corpus 1: 688 bytes
Size of set for corpus 1: 2264 bytes


Whether you should utilize sets or not depends on the goals and the limitations of your project. If your priority is to minimize run time, sets are the ideal data structure.