## Loading passwords with a function
In the last notebook you learned how to open a text file containing passwords and to read it line by line. You stripped the newline character from each line and you stored the lines in the list of passwords, unless the line was empty. Since you are going to reuse this code many times, it is a perfect candidate to be implemented in a function. The function needs to have one (string) parameter which is the name of the file, and the function should return the preprocessed and filtered list of passwords. You can use the following template for the function:
```python
def load_passwords(path):
    # load the passwords here
    return passwords
```

- Write the function that returns a list of passwords from a text file. You can use the template above or write the function from scratch.
- What happens if you call the function without any argument `load_passwords()`? The error message will tell you exactly what is wrong with the function call.
- Load the passwords from the file `passwords.txt` into a list. How many passwords in the list only contain numeric characters?
- EXTRA: Add a boolean parameter called `load_empty` that allows you to optionally load empty passwords. If you call the function with the argument `load_empty=True`, the empty lines should be included in the list of passwords.

## Normalized length distribution
Instead of the function `np.bincount()`, you can also use the function `np.unique()` to get the length distribution. As the name suggests, the primary usage of the function `np.unique()` is to get the unique values from an array. If you have an array called `lengths`, the function `np.unique(lengths)` will return the unique values in the array `lengths` in ascending order. 
```python
lengths = np.array([2, 1, 4, 1, 1])
unique_lengths = np.unique(lengths)
```
However, if you call the function with the additional argument `return_counts=True`, you will receive two arrays. The first array will contain the unique values and the second array will return the corresponding counts:
```python
lengths = np.array([2, 1, 4, 1, 1])
unique_lengths, length_counts = np.unique(lengths, return_counts=True)
```
Compared to using `np.bincount()`, the function `np.unique()` can directly give you the x-values and the y-data for the length distribution plot, and you do not have to remove the leading zeros manually.

- Reproduce the two examples in the exercise with the predefined array `lengths`.
- Create a new array with at least ten integer values in the interval [0, 5]. Write down the unique values and counts that you expect on paper, and then use the function `np.unique()` to see if you were right.
- Compute the array of lengths for the list of passwords that you loaded in the previous exercise. Use the function `np.unique()` to get the unique lengths and their counts.
- Normalize the array of the length counts and store the result in a new array. If the new array is normalized correctly, the sum of the array will be `1.0`.
- Create a bar plot with the normalized length distribution. Label the axes, add a title (that includes the number of passwords) and change any other property to improve the plot.

## Counting duplicate passwords
With an increasing number of passwords, there will also be an increasing number of duplicate passwords. (Actually, there were already a few duplicates in the first 100 passwords. Did you notice any of them?) Getting the unique passwords and their counts should therefore be the first step of the data analysis. This will allow you to get a much better overview of your data without removing any information. Even though the list of passwords is not numerical data, you can still use the function `np.unique()` that was introduced in the previous exercise.

- Get the unique passwords and their counts with the function `np.unique()`. How are the unique passwords ordered? If the start of the array is too confusing, look at the last 50 **unique** passwords instead.
- Compare the number of **unique** passwords to the number of **all** passwords. How many passwords in the initial dataset were duplicates? What is the maximum number of duplicates of a password?
- Use NumPy array filtering to get the passwords with a count greater than 10 (greater than 2 if you are using the smaller dataset). If you are using one of these passwords yourself, you should probably consider changing it. :)
- Compute the length of each **unique** password and store the result in a new array. How can you calculate the average length of **all** passwords from the unique arrays?
- Plot the counts as a function of the lengths, label the axes and add a title. You can either use the function `plt.scatter()` or you can use `plt.plot()` with a format that only display markers.
- Create a second plot and highlight the data points where the count is greater than 10 (greater than 2 if you are using the smaller dataset). You can either use a different color or increase the size of the points. Hint: Call the plot function twice and use NumPy array filtering to select the data. Add labels to the axes, assign labels to the data and add a legend.

## Sorting the passwords
The current order of the unique passwords is somewhat unfortunate to further analyze the data, since there are just some weird passwords beginning with special characters at the start of the array of **unique** passwords. Sorting the arrays by the counts in descending order would make a lot more sense, the most recurring passwords would then come first in the arrays. If you just want to sort the array of counts, you can directly do that with the function `np.sort()`. By default the counts will be sorted in ascending order but you can reverse the array with the indexing `[::-1]`. See the following code snippet to sort an array in descending order:
```python
some_numbers = np.array([3, 1, 4, 1, 5, 9, 2, 6])
np.sort(some_numbers)[::-1]
```
The problem here is that you will only sort the counts but not the unique passwords and their lengths. Instead of directly sorting the counts, you should therefore use the function `np.argsort()` that will return the indices to sort the counts. See the following code snippet that will have the same result as the example above:
```python
some_numbers = np.array([3, 1, 4, 1, 5, 9, 2, 6])
sorted_indices = np.argsort(some_numbers)[::-1]
some_numbers[sorted_indices]
```
You can then use the array `sorted_indices` to sort the **unique** passwords, the counts and the lengths.

- Reproduce the two examples in the exercise with the predefined array `some_numbers`.
- Create a new array with at least seven integer values. Write down the results you expect from the functions `np.sort()` and `np.argsort()` on paper, and then execute the functions to see if you were right. You can choose whether you want to sort the array in ascending order or in descending order.
- Use the function `np.argsort()` and the reverse index to get the indices that will sort the counts in descending order.
- Create three new arrays for the **unique** passwords, the counts and the lengths by applying the sorted indices to the respective arrays.
- Look at the first ten passwords and their counts in the sorted arrays. Which passwords did you expect to be in the "top ten"?

## Keeping arrays in a data frame
The sorting process of the three arrays in the previous section showed you that it is not very convenient to manage multiple NumPy arrays manually. In principle, you could stack the three arrays from the previous exercise to create a two-dimensional NumPy array. The data would then remain aligned if you change the order in any way. However, since one of the arrays stores string data and the other two arrays store integer data, this data is not suitable to be combined in a single array.  

The package [pandas](https://pandas.pydata.org/) (imported as `pd` by convention) resolves this issue with a so-called data frame that works very similar to a spreadsheet (from Excel or LibreOffice Calc). A data frame has an index (as an identifier of the rows) and columns to store the data. Each column in the data frame works just like a NumPy array, but the columns are not required to all have the same type. A data frame is therefore suited to store mixed data, such as your passwords, the counts and the lengths. See the following code snippet to create a data frame from the (unsorted) arrays:
```python
import pandas as pd
df = pd.DataFrame(dict(password=unique_passwords, count=password_counts, length=lengths))
```
The variable name `df` is often used for data frames but you can use any other variable name here. The keys in the dictionary will be the names of the columns and the values are the column data. By default, the index goes from `0` to the `len(unique_passwords) - 1`, just like `range(len(unique_passwords))` or `np.arange(len(unique_passwords))`.  

You can directly create a plot from the data frame using the method `df.plot(x, y)`. The method only needs the names of the columns and the data is then taken from the data frame. See the following code snippet to plot the counts as a function of the length:
```python
df.plot("length", "count")
```

- Import the package `pandas` and rename it to `pd`. Create the data frame with the code snippet above and look at the output of the data frame.
- How can you change the column names when creating the data frame? You do not have to find better names for the columns, just try any other names to understand how the renaming works.
- Use the data frame method `df.plot()` to display the counts as a function of the length. Look at the docstring of the method to find out which parameter you have to change to get a scatter plot. How can you change the marker and the color?