## 3.1 Loading passwords with a function
In the last notebook you learned how to read passwords from a text file. Since you are going to reuse this code many times, it is a perfect candidate to be implemented in a function. The function needs to have one (string) parameter which is the name of the file, and the function should return the preprocessed (all `"\n"` removed) and filtered (no empty passwords) list of passwords. You can use the following template for the function:
```python
def load_passwords(path):
    # load the passwords here
    return passwords
```

1. Write the function that returns a list of passwords from a text file. Copy the template above or write the function from scratch.
2. What happens if you call the function without any argument `load_passwords()`? The error message will tell you exactly what is wrong with the function call.
3. Load the passwords from the file `passwords.txt` into a list. How many passwords are left after the preprocessing and filtering?
4. Count the number of passwords in the list that _only_ contain alphabetic characters, numeric characters or special characters respectively. Try to use list comprehensions wherever possible. An example for a password with only special characters is `"!?:"`. Examples for a purely alphabetic password or a purely numeric password are `"abc"` and `"123"` respectively.
5. EXTRA: Add a boolean parameter called `load_empty` that allows you to optionally load empty passwords. If you call the function with the argument `load_empty=True`, the empty lines will be included in the list of passwords.

## 3.2 Normalized length distribution
Instead of the function `np.bincount()`, you can also use the function `np.unique()` to get the length distribution. As the name suggests, the primary usage of the function `np.unique()` is to get the unique values from an array. If you have an array called `lengths`, the function `np.unique(lengths)` will return the unique values in the array `lengths` in ascending order. 
```python
lengths = np.array([2, 1, 4, 1, 1])
np.unique(lengths) -> np.array([1, 2, 4])
```
However, if you call the function with the additional argument `return_counts=True`, you will receive two arrays. The first array will contain the unique values and the second array will return the corresponding counts:
```python
lengths = np.array([2, 1, 4, 1, 1])
np.unique(lengths, return_counts=True) -> array([1, 2, 4]), array([3, 1, 1])
```
Compared to using `np.bincount()`, the function `np.unique()` can directly give you the x-values and the y-data for a (length) distribution plot.

1. Reproduce the two examples in the exercise with the predefined array `lengths`. Make sure that you understand the relation between the `lengths` and the arrays returned from the function `np.unique()`.
2. Create a new array with at least ten integer values in the interval $[0, 5]$. Write down the unique values and counts that you expect in a markdown cell or on paper, and then use the function `np.unique()` to see if you were right.
3. Compute the array of lengths for the list of passwords that you loaded in the previous exercise. Use the function `np.unique()` to get the unique lengths and their counts.
4. Normalize the array of the length counts and store the result in a new array. If the new array is normalized correctly, the sum of the values in the array will be `1.0`.
5. Create a bar plot with the normalized length distribution. Add axis labels, add a title (that includes the number of passwords) and change one other property to improve the plot.

## 3.3 Counting duplicate passwords
With an increasing number of passwords, there will also be an increasing number of duplicate passwords. (Actually, there were already a few duplicates in the first 100 passwords. Did you notice any of them?) Getting the unique passwords and their counts should therefore be the first step of the data analysis. This will allow you to get a much better overview of your data without removing any information. Even though the list of passwords is not numerical data, you can still use the function `np.unique()` that was introduced in the [previous exercise](#3.2-Normalized-length-distribution).

1. Get the unique passwords and their counts with the function `np.unique()`. How are the unique passwords ordered? If the start of the array is too confusing, look at the last 50 unique passwords instead.
2. Compare the number of _unique_ passwords to the number of _all_ passwords. How many passwords in the initial dataset were duplicates? What is the maximum number of duplicates of a password?
3. Use NumPy array filtering to get the passwords with a count greater than 10 (greater than 2 if you are using the smaller dataset). If you are using one of these passwords yourself, you should probably consider changing it. :)
4. Compute the length of each unique password and store the result in a new array. How can you calculate the average length of **all** passwords from the unique arrays?
5. Plot the counts as a function of the lengths, add axis labels and add a title. You can either use the function `plt.scatter()` or you can use `plt.plot()` with a format that only displays markers.
6. Create a second plot and highlight the data points where the count is greater than 10 (greater than 2 if you are using the smaller dataset). You can either use a different color or increase the size of the points. Hint: Call the plot function twice and use NumPy array filtering to select the data. Add axis labels, assign labels to the data and add a legend.

## 3.4 Sorting the passwords
The current order of the unique passwords is somewhat unfortunate to further analyze the data, since there are just some weird passwords beginning with special characters at the start of the array of _unique_ passwords. Sorting the arrays by the counts in descending order would make a lot more sense, the most recurring passwords would then come first in the arrays. If you just want to sort the array of counts, you can directly do that with the function `np.sort()`. By default the counts will be sorted in ascending order but you can reverse the array with the indexing `[::-1]`. See the following code snippet to sort an array in descending order:
```python
some_numbers = np.array([3, 1, 4, 1, 5, 9, 2, 6])
np.sort(some_numbers)[::-1] -> np.array([9, 6, 5, 4, 3, 2, 1, 1])
```
The problem here is that you will only sort the counts but not the unique passwords and their lengths. Instead of directly sorting the counts, you should therefore use the function `np.argsort()` that will return the indices to sort the counts. See the following code snippet that will have the same result as the example above:
```python
some_numbers = np.array([3, 1, 4, 1, 5, 9, 2, 6])
sorted_indices = np.argsort(some_numbers)[::-1]
some_numbers[sorted_indices]
```
You can then use the array `sorted_indices` to sort the _unique_ passwords, their counts and their lengths.

1. Reproduce the two examples in the exercise with the predefined array `some_numbers`.
2. Create a new array with at least seven integer values. Write down the results you expect from the functions `np.sort()` and `np.argsort()` in a Markdown cell or on paper, and then execute the functions to see if you were right. You can choose whether you want to sort the array in ascending order or in descending order.
3. Use the function `np.argsort()` and the reverse index to get the indices that will sort the counts in descending order.
4. Create three new arrays for the _unique_ passwords, their counts and their lengths by applying the sorted indices to the respective arrays.
5. Look at the first ten passwords and their counts in the sorted arrays. Which passwords did you expect to be in the "top 10"?

## 3.5 Storing arrays in a data frame
The sorting process of the three arrays in the previous section showed you that it is not very convenient to manage multiple NumPy arrays manually. In principle, you could stack the three arrays from the previous exercise to create a two-dimensional NumPy array. The data would then remain aligned if you change the order in any way. However, since one of the arrays stores string data and the other two arrays store integer data, this data is not suitable to be combined in a single array.  

The package [pandas](https://pandas.pydata.org/) (imported as `pd` by convention) resolves this issue with a so-called data frame that works very similar to a spreadsheet (from Excel or LibreOffice Calc). A data frame has an index (as an identifier of the rows) and columns to store the data. Each column in the data frame works just like a NumPy array, but the columns are not required to have the same data type. A data frame is therefore suited to store mixed data, such as your passwords, their counts and their lengths. See the following code snippet to create a data frame from the (unsorted) arrays:
```python
import pandas as pd
df = pd.DataFrame(dict(password=passwords, count=counts, length=lengths))
```
The variable name `df` is often used for data frames but you can also use a different variable name here. The keys in the dictionary will be the names of the columns and the values are the column data. By default, the index counts from `0` to the `len(unique_passwords) - 1`, just like `range(len(unique_passwords))` or `np.arange(len(unique_passwords))`.  

You can directly create a plot from the data frame using the method `df.plot(x, y)` where `x` and `y` are the names of the columns, the data is then taken directly from the data frame. See the following code snippet to plot the counts as a function of the length:
```python
df.plot("length", "count")
```

1. Import the package `pandas` and rename it to `pd`.
2. Create the data frame containing the password data from the [previous exercise](#3.4-Sorting-the-passwords) with the code snippet above and output the data frame.
3. How could you change the column names when creating the data frame? You do not have to find better names for the columns, just try any other names to understand how the renaming works.
4. Use the data frame method `df.plot()` to display the counts as a function of the length. Look at the docstring of the method to find out which parameter you have to change to get a scatter plot. How can you change the marker and the color?

## 3.6 Accessing rows in a data frame
The data frame you created in the [previous exercise](#3.5-Storing-arrays-in-a-data-frame) should have three columns storing the (unique) passwords, their counts and their lengths. Since the passwords are the "identifier" of each row in the data frame, it makes sense to turn this column into the index column (which currently just holds the numbers $0$ to $n - 1$ where $n$ is the number of unique passwords). If you want to turn the existing column `"password"` in the data frame `df` into the index column, you can use the method
```python
df.set_index("password")
```
This will return a modified copy of the data frame. If you want to persist this change, you can either reassign the return value to the variable `df`, or you can call the method with the argument `inplace=True`. Be careful, in the latter case the method will not return anything! If you assign this to the variable `df`, the variable will just be empty afterwards.

With the new index column, you can directly use a password to read the corresponding row from the data frame. The following code snippet will return the row of the password `"123456"` (if it exists):
```python
df.loc["123456"]
```

Even though the passwords are now the index of the data frame, you can still use a numerical index to read rows from the data frame. The corresponding property of the data frame is called `.iloc`. See the following code snippet that will select the 10th row from the data frame (regardless of the actual value of the index column):
```python
df.iloc[9]
```
Instead of a single index, you can also use this with an index range to get multiple rows of the data frame. The indexing with `df.iloc[]` works just like the indexing of a NumPy array.

1. Turn your password column into the index column and persist the change.
2. Look at the documentation of the method `df.set_index()` to find out how to add a column to the index instead of completely replacing the index. Try this with one of the other columns in the data frame, but don't persist the change!
3. Read a few rows from the data frame by using the passwords as the index. You can try the passwords `"123456"` and `"ronaldo"`, or any other (memorable) password from the previous exercises. What happens if you are trying to get a password that does not exist?
4. Use the property `df.iloc` to read the first/last 20 passwords from the data frame. How can you reverse the data frame using the property `.iloc`?

## 3.7 Accessing columns in a data frame
Regarding the columns a data frame works just like a dictionary. The key is the column name and the value will be the data saved in the column. You can use this to read existing columns from the data frame, and you can also use this to assign new columns to the data frame. See the following code snippet to get the column `"length"` from the data frame `df`:
```python
length_series = df["length"]
```
When you read a single column from a data frame, the column will be returned as a series. Instead of a single column name, you can also index the data frame with a list of column names. In that case, the returned variable is a data frame.

1. Compute the array of digit sums by iterating over the passwords in `df.index`. Assign the array to the data frame with the column name `"digit_sum"`.
2. Plot the digit sums in a histogram with a logarithmic y-axis. How can you adjust the number of bins?

## 3.8 Computing more password metrics
Besides the count and the length of each password, it would be nice to have some more data for each password. In the [exercise 3.7](#3.7-Accessing-columns-in-a-data-frame) you have already added the digit sum that computes the sum of all digits in each password. Two more interesting values of passwords are the count of alphabetic characters and numeric characters. To compute these values, you just have to count the characters of the respective categories. See the following example password with an alphabetic character count of `4` and a numeric character count of `2`:
```python
password = "stop12!?"
```

1. Write two functions to count the alphabetic characters and numeric characters of a single password. Use this function to calculate the two values for all passwords, and assign them to the data frame as new columns.
2. Display the digit sum as a function of the numeric character count and as a function of the alphabetic character count. Can you already spot some correlation in the data?
3. EXTRA: Implement the function to count the special characters and also add the number of special characters as a new column to the data frame. Create a plot that shows the digit sum as a function of the special character count.

## 3.9 Statistics on data frame columns
If your data frame only contains numerical data, you can directly use the data frame methods `df.sum()`, `df.min()`, `df.max()`, `df.mean()` etc. to run these computations. The result will be a dictionary-like value where the keys are the names of the columns and the values are the results of the computations. You can also run the same methods on a single column (also called series) to immediately get a single return value.  

Besides computations on single columns, you can also compute correlations across the data frame columns. The corresponding method `df.corr()` will return a new data frame with the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between all the columns. If the coefficient is > 0, the two columns are positively correlated, and if the coefficient is < 0, the two columns are negatively correlated. The maximum correlations are reached at +1 and -1 respectively. Note that the Pearson correlation coefficient is only the default option of the method `df.corr()`, there are also other correlation coefficients available and you can even compute your own correlation coefficient.

1. Try the computational methods on the entire data frame. Do the unique passwords contain more alphabetic characters or numeric characters on average?
2. Use the methods `df.idxmax()` and `df.idxmin()` to get the passwords (which are the indices of the data frame) for the maximum/minimum value of each column.
3. Compute the Pearson correlation coefficient across all the columns in the data frame. Look at the individual coefficients, are any of them surprising to you?
4. Plot the alphabetic character count as a function of the numeric character count. Can you see the strongly negative correlation in the data?

## 3.10 Sorting and querying a data frame
Since the index and the columns in the data frame are aligned, you can directly sort the entire data frame based on the values of one (or mulitple) column(s). The corresponding method `df.sort_values()` returns a sorted copy of data frame `df`, and you can decide the order with the boolean argument `ascending`. If you want to keep the sorted data frame, you have to overwrite the existing one, create a new variable or use the keyword argument `inplace`. See the following code snippet that returns the data frame sorted by the `"length"` in descending order:
```python
df.sort_values("length", ascending=False)
```

In a data frame you can use a so-called query to select specific rows from the data frame. The data frame method `df.query()` takes a string argument with one (or multiple) filter condition(s) and returns the reduced data frame. As an example, see the following code snippet that will return all rows where the length of the password is greater than 10:
```python
df.query("length > 10")
```
You can combine multiple conditions with `and`/`or` or the characters `&`/`|`. If you want to create more complex conditions, you can also use parentheses to group the conditions.

1. Try a few different queries with the comparison operators `==`, `<=`, `>` and `!=` to get used to the method `df.query()`. 
2. Use the query method to get all passwords with a length equal to 6 that occur between 5 and 9 times (including 5 and 9) in the data frame.
3. Combine the query and the sorting to find the 10 longest passwords that occur at least twice in the dataset. You can "stack" the methods and write everything in a single line of code.