## 2.1 Reading passwords from a file
You are now going to start working with a tiny part of the actual password dataset. Whenever there is data saved in files, you need a way to load the data into your programming environment. In this case you have a plain text file with the first 100 passwords from the large dataset. You can open the file in a new tab to check (and edit) its content. Make sure to save the file after you changed something (and you want to keep those changes).  

With the following code snippet you can open a file called `some_file.txt` and print the first line from the file. The file is closed automatically at the end of the indented code block in the `with` statement.
```python
with open("some_file.txt") as f:
    print(f.readline())
```
It is a convention to name the file variable `f`. If you want to change the name, you also have to use the new name in the indented block of the `with` statement.

The method `f.readline()` returns the current line as a string and moves the file variable `f` to the next line. If you call the method for a second time, you will get the second line from the file etc... You can call the method `f.readline()` as often as you want, but the returned strings will be empty if you have reached the end of the file.

If you already know that you want to iterate over all lines in the file, you can directly use the file variable `f` in a `for` loop. In each step of the loop, a new line will be assigned to the loop variable `line` as a string. The loop will stop automatically once it reaches the end of the file.
```python
with open("some_file.txt") as f:
    for line in f:
        print(line)
```

1. Use a `for` loop to print the first 10 passwords from the file `100.txt`. Open the file in a new tab and compare it to the printed output.
2. Print the _last_ 10 passwords from the file `100.txt`. You can assume that you already know the length of the file (100 lines).
3. Use a `for` loop of your choice to iterate over the entire file `100.txt` and count the total number of characters in the file.
4. Assign the first password in the file to a variable. Output the password (without using `print()`) and calculate the length of the password. Can you see what is "wrong" with the password?
<!-- - Calculate the length of the password  -->


<!-- - You can read the entire file at once with the method `f.readlines()`. The lines in the file will be returned as a list of strings. Assign the list to a new variable and check that its length matches the length of the file. -->
<!-- - Take a closer look at a few passwords in the list and calculate their lengths. Can you see what is "wrong" with the password data in the list? -->

## 2.2 Preprocessing the file data
The reason why the length of the password data was "wrong" in the [previous exercise](#2.1-Reading-passwords-from-a-file) was the newline character `"\n"` at the end of each line. Since this character should not be part of the passwords, you need to remove it from every password before you can start the data analysis.

If you want to remove a trailing (or leading) newline character from a string `s`, you can use the method `s.strip("\n")`. Strings are immutable (they cannot be modified) in Python, the method will therefore return a new string without the newline character that you have to assign to the variable `s` or to a new variable.
```python
"hello\n".strip("\n") -> "hello"
```
<!-- (By default the method `.strip()` also removes leading and trailing whitespace characters. In the context of this password data this is okay, but you have to keep this in mind if you are working with data where the whitespaces matter! If you want the method to only target newline characters, you can also call the method with the newline character as the argument `s.strip("\n")`.) -->

Another approach to preprocess the data is to read the entire file as a single string using the method `f.read()`. You can then split the string at the newline characters with the method `s.split("\n")`. The resulting substrings will be the individual passwords without newline characters.
```python
"hello\nworld".split("\n") -> ["hello", "world"]
```

<!-- Instead of looping over the file variable `f` or repeatedly calling `f.readline()`, you can directly use the method `f.readlines()` to get a list of all the lines in the file `f`. You can then loop over the list again to strip the newline characters from the passwords.
```python
with open("some_file.txt") as f:
    lines = f.readlines()
```
This character should not be part of the passwords 
Since there is not a good reason for a password to ever contain a newline character, the best approach is to remove it from the passwords before storing them in a list. If you want to remove the newline character `"\n"` from each line of the file, you can use the method `line.strip()` (use the name of your string variable in place of `line`). -->
 
1. Read the first 10 passwords from the file again and remove the newline characters before printing the passwords. The empty lines should be gone now.
2. Create an empty list, iterate over the file `100.txt` and strip the newline character from each line before appending it to the empty list.
3. EXTRA: Use a list comprehension to iterate over the file and to strip the newline characters in a single line of code.
4. Compute the total number of characters in the list of passwords. Why is the difference to the result from the [previous exercise](#2.1-Reading-passwords-from-a-file) not 100?
5. Try the second approach with the methods `f.read()` and `s.split("\n")` to get the passwords from the file. Confirm that the total number of characters is equal to the number from the previous task.

## 2.3 Filtering the file data
When you open the text file in a new tab, you can see that a few lines are empty (or look empty at least). If there are no characters left in a password after you have removed the newline character, you do not want to keep the password since an empty string cannot be a valid password. If you have a string variable named `s`, there are two ways to check if it is empty:
```python
if s == "":
    print("The string is empty")

if not s:
    print("The string is still empty")
```
The first option is quite intuitive. The string variable `s` is compared to an empty string with the operator `==` and the result is `True` if the string is actually empty. The second option looks a bit weird at first sight, but this is the "pythonic" way of checking if a string is empty since it is then considered to be `False` in a boolean context. With the keyword `not` this condition is then negated to be `True` for empty strings. Instead of using an `if` statement, you can also pass any variable to the built-in function `bool()` to check its boolean value.

1. What is the boolean value of the list `[False, False]`? When would you expect a list to equate to `False`? Does the same approach work for dictionaries?
2. Numbers are also considered to be either `True` or `False`. There is only a single number that equates to `False`, can you find out which one it is?
3. Load all the passwords from the file `100.txt` again, but exclude the empty passwords from the list. Use your preferred approach from [exercise 2.2](#2.2-Preprocessing-the-file-data).
4. Check the length of the new list of passwords to see how many passwords were filtered out. Take a look at the file again to see if this number is correct.
5. EXTRA: There is an almost empty line in the file which was not filtered out yet. Find a way to also exclude this line from the passwords.

## 2.4 First password statistics
With the preprocessed and filtered list of passwords you can now start to evaluate the data. As for the single password, the first value you have to compute is the length of each password. The length data will be very important to filter and group the password data during the further analysis.

1. Store the lengths of the passwords in a new list. You can use a regular `for` loop or a list comprehension to iterate over the passwords.
2. Compute the average length of the passwords and assign the result to a variable. Do not use a `for` loop for this task!
3. What is the type of the average length of the passwords? You can pass any variable to the built-in function `type()` to check its type.
4. Compute the [(population) variance](https://en.wikipedia.org/wiki/Variance#Population_variance) of the password lengths. Use a list comprehension if possible. You can square a number `x` using the expression `x**2`.
5. Take the square root of the population variance to compute the standard deviation of the password lengths.

## 2.5 Using arrays for numerical data
In principle you could continue to use lists to store any kind of data related to the passwords. There are no restrictions on the types that you can store in the list and you can modify/slice the lists if you only want to work with a subset of the passwords. The downside is that lists are very limited when it comes to mathematical operations. As you could see during the [last exercise](#2.4-First-password-statistics), something as simple as computing the mean value is not directly implemented as a function or method. And the computation of the population variance even requires an iteration over the entire list.  

To overcome this limitation of native Python lists, the package [NumPy](https://numpy.org/) provides so-called arrays that are designed to store homogeneous (and multi-dimensional) numerical data and to run (mathematical) operations on the data. To use the package you have to import it with the `import` statement:
```python
import numpy as np
```
Instead of just using `import numpy` it is a convention to rename the import to `np`. You now have access to the entire NumPy package with the prefix `np.`

1. Import the NumPy package to your notebook. (Note: If a `ModuleNotFoundError` is raised you do not have numpy installed on your system yet.)
2. Convert the list of password lengths to a NumPy array with the function `np.array()` and assign it to a new variable. Print or output it to look at the data.
3. An array can be indexed and sliced just like a list. Look at a few single positions and a few index ranges. Can you raise an `IndexError` with the array?
4. You can get the length of an array `a` using the function `len(a)` or the attribute `a.shape`. Can you think of a reason why the shape is a `tuple`?
5. EXTRA: The type of the data in the array is saved in the attribute `a.dtype`. Check the data type of the array of lengths and find out what it means.
6. EXTRA: You can pass the data type of the lengths array to `np.iinfo()` to get some more information. What happens if you try to assign a value outside of the allowed range to the array?

## EXTRA: Finding the best data type
While NumPy arrays are already optimized to store numerical data, you can still do some additional optimization manually. When you pass the list of password lengths to `np.array()`, an array with the dtype `np.int64` is returned. NumPy is able to correctly pick an integer data type since there are only integers in the list of lengths. However, NumPy will always pick the most general data type that correctly represents the data. Since the array will only contain lengths of passwords, you can restrict the data type further when creating the array. Alternative you can get a copy of an existing array `a` with a different data type using the method `a.astype()`.

Disclaimer: This optimization might not be appropriate yet if you are analyzing fewer than 100 passwords. However, if you are working on millions of passwords, using another data type can save you a lot of memory.

1. Specify the requirements for the optimal data type of an array to store password lengths.
2. Check the available integer data types provided by NumPy. Which one is the best match for the specified requirements?
3. Create a new array from the lists or create a copy of the existing array using the new data type you picked.
4. Use the array attributes `size` and `itemsize` to compute the memory size of both arrays. How much memory did you save with the new data type?
5. Store the maximum allowed integer for the type `np.int64` in the variable `n_max`. Create an array using `np.array([n_max])` and check the data type. How does the data type of the array change if you use `n_max + 1` or `n_max * 10` instead?

## 2.6 Basic array computations
Since NumPy arrays are optimized for numerical data, they already offer a lot of computational capabilities. You can directly use the arithmetic operators `+`, `-`, `*`, `/` and `**` with NumPy arrays, and there are quite a few methods/functions implemented to run more complex computations on single arrays. The most common ones are `sum()`, `mean()`, `std()`, `max()` and `min()`. You can either use them as methods of an array or as functions from the NumPy package. Which option you use is mainly a matter of preference, most computations are implemented as functions and array methods. E.g. to compute the sum of all values in an array `a` you can either use the method `a.sum()` or the function `np.sum(a)`. Less common computations such as `np.median()` might only be implemented as a function.

In any case, you should avoid mixing the two options all the time since this can make your code more difficult to read/understand. Especially if you use different options for the same computation.

1. Try the operators `+`, `-`, `*`, `/` and `**` with the array of lengths and the number 3. How is each operator applied to the array and the scalar value?
2. What happens when you try to multiply the list of passwords with the number 3 (use `print()` to get a compact output)? Why is it not possible to divide the list by 3?
3. Use the computational methods/functions introduced in this exercise on the array of lengths. Check that the mean value and the standard deviation match the results you manually calculated with the list in [exercise 2.5](#2.5-Using-arrays-for-numerical-data).
4. EXTRA: Try one more computational method/function with the array of lengths. If you don't know where to start, you can type `lengths.` (use your array name here) in a cell and use tab completion to show you the available methods, or you can check the [NumPy user guide](https://numpy.org/doc/stable/user/quickstart.html#functions-and-methods-overview) for some inspiration. If you already have something in mind, you can also look online to find out if there is an implementation in the NumPy package.

## 2.7 Filtering data in NumPy arrays

In addition to the arithmetic operators you can also use the comparison operators `==`, `>`, `<=` etc. directly with NumPy arrays. If you want to compare an array to another array, they must have the same shape (or at least similar shapes), otherwise the comparison is not well defined. If you compare an array to a scalar value, the comparison is done element-wise. You can directly use the result of the comparison to index an array with the same shape. This allows you to quickly filter data in NumPy arrays. See the following example that will return all values smaller than 5 from the array called `lengths`:
```python
lengths[lengths < 5]
```

1. Use the equality operator with the array of lengths and the number 7. What is the type of the data in the resulting array?
2. Get all lengths greater than 6 from the array and compute their average value. You should not need more than one line of code for this task.
3. Count the number of passwords that have the median length. Can you find more than one way to do this? Hint: In Python `True + True` equals `2`.

## 2.8 A simple plot
The most popular Python package for plotting is [matplotlib](https://matplotlib.org/). You can do everything from a simple line plot to advanced group plots, and the package works great together with NumPy. Instead of importing the entire package `matplotlib` it is usually sufficient if you directly import `matplotlib.pyplot` since the most common functions are all implemented there. By convention you should rename the import to `plt`. You can then create a simple plot of a one-dimensional array `y` with the following function:
```python
plt.plot(y)
```
If you call the function with two or three arguments, you can specify `x` values and you can change the style of the data points with the format string `fmt`:
```python
plt.plot(x, y)
plt.plot(y, fmt)
plt.plot(x, y, fmt)
```
If you want to learn more about the (possible) arguments of the function `plt.plot()`, take a look at the docstring or the documentation at [https://matplotlib.org](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html).

<!-- where the square brackets indicate that both the `x` (values) and the `fmt` (format) are optional. If there is only one argument, it will be interpreted as the `y` (data). The format is a string that defines the shape of the markers, the line between the points and their color. By default, the plot will be a blue line without any markers.  -->

1. Import the module `matplotlib.pyplot` and rename it to `plt`. What happens when you call the function `plt.plot()` without any data/arguments?
2. Compute the passsword length distribution with the function `np.bincount()`. If you do not understand the length distribution array, look at the examples in the docstring of the function.
3. Pass the length distribution data to the function `plt.plot()`. If you don't like the solid line, look for other formats in the docstring/documentation.
4. If you only passed the y-data to the plot, why is the x-axis correct anyway? Hint: What are the x-values when you directly plot the array of lengths?

## 2.9 Improving the first plot
If you want to display the length distribution of the passwords, a histogram would be much better suited than the simple plot from the [previous exercise](#2.8-A-simple-plot). In matplotlib you can do this with the function `plt.bar()`. Compared to `plt.plot()` this function requires both the x-values and the y-data (height) to display anything.

You can use the function `np.arange()` to generate the x-values for the histogram. This function has essentially the same signature as the built-in function `range()`. You will get an array with integer values from the `start` (defaults to 0) to `stop - 1`.
```python
np.arange(4) -> np.array([0, 1, 2, 3])
```

1. Look at the docstring or the [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html) of the function `plt.bar()`. Can you understand that both the x-values and the y-data (called height) are required?
2. Get the length distribution data without leading zeros. Try to find a general solution that works for any array of password lengths.
3. Generate an array with the x-values to accompany the histogram data. Make sure that the array has the same shape as the y-data.
4. Create a bar plot using the x-values and the filtered histogram data from task 2.
5. Add axis labels with the functions `plt.xlabel("x")` and `plt.ylabel("y")`, and add a title with the function `plt.title("title")`.
6. Change one property of the bars to further improve the plot (based on your personal preference). You can pick something from the docstring of `plt.bar()` or you can find something else online.

## 2.10 Readable and reusable code with functions
So far you have directly written and executed your code in the cells, which is one of the big advantages of the Jupyter environment. This was a perfectly fine approach since most of the tasks required just a few lines of code and you never had to use more than one for-loop. Wrapping a task in a function makes sense as soon you have to repeat the task many times.

If you want to compute the digit sum of a password, there is no function available in native Python or NumPy. You have to iterate over the characters in the password, filter out the numeric characters and convert them to integers before finally summing them up. Implementing the computation of the digit sum in a function will therefore make your code a lot more readable and reusable since you just have to call `compute_digit_sum(password)` for every password in the dataset. You can use the following template to write your own function for the computation of the sum of digits:
```python
def compute_digit_sum(password):
    digit_sum = 
    
    return digit_sum
```
The parameter is called `password` and it is only defined in the scope of the function (indicated by the indentation). The keyword `return` is followed by the computed digit sum that you want to return from the function.

1. Recall how the computation of the digit sum worked. You can either copy the cell from the previous notebook or you can rewrite the computation.
2. Implement the computation as a function. You can use the template above or you can write the function from scratch.
3. Compute the digit sums for all the passwords and store the result in a NumPy array. You can store it in a list first, but the final result must be a NumPy array.
4. Display the digit sums in a new plot and decide what data to use for the x-values. Label the axes, give the plot a title and modify a few plot properties to your liking.
5. EXTRA: Try to write the computation of the digit sums for all passwords without a function. Do you agree that using the function increases the readability?

## 2.11 Labeling data in plots
When you have multiple lines and/or markers in a single plot, you want to label the data and display the labels in a legend. In matplotlib all the plot functions such as `plt.plot()`, `plt.bar()` etc. have the optional string parameter `label` that allows you to label the data. You can then add a legend to the plot where the markers/lines will be displayed next to their labels. See the following code snippet that creates a simple line plot with a label and a legend:
```python
plt.plot(-np.arange(10) / 2, "r-", label="something linear")
plt.legend()
```
All labels that are assigned in the same cell before calling `plt.legend()` will be included in the legend. You should therefore call `plt.legend()` at tne end of a cell. If you want to exclude something from the legend, just omit the parameter `label` for that function.

To include some "dynamic" information about the dataset in a label or a title, you can use string formatting to include variables in a string. The recommended way is to use so-called f-strings where the variables are just written directly in the string. As an example, consider a plot where the length distribution of the passwords with a digit sum lower than the value of the variable `max_digit_sum` is shown. You can then use the variable `max_digit_sum` to create the following title:
```python
plt.title(f"Length distribution of passwords with a digit sum lower than {max_digit_sum}")
```

1. Create an f-string that presents a password from the dataset including all the properties of the password that you have computed in this notebook.
2. Plot the digit sums as a function of the password length and label the axes accordingly. Use a reasonable format to display the data.
3. Compute the maximum possible digit sum as a function of the password length and include this in the plot by calling `plt.plot()` for a second time.
4. Assign labels to all the data with the `label` parameter and add a legend to the plot. How can you change the position of the legend in the plot?
5. Add a title to the plot that includes the number of passwords in the dataset using an f-string

## 2.12 Counting characters
In this exercise you are going to count the characters of all passwords you have read from the file in a single dictionary. As a reminder, the count of a character tells you how often this character appeared in the passwords of your dataset.
```python
["abc", "dac", "bba"] -> {"a": 3, "b": 3, "c": 2, "d": 1}
```

1. Write a function that accepts a password as an argument and returns a dictionary with the character counts.
2. Copy the function from task 1 and change it to accept a dictionary as a second argument. Modify the existing dictionary in the function and do not return it.
3. Use the function from task 2 to get the character counts of the entire dataset.
4. EXTRA: Use the function from task 1 and "merge" the individual dictionaries from the dataset. Check that the resulting dictionary is equal to the one from task 3.
5. Display the character counts in a bar plot. Label the axes and use an f-string that includes the length of the dataset in title.
6. Create three new dictionaries to separate the counts of the alphabetic characters, the numeric characters and the special characters. Create a bar plot for each one and label it accordingly.

## EXTRA: Sorting the character counts
The plots in the previous exercise look a bit messy since the character count data is not ordered in any reasonable way. Due to the iteration over the passwords and the characters, the characters are currently just ordered by their first occurrence in the password dataset. For all three categories it would be nice to sort the values in descending order such that the characters are ordered by their frequencies from left to right in the plot. Alternatively, you could also sort the alphabetic data by the keys in alphabetic order (meaning "A" to "Z"), and you could sort the numeric data by the keys in numeric order (meaning "0" to "9").  

There are two options to approach the sorting. You can either use the native Python function `sorted()` or you can make use of the NumPy package and the functions `np.sort()` and `np.argsort()`. If you want to try the former option, read the section on using the `key` parameter in the `sorted()` function [here](https://realpython.com/sort-python-dictionary/#using-the-key-parameter-and-lambda-functions). If you want to use the NumPy approach, just read the remaining instructions in this exercise.  

Even though NumPy is optimized for numerical data, the sorting functions also work for string data. If you sort an array of strings, it is ordered alphabetically from "A" to "Z" based on the first character in each string. Even though this might sound a bit complicated, this approach is used whenever strings have to be sorted, for example books in a library. The catch here is that you have to sort the keys together with the values, but to use the NumPy functions you need to separate the keys and values into two arrays. Note that you cannot directly create arrays from the dictionary keys and dictionary values, you have to pass them to the built-in function `list()` before turning them into an array.
```python
keys = np.array(list(alphabetic_counts.keys()))
values = np.array(list(alphabetic_counts.values()))
```
This will convert the keys and values to NumPy data types. While this might look a bit odd when printing the arrays/dictionaries, this is not an issue for any further analysis or plotting. You can call `np.array()` with the argument `dtype=object` to conserve the native data types.

The function `np.sort()` can sort the arrays individually but if you want to keep the arrays aligned during the sorting, you have to use the function `np.argsort()`. Instead of the sorted array, this function will return indices that you can use to sort both arrays (the keys and the values).

1. Write a function that accepts a dictionary and returns a new dictionary that is sorted by the keys. Use your preferred approach to do the sorting.
2. Continue with the function from task 1. Add a string parameter `by` that decides whether the dictionary is sorted by the `"keys"` or by the `"values"`. Also add a boolean parameter `ascending` that decides the order of the sorting. 
3. Start with the alphabetic character counts (you can use the dictionaries from the [previous exercise](#2.12-Counting-characters)). Sort the dictionary by the keys/values in descending/ascending order and display it in a bar plot. Don't forget to add axis labels and a title! Which sorting option (`by` and `ascending`) is your favorite one?
4. Also display the numeric character counts. Reconsider which (sorting) key and order are most suitable to display the data.
5. Display all three categories (alphabetic, numeric and special) in one plot. Call the function `plt.bar()` individually for each category to automatically get the bars in different colors. Label the data and add a legend to the plot.