### Introduction to Python Programming

[**CC-BY-NC-SA**](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)<br/>
Prof. Dr. Annemarie Friedrich<br/>
with some minor edits by Dr. Jakob Prange<br/>
Faculty of Applied Computer Science, University of Augsburg<br/>

## 4c Python for Data Science

Data Science is, roughly speaking, the field of applying tools from mathematics, statistics, artificial intelligence, and computer engineering for extracting knowledge from data. Python is the preferred programming and scripting language for data scientists. IPython and Jupyter notebooks are a common form to perform a data analysis, mixing code, textual information, and visualization.

The goal of this lecture is to become familiar with the basics of common data science toolkits (NumPy, Pandas, Matplotlib, and (to a limited extent) Scikit-Learn).


### 13.1 NumPy

NumPy (Numerical Python) is a library for the efficient stoarge and manipulation of numerical arrays. It also includes a suite of mathematical computing tools. You can find the documentation of NumPy [here](https://numpy.org/doc/stable/reference/).

At the core of NumPy is the `array` class, which can be used to store vectors, matrices, or even tensors (higher-dimensional data).
We can create `array` objects in a number of ways, the simplest of which is to instantiate it from a list.
In th example below, `a.shape` returns the tuple `(6,)`, which tells us that `a` is a vectors with six entries.


In [None]:
# It is common to import NumPy like this:
import numpy as np

import pprint as pp

In [None]:
a = np.array([1, 34, 4, 3, 2, 5.6])
print(a)
print(a.shape)

In the next example, we instantiate a NumPy `array` with a two-dimensional list, i.e., `m` is a matrix of shape `(2, 3)`, i.e., it has two rows and three columns.

In [None]:
m = np.array([[1, 2, 3], [3, 4, 5]])
pp.pprint(m)
m.shape

Even more complicated, in the next example, we create a tensor of shape `(4, 2, 3)`, which means that it consists of four matrices of shape `(2, 3)` each. We can visualize such cases in three-dimensional space (see slides).

Exercise: Try out what `t.ndim` prints and look up in the documentation what it means.

In [None]:
t = np.array([[[1, 1, 1], [1, 1, 1]], [[2, 2, 2], [2, 2, 2]], [[3, 3, 3], [3, 3, 3]], [[4, 4, 4], [4, 4, 4]]])
pp.pprint(t)
print("t.shape =", t.shape)
# t.ndim

In contrast to Python lists, the entries of a NumPy `array` must all have the same datatype. By default, this will be the datatype that covers all entries, e.g., in the case `[1.2, 4, 2]`, NumPy will create an object with `float32` types. We can also specify the datatype for an array explicitly:

In [None]:
a = np.array([1, 2, 3], dtype="float32")
a

__Broadcasting__

An important concept to understand when performing calculations with NumPy is that of __broadcasting__.
The following code accompanies the examples and rules illustrated in the slides.

In [None]:
# Broadcasting
M = np.ones((2, 3))
a = np.arange(3)
pp.pprint(M)
pp.pprint(a)
pp.pprint(a+M)

__Masks__

Masking is a useful tool when it comes to extracting, modifying, or counting particular values in a NumPy `array`.


In [None]:
x = np.array([[0, 8, 6, 2], [0, -1, 3, 0]])
print("x =")
pp.pprint(x)

# Mask
print("\nx != 0 at these positions:")
pp.pprint(x != 0)

The mask `x != 0` returns a matrix of the exact same size as `x`, with entries being `True` is the respective value in `x` is unequal `0`, and `False` otherwise. Next, we use this mask (a) to count how many non-zero items `x` contains, and  (b) to retrieve particular values from `x`:

In [None]:
# (a) counting non-zero items
mask = x != 0
pp.pprint(mask)
print(np.sum(mask)) # counts the True values in mask


In [None]:
# (b) retrieving items
y = x[x != 0]
print("y =")
pp.pprint(y) # Note that this returns a vector of those values!

We can also perform operations such as `sum` row-wise or column-wise. It may at first be a bit confusing _which_ axis the summation is performed over in the following case, so let's examine this step-by-step.

In [None]:
print("x =")
pp.pprint(x)
print(x.shape)
# summing row-wise: we "generalize" over columsn

`(2, 4)` means there are two rows and 4 columns, i.e., dimension `0` corresponds to the rows and dimension `1` to the columns.
Let's sum over rows, and over columns:

In [None]:
pp.pprint(x)
row_sums = np.sum(x, axis=1) # axis=1: "generalize" over columns / "removing" this dimension by the operation
print("row_sums:", row_sums)

col_sums = np.sum(x, axis=0) # axis=0: "generalize" over rows / "removing" this dimension by the operation
print("col_sums:", col_sums)

Check carefully that you understand this principle. (If you move on to deep learning, for example, the same principle is followed by PyTorch.) In the general case, the `axis` that you specify is that over which the operation generalizes and which, as a result, gets removed by the operation. This also holds if we have more than two dimensions.

Next, let's combine the summing operation with a mask:

In [None]:
# how many values less than 6 in each row?
print(np.sum(x < 6, axis=1)) # axis=1: "generalize" over columns
print(np.sum(x < 6, axis=0)) # axis=0: "generalize" over rows

__Numpy Exercises__:

1. Find out how to create a matrix of `(n,m)` zeros or ones easily.

2. Find out (and note for yourself) the differences between `ndim`, `shape`, and `size`.

3. Use the `np.reshape()` function to create matrices and tensors of various sizes from the vector `x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 0])`. Note down the general functionality of `np.reshape()`.

4. Find out how to `concatenate` two vectors. Next, find out how to stack them vertically into a matrix.

In [None]:
# Your code here

### 13.2 Pandas

Pandas is a library offering convenient storage for labeled and heterogenous data. It is built on top of NumPY and also provides powerful data operations similar to database or spreadsheet operations. You can find the API references of Pandas [here](https://pandas.pydata.org/docs/reference/index.html).

One basic type in Pandas is `Series`, which consists of entries that are each associated with an item in an index (similar to a dictionary).

In [None]:
# It is common to import Pandas like this:
import pandas as pd

In [None]:
grade_list = [4.5, 1.3, 2.7, 2.3]
grade_series = pd.Series(grade_list)
print(grade_series)

grade_dict = {"Susie":4.5, "Tim":1.3, "Lucie":2.7, "Frank":2.3}
grade_series = pd.Series(grade_dict)
grade_series

We can look at the `Index` object like this:

In [None]:
grade_series.index

The second important datatype is `DataFrame`, which is similar to a spreadsheet or database table. A `DataFrame` is a two-dimensional array, or a sequence of __named__ (see column name!) and __aligned__ (as indicated by their index values) `Series` objects. The indices of the `Series` objects must match.

We can simply create a `DataFrame` object from dictionaries as follows:


In [None]:
math_grades = [4.0, 2.5, 1.3, 2.7]
python_grades = [3.0, 1.3, 2.0, 3.0]
names = ["Susie", "Tim", "Tom", "Lisa"]

df = pd.DataFrame({"Name" : names, "Math": math_grades, "Python": python_grades})
display(df)

When constructing a `DataFrame` object from a dictionary as above, Pandas assumes that the dictionary keys are the column names and the lists that are the dictionary values are the row-wise entries. In this case, a numeric index `[0, 1, 2, 3]` is constructed automatically (you can see it in the left-most column, which is not a named `Series` but just the `Index`).

Often, the data is already available in spreadsheet format, or, as in the following case, as a CSV file. Luckily, Pandas provides a number of `read` methods that can read in tabular data from various formats.

In [None]:
# Read in grades and additional information from CSV file
df = pd.read_csv("data/grades.csv", delimiter=" ")
df

__Pandas Exercises__

_Check out the Pandas API reference to solve the following exercises._

1. Which of [these formats](https://pandas.pydata.org/docs/reference/io.html) do you know already?
2. Make sure you understand the output of the method `describe()` (API see [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)).
3. Sort the table by "Grade", and assign the sorted `DataFrame` object to the variable `df`. What happened to the index?
4. Use `loc` to retrieve and print the entry for index `2`.
5. Assume we want to create an anonymous statistic of class attendance and grade. Use `loc` to retrieve the two relevant columns. (Hint: see [here](https://pandas.pydata.org/docs/user_guide/10min.html#selection),  "Selection by label.")
6. Using `loc`, retrieve the Grade for the entry with index `1`. Do the same using `at`. Find out what the difference between the two functions is.
7. What is the difference between `df.loc[1]` and `df.iloc[1]`? (Hint: try this on our sorted `DataFrame` object from above to see the effect.) Read on these two methods to access data in the API reference.


In [None]:
# Your code here

Similar to NumPy, we can also use __Masking__ with Pandas `DataFrame`s.
The following example illustrates how to find a row with a particular cell value.

In [None]:
df = pd.read_csv("data/grades.csv", delimiter=" ")
display(df)

First, we obtain a mask indicating for each index value, whether the condition that the cell value of the column "Firstname" is "Lucie" applies. This is a pointer to rows.

In [None]:
display(df["Firstname"] == "Lucie") # Mask

We can use this Mask to obtain the relevant row from the dataframe:

In [None]:
display(df[df["Firstname"] == "Lucie"]) # obtain relevant row

In [None]:
df2 = df[df["Firstname"] == "Lucie"] # assign this view to another variable
print(df2["Grade"])

In [None]:
# This is equivalent to:
print("Grade of Lucie", df[df["Firstname"] == "Lucie"]["Grade"])

To make clear how the mask works, consider the following extension of the above example:

In [None]:
# Appending another row to the dataframe
new_row = {"Firstname":"Lucie", "Lastname":"Lucky", "ClassAttendance":1.0, "HomeworkCompleted":0.9, "Grade":1.0}
df.loc[len(df)] = new_row  # len(df) is the new index: here it is simple as we have a numeric integer-range based index
display(df)

In [None]:
lucie_subset = df[df["Firstname"] == "Lucie"]
display(lucie_subset)

This time, we have selected all rows where the condition that the Firstname is Lucie applies. If we select the grade now, we get both the grades of the two Lucies:

In [None]:
display(lucie_subset["Grade"])

Pandas also provides database-like operations such as `groupby`. The following code first groups the entries of the dataframe by the value of a particular column, and then applies an operation group-wise.

In the following example, Tim and Susie are playing a game. Each has three trials, in which a certain number of blue and a certain number of red points are achieved. The objective of the game is to obtain as many blue points as possible, and as few red points as possible. For each of these objectives, the best result out of the three trials counts.

In [None]:
df3 = pd.DataFrame({"Name" : ["Tim", "Tim", "Tim", "Susie", "Susie", "Susie"],
      "BluePoints" : [1, 5, 3, 8, 2, 1],
      "RedPoints" : [4, 5, 2, 1, 2, 3]})

display(df3)

print("Game Results (best of trials):")
print(type(df3.groupby("Name"))) # This is a DataFrameGroupBy object. We can perform, for example, aggregation operations on it as follows:
results = df3.groupby("Name").agg({"BluePoints": "max", "RedPoints" : "min"})
display(results)

The final number of points is computed by taking the number of blue points minus that of the red points. Who won the game?

In [None]:
# Add column with final scores
results["FinalScore"] = results["BluePoints"] - results["RedPoints"]
display(results)

Congratulations, you have worked through the basics of NumPy and Pandas. Let's now visualize the data.<br/><br/>

### 13.3 Matplotlib

Matplotlib is a well-tested, cross-platform visualization and plotting library for Python.

In [None]:
# We import Matplotlib like this:
import matplotlib as mpl
import matplotlib.pyplot as plt

First, let's draw a __scatterplot__ to see if there is any correlation between class attendance and grade and between homework completion in the grade in our tiny class example above.
Plotting a scatterplot is very easy in Matplotlib, we just need to tell it what the `x` and `y` values are.
There is also a function called `scatter`, but the `plot` is more faster when used with many datapoints.
The first argument of `plot` below are the `x`-values, the second are the `y`-values, and the third argument indicates the style of the plotting: here, we give just the strings `s` and `o`, which indicate the type of marker to be drawn.

In [None]:
x1 = df["ClassAttendance"]*100 # converting to percent (happens entry-wise!)
y1 = df["Grade"]
plt.plot(x1, y1, 'o', color='blue', markersize=10, label="ClassAttendance")

x2 = df["HomeworkCompleted"]*100 # converting to percent (happens entry-wise!)
y2 = df["Grade"]
plt.plot(x2, y2, 's', color='red', markersize=10, label="HomeworkCompleted") # s=square

# Let's add a legend
plt.legend()

# Let's add axis labels
plt.xlabel("Percent")
plt.ylabel("Grade")


Voilà! What can we read from this scatterplot?

Often, it is useful to check the data using __histogram__ plots. A histogram tells us how many instances fall into a particular bin. Let's look at the following example to illustrate this.

Assume we walk around for a month and measure everyone whom we meet. We record their height and gender, and have the following dataset in the end. (The data is completely made up!)

In [None]:
gender = ["m"]*100 + ["f"]*100 + ["d"]*100
m_heights = np.minimum(np.random.normal(1.9, 0.1, 100), [2.3]*100)
f_heights = np.minimum(np.random.normal(1.55, 0.1, 100), [2.0]*100)
d_heights = np.minimum(np.random.normal(1.7, 0.1, 100), [2.1]*100)
heights = np.concatenate((m_heights, f_heights, d_heights))

df = pd.DataFrame({"Gender" : gender, "BodyHeight" : heights})
display(df)

In [None]:
plt.hist(df["BodyHeight"], bins=30) # Try out varying number of bins
plt.ylabel("Number of measured people with this size")
plt.xlabel("Height in m")

As a full example, let us create one histogram per gender:

In [None]:
kwargs = dict(alpha=0.5, bins=20) # Keyword arguments that we will reuse for each histogram
# **kwargs "unfolds" the dictionary into its key-value pairs as they are required as arguments.
# alpha=0.5 indicates the transparency (try out other values)

plt.hist(df[df["Gender"] == "m"]["BodyHeight"], **kwargs, label="m")
plt.hist(df[df["Gender"] == "f"]["BodyHeight"], **kwargs, label="f")
plt.hist(df[df["Gender"] == "d"]["BodyHeight"], **kwargs, label="d")
plt.ylabel("Number of measured people with this size")
plt.xlabel("Height in m")
plt.legend()

__Matplotlib (& Pandas) Exercise__

In this exercise, your task is to visualize the label distribution of the paragraphs in the [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus). The dataset consists of a set of texts taken from 15 categories.

You may have to refer to the API references of Matplotlib to solve this exercise. Make sure you understand the code.

1. From the provided dataframe, get the list of categories.
2. For each category, count how many paragraphs (i.e., rows) are in the dataframe.
3. Create a barplot showing for each category, how many instances there are.
4. Rotate the `xticks` such that they are vertical.
5. Add labels to the `x` and `y` axes stating "category" and "number of instances."
6. Add labels to each bar stating the number of instances of the respective category.
7. If you followed the steps above, the labels of the bars probably overlap. Find out how to make the plot wider such that it looks nice.
8. Finally, increase the font size of the `xlabel` and the `ylabel` and add a title to your plot (e.g., "Brown Corpus Category Distribution for Paragraphs").

In [None]:
df = pd.read_json("data/brown.json")