# `DSML_WS_03` - Introduction to Pandas

Please work on the following tasks **before** the third workshop session.

## 1. Working with student grades in NumPy

Last week, you made yourself familiar with NumPy. Let's check your NumPy knowledge using a small case.

Imagine you are the teacher of a class of 15 students. During the year, the class has written 3 tests, each with a maximum of 100 points. You want to summarize the students' performances using NumPy.

1. Simulate the described case by creating a two-dimensional NumPy array with each row representing a student and each column representing a test. Generate random scores for each student and test between 0 and 100, and assign the array to a variable called `student_scores`.
2. Oops! You completely forgot Thomas, who joined the class during the school year after the first test. Thomas' score for the second test was 87, and 93 for the third test.
    - Since Thomas does not have a score for the first test, you want to simply use the average score of all other students. Calculate this and assign it to a variable called `avg_first_test`. (Hint: array slicing and the function np.mean() might be helpful here)
    - Add Thomas using `avg_first_test` as his first test score and his actual second and third test scores to `student_scores`.
3. You want to generate the sum of the scores from all three tests for each student. Do this using a matrix multiplication and save the resulting array to a variable called `student_totals`.
4. Finally, you want to transform the total scores in `student_totals` to a percentage of maximum available points. Assign this array to a variable called `student_pct`.

In [3]:
help(np.random.randint)

Help on method randint in module numpy.random:

randint(low, high=None, size=None, dtype=<class 'int'>) method of numpy.random.mtrand.RandomState instance
    randint(low, high=None, size=None, dtype=int)

    Return random integers from `low` (inclusive) to `high` (exclusive).

    Return random integers from the "discrete uniform" distribution of
    the specified dtype in the "half-open" interval [`low`, `high`). If
    `high` is None (the default), then results are from [0, `low`).

    .. note::
        New code should use the `~numpy.random.Generator.integers`
        method of a `~numpy.random.Generator` instance instead;
        please see the :ref:`random-quick-start`.

    Parameters
    ----------
    low : int or array-like of ints
        Lowest (signed) integers to be drawn from the distribution (unless
        ``high=None``, in which case this parameter is one above the
        *highest* such integer).
    high : int or array-like of ints, optional
        If provided, o

In [2]:
# your code here
import numpy as np #(?)

# create array
student_scores = np.random.randint(0,101, size=(15,3))
print(student_scores)

[[39  6 18]
 [83 27 59]
 [ 6 84 64]
 [93 48 81]
 [12 44 81]
 [82 72 58]
 [ 6 17 56]
 [92 80 60]
 [47 50 76]
 [70 69  5]
 [25 88 36]
 [20 64 35]
 [82 89 36]
 [51 57 76]
 [52 30 17]]


In [4]:
# select all rows (:) and the 1st column (0)
first_test_scores = student_scores[:,0]
print(first_test_scores)

[39 83  6 93 12 82  6 92 47 70 25 20 82 51 52]


In [107]:
#calculate average for first test
first_test_avg = np.mean(first_test_scores, dtype=int)
print(first_test_avg)

45


In [108]:
# create Thomas' scores
thomas_scores = np.array([first_test_avg, 87, 93])
print(thomas_scores)

[45 87 93]


In [109]:
# update students array
student_scores = np.vstack([student_scores, thomas_scores])
print(student_scores)

[[97 79 35]
 [ 1 85 72]
 [20 95 12]
 [45 48 81]
 [ 3 82 62]
 [31 92 57]
 [14 11 48]
 [25 30 34]
 [68 19 62]
 [50 85 71]
 [84 20 77]
 [69 50 77]
 [55 69 96]
 [55  2 60]
 [72 26 81]
 [45 87 93]]


## 2. Getting started with Pandas

This week, we will be exploring Pandas - a core package for working with data in Python. You can think of Pandas as enhanced versions of NumPy arrays. Let's see why.

As always, we first have to import pandas to use its functionalities within this Jupyter notebook. Pandas is commonly abbreviated using pd.

In [110]:
import pandas as pd

The Pandas equivalent to a one-dimensional array is a Series object, which you can create just like arrays, but use pd.Series instead of np.array. Let's stick with the student grade example from Task 1, but focus on only five students: Helena, Tom, Nina, Sam and Kim, who are 15, 15, 16, 17 and 16 years old, and scored 75, 69, 87, 88, and 54 points on the first test. Create three Pandas Series objects called `names`, `ages` and `scores` to store the respective data about our five students. How do Pandas Series objects differ from NumPy arrays?

In [121]:
# your code here

names = pd.Series(["Helena", "Tom", "Nina", "Sam", "Kim"])
ages = pd.Series([15, 15, 16, 17, 16])
scores = pd.Series([75, 69, 87, 88, 54])

print("names:\n", names)
print("\nages:\n",ages)
print("\nscores\n", scores)


# help(pd.Series)
# d = {'a': 1, 'b': 2, 'c': 3}
# ser = pd.Series(data=d, index=['a', 'b', 'c'])

# differences: indices, mix datatypes, … 

names:
 0    Helena
1       Tom
2      Nina
3       Sam
4       Kim
dtype: object

ages:
 0    15
1    15
2    16
3    17
4    16
dtype: int64

scores
 0    75
1    69
2    87
3    88
4    54
dtype: int64


At the heart of Pandas are dataframes, the equivalent to two-dimensional arrays. Let's combine our three Series objects into one dataframe using pd.DataFrame({'name_1': series_1, 'name_2': series_2,...}) and assign it to a variable called `students`. How does the dataframe differ from a two-dimensional array?

In [136]:
# your code here
students = pd.DataFrame({
    'Name': names,
    'Age': ages,
    'Score': scores
})

students

# difference: mix data types, has labels, easier operations, …


Unnamed: 0,Name,Age,Score
0,Helena,15,75
1,Tom,15,69
2,Nina,16,87
3,Sam,17,88
4,Kim,16,54


You can select specific information from your dataframe using the .loc[row_name, column_name] method. Return all rows but only the age column using .loc.

In [157]:
# your code here
students.loc[:,'Age']


0    15
1    15
2    16
3    17
4    16
Name: Age, dtype: int64

We can also use .loc to filter based on certain conditions. For example, if I want to only return Helena's test score, I could write `students.loc[students.name == 'Helena','score']`. Return all information on students with a score higher than 80.

In [164]:
# your code here
students.loc[students['Score'] > 80]


Unnamed: 0,Name,Age,Score
2,Nina,16,87
3,Sam,17,88
