# `DSML_WS_03` - Introduction to Pandas

Please work on the following tasks **before** the third workshop session.

## 1. Working with student grades in NumPy

Last week, you made yourself familiar with NumPy. Let's check your NumPy knowledge using a small case.

Imagine you are the teacher of a class of 15 students. During the year, the class has written 3 tests, each with a maximum of 100 points. You want to summarize the students' performances using NumPy.

1. Simulate the described case by creating a two-dimensional NumPy array with each row representing a student and each column representing a test. Generate random scores for each student and test between 0 and 100, and assign the array to a variable called `student_scores`.
2. Oops! You completely forgot Thomas, who joined the class during the school year after the first test. Thomas' score for the second test was 87, and 93 for the third test.
    - Since Thomas does not have a score for the first test, you want to simply use the average score of all other students. Calculate this and assign it to a variable called `avg_first_test`. (Hint: array slicing and the function np.mean() might be helpful here)
    - Add Thomas using `avg_first_test` as his first test score and his actual second and third test scores to `student_scores`.
3. You want to generate the sum of the scores from all three tests for each student. Do this using a matrix multiplication and save the resulting array to a variable called `student_totals`.
4. Finally, you want to transform the total scores in `student_totals` to a percentage of maximum available points. Assign this array to a variable called `student_pct`.

In [None]:
# your code here
import numpy as np

students_scores = np.random.randint(0, 101, size=(15, 3))
print(students_scores)

# Calculate the average score for test 1 of each student
average_test1 = np.mean(students_scores[:, 0])
print("Average score for test 1:", average_test1)

#Add Thomas as a new row to student_scores
thomas_scores = np.array([[average_test1, 87, 93]])
students_scores_thomas = np.vstack((students_scores, thomas_scores))
print(students_scores_thomas)

#Sum of all scores for each student
student_totals = np.sum(students_scores_thomas, axis=1)
print("Sum of all scores for each student:", student_totals)

# Total scores as percentage of maximum possible score
student_pct = student_totals / 300 * 100
print("Total scores as percentage of maximum possible score:", student_pct)

## 2. Getting started with Pandas

This week, we will be exploring Pandas - a core package for working with data in Python. You can think of Pandas as enhanced versions of NumPy arrays. Let's see why.

As always, we first have to import pandas to use its functionalities within this Jupyter notebook. Pandas is commonly abbreviated using pd.

In [10]:
import pandas as pd

The Pandas equivalent to a one-dimensional array is a Series object, which you can create just like arrays, but use pd.Series instead of np.array. Let's stick with the student grade example from Task 1, but focus on only five students: Helena, Tom, Nina, Sam and Kim, who are 15, 15, 16, 17 and 16 years old, and scored 75, 69, 87, 88, and 54 points on the first test. Create three Pandas Series objects called `names`, `ages` and `scores` to store the respective data about our five students. How do Pandas Series objects differ from NumPy arrays?

In [12]:
# your code here
names = pd.Series(['Helena', 'Tom', 'Nina', 'Sam', 'Kim'])
print(names)

ages = pd.Series([15, 15, 16, 17, 16])
print(ages)

scores = pd.Series([75, 69, 87, 88, 54])
print(scores)

0    Helena
1       Tom
2      Nina
3       Sam
4       Kim
dtype: object
0    15
1    15
2    16
3    17
4    16
dtype: int64
0    75
1    69
2    87
3    88
4    54
dtype: int64


At the heart of Pandas are dataframes, the equivalent to two-dimensional arrays. Let's combine our three Series objects into one dataframe using pd.DataFrame({'name_1': series_1, 'name_2': series_2,...}) and assign it to a variable called `students`. How does the dataframe differ from a two-dimensional array?

In [13]:
# your code here
students = pd.DataFrame({'name_1': names, 'age_1': ages, 'score_1': scores})
print(students)

   name_1  age_1  score_1
0  Helena     15       75
1     Tom     15       69
2    Nina     16       87
3     Sam     17       88
4     Kim     16       54


You can select specific information from your dataframe using the .loc[row_name, column_name] method. Return all rows but only the age column using .loc.

In [15]:
# your code here
print(students.loc[:, 'age_1'])

0    15
1    15
2    16
3    17
4    16
Name: age_1, dtype: int64


We can also use .loc to filter based on certain conditions. For example, if I want to only return Helena's test score, I could write `students.loc[students.name == 'Helena','score']`. Return all information on students with a score higher than 80.

In [19]:
# your code here
print(students.loc[students.score_1 >= 80, :])

  name_1  age_1  score_1
2   Nina     16       87
3    Sam     17       88
