# Chapter 1 - Introduction to Data Analysis

Exercise 1

**Create a series of 10 elements, random integers from 70 to 100, representing scores on a monthly exam. Set the index to be the month names, starting in February and ending in November.
Try to answer the following questions:**

- a. What is the student's average score?
- b. What is the student's average score during the first semester (i.e., the first five months)?
- c. What is the student's average score during the second semester?
- d. Did the student improve their performance in the second half? If so, by how much?

In [20]:
from pandas import Series, DataFrame
from numpy.random import default_rng
import calendar

### Working it out

- How do we define a Series?
- How can we create 10 random integers between 70 and 100?
- How can we set the index to be the month names, starting in February and ending in November?





In [27]:
gen = default_rng(seed=0)
scores = Series(gen.integers(70, 101, 10), index=calendar.month_abbr[2:-1])

In [28]:
# a. What is the student's average score?
print(f'Yearly average: {scores.mean()}')

# b. What is the student's average score during the first semester (i.e., the first five months)?
print(f'First semester average: {scores.loc['Feb':'May'].mean()}')
# or
print(f'First semester average: {scores[:5].mean()}')
# or
print(f'First semester average: {scores.iloc[:5].mean()}')
# or
print(f'First semester average: {scores.head().mean()}')

# c. What is the student's average score during the second semester?
print(f'Second semester average: {scores.loc['Jul':'Nov'].mean()}')
# or
print(f'Second semester average: {scores[5:].mean()}')
# or
print(f'Second semester average: {scores.iloc[5:].mean()}')
# or
print(f'Second semester average: {scores.tail().mean()}')

# d. Did the student improve their performance in the second half? If so, by how much?
def calc_performance_diff(scores: Series):
    performance_diff = scores.loc['Jul':'Nov'].mean() - scores.loc['Feb':'Jun'].mean()
    
    if performance_diff > 0:
        return f'The student improved their performance in the second half by {performance_diff:.2f}'
    elif performance_diff < 0:
        return f'The student\'s performance decreased in the second half by {abs(performance_diff):.2f}'
    else:
        return 'The student\'s performance remained the same in both halves'

print(f'Improvement in the second half: {calc_performance_diff(scores)}')





Yearly average: 81.0
First semester average: 87.0
First semester average: 85.4
First semester average: 85.4
First semester average: 85.4
Second semester average: 76.6
Second semester average: 76.6
Second semester average: 76.6
Second semester average: 76.6
Improvement in the second half: The student's performance decreased in the second half by 8.80


- e. In which month did the student score the highest?
- f. In which month did the student score the lowest?
- g. In which month did the student score the highest in the second half?
- h. In which month did the student score the lowest in the second half?
- i. In which month did the student score the highest in the first half?
- j. In which month did the student score the lowest in the first half?

In [32]:
from collections import defaultdict
import calendar

month_names = defaultdict(lambda: "MonthNames", {abbr: name for abbr, name in zip(calendar.month_abbr[2:-1], calendar.month_name[2:-1])})

# in which month did the student score the highest?
print(f'Highest score: {month_names[scores.idxmax()]}, score: {scores.max()}')
# in which month did the student score the lowest?
print(f'Lowest score: {month_names[scores.idxmin()]}, score: {scores.min()}')
# in which month did the student score the highest in the second half?
print(f'Highest score in the second half: {month_names[scores.loc['Jul':'Nov'].idxmax()]}, score: {scores.loc['Jul':'Nov'].max()}')
# in which month did the student score the lowest in the second half?
print(f'Lowest score in the second half: {month_names[scores.loc['Jul':'Nov'].idxmin()]}, score: {scores.loc['Jul':'Nov'].min()}')
# in which month did the student score the highest in the first half?
print(f'Highest score in the first half: {month_names[scores.loc['Feb':'Jun'].idxmax()]}, score: {scores.loc['Feb':'Jun'].max()}')
# in which month did the student score the lowest in the first half?
print(f'Lowest score in the first half: {month_names[scores.loc['Feb':'Jun'].idxmin()]}, score: {scores.loc['Feb':'Jun'].min()}')


Highest score: February, score: 96
Lowest score: September, score: 70
Highest score in the second half: November, score: 95
Lowest score in the second half: September, score: 70
Highest score in the first half: February, score: 96
Lowest score in the first half: May, score: 78


In [19]:
print({calendar.month_name[1]: calendar.month_abbr[1]})

{'January': 'Jan'}
