# **Mid-Term Summative Assessment â€” Data Science Fundamentals**

- **Duration:** approx 1-3 hours per exercise (5-15 hours of work)
- **Format:** Individual coursework, submitted via Cortex and GitHub
- **Weight:** 30% of module grade
- **Dataset:** `attendance_anonymised.csv`

---

### ðŸŽ¯ **Learning Outcomes**

By completing this assessment, you will demonstrate your ability to:

* Manage a Python data-science project using Git and GitHub.
* Import, clean, and transform a real-world dataset using `pandas`.
* Visualise data using `matplotlib`/`seaborn`.
* Build and run a simple interactive app with `pyShiny`.
* Perform and interpret a basic statistical analysis (correlation or regression).

### **ðŸ’¡ Tips**

* Always work from an **active virtual environment**.
* Keep your notebook clean and readable.
* Commit after each weekâ€™s exercise (recommended but not marked)
* Use `requirements.txt` to ensure reproducibility.

## ðŸ“˜ Instructions

### **Exercise 1 â€” Project Setup (Version Control & Environment)**


1. Go to github.com and create a **new empty GitHub repository** on your Github account (no README, .gitignore, or license). 
   > *!Ensure the repo is public.*
2. Copy the repository URL and paste it below

   https://github.com/liamthobrien/attendance-analysis.git 



3. In your terminal, navigate to a local folder and run `git clone`:

   ```bash
   cd ./datascience2025/ 
   git clone <your_repo_url>
   ```
   > Tips: 
   > 1. Replace `./datascience2025/` with a local path of your choice
   > 2. replace <your_repo_url> with the url you copied from Github, e.g. https://github.com/user/repo-url. 
   
   This command will download the remote Github repo to the local path. It will create a folder on your machine (e.g. `./datascience2025/<assessment_project_repo>/`) linked to the remote folder on Github.
4. Inside this folder, create:

   * `requirements.txt` â€” include all packages needed for the project (you may copy the one used in class).
   * `README.md` â€” write exactly:

     ```
     Hello world! I love summative assessments.
     ```
5. Stage and commit both files:

   ```bash
   git add .
   git commit -m "Initial commit with requirements and readme. I really love summatives."
   git push
   ```
   > If you prefer using the VS Code user interface to push your commit, please feel free. 

âœ… **Checkpoint:** Refresh your repository on GitHub. It should show both files.



### **Exercise 2 â€” Data Cleaning & Exploration**



1. Load `attendance_anonymised.csv` using `pandas`.



In [1]:
# your code here

import pandas as pd
attendance_anon = pd.read_csv('attendance_anonymised-1.csv')

2. Describe the dataset using inbuilt functions such as `.head()`, `.info()`, `.describe()`.


In [2]:
# your code here
attendance_anon.head(50)

Unnamed: 0,Person Code,Unit Instance Code,Calocc Code,Surname,Forename,Long Description,Register Event ID,Object ID,Register Event Slot ID,Planned Start Date,Planned End Date,is Positive,Postive Marks,Negative Marks,Usage Code
0,129,278,2025,Lewis,Ursula,Nursing,37,37,574,2025-04-03,2023-11-01,Y,1,1,O
1,129,492,2023,Lewis,Ursula,Italian,726,37,1040,2023-11-03,2023-05-01,N,0,0,A
2,280,1266,2024,Lim,Michael,History,846,726,1123,2024-07-03,2023-05-07,N,0,0,A
3,280,1266,2024,Lim,Michael,History,846,726,653,2024-10-09,2025-04-29,N,0,0,A
4,280,1266,2023,Lim,Michael,History,846,726,776,2023-12-27,2025-03-22,N,0,0,A
5,280,1266,2025,Lim,Michael,History,295,726,735,2025-08-02,2024-10-09,N,0,0,A
6,280,1266,2023,Lim,Michael,History,295,726,271,2023-07-02,2025-06-26,N,0,0,A
7,280,557,2023,Lim,Michael,Arabic,924,726,1124,2023-09-18,2025-05-25,Y,1,1,Y
8,440,871,2023,Jain,Yang,Database Design,658,846,592,2023-09-19,2023-06-28,N,0,0,A
9,440,871,2023,Jain,Yang,Database Design,682,846,68,2023-07-07,2024-12-20,N,0,0,A


3. Drop the `Planned End Date` column.


In [8]:
# your code here
attendance_anon_slim = attendance_anon.drop(columns=['Planned End Date'])
attendance_anon_slim.head(10)

Unnamed: 0,Person Code,Unit Instance Code,Calocc Code,Surname,Forename,Long Description,Register Event ID,Object ID,Register Event Slot ID,Planned Start Date,is Positive,Postive Marks,Negative Marks,Usage Code
0,129,278,2025,Lewis,Ursula,Nursing,37,37,574,2025-04-03,Y,1,1,O
1,129,492,2023,Lewis,Ursula,Italian,726,37,1040,2023-11-03,N,0,0,A
2,280,1266,2024,Lim,Michael,History,846,726,1123,2024-07-03,N,0,0,A
3,280,1266,2024,Lim,Michael,History,846,726,653,2024-10-09,N,0,0,A
4,280,1266,2023,Lim,Michael,History,846,726,776,2023-12-27,N,0,0,A
5,280,1266,2025,Lim,Michael,History,295,726,735,2025-08-02,N,0,0,A
6,280,1266,2023,Lim,Michael,History,295,726,271,2023-07-02,N,0,0,A
7,280,557,2023,Lim,Michael,Arabic,924,726,1124,2023-09-18,Y,1,1,Y
8,440,871,2023,Jain,Yang,Database Design,658,846,592,2023-09-19,N,0,0,A
9,440,871,2023,Jain,Yang,Database Design,682,846,68,2023-07-07,N,0,0,A


4. Rename the columns exactly as follows:

   | Old                    | New             |
   | ---------------------- | --------------- |
   | Person Code            | Person Code     |
   | Unit Instance Code     | Module Code     |
   | Calocc Code            | Year            |
   | Surname                | Surname         |
   | Forename               | Forename        |
   | Long Description       | Module Name     |
   | Register Event ID      | Event ID        |
   | Object ID              | Object ID       |
   | Register Event Slot ID | Event Slot ID   |
   | Planned Start Date     | Date            |
   | is Positive            | Has Attended    |
   | Postive Marks          | Attended        |
   | Negative Marks         | NotAttended     |
   | Usage Code             | Attendance Code |



In [10]:
# your code here
attendance_anon_slim_rename = attendance_anon_slim.rename(columns={'Unit Instance Code': 'Module Code', 'Calocc Code': 'Year', 'Long Description': 'Module Name', 'Register Event ID': 'Event ID', 'Register Event Slot ID': 'Event Slot ID', 'Planned Start Date': 'Date', 'is Positive': 'Has Attended', 'Positive Marks': 'Attended', 'Negative Marks': 'NotAttended', 'Usage Code': 'Attendance Code'})
attendance_anon_slim_rename.head(10)

Unnamed: 0,Person Code,Module Code,Year,Surname,Forename,Module Name,Event ID,Object ID,Event Slot ID,Date,Has Attended,Postive Marks,NotAttended,Attendance Code
0,129,278,2025,Lewis,Ursula,Nursing,37,37,574,2025-04-03,Y,1,1,O
1,129,492,2023,Lewis,Ursula,Italian,726,37,1040,2023-11-03,N,0,0,A
2,280,1266,2024,Lim,Michael,History,846,726,1123,2024-07-03,N,0,0,A
3,280,1266,2024,Lim,Michael,History,846,726,653,2024-10-09,N,0,0,A
4,280,1266,2023,Lim,Michael,History,846,726,776,2023-12-27,N,0,0,A
5,280,1266,2025,Lim,Michael,History,295,726,735,2025-08-02,N,0,0,A
6,280,1266,2023,Lim,Michael,History,295,726,271,2023-07-02,N,0,0,A
7,280,557,2023,Lim,Michael,Arabic,924,726,1124,2023-09-18,Y,1,1,Y
8,440,871,2023,Jain,Yang,Database Design,658,846,592,2023-09-19,N,0,0,A
9,440,871,2023,Jain,Yang,Database Design,682,846,68,2023-07-07,N,0,0,A


5. Convert `Date` to a pandas timestamp. 
   > tip: use `pd.to_datetime()`

In [13]:
# your code here
attendance_anon_slim_rename_timed = attendance_anon_slim_rename.copy()
attendance_anon_slim_rename_timed['Date'] = pd.to_datetime(attendance_anon_slim_rename_timed['Date'])
attendance_anon_slim_rename_timed.head(10)

Unnamed: 0,Person Code,Module Code,Year,Surname,Forename,Module Name,Event ID,Object ID,Event Slot ID,Date,Has Attended,Postive Marks,NotAttended,Attendance Code
0,129,278,2025,Lewis,Ursula,Nursing,37,37,574,2025-04-03,Y,1,1,O
1,129,492,2023,Lewis,Ursula,Italian,726,37,1040,2023-11-03,N,0,0,A
2,280,1266,2024,Lim,Michael,History,846,726,1123,2024-07-03,N,0,0,A
3,280,1266,2024,Lim,Michael,History,846,726,653,2024-10-09,N,0,0,A
4,280,1266,2023,Lim,Michael,History,846,726,776,2023-12-27,N,0,0,A
5,280,1266,2025,Lim,Michael,History,295,726,735,2025-08-02,N,0,0,A
6,280,1266,2023,Lim,Michael,History,295,726,271,2023-07-02,N,0,0,A
7,280,557,2023,Lim,Michael,Arabic,924,726,1124,2023-09-18,Y,1,1,Y
8,440,871,2023,Jain,Yang,Database Design,658,846,592,2023-09-19,N,0,0,A
9,440,871,2023,Jain,Yang,Database Design,682,846,68,2023-07-07,N,0,0,A


6. Filter the DataFrame on **one specific module** (your choice) and plot its **attendance rate over time**
   (x = Date, y = the module's average attendance).



In [14]:
# your code here
attendance_anon_slim_rename_timed['Module Name'].value_counts()

Module Name
Algorithms                1140
Cloud Computing            940
Project Management         870
Quality Assurance          825
Linguistics                780
Arabic                     662
Journalism                 617
History                    600
Pharmacy                   600
System Administration      600
Public Speaking            580
International Business     551
Graphic Design             488
Geography                  448
Music Theory               442
Database Design            420
Theater                    420
French                     390
Chemistry                  360
Ethics                     354
Cybersecurity              326
Biology                    320
Medicine                   316
Leadership                 280
Statistics                 280
Arts & Crafts              266
User Experience Design     240
Digital Systems            228
Japanese                   226
Demographics               210
Nutrition                  200
Data Science               

7. **[For L6 students only]** Add a column, titled "Student Overall Attendance", with the average attendance for the student. Add a column, titled "Standardised Student Overall Attendance", with the z-scores of "Student Overall Attendance".

âœ… **Checkpoint:** Cleaned DataFrame and a line or bar plot of attendance over time for one module.

---



### **Exercise 3 â€” Make it Shiny**



1. Create a file called `app.py`.
2. Build a minimal Shiny app that only displays the plot you created in the previous exercise.
3. Run the app locally using this terminal command:

   ```bash
   shiny run --reload --launch-browser app.py
   ```
   > if your terminal is not in the same folder where the `app.py` file is, you will need to specify the correct path, e.g. `shiny run --reload --launch-browser /path/to/my/app.py`

4. **[For L6 students only]** Enhance the app with a title and at least one more UI element. E.g., you could enhance the app with interactivity (e.g. allow the user to select the module to plot).

âœ… **Checkpoint:** The app runs without errors and displays the plot from Exercise 2 in a browser.

---

### **Exercise 4 â€” Comparing Modules**



1. Compute the **mean** and **standard deviation** of attendance per module using pandas `groupby`. Consult the documentation if needed.


In [9]:
# your code here
...

2. Create a `seaborn.barplot` of mean attendance per module with 95% confidence intervals.


In [10]:
# your code here
...

3. Write a short interpretation (2â€“3 sentences) explaining what the plot and error bars show.

    [Double click on this cell to edit and type your answer]

âœ… **Checkpoint:** My notebook and/or my Shiny app show a barplot with error bars, one bar per module (x-axis), where the height of the bar is the module's attendance rate (y-axis).

---



### **Exercise 5 â€” Attendance Relationships**



1. Answer the question: **Do students with low overall attendance tend to attend modules that have low attendance?**.

2. **[For L5 students only]**. Choose between a correlation analysis or a linear regression analysis to answer the question.

3. **[For L6 students only]** Run both correlation analysis and linear regression analysis to answer the same question. Do they tell the same story?

> Tips:
> 1. Compute the mean attendance rate of each module. Can you re-use some of the `groupby` code you wrote above?
> 2. For each module, compute the average overall attendance of the students who attended that module.
> 3. Correlation and linear regression can both help you answer the question. 
> 4. Report main statistic and p-value(s).


In [11]:
# your code here
...

âœ… **Checkpoint:** Table or printout showing correlation or regression results.

# THE END!