# DS 3000 - Assignment 3

**Student Name**: [Julia Ouritskaya]

**Date**: [9/23/2023]


### Submission Instructions
Submit this `ipynb` file to canvas.

The `ipynb` format stores outputs from the last time you ran the notebook.  (When you open a notebook it has the figures and outputs of the last time you ran it too).  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh run `Kernel > Restart & Run All` just before uploading the `ipynb` file to Canvas.

### Academic Integrity

**Writing your homework is an individual effort.**  You may discuss general python problems with other students but under no circumstances should you observe another student's code which was written for this assignment, from this year or past years.  Pop into office hours or DM us in MS Teams if you have a specific question about your work or if you would like another pair of eyes or talk through your code.

Don't forget to cite websites which helped you solve a problem in a unique way.  You can do this in markdown near the code or with a simple one-line comment. You do not need to cite the official python documentation.

**Documentation / style counts for credit**  Please refer to the Pep-8 style, to improve the readability and consistency of your Python code. For more information, read the following article [How to Write Beautiful Python Code With PEP 8](https://realpython.com/python-pep8/) or ask your TA's for tips.

**NOTE:<span style='color:red'> Write python expressions to answer ALL questions below and ensure that you use the `print()` function to display the output.</span>** Each question should be answered in a new code cell. For example, your solution for question 1.1 should be in a different code cell from your solution for question 1.2.

## Question 1: Loading Data (40 pts)

The data that you are working with contains the exam grades and the overall grade for students who took a statistics course at a university. Write python code to answer the questions below and ensure that you round all numeric calculations to 2 decimal places. 

1. (2pts) Load the attached data into a numpy array: exam_grades.csv
2. (1pt) How many observations and columns are in the data set?
3. (2pts) What is the minimum and maximum overall course grade for the statistics course (i.e. for the entire data set).
4. (5pts) What time period (e.g. years) does the data set cover? Note: only display unique years (do not display duplicates).
5. (10 pts) For each year that you identified above, count the number of students who took the course each year.
6. (20 pts) For each year that you identified above, calculate the minimum, maximum, mean and standard deviation for each of the three exam scores. Summarize your findings.


In [1]:
import numpy as np 
import pandas as pd

# loads the attached data into a numpy array.
data = pd.read_csv("exam_grades.csv")
data_array = data.to_numpy()

# number of observations and columns in the data set.
observations = data_array.shape[0]
columns = data_array.shape[1]
print(f"Number of observations: {observations}")
print(f"Number of columns: {columns}")

# minimum and maximum overall course grade for the statistics course.
course_grade = data_array[:, -1]
minimum_course_grade = np.min(course_grade)
maximum_course_grade = np.max(course_grade)
print(f"Minimum course grade: {minimum_course_grade:.2f}")
print(f"Maximum course grade: {maximum_course_grade:.2f}")

# time period (e.g. years) the data set covers (does not display duplicates).
years = data_array[:,0]
unique_years = np.unique(years)
print(f"Time period the data set covers: {unique_years}")

# counts the number of students who took the course each year from the years identified above.
for year in unique_years:
    student_count = np.sum(years == year)
    print(f"The number of students who took the course in the year {int (year)}: {student_count}")
    
# calculates the minimum, maximum, mean, and standard deviation for each of the three exam scores in the years identified above.
exam_scores = [1,2,3]
for year in unique_years:
    year_data = data_array[years == year]
    print(f"\nFindings for the year {int (year)}:")
    for index in exam_scores:
        exam_data = year_data[:, index]
        minimum_exam_grade = np.min(exam_data)
        maximum_exam_grade = np.max(exam_data)
        mean_exam_grade = np.mean(exam_data)
        standard_deviation_exam_grade = np.std(exam_data)
        print(f"\nOn exam {index}: \nMinimum: {minimum_exam_grade:.2f} \nMaximum: {maximum_exam_grade:.2f} \nMean: {mean_exam_grade:.2f} \nStandard deviation: {standard_deviation_exam_grade:.2f}")

Number of observations: 233
Number of columns: 5
Minimum course grade: 43.27
Maximum course grade: 97.57
Time period the data set covers: [2000. 2001. 2002. 2003.]
The number of students who took the course in the year 2000: 86
The number of students who took the course in the year 2001: 75
The number of students who took the course in the year 2002: 36
The number of students who took the course in the year 2003: 36

Findings for the year 2000:

On exam 1: 
Minimum: 46.50 
Maximum: 96.00 
Mean: 74.49 
Standard deviation: 11.39

On exam 2: 
Minimum: 41.00 
Maximum: 99.50 
Mean: 74.19 
Standard deviation: 13.02

On exam 3: 
Minimum: 28.00 
Maximum: 97.00 
Mean: 73.23 
Standard deviation: 15.70

Findings for the year 2001:

On exam 1: 
Minimum: 58.00 
Maximum: 98.00 
Mean: 85.01 
Standard deviation: 8.37

On exam 2: 
Minimum: 41.50 
Maximum: 96.50 
Mean: 71.36 
Standard deviation: 13.27

On exam 3: 
Minimum: 36.40 
Maximum: 98.50 
Mean: 72.74 
Standard deviation: 13.19

Findings for the y

## Question 2: Numpy Arrays & Lists
Based on the required reading (and/or any other resources that you prefer), compare and contrast numpy arrays with python lists.
Write python code that contains a simple example which demonstrates one of the differences that you found. note: do not use the same example as the tutorial. Ensure that you cite any resources in a markdown cell.

In [2]:
# Source: Jake VanderPlas. 2016. Python Data Science Handbook: Essential Tools for Working with Data (1st. ed.). O'Reilly Media, Inc.

# Some of the similarities between numpy arrays and python lists are that they are both index-based and iterable. 
# One difference between numpy arrays and python lists are that numpy arrays contain elements of the same type while python lists can contain elements of different types. Another difference is that numpy arrays have built-in functions for computation while Pyhton lists don't.

# Simple example which demonstrates the difference that numpy arrays contain elements of the same type while python arrays can contain elements of different types. 
# The numpy array converts all the elements to a string data type while the python list uses both integer and string data types. 
import numpy as np

# numpy array
numpy_array = np.array([1, 2, "three"])
print("Numpy Array:", numpy_array)
print("Data type:", numpy_array.dtype)

# python list
python_list = [1, 2, "three"]
print("Python List:", python_list)
print("Data type:", [type(element) for element in python_list])

Numpy Array: ['1' '2' 'three']
Data type: <U21
Python List: [1, 2, 'three']
Data type: [<class 'int'>, <class 'int'>, <class 'str'>]


## Helpful resources 
Don't forget to cite websites which helped you solve a problem in a unique way.  You can do this in markdown near the code or with a simple one-line comment inside the code cell, or you can list them below. 

You do not need to cite the official python documentation.
