## Our libraries that we are importing:

In [1]:
# 1. Importing all the libraries necessary for this assignment
import numpy as np
import pandas as pd

This is data from the last class. It includes the sum of the class' quizzes as points, the final exam grade as points, their final grade in the class as a percentage, whether they attended class or not, and their name (note: names were randomly generated). We are interested in the effect that attending class has on one’s grade.

### The dataset of the last class:

In [2]:
# 2a. Load the data using the following code
print('-------------------------------- 2a --------------------------------')
data = pd.read_csv('problem_set_1_data.csv')

# Take a look at the class dataset
data

-------------------------------- 2a --------------------------------


Unnamed: 0,quizzes,attendedClass,finalgrade,final_EXAM,student
0,78.50,1,99.38,46,Linnie Lietz
1,64.25,0,95.78,45,Mitch Mustain
2,76.75,0,97.42,41,Salina Chavera
3,69.00,1,92.50,38,Kimberely Conwell
4,64.00,0,92.50,39,Zack Burk
...,...,...,...,...,...
166,73.50,0,96.25,45,Dana Malta
167,75.50,0,98.13,44,Sixta Heyden
168,41.00,0,69.38,32,Brianne Broome
169,77.00,0,101.25,47,Elouise Weatherholt


Now that we've loaded in the dataset, let's go ahead and rename the columns to make the data a bit cleaner.

In [3]:
# 2b. Rename the columns so that they are easier to understand
print('----------------------------- 2b & 2c -----------------------------')
data.rename(columns={'quizzes':'quiz_total'}, inplace=True)
data.rename(columns={'attendedClass':'attended'}, inplace=True)
data.rename(columns={'finalgrade':'final_grade'}, inplace=True)
data.rename(columns={'final_EXAM':'final_exam'}, inplace=True)
data.rename(columns={'student':'name'}, inplace=True)

# 2c. Take a look at the dataset with the new column names at the start of the list
data.head()

----------------------------- 2b & 2c -----------------------------


Unnamed: 0,quiz_total,attended,final_grade,final_exam,name
0,78.5,1,99.38,46,Linnie Lietz
1,64.25,0,95.78,45,Mitch Mustain
2,76.75,0,97.42,41,Salina Chavera
3,69.0,1,92.5,38,Kimberely Conwell
4,64.0,0,92.5,39,Zack Burk


In [4]:
# 2c. Take a look at the dataset with the new column names at the end of the list
print('-------------------------------- 2c --------------------------------')
data.tail()

-------------------------------- 2c --------------------------------


Unnamed: 0,quiz_total,attended,final_grade,final_exam,name
166,73.5,0,96.25,45,Dana Malta
167,75.5,0,98.13,44,Sixta Heyden
168,41.0,0,69.38,32,Brianne Broome
169,77.0,0,101.25,47,Elouise Weatherholt
170,72.0,0,92.81,38,Addie Maharaj


Great! We have simplified the names of our columns. Let's summarize the basic information of the data in the class dataset.

In [5]:
# 2d. Now here is the summary of the basic information of the class data
print('-------------------------------- 2d --------------------------------')
data.info()

-------------------------------- 2d --------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171 entries, 0 to 170
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   quiz_total   171 non-null    float64
 1   attended     171 non-null    int64  
 2   final_grade  171 non-null    float64
 3   final_exam   171 non-null    int64  
 4   name         171 non-null    object 
dtypes: float64(2), int64(2), object(1)
memory usage: 6.8+ KB


Here is the more in-depth summary for the numeric values contained within the class data.

In [6]:
# 2d. Here is the descriptive summary of the class data of each column
data.describe()

Unnamed: 0,quiz_total,attended,final_grade,final_exam
count,171.0,171.0,171.0,171.0
mean,69.131579,0.192982,91.85538,39.672515
std,9.488473,0.395798,9.487795,5.511304
min,0.0,0.0,2.5,0.0
25%,65.5,0.0,89.38,37.0
50%,72.0,0.0,93.13,40.0
75%,75.5,0.0,96.36,44.0
max,79.5,1.0,104.38,49.0


Based on our summary above for the last class, we can tell the following:
*   **Average of the Class' Total Quiz Scores**: 69.1 points (Highest being 79.5, Lowest being 0)
*   **Did Students Attend Class?**: A majority did not attend class (due to 0.19 being the mean). 
*   **Average of the Class' Final Exam Score**: 39.7 points (out of 50)
*   **Average Final Grade**: 91.9% (A-)



## Subset the data by whether someone attended class:

In [7]:
# 3a. Subset the 'attended' data
attendance_subset = data['attended']

# Selects the 'attended' column and displays the row values corresponding to that
print('-------------------------------- 3a --------------------------------')
attendance_subset.describe()

-------------------------------- 3a --------------------------------


count    171.000000
mean       0.192982
std        0.395798
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        1.000000
Name: attended, dtype: float64

We're going to replace the values for "attended" so that the values make more sense. Each number corresponds with a specific value.

In [8]:
# 3b. Replace current '1' and '0' values with 'Yes' and 'No' to clarify the data better
data_bool = data['attended'].replace({1: 'Yes', 0: 'No'})
data['attended'] = data_bool.replace({1: 'Yes', 0: 'No'})

# Check the new data values
print('-------------------------------- 3b --------------------------------')
data[['attended', 'name']]

-------------------------------- 3b --------------------------------


Unnamed: 0,attended,name
0,Yes,Linnie Lietz
1,No,Mitch Mustain
2,No,Salina Chavera
3,Yes,Kimberely Conwell
4,No,Zack Burk
...,...,...
166,No,Dana Malta
167,No,Sixta Heyden
168,No,Brianne Broome
169,No,Elouise Weatherholt


Now our class data displays whether students did or didn't attend class. Let's see if class attendance affected one's grades.

In [9]:
# 3c. Calculate average “quiz_total” for those who attended class versus did not attend
print('-------------------------------- 3c --------------------------------')
data[['attended', 'quiz_total']].groupby(['attended']).agg(['mean', 'min', 'max', 'count'])

-------------------------------- 3c --------------------------------


Unnamed: 0_level_0,quiz_total,quiz_total,quiz_total,quiz_total
Unnamed: 0_level_1,mean,min,max,count
attended,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
No,68.326087,0.0,79.5,138
Yes,72.5,54.0,78.5,33


In [10]:
# 3c. Calculate average “final_grade” for those who attended class versus did not attend
data[['attended', 'final_grade']].groupby(['attended']).agg(['mean', 'min', 'max', 'count'])

Unnamed: 0_level_0,final_grade,final_grade,final_grade,final_grade
Unnamed: 0_level_1,mean,min,max,count
attended,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
No,91.086159,2.5,104.38,138
Yes,95.072121,85.47,101.25,33


In [11]:
# 3c. Calculate average “final_exam” for those who attended class versus did not attend
data[['attended', 'final_exam']].groupby(['attended']).agg(['mean', 'min', 'max', 'count'])

Unnamed: 0_level_0,final_exam,final_exam,final_exam,final_exam
Unnamed: 0_level_1,mean,min,max,count
attended,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
No,39.442029,0,49,138
Yes,40.636364,33,47,33


Looking at these results, we can see that a majority of students did not attend class, with 138 not attending and only 33 actually attending the class.

Summing up the results, it appears that on average:
*   Students who attended class had higher total quiz scores (72.5 points) than those who did not attend class (68.3 points)
*   Students who attended class had higher final grades (95.1%, A) than those who did not (91.1%, A-)
*   Students who attended class did better on the final exam than (40.6 points) those who did not (39.4 points)

Thus, we are able to deduce that class attendance does place some impact on one's grades. Although not too significant an amount, attending class does lead to receiving better grades in class.

## Based on the data you have, was anyone working together?

In [12]:
# EC. Find the duplicate occurences of scores in the class
print('--------------------------- Extra Credit ---------------------------')
same_scores = data[data.duplicated(['quiz_total', 'final_grade', 'final_exam'])]

same_scores

--------------------------- Extra Credit ---------------------------


Unnamed: 0,quiz_total,attended,final_grade,final_exam,name
170,72.0,No,92.81,38,Addie Maharaj


It appears there is one occurrence in the data that suggests two students may have worked together based on their similarities in grades and scores. Both Addie Maharaj and Jaraham Eidda have the same exact total quiz and final exam scores as well as final grade. So it is possible these two students were working together during quizzes and exams.