# COGS 108 - Data Checkpoint

# Names

- Sharon Chen
- Pamela Ghag
- Yuzi Chu
- Cheng Chang
- Stanley Hahm

<a id='research_question'></a>
# Research Question

Are there significant differences among courses under the Humanities, Social Sciences, and STEM departments at UCSD in terms of the correlation between course difficulty (as indicated through average GPA) and teachers’ ratings on the student-feedback platforms (CAPE)?

# Dataset(s)

#### Dataset Name: CAPE

Link to the dataset: https://raw.githubusercontent.com/dcao/seascape/master/data/data.csv

Number of observations: 51281 rows(before cleanup) 

Description: The CAPE dataset includes information on the professor teaching the course, the course, the course date, average GPA received, percentage of students that recommended the professor, average GPA expected, the field of study the course is in. The raw dataset includes observations from Fall Quarter 2007 up to and including Spring Quarter 2020. 
 
Each observation has the following columns:

- instr: instructor name

- course: course name

- term: school term

- enrolled: number of students enrolled

- evals: number of students evaluated

- recClass: percentage of students who recommend the class

- recInstr: percentage of students who recommend the instructor

- hours: estimated hours spent per week to study for the course

- gpaExp: average expected GPA

- gpaAvg: actual average GPA of course

(column names comparison: https://cape.ucsd.edu/responses)
 
We will be comparing the level of difficulty up against the professor rating to find out if there is correlation between the two depending on which field of study the course is categorized into. As we are measuring course difficulty in terms of average GPA (e.g higher average GPA = lower difficulty, lower average GPA = higher difficulty) we decided to remove any observations that may contain a null value in the average GPA column as they would not contribute to helping us reach a conclusion.We will also be dropping observations from Winter Quarter 2020 and Spring Quarter 2020 as classes during this period were taken during the pandemic and it would not be an objective indication of the professors capabilities as a teacher.

# Setup

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# The seaborn library makes plots look nicer
# sns.set(context = 'talk', style='white')

# Round decimals when displaying DataFrames
pd.set_option('precision', 2)

# Make plots just slightly bigger for displaying well in notebook
# set plotting size parameter
#plt.rcParams['figure.figsize'] = (10, 5)

In [2]:
# Read in data
cape = pd.read_csv("https://raw.githubusercontent.com/dcao/seascape/master/data/data.csv")

# Data Cleaning

Since we do not want any data that may be influenced by the COVID-19 pandemic, we want to __drop evaluations made for school term WI20 and SP20__, the lastest 2 terms recorded in the dataset.

In [3]:
cape = cape.drop(cape[(cape.term =='WI20') | (cape.term =='SP20')].index)
cape.head()

Unnamed: 0,instr,course,term,enrolled,evals,recClass,recInstr,hours,gpaExp,gpaAvg
0,"Butler, Elizabeth Annette",AAS 10,FA20,65,29,89.0,96.0,4.5,3.77,3.33
1,"Puritty, Chandler Elizabeth",AAS 190,FA20,19,5,100.0,100.0,2.1,4.0,
2,"Andrews, Abigail Leslie",AIP 197T,FA20,34,11,100.0,100.0,4.06,3.67,
3,"Jones, Ian William Nasser",ANAR 120,FA20,15,4,100.0,100.0,2.5,3.5,
4,"Smith, Neil Gordon",ANAR 121,FA20,17,6,100.0,100.0,6.5,4.0,


In [4]:
cape.describe()

Unnamed: 0,enrolled,evals,recClass,recInstr,hours,gpaExp,gpaAvg
count,48645.0,48645.0,48645.0,48645.0,48645.0,47295.0,34282.0
mean,76.62,39.25,88.73,88.99,5.61,3.52,3.23
std,86.22,47.6,12.63,14.63,2.58,0.32,0.4
min,1.0,3.0,0.0,0.0,0.0,1.33,1.21
25%,20.0,10.0,83.0,84.0,4.0,3.3,2.93
50%,40.0,20.0,92.0,94.0,5.3,3.5,3.22
75%,103.0,50.0,100.0,100.0,6.93,3.75,3.52
max,1064.0,509.0,100.0,100.0,20.5,4.0,4.0


An important method for us to estimate the difficulty of the courses is the average GPA received by the students; therefore, data with no "gpaAvg" does not seem helpful for our purposes. We want to __drop all rows with NaN in the column of average GPA__.

In [5]:
cape = cape.dropna(subset=['gpaAvg'])
cape.describe()

Unnamed: 0,enrolled,evals,recClass,recInstr,hours,gpaExp,gpaAvg
count,34282.0,34282.0,34282.0,34282.0,34282.0,33940.0,34282.0
mean,100.02,51.1,87.32,87.2,5.8,3.46,3.23
std,91.57,51.61,12.22,14.85,2.35,0.29,0.4
min,20.0,3.0,0.0,0.0,0.5,1.6,1.21
25%,34.0,16.0,81.0,82.0,4.22,3.27,2.93
50%,63.0,32.0,90.0,92.0,5.39,3.46,3.22
75%,138.0,68.0,97.0,99.0,6.94,3.67,3.52
max,1064.0,509.0,100.0,100.0,20.33,4.0,4.0


However, we can see from the above descrptions of the two results that the mean student enrollment of the DataFrame changes significantly when we delete every row with NaN in the "gpaAvg" column. This suggests that an uncalculated average GPA of a course may be caused by the lack of student count in the course. __More needs to be done in EDA to understand how this affects our analysis.__

Next, we __change column names__ so they are more similar and "pythonic".

In [6]:
col_name_map = {
    "evals": "eval",
    "recClass": "rec_class",
    "recInstr": "rec_instr",
    "gpaExp": "gpa_exp",
    "gpaAvg": "gpa_rec"
}
cape = cape.rename(columns=col_name_map)
print(list(cape.columns))

['instr', 'course', 'term', 'enrolled', 'eval', 'rec_class', 'rec_instr', 'hours', 'gpa_exp', 'gpa_rec']


As stated above, due to privacy concern, we would like to hide the name of the instructor and the course. Here we will first __map insturctor names to an ID and delete their names from the DataFrame__.

In [7]:
# Change all instructor names to lower cases
cape["instr"] = cape["instr"].apply(lambda name: name.lower())

# Make a list of unique instructor name and IDs
instr_names = cape["instr"].unique()
instr_names.sort()
instr_ids = [("I_" + str(x)) for x in range(len(instr_names))]

# Make a map of names and IDs
instr_id_map = {n:i for (n,i) in zip(instr_names, instr_ids)}

# Switch instructor names to IDs
cape["instr"] = cape["instr"].apply(lambda name: instr_id_map[name])

cape.head()

Unnamed: 0,instr,course,term,enrolled,eval,rec_class,rec_instr,hours,gpa_exp,gpa_rec
0,I_457,AAS 10,FA20,65,29,89.0,96.0,4.5,3.77,3.33
5,I_758,ANAR 146,FA20,41,16,100.0,100.0,4.0,3.81,3.79
6,I_1099,ANBI 118,FA20,20,15,93.0,100.0,2.77,3.67,3.77
8,I_3175,ANBI 136,FA20,22,15,66.0,73.0,5.17,3.27,2.99
9,I_1137,ANBI 141,FA20,117,53,100.0,100.0,3.75,3.7,3.87


We need to change course names to IDs later, as we __need to classify the disciplines (Humanities, Social Science, STEM) they belongs to__. This will be done in EDA.

# Project Plan (updated)

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/19  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  1 PM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/2  | 1 PM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/16  | 1 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 1 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/9  | 1 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/16  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |