**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Pallavi Prabhu
- Hana Ton-Nu
- Justin Gamm
- Raquel Sanchez
- Ria Singh


# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)



## Background and Prior Work


- Include a general introduction to your topic
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be research publications, but they need not be. Blogs, GitHub repositories, company websites, etc., are all viable references if they are relevant to your project. It must be clear which information comes from which references. (2-3 paragraphs, including at least 2 references)

 **Use inline citation through HTML footnotes to specify which references support which statements** 

For example: After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Use a minimum of 2 or 3 citations, but we prefer more.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) You need enough to fully explain and back up important facts. 

Note that if you click a footnote number in the paragraph above it will transport you to the proper entry in the footnotes list below.  And if you click the ^ in the footnote entry, it will return you to the place in the main text where the footnote is made.

To understand the HTML here, `<a name="#..."> </a>` is a tag that allows you produce a named reference for a given location.  Markdown has the construciton `[text with hyperlink](#named reference)` that will produce a clickable link that transports you the named reference.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.


# Hypothesis



- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: CAPES data from UCSD
  - Link to the dataset: https://www.kaggle.com/datasets/sanbornpnguyen/ucsdcapes/
  - Number of observations:63363
  - Number of variables:11
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Dataset #1 (use name instead of number here)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
capes = pd.read_csv("capes_data.csv")

In [3]:
#determine number of variables and observations
capes.shape

(63363, 11)

In [4]:
#TO REMOVE
    #drop evaluation url column
    #check for NA
    #change grade expected and recieved using standardisation function
    #change percentage to ints using fucntion
    #add column for year
    #difference between expected and recived grade for 

In [5]:
capes.head()

Unnamed: 0,Instructor,Course,Quarter,Total Enrolled in Course,Total CAPEs Given,Percentage Recommended Class,Percentage Recommended Professor,Study Hours per Week,Average Grade Expected,Average Grade Received,Evalulation URL
0,Butler Elizabeth Annette,AAS 10 - Intro/African-American Studies (A),SP23,66,48,93.5%,100.0%,2.8,A- (3.84),B+ (3.67),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
1,Butler Elizabeth Annette,AAS 170 - Legacies of Research (A),SP23,20,7,100.0%,100.0%,2.5,A- (3.86),A- (3.92),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
2,Jones Ian William Nasser,ANAR 111 - Foundations of Archaeology (A),SP23,16,3,100.0%,100.0%,3.83,B+ (3.67),,https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
3,Shtienberg Gilad,ANAR 115 - Coastal Geomorphology/Environ (A),SP23,26,6,100.0%,83.3%,3.83,B+ (3.50),B (3.07),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...
4,Braswell Geoffrey E.,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),SP23,22,9,100.0%,100.0%,5.17,A (4.00),A (4.00),https://cape.ucsd.edu/CAPEReport.aspx?sectioni...


In [6]:
capes.dtypes

Instructor                           object
Course                               object
Quarter                              object
Total Enrolled in Course              int64
Total CAPEs Given                     int64
Percentage Recommended Class         object
Percentage Recommended Professor     object
Study Hours per Week                float64
Average Grade Expected               object
Average Grade Received               object
Evalulation URL                      object
dtype: object

In [7]:
#Dropping column as we cannot use it
capes = capes.drop(columns = ["Evalulation URL", "Instructor"])

In [8]:
#checking for NA 
capes.isna().sum()

Course                                  0
Quarter                                 0
Total Enrolled in Course                0
Total CAPEs Given                       0
Percentage Recommended Class            0
Percentage Recommended Professor        0
Study Hours per Week                    1
Average Grade Expected               1486
Average Grade Received              17628
dtype: int64

In [9]:
#Investigating if NA in grades expected and recieved are from particular quarter(covid?)
capes_na = capes[capes.isnull().any(axis=1)]
capes_na.nunique()

Course                              6028
Quarter                               87
Total Enrolled in Course             278
Total CAPEs Given                    166
Percentage Recommended Class         332
Percentage Recommended Professor     322
Study Hours per Week                 963
Average Grade Expected               155
Average Grade Received               135
dtype: int64

In [10]:
capes_na["Quarter"].unique()

array(['SP23', 'WI23', 'FA22', 'S322', 'S222', 'S122', 'SP22', 'WI22',
       'FA21', 'S321', 'S221', 'S121', 'SP21', 'WI21', 'FA20', 'S320',
       'S220', 'S120', 'SP20', 'WI20', 'FA19', 'S319', 'S219', 'S119',
       'SP19', 'WI19', 'FA18', 'S318', 'S218', 'S118', 'SP18', 'WI18',
       'FA17', 'S317', 'S217', 'S117', 'SP17', 'WI17', 'FA16', 'S316',
       'S216', 'S116', 'SP16', 'WI16', 'FA15', 'S215', 'S115', 'SP15',
       'WI15', 'FA14', 'S314', 'S214', 'S114', 'SP14', 'WI14', 'FA13',
       'S313', 'S213', 'S113', 'SP13', 'WI13', 'FA12', 'S312', 'S212',
       'S112', 'SP12', 'WI12', 'FA11', 'S311', 'S211', 'S111', 'SP11',
       'WI11', 'FA10', 'SU10', 'SP10', 'WI10', 'FA09', 'SU09', 'SP09',
       'WI09', 'FA08', 'SU08', 'SP08', 'WI08', 'FA07', 'SU07'],
      dtype=object)

In [11]:
#dropping all rows with NA as remianing dataset is still large enough and the NA rows are random
capes = capes.dropna()
capes.shape

(45393, 9)

In [12]:
#Function to make the Average Grade Expected and Average Grade Recieved into floats of just the grade point number
def grade_standardize(grade):
    
    #ensure grade has not been standardized
    if type(grade) == float:
        return grade
    
    #retain only part after open parenthesis
    grade = grade.split("(")[-1]
    
    #remove close parenthesis
    grade = grade.replace(")", "")
    
    #change to float
    grade = float(grade)
    
    return grade
    

In [13]:
capes["Average Grade Expected"] = capes["Average Grade Expected"].apply(grade_standardize)
capes["Average Grade Received"] = capes["Average Grade Received"].apply(grade_standardize)

In [14]:
#function to remove % symbol from Percentage Recommended Class and Percentage Recommended Professor and make into float

def percent_standardize(percent):
    #ensure percent has not been standardized
    if type(percent) == float:
        return percent
    
    #remove % sign
    percent = percent.replace("%", "")
    
    #convert to int
    percent = float(percent)
    
    return percent
        

In [15]:
capes["Percentage Recommended Class"] = capes["Percentage Recommended Class"].apply(percent_standardize)
capes["Percentage Recommended Professor"] = capes["Percentage Recommended Professor"].apply(percent_standardize)

In [16]:
capes.dtypes

Course                               object
Quarter                              object
Total Enrolled in Course              int64
Total CAPEs Given                     int64
Percentage Recommended Class        float64
Percentage Recommended Professor    float64
Study Hours per Week                float64
Average Grade Expected              float64
Average Grade Received              float64
dtype: object

In [17]:
#create new column called difference which is the difference between the average grade expected and the average grade recieved
#negative numbers in difference signify students overestimated their grade, positive number signifies students underestimted their grade
capes = capes.assign(Difference = capes["Average Grade Received"] - capes["Average Grade Expected"])

In [18]:
#create columns for year and terms where each year will be an int (eg. 07, 08, 23, etc) and terms will be denoted by numbers(see code)
#this code is taken from https://github.com/COGS108/FinalProjects-Wi21/blob/main/FinalProject_group084.ipynb

In [19]:
def extract_term(term):
    TERMS = { "WI": 1, "SP": 2, "S1": 3, "S2": 4, "S3": 5, "FA": 6, "SU" : 7}
    
    term = term[:2]
    term = TERMS[term]
    
    return term

In [20]:
def extract_year(qtr):
    
    year = qtr[2:]
    year = int("20" + year)
    
    return year


In [21]:
capes["Term"] = capes["Quarter"].apply(extract_term)
capes["Year"] = capes["Quarter"].apply(extract_year)

In [22]:
capes.head()

Unnamed: 0,Course,Quarter,Total Enrolled in Course,Total CAPEs Given,Percentage Recommended Class,Percentage Recommended Professor,Study Hours per Week,Average Grade Expected,Average Grade Received,Difference,Term,Year
0,AAS 10 - Intro/African-American Studies (A),SP23,66,48,93.5,100.0,2.8,3.84,3.67,-0.17,2,2023
1,AAS 170 - Legacies of Research (A),SP23,20,7,100.0,100.0,2.5,3.86,3.92,0.06,2,2023
3,ANAR 115 - Coastal Geomorphology/Environ (A),SP23,26,6,100.0,83.3,3.83,3.5,3.07,-0.43,2,2023
4,ANAR 155 - Stdy Abrd: Ancient Mesoamerica (A),SP23,22,9,100.0,100.0,5.17,4.0,4.0,0.0,2,2023
5,ANBI 111 - Human Evolution (A),SP23,22,4,100.0,100.0,2.5,4.0,2.95,-1.05,2,2023


In [23]:
capes.dtypes

Course                               object
Quarter                              object
Total Enrolled in Course              int64
Total CAPEs Given                     int64
Percentage Recommended Class        float64
Percentage Recommended Professor    float64
Study Hours per Week                float64
Average Grade Expected              float64
Average Grade Received              float64
Difference                          float64
Term                                  int64
Year                                  int64
dtype: object

In [24]:
capes.describe()

Unnamed: 0,Total Enrolled in Course,Total CAPEs Given,Percentage Recommended Class,Percentage Recommended Professor,Study Hours per Week,Average Grade Expected,Average Grade Received,Difference,Term,Year
count,45393.0,45393.0,45393.0,45393.0,45393.0,45393.0,45393.0,45393.0,45393.0,45393.0
mean,99.476637,50.702047,88.01085,88.267233,5.827672,3.479624,3.271943,-0.207682,3.002401,2015.932258
std,93.293638,52.987864,12.260053,14.430897,2.346697,0.297638,0.404402,0.278231,2.049306,4.595074
min,20.0,3.0,0.0,0.0,0.5,1.6,0.55,-2.24,1.0,2007.0
25%,33.0,16.0,82.4,83.3,4.23,3.28,2.97,-0.38,1.0,2012.0
50%,63.0,31.0,91.1,93.1,5.43,3.49,3.28,-0.18,2.0,2016.0
75%,136.0,67.0,97.6,100.0,7.0,3.7,3.58,-0.02,6.0,2020.0
max,1101.0,588.0,100.0,100.0,20.33,4.0,4.0,1.6,6.0,2023.0


In [28]:
capes.isna().any()

Course                              False
Quarter                             False
Total Enrolled in Course            False
Total CAPEs Given                   False
Percentage Recommended Class        False
Percentage Recommended Professor    False
Study Hours per Week                False
Average Grade Expected              False
Average Grade Received              False
Difference                          False
Term                                False
Year                                False
dtype: bool

## Dataset #2 (if you have more than one, use name instead of number here)

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |