### Making Sense of UCSD CAPE Evaluation data

import packages and hide warnings

In [18]:
import pandas as pd
import os
pd.options.mode.chained_assignment = None # Why does this warning occur?

Convert each html file in '\html-data' to a dataframe using read_html and concatenate them together. This creates a massive data frame of data on all the courses in UCSD.

In [15]:
all_data = pd.DataFrame()
file_names = os.listdir('html-data')
for file_name in file_names:
    file_path = 'html-data/' + file_name
    new_data = pd.read_html(file_path)[0]
    all_data = pd.concat([all_data, new_data])
print(all_data)

                     Instructor  \
0              Nomura, Keiko K.   
1              Miller, David R.   
2    Llewellyn Smith, Stefan G.   
3     Bahadori, Mohammad Yousef   
4            Lindsey, Stephanie   
..                          ...   
434        Mah, Silvia Armitano   
435        Mah, Silvia Armitano   
436      Palmer, Douglas Arthur   
437  Bartsch, Dirk-Uwe Guenther   
438         Williams, Ebonee P.   

                                            Course  Term  Enroll  Evals Made  \
0      MAE 101A - Introductory Fluid Mechanics (A)  WI22     116         106   
1      MAE 101A - Introductory Fluid Mechanics (B)  WI22      42          20   
2          MAE 101B - Advanced Fluid Mechanics (A)  WI22      75          43   
3      MAE 105 - Intro to Mathematical Physics (A)  WI22      41          28   
4     MAE 107 - Computational Methods/Engineer (A)  WI22      58          48   
..                                             ...   ...     ...         ...   
434   ENG 100 - Pri

We only care about the course name, the number evaluations made, and the Study Hours per Week so lets make a smaller dataframe with just those fields. The course names are also quite long so let's simplify them by removing everything after the dash.

In [19]:
my_data = all_data[['Course', 'Evals Made', 'Study Hrs/wk']]
my_data['Course'] = my_data['Course'].str.replace(r'.-.*', '', regex=True)
print(my_data)

       Course  Evals Made  Study Hrs/wk
0    MAE 101A         106          8.14
1    MAE 101A          20         10.28
2    MAE 101B          43          7.24
3     MAE 105          28          8.35
4     MAE 107          48          9.00
..        ...         ...           ...
434   ENG 100          34          2.92
435  ENG 100A          34          2.92
436  ENG 100L           4          5.50
437  ENG 100L           3          5.17
438  ENG 100L           3          3.50

[54934 rows x 3 columns]


Our dataframe is full of repeated courses from different instructors. We want to average the data of all repeated courses to make the data more readable. Group the data by 'Course' then iterate through each group, averaging the Study Hrs/wk.

In [62]:
group_by_object = my_data.groupby('Course')
averaged_data = pd.DataFrame(columns=['Course', 'Study Hrs/wk'])
for name, group in group_by_object:
    grouped_data = group_by_object.get_group(name)
    total_evals = grouped_data['Evals Made'].sum()
    total_study_hrs = (grouped_data['Evals Made'] * grouped_data['Study Hrs/wk']).sum()
    avg_study_hrs = round((total_study_hrs / total_evals), 2)
    averaged_data.loc[len(averaged_data.index)] = [name, avg_study_hrs]
print(averaged_data)

        Course  Study Hrs/wk
0     ANAR 100          6.33
1     ANAR 103          9.50
2     ANAR 104          6.26
3     ANAR 105          3.70
4     ANAR 111          4.47
...        ...           ...
3943   WARR 87          1.34
3944  WCWP 100          7.33
3945  WCWP 10A          6.28
3946  WCWP 10B          6.14
3947  WCWP 160          5.00

[3948 rows x 2 columns]


Now we need to compare