## Probability and Measures of Center/Spread

In [476]:
# Import necessary packages
import pandas as pd
import numpy as np 

In [477]:
# Read .csv file into a dataframe
df = pd.read_csv('UMACurriculumAssociations.csv')

In [478]:
df

Unnamed: 0,courseName,courseLetter,courseNumber,Item,ItemType,Unnamed: 5,CourseName,CourseLetter,CourseNumber
0,AME 105,AME,105,"Contemporary & Popular Music, BM checksheet",Program,,AME 105,AME,105
1,AME 105,AME,105,MUH 105,CrossList,,AME 121,AME,121
2,AME 121,AME,121,ANT 121,CrossList,,AME 122,AME,122
3,AME 121,AME,121,ANT 308,Prereq,,AME 201W,AME,201W
4,AME 121,AME,121,American Studies Minor checksheet,Program,,AME 205,AME,205
...,...,...,...,...,...,...,...,...,...
5697,WGS 430,WGS,430,Embodied Social Justice Minor checksheet,Program,,,,
5698,WGS 430,WGS,430,"Holocaust, Genocide & Human Rights Studies Min...",Program,,,,
5699,WGS 430,WGS,430,INT 430,CrossList,,,,
5700,WGS 489,WGS,489,"Women, Gender & Sexuality Studies, Minor check...",Program,,,,


### 1.	How many 100 level courses from the CIS department are part of the Computer Information Systems, BS program?

In [479]:
# Filter the dataset based on the specifications in the question.
cis_100_lv_bs = df[(df['courseLetter'] == 'CIS') & (df['courseNumber'].str.startswith('1')) & (df['Item'] == 'Computer Information Systems, BS checksheet')]
count = len(cis_100_lv_bs)

print(f"The Number of 100 level courses in the CIS department for Computer Information Systems BS Program is {count}")


The Number of 100 level courses in the CIS department for Computer Information Systems BS Program is 6


### 2.	Compute how many cross-lists each course has.  Using that result (which you do not need to give as a comment to your submission), please calculate:
* a.	Mean number of crosslists
* b.	Median number of crosslists
* c.	Mode number of crosslists

***Note on Feedback: I am not sure what you mean when you say not to forget the zeros***


In [480]:
# Initialize a dictionary to store course counts
course_counts = {} 
# Group the data by 'courseName' and count the occurrences of 'CrossList' within each group
for course, group_df in df[df['ItemType'] == 'CrossList'].groupby('courseName'):
    course_counts[course] = len(group_df)

# convert the crosslist count per course into a small DataFrame for easier calculations  in next cell
df_course_counts = pd.DataFrame.from_dict(course_counts, orient='index', columns=['CrossList Count'])
df_course_counts


Unnamed: 0,CrossList Count
AME 105,1
AME 121,1
AME 122,1
AME 205,2
AME 265,1
...,...
WGS 340,2
WGS 350W,1
WGS 353,1
WGS 420,2


In [481]:
# Count the number of crosslisted courses
crossList_count = len(df_course_counts['CrossList Count'])

# Calculate and display the mean, median, and mode of these courses
mean = np.mean(df_course_counts['CrossList Count'])
median = np.median(df_course_counts['CrossList Count'])
mode = df_course_counts['CrossList Count'].mode().iat[0] # specify index at 0 so mode is displayed correctly, the original output without this argument was weird and showing 0 and 1
print(f'Total count of crosslists: {crossList_count}')
print(f'Mean number of crosslists: {mean}')
print(f'Median number of crosslists: {median}')
print(f'Mode number of crosslists: {mode}')


Total count of crosslists: 214
Mean number of crosslists: 1.3317757009345794
Median number of crosslists: 1.0
Mode number of crosslists: 1


### 3.	How many courses at UMA are:
* a.	CIS courses
* b.	CYB courses
* c.	DSC courses
* d.	ISS courses

***EDITED see the note next to the computing_courses_df variable***

In [482]:
# list of courses we are looking for
courses_specified = ['CIS', 'CYB', 'DSC', 'ISS']

# filtered DataFrame
computing_courses_df = df[df['CourseLetter'].isin(courses_specified)] #capitalized the 'C' in 'CourseLetter" column name to specify the column name in the dictionary

#count of specified courses
computing_courses_df['CourseLetter'].value_counts()

CourseLetter
CIS    72
ISS    33
DSC    19
CYB    15
Name: count, dtype: int64

### 4.	Those four branches are called computing courses.  How many computing courses are at UMA (be careful – crosslists are functionally the same course).

***EDITED see the note next to the computing_courses_df variable. I keep getting 139 instead of the 129 noted in my feedback***

In [483]:
# Drop duplicate course codes (consider cross-listed as one)
unique_computing_courses_df = computing_courses_df.drop_duplicates(subset=['CourseLetter', 'CourseNumber'])

# Get total count
total_count = len(unique_computing_courses_df)

print(f"Computing courses (considering cross-listed): {total_count}")

Computing courses (considering cross-listed): 139


### 5.	What proportion of courses (by code – crosslists are separate for this purpose) are in the Architecture, B.Arch checksheet?

In [484]:
# get the count of unique values in the DataFrame to avoid double counting courses
tot_course_count = df['CourseName'].nunique()
# Filter the DataFrame to select rows with 'ItemType' equal to 'Architecture, B.Arch checksheet'
architecture_courses_df = df[df['Item'] == 'Architecture, B.Arch checksheet']

# Get the count of 'Architecture, B.Arch checksheet' items
b_arch_count = len(architecture_courses_df)

proportion_b_arch = (b_arch_count / tot_course_count) * 100

print(f'{round(proportion_b_arch,2)}% of courses are in the Architecture, B.Arch checksheet')

4.06% of courses are in the Architecture, B.Arch checksheet


### 6.	What proportion of courses (by code) are ARC courses? 

***EDITED see comment to right of  ARC_courses_df variable***

In [485]:
# Filter the DataFrame to select rows with 'courseLetter' equal to 'ARC' 
ARC_courses_df = df[df['CourseLetter'] == 'ARC'].groupby('CourseName') #changed column name to 'CourseName' instead of 'courseName' to filter from the dictionary instead of associations'

# Get the count of 'ARC' courses
ARC_count = len(ARC_courses_df)

proportion_ARC = (ARC_count / tot_course_count) * 100

print(f'{round(proportion_ARC,2)}% of courses are ARC courses')

3.3% of courses are ARC courses


### 7.	Assuming these are independent, what proportion of courses (by code) are both in the Architecture, B.Arch checksheet and are ARC courses?

***EDITED, updated to use independent intersection formula***

In [486]:
# assuming independece I would multiply the proportion of both ARC and B. Arch checklist and divide by total course count.
b_Arch_ARC = ((ARC_count / tot_course_count)) * ((b_arch_count / tot_course_count)) # independent intersection formula

ARC_and_b_arch = b_Arch_ARC * 100

print(f'{round(ARC_and_b_arch,4)}% of courses are ARC course and are in the Architecture, B.Arch checksheet')


0.1342% of courses are ARC course and are in the Architecture, B.Arch checksheet


### 8.	What is the conditional probability that a given course with an ARC designation is part of the Architecture, B.Arch checksheet?

***EDITED this result has been corrected given the changes in Q6***

In [487]:
# Filter the DataFrame to select rows with 'ItemType' equal to 'Architecture, B.Arch checksheet' and 'courseLetter' equal to 'ARC'
ARC_and_b_arch_df = df[(df['courseLetter'] == 'ARC') & (df['Item'] == 'Architecture, B.Arch checksheet')]

# Get the count of 'Architecture, B.Arch checksheet' item and 'ARC' courseLetter
ARC_and_b_arch_count = len(ARC_and_b_arch_df)

proportion_ARC_and_b_arch = (ARC_and_b_arch_count / tot_course_count) * 100

# The conditional probability that a course with an ARC designation is part of the the Architecture, B.Arch checksheet is the proportion of both ARC and B.Arch divided by the proporation of just ARC
cond_prob_ARC_is_b_arch = (proportion_ARC_and_b_arch / proportion_ARC) * 100
print(f' the conditional probability that a given course with an ARC designation is part of the Architecture, B.Arch checksheet is {round(cond_prob_ARC_is_b_arch, 2)}%')

 the conditional probability that a given course with an ARC designation is part of the Architecture, B.Arch checksheet is 79.49%


### 9.	What is the conditional probability that a given course in the Architecture, B.Arch checksheet is an ARC course?

In [488]:
# the conditional probability that a course that is part of the the Architecture, B.Arch checksheet is with an ARC designation is proportion of both ARC and B.Arch divided by the proporation of just courses on the Architecture, B.Arch checksheet
cond_prob_b_arch_is_ARC = (proportion_ARC_and_b_arch / proportion_b_arch) * 100
print(f'the conditional probability that a given course in the Architecture, B.Arch checksheet is an ARC course {round(cond_prob_b_arch_is_ARC,2)}%')

the conditional probability that a given course in the Architecture, B.Arch checksheet is an ARC course 64.58%


### 10.	Do the answers from 5-9 suggest that ARC courses and Architecture, B.Arch checksheet courses are independent statistically?  Why or why not.

Based on my calculations:
* 4.06% of courses are in the Architecture, B.Arch checksheet
* 3.3% of courses are ARC courses
* 0.1342% of courses are ARC course and are in the Architecture, B.Arch checksheet
* the conditional probability that a given course with an ARC designation is part of the Architecture, B.Arch checksheet is 79.49%
* the conditional probability that a given course in the Architecture, B.Arch checksheet is an ARC course 64.58% 

I do not believe the ARC course and Architecture, B.Arch checksheet courses are independent statistically because if they were then the joint probability of both ARC and Architecture, B.Arch checksheet courses would be approximately equal to the product of the individual probabilities. So P(A)*P(B) != P(A and B). To further test this we can also see the P(B|A) != P(B) and P(A|B) != P(A). It seems that testing for either one of these three conditions is enough to test for independence because if one of these conditions is true then the others are true as well, hwoever it might be useful to understand all three. Below is a reference that I used for this question which summarized this concept very well for me.

https://online.stat.psu.edu/stat800/lesson/how-do-we-check-independence

### 11.	What proportion of courses are a prerequisite for another course?

In [489]:
# FIlter the dataframe to group by unique values in the courseName column that also have a corresponding 'Prereq' record in the 'ItemType' field.
prereq = df[df['ItemType'] == 'Prereq'].groupby('courseName')
# Count the unique courses that appear as prerequisites
prereq_count = len(prereq)
prereq_proportion = (prereq_count / tot_course_count) * 100
print(f'Proportion of courses that are a prerequisite for another course: {round(prereq_proportion, 2)}%')


Proportion of courses that are a prerequisite for another course: 29.64%


### 12.	What proportion of courses have at least one prerequisite?

***EDITED changed column being counted. value now matches the values provided in feedback***

In [490]:
#simplified filtering and removed the regex from prior submission
prereq_filtered_df = df[(df['ItemType'] == 'Prereq')]
# count the number of unique values based on 'courseName' column to not double count courses in the filtered data frame which now holds records that have a course in the 'Item' field and 'Prereq' in the ItemType Field
courses_with_prerequisites_count = len(prereq_filtered_df['Item'].unique())
prereq_filtered_df_count_proportion = (courses_with_prerequisites_count / tot_course_count) * 100
print(f'Proportion of courses that have at least one prerequisite: {round(prereq_filtered_df_count_proportion, 2)}%')


Proportion of courses that have at least one prerequisite: 66.55%


### 13.	What support does the dataset provide for AME courses?

***EDITED used dictionary as instructed***

In [491]:
# Filter the DataFrame to select rows with 'courseLetter' equal to 'AME'
AME_courses_df = df[df['CourseLetter'] == 'AME'] #changed column name to 'CourseName' instead of 'courseName' to filter from the dictionary instead of associations' 

# Get the count of 'AME' courses
AME_count = len(AME_courses_df)

support_x = (AME_count / tot_course_count)

print(f'The support the dataset provides for AME courses is {round(support_x,4)}')

The support the dataset provides for AME courses is 0.0186


### 14.	What support does the dataset provide for crosslisted courses?

In [492]:
# Count the number of crosslisted courses
crossList_count = len(df_course_counts['CrossList Count'])
# Support is calculated the same as the proportations from earlier questions. However it is presented as a ratio and is defined as the number of transactions or records containing 'x' divided byt the total number of tranasactions,
support_y = (crossList_count / tot_course_count)

print(f'The support the dataset provides for corsslisted courses is {round(support_y,4)}')

The support the dataset provides for corsslisted courses is 0.1812


### 15.	What is the confidence that an AME course is crosslisted?

***EDITED I found records that were both AME and Crosslisted and changed the formula.***

In [493]:
#calculate support of both x (AME) and y(crosslisted) is calculated
support_x_y = support_x * support_y

# Calculate the confidence of AME courses being cross-listed
confidence_x_y = support_x_y / support_x

print(f'The confidence than an AME course is croslisted is {round(confidence_x_y,4)}')

The confidence than an AME course is croslisted is 0.1812


### 16.	What is the lift associated with the knowledge that the course is an AME course?

***made updates to notes below, the result should now be correct with the update made to question 15 also. the confidence_x_y variable is the result from Q15 and the support_y variable is the result from Q14***

In [494]:
#lift is calculated by dividing the Confidence by support from question 14
lift_x_y = confidence_x_y / support_y
print(f'The lift associated with the knowledge that the course is an AME course {round(lift_x_y,4)}')

The lift associated with the knowledge that the course is an AME course 1.0
