# Probability and Bayes’ Theorem: Kaggle ML & DS Survey [60 pts]

 **Dataset:** 2017 Kaggle Machine Learning (ML) & Data Science (DS) Survey comprises over 16,000
 responses from Kaggle’s industry-wide survey on the state of data science and machine learning.


 The dataset, provided to you as cs412-hw2-data.zip, has 5 files:

 • ***schema.csv***: a CSV file with survey schema. This schema includes the questions that correspond to
 each column name in both the *multipleChoiceResponses.csv *and freeformResponses.csv.

• ***multipleChoiceResponses.csv:*** Respondents’ answers to multiple choice and ranking questions. These are non-randomized and thus a single row does correspond to all of a single user’s answers.

• ***freeformResponses.csv:*** Respondents’ freeform answers to Kaggle’s survey questions. These responses are randomized within a column, so that reading across a single row does not give a single user’s answers.

• ***conversionRates.csv:*** Currency conversion rates (to USD).

• ***RespondentTypeREADME.txt:*** This is a schema for decoding the responses in the ”Asked” column of the schema.csv file.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd

**Constructing Data Frames**

In [2]:
# The path to your CSV file
file_path = '/content/conversionRates.csv'

# Read the CSV file
conversion_rates_df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(conversion_rates_df.head())

   Unnamed: 0 originCountry  exchangeRate
0           1           USD      1.000000
1           2           EUR      1.195826
2           3           INR      0.015620
3           4           GBP      1.324188
4           5           BRL      0.321350


In [3]:
# The path to your CSV file
file_path = '/content/freeformResponses.csv'

# Read the CSV file
free_form_responses_df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(free_form_responses_df.head(20))

   GenderFreeForm KaggleMotivationFreeForm CurrentJobTitleFreeForm  \
0             NaN                      NaN                     NaN   
1             NaN                      NaN                     NaN   
2             NaN                      NaN                 teacher   
3             NaN                      NaN                     NaN   
4             NaN                      NaN                     NaN   
5             NaN                      NaN                     NaN   
6             NaN                      NaN                     NaN   
7             NaN                      NaN                     NaN   
8             NaN                      NaN                     NaN   
9             NaN                      NaN                     NaN   
10            NaN                      NaN                     NaN   
11            NaN                      NaN                     NaN   
12            NaN                      NaN                     NaN   
13            NaN   

  free_form_responses_df = pd.read_csv(file_path)


In [4]:
# The path to your CSV file
file_path = '/content/schema.csv'

# Read the CSV file
schema_df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(schema_df.head(10))

                     Column  \
0              GenderSelect   
1            GenderFreeForm   
2                   Country   
3                       Age   
4          EmploymentStatus   
5             StudentStatus   
6       LearningDataScience   
7  KaggleMotivationFreeForm   
8                CodeWriter   
9            CareerSwitcher   

                                            Question         Asked  
0     Select your gender identity. - Selected Choice           All  
1  Select your gender identity. - A different ide...           All  
2          Select the country you currently live in.           All  
3                                   What's your age?           All  
4             What's your current employment status?           All  
5  Are you currently enrolled as a student at a d...    Non-worker  
6  Are you currently focused on learning data sci...    Non-worker  
7    What's your motivation for being a Kaggle user?  Non-switcher  
8  Do you write code to analyze data 

In [5]:
# Use a try-except block to attempt to read the file with different encodings
try:
    # Try reading with the default utf-8 encoding first
    multiple_choice_responses_df = pd.read_csv('/content/multipleChoiceResponses.csv')
except UnicodeDecodeError:
    # If a UnicodeDecodeError occurs, try a different encoding
    multiple_choice_responses_df = pd.read_csv('/content/multipleChoiceResponses.csv', encoding='ISO-8859-1')

# Display the first few rows of the DataFrame
print(multiple_choice_responses_df.head(10))

                                        GenderSelect        Country   Age  \
0  Non-binary, genderqueer, or gender non-conforming            NaN   NaN   
1                                             Female  United States  30.0   
2                                               Male         Canada  28.0   
3                                               Male  United States  56.0   
4                                               Male         Taiwan  38.0   
5                                               Male         Brazil  46.0   
6                                               Male  United States  35.0   
7                                             Female          India  22.0   
8                                             Female      Australia  43.0   
9                                               Male         Russia  33.0   

                                    EmploymentStatus StudentStatus  \
0                                 Employed full-time           NaN   
1           

  multiple_choice_responses_df = pd.read_csv('/content/multipleChoiceResponses.csv', encoding='ISO-8859-1')


 2.a [5 pts] What is the probability that a respondent is currently employed as a Programmer given they
 use C/C++ at work?


In [6]:
# Create a boolean mask for C/C++ users
c_cpp_mask = multiple_choice_responses_df['WorkToolsSelect'].str.contains('C/C\+\+', na=False)

# Apply this mask to the DataFrame to get only the rows for C/C++ users
c_cpp_users_df = multiple_choice_responses_df[c_cpp_mask]

# Count the total number of C/C++ users
num_c_cpp_users = c_cpp_users_df.shape[0]

# Create another boolean mask for Programmers within the C/C++ users
programmer_mask = c_cpp_users_df['CurrentJobTitleSelect'].str.contains('Programmer', na=False)

# Apply this mask to the c_cpp_users_df to get programmers who use C/C++
programmers_using_c_cpp_df = c_cpp_users_df[programmer_mask]

# Count the number of programmers who use C/C++
num_programmers_using_c_cpp = programmers_using_c_cpp_df.shape[0]

# Calculate the total number of survey respondents
total_respondents = multiple_choice_responses_df.shape[0]

# Calculate P(Programmer and C/C++) which is num_programmers_using_c_cpp divided by the total number of respondents
p_programmer_and_c_cpp = num_programmers_using_c_cpp / total_respondents

# Calculate P(C/C++) which is num_c_cpp_users divided by the total number of respondents
p_c_cpp = num_c_cpp_users / total_respondents

# Calculate the conditional probability P(Programmer | C/C++)
p_programmer_given_c_cpp = p_programmer_and_c_cpp / p_c_cpp

# Output the conditional probability
print(f'The conditional probability that a respondent is a Programmer given they use C/C++ is: {p_programmer_given_c_cpp}')


The conditional probability that a respondent is a Programmer given they use C/C++ is: 0.022251308900523563


 2.b [5 pts] What is the probability that a respondent is a Data Scientist given they have majored in
 computer science, mathematics or statistics?

In [7]:
# Assuming 'MajorSelect' is the column that indicates the respondent's major.
# Create a boolean mask for respondents who have majored in Computer Science, Mathematics, or Statistics
major_mask = multiple_choice_responses_df['MajorSelect'].isin(['Computer Science', 'Mathematics or statistics'])

# Apply this mask to the DataFrame to get only the rows for respondents with the specified majors
majored_in_cs_math_stats_df = multiple_choice_responses_df[major_mask]

# Create another boolean mask for Data Scientists within the filtered DataFrame
data_scientist_mask = majored_in_cs_math_stats_df['CurrentJobTitleSelect'] == 'Data Scientist'

# Apply this mask to the majored_in_cs_math_stats_df to get Data Scientists with the specified majors
data_scientists_with_major_df = majored_in_cs_math_stats_df[data_scientist_mask]

# Count the number of Data Scientists who have majored in Computer Science, Mathematics, or Statistics
num_data_scientists_with_major = data_scientists_with_major_df.shape[0]

# Calculate the total number of respondents with the specified majors
total_respondents_with_major = majored_in_cs_math_stats_df.shape[0]

# Calculate the conditional probability P(Data Scientist | CS/Math/Stats major)
p_data_scientist_given_major = num_data_scientists_with_major / total_respondents_with_major

# Output the conditional probability
print(f'The conditional probability that a respondent is a Data Scientist given they majored in CS, Math, or Stats is: {p_data_scientist_given_major}')


The conditional probability that a respondent is a Data Scientist given they majored in CS, Math, or Stats is: 0.1597400634728729


 2.c [10 pts] What is the probability that a respondent works in the Technology industry given that
 they earn more than 40,000 USD annually?

In [8]:
# Merge the exchange rates into the main DataFrame
multiple_choice_responses_df = multiple_choice_responses_df.merge(
    conversion_rates_df,
    how='left',
    left_on='CompensationCurrency',
    right_on='originCountry'
)

# Create the 'CompensationAmountUSD' column
multiple_choice_responses_df['CompensationAmountUSD'] = pd.to_numeric(
    multiple_choice_responses_df['CompensationAmount'].str.replace(',', '').str.replace('-', ''),
    errors='coerce'
) * multiple_choice_responses_df['exchangeRate']

# After creating the 'CompensationAmountUSD' column, ensure that it exists before proceeding
if 'CompensationAmountUSD' in multiple_choice_responses_df.columns:
    # Filter for respondents who earn more than $40,000 USD annually
    high_earners_df = multiple_choice_responses_df[multiple_choice_responses_df['CompensationAmountUSD'] > 40000]

    # Calculate the number of high earners in the Technology industry
    tech_industry_high_earners_count = high_earners_df[high_earners_df['EmployerIndustry'] == 'Technology'].shape[0]

    # Calculate the total number of high earners
    total_high_earners_count = high_earners_df.shape[0]

    # Calculate the conditional probability
    probability_tech_given_high_earner = tech_industry_high_earners_count / total_high_earners_count if total_high_earners_count else 0

    # Print the conditional probability
    print(f'The conditional probability of working in the Technology industry given earning more than $40,000 USD annually is: {probability_tech_given_high_earner}')
else:
    print("The 'CompensationAmountUSD' column was not created successfully.")

The conditional probability of working in the Technology industry given earning more than $40,000 USD annually is: 0.1935483870967742


 2.d [5 pts] What is the joint probability of a respondent being over 30 years old and having a at
 least a Bachelor’s degree?

In [9]:
# 2.d
# Convert the 'Age' column to numeric and handle any non-numeric entries by coercing them to NaN.
multiple_choice_responses_df['Age'] = pd.to_numeric(multiple_choice_responses_df['Age'], errors='coerce')

# Assuming that 'FormalEducation' contains strings that include "Bachelor’s", "Master’s", or "Doctoral" for those levels of education.
# You may need to adjust the contains method based on the exact wording in your DataFrame.

# Filter for respondents over 30
over_30_df = multiple_choice_responses_df[multiple_choice_responses_df['Age'] > 30]

# Filter for respondents with at least a Bachelor's degree from the already filtered 'over_30_df'
degree_levels = ['Bachelor’s degree', 'Master’s degree', 'Doctoral degree']  # Adjust the list based on your data
over_30_with_degree_df = over_30_df[over_30_df['FormalEducation'].isin(degree_levels)]

# Calculate the joint probability
joint_probability = over_30_with_degree_df.shape[0] / multiple_choice_responses_df.shape[0]
print(f"The joint probability of being over 30 years old and having at least a Bachelor's degree is: {joint_probability}")



The joint probability of being over 30 years old and having at least a Bachelor's degree is: 0.11019382627422829


 2.e [5 pts] What is the probability that a respondent is a Data Scientist who majored in Computer
 Science, Mathematics or statistics?

In [14]:
# First, we identify respondents who are Data Scientists
data_scientist_mask = multiple_choice_responses_df['CurrentJobTitleSelect'] == 'Data Scientist'

# Define a mask for respondents who are Data Scientists in the free-form responses
free_form_data_scientist_mask = free_form_responses_df['CurrentJobTitleFreeForm'].str.contains(
    'data scientist', case=False, na=False)



# Then, we identify respondents who have majored in Computer Science, Mathematics, or Statistics
major_mask = multiple_choice_responses_df['MajorSelect'].isin(['Computer Science', 'Mathematics or statistics'])
# Then, we identify respondents who have majored in Computer Science, Mathematics, or Statistics
free_form_major_mask = free_form_responses_df['MajorFreeForm'].isin(['Computer Science', 'Mathematics or statistics'])


# Combine the masks to find respondents who satisfy both conditions in the free-form responses
free_form_data_scientist_and_major_mask = free_form_data_scientist_mask & free_form_major_mask
# Next, we create a combined mask to find respondents who satisfy both conditions
data_scientist_and_major_mask = data_scientist_mask & major_mask


# Apply this combined mask to filter the DataFrame
data_scientists_with_cs_math_stats_major_df = multiple_choice_responses_df[data_scientist_and_major_mask]
# Apply this combined mask to filter the free-form DataFrame
free_form_data_scientists_with_cs_math_stats_major_df = free_form_responses_df[
    free_form_data_scientist_and_major_mask]


# Count the number of such Data Scientists
num_data_scientists_with_cs_math_stats_major = data_scientists_with_cs_math_stats_major_df.shape[0] + free_form_data_scientists_with_cs_math_stats_major_df.shape[0]

# Calculate the total number of survey respondents
total_respondents = multiple_choice_responses_df.shape[0] + free_form_responses_df.shape[0]


# Calculate the joint probability
p_data_scientist_and_cs_math_stats_major = num_data_scientists_with_cs_math_stats_major / total_respondents

# Output the joint probability
print(f"The probability of a respondent being a Data Scientist who majored in CS, Math, or Stats is: {p_data_scientist_and_cs_math_stats_major}")


The probability of a respondent being a Data Scientist who majored in CS, Math, or Stats is: 0.03161641541038526


2.f [5 pts] What is the joint probability that a respondent is from France, earns less than 100,000 USD annually, and uses Cross-Validation Often or Most of the time?

In [15]:
# Filter for respondents from France
from_france_df = multiple_choice_responses_df[multiple_choice_responses_df['Country'] == 'France']

# Filter for respondents earning less than $100,000 annually from those who are from France
earns_less_than_100k_df = from_france_df[from_france_df['CompensationAmountUSD'] < 100000]

# Filter for respondents who use Cross-Validation 'Often' or 'Most of the time' from those earning less than $100,000
uses_cross_validation_often_df = earns_less_than_100k_df[
    earns_less_than_100k_df['WorkMethodsFrequencyCross-Validation'].isin(['Often', 'Most of the time'])
]

# Calculate the joint probability as the count of such respondents divided by the total number of respondents
joint_probability_france_cv_less_100k = len(uses_cross_validation_often_df) / len(multiple_choice_responses_df)

# Print the joint probability
print(f"The joint probability of being from France, earning less than $100,000, and using Cross-Validation often is: {joint_probability_france_cv_less_100k}")


The joint probability of being from France, earning less than $100,000, and using Cross-Validation often is: 0.0046063651591289785


2.g [10 pts] What is the probability that a respondent uses C/C++ at work given that they are employed
 as a Programmer? (Hint: Use your findings from Question 2a).

In [16]:
# Calculate the number of respondents who are Programmers
num_programmers = multiple_choice_responses_df['CurrentJobTitleSelect'].str.contains('Programmer', na=False).sum()

# Calculate the probability of being a Programmer
p_programmer = num_programmers / total_respondents

#This was already calculated in Question 2a
# Calculate P(Programmer and C/C++) which is num_programmers_using_c_cpp divided by the total number of respondents
#p_programmer_and_c_cpp = num_programmers_using_c_cpp / total_respondents

# Calculate the conditional probability P(C/C++ | Programmer)
p_c_cpp_given_programmer = p_programmer_and_c_cpp / p_programmer

# Output the conditional probability
print(f"The probability that a respondent uses C/C++ at work given that they are employed as a Programmer is: {p_c_cpp_given_programmer}")


The probability that a respondent uses C/C++ at work given that they are employed as a Programmer is: 0.14718614718614717


2.h [15 pts] Given the probability of a respondent wearing glasses is 0.15, and the probability of a
 respondent wearing glasses given they have a PhD is 0.25, find the probability of a respondent having a PhD
 given that they wear glasses.

In [17]:
# Filter for respondents with a Doctoral degree (PhD)
phd_mask = multiple_choice_responses_df['FormalEducation'].str.contains('Doctoral degree', na=False)

# Apply this mask to filter the DataFrame
phd_df = multiple_choice_responses_df[phd_mask]

# Calculate the number of respondents with a PhD
num_phd = phd_df.shape[0]

# Calculate the total number of respondents
total_respondents = multiple_choice_responses_df.shape[0]

# Calculate the probability of having a PhD
P_PhD = num_phd / total_respondents if total_respondents else 0

# Given probabilities from the problem statement
P_Glasses = 0.15
P_Glasses_given_PhD = 0.25

# Applying Bayes' theorem to find P(PhD | Glasses)
P_PhD_given_Glasses = (P_Glasses_given_PhD * P_PhD) / P_Glasses if P_Glasses else 0

# Print the probabilities
print(f"Probability of a respondent having a PhD: {P_PhD}")
print(f"Probability of a respondent having a PhD given they wear glasses: {P_PhD_given_Glasses}")


Probability of a respondent having a PhD: 0.14040440296721704
Probability of a respondent having a PhD given they wear glasses: 0.23400733827869508
