# Analysis of the Data Science Survey

This notebook analyzes the data collected from the study on the state of the Data Science field, saved in the file `data/surveyDataScience.csv`.

Descriptive information will be calculated (number of respondents, attributes, completeness), the duration of higher education will be estimated (using assumptions: bachelor's degree = 3 years, master's degree = 2 years, doctorate = 3 years), subgroups will be filtered and compared (e.g., respondents from Romania, women from Romania who program in Python or C++), the range of values for each attribute will be summarized, and the information about programming experience will be transformed to calculate statistical moments (min, max, mean, standard deviation, median).

Additionally, visualizations will be produced to highlight distributions across age categories and identify outliers for programming experience.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

# Ensure displaying plots within the notebook
%matplotlib inline

# Load the CSV file
file_path = 'data/surveyDataScience.csv'
df = pd.read_csv(file_path)

print('First 5 records:')
display(df.head())

First 5 records:


  df = pd.read_csv(file_path)


Unnamed: 0,Time from Start to Finish (seconds),Q1,Q2,Q3,Q4,Q5,Q6,Q7_Part_1,Q7_Part_2,Q7_Part_3,...,Q38_B_Part_3,Q38_B_Part_4,Q38_B_Part_5,Q38_B_Part_6,Q38_B_Part_7,Q38_B_Part_8,Q38_B_Part_9,Q38_B_Part_10,Q38_B_Part_11,Q38_B_OTHER
0,Duration (in seconds),What is your age (# years)?,What is your gender? - Selected Choice,In which country do you currently reside?,What is the highest level of formal education ...,Select the title most similar to your current ...,For how many years have you been writing code ...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,...,"In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor..."
1,910,50-54,Man,India,Bachelor’s degree,Other,5-10 years,Python,R,,...,,,,,,,,,,
2,784,50-54,Man,Indonesia,Master’s degree,Program/Project Manager,20+ years,,,SQL,...,,,,,,,,,,
3,924,22-24,Man,Pakistan,Master’s degree,Software Engineer,1-3 years,Python,,,...,,,TensorBoard,,,,,,,
4,575,45-49,Man,Mexico,Doctoral degree,Research Scientist,20+ years,Python,,,...,,,,,,,,,,


## 1.a. Descriptive Analyses

In [6]:
# 1. Total number of respondents
num_respondents = df.shape[0]
print('Total number of respondents:', num_respondents)

# 2. Number and type of attributes (properties) for a respondent
num_attributes = df.shape[1]
print('Number of attributes:', num_attributes)

print('\nType of each attribute:')
print(df.dtypes)

Total number of respondents: 25974
Number of attributes: 369

Type of each attribute:
Time from Start to Finish (seconds)    object
Q1                                     object
Q2                                     object
Q3                                     object
Q4                                     object
                                        ...  
Q38_B_Part_8                           object
Q38_B_Part_9                           object
Q38_B_Part_10                          object
Q38_B_Part_11                          object
Q38_B_OTHER                            object
Length: 369, dtype: object


In [7]:
# 3. Number of respondents with complete data
df_complete = df.dropna()
num_complete = df_complete.shape[0]
print('Number of respondents with complete data:', num_complete)

Number of respondents with complete data: 1


### Calculating the Average Duration of Higher Education

It is assumed that:
- Bachelor's degree takes **3 years**
- Master's degree takes **2 years**
- Doctorate takes **3 years**

The average duration is calculated for:
- Respondents with complete data
- Respondents from Romania
- Respondents from Romania who are women

In [8]:
# Name of the education column (according to the header in the CSV)
edu_col = "What is your highest level of formal education that you have attained or plan to attain within the next 2 years?"

# Mapping for the duration of studies
edu_mapping = {
    "Bachelor’s degree": 3,
    "Master’s degree": 2,
    "Doctorate": 3
}

# Working on the dataset with completeness
df_complete = df_complete.copy()
df_complete['education_years'] = df_complete[edu_col].map(edu_mapping)

# Calculate the mean for respondents with complete data (extracting only non-null values)
mean_all = df_complete['education_years'].dropna().mean()
print('Average duration of higher education for respondents with complete data:', mean_all)

# Respondents from Romania
romania_mask = df_complete['In which country do you currently reside?'] == 'Romania'
df_romania = df_complete[romania_mask]
mean_romania = df_romania['education_years'].dropna().mean()
print('Average duration of studies for respondents from Romania:', mean_romania)

# Respondents from Romania who are female
female_mask = df_complete['What is your gender? - Selected Choice'] == 'Female'
df_romania_female = df_romania[female_mask]
mean_romania_female = df_romania_female['education_years'].dropna().mean()
print('Average duration of studies for respondents from Romania who are female:', mean_romania_female)

KeyError: 'What is your highest level of formal education that you have attained or plan to attain within the next 2 years?'

In [None]:
# 5. Number of female respondents from Romania with complete data
num_romania_female_complete = df_romania_female.shape[0]
print('Number of female respondents from Romania with complete data:', num_romania_female_complete)

### Analysis of Programming Languages for Women in Romania

The following is determined:
- Number of women in Romania who program in **Python**
- Age range (category) with the most women who program in **Python**
- The same information for **C++**

It is assumed that the information about the languages used is stored in the columns:
  - `... - Python`
  - `... - C++`

In [None]:
# Column names for Python and C++ languages (according to the header)
python_col = "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python"
cpp_col = "What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - C++"

# Filtering for women in Romania who code in Python
df_romania_female_python = df_romania_female[df_romania_female[python_col].notna()]
num_romania_female_python = df_romania_female_python.shape[0]
print('Number of women in Romania who code in Python:', num_romania_female_python)

# Determine the age range with the most women coding in Python
age_col = "What is your age (# years)?"
age_counts_python = df_romania_female_python[age_col].value_counts()
if not age_counts_python.empty:
    print('Age range with the most women coding in Python:',
          age_counts_python.idxmax(), '(', age_counts_python.max(), 'respondents)')
else:
    print('No data for women coding in Python.')

# Filtering for women in Romania who code in C++
df_romania_female_cpp = df_romania_female[df_romania_female[cpp_col].notna()]
num_romania_female_cpp = df_romania_female_cpp.shape[0]
print('Number of women in Romania who code in C++:', num_romania_female_cpp)

age_counts_cpp = df_romania_female_cpp[age_col].value_counts()
if not age_counts_cpp.empty:
    print('Age range with the most women coding in C++:',
          age_counts_cpp.idxmax(), '(', age_counts_cpp.max(), 'respondents)')
else:
    print('No data for women coding in C++.')

### Range of possible values and extreme values for each attribute

For each column, a summary is generated: if the attribute is numeric, the minimum and maximum values are displayed, while for categorical attributes, the number of unique values (and a few examples) is calculated.

In [None]:
feature_summary = []
for col in df.columns:
    if pd.api.types.is_numeric_dtype(df[col]):
        summary = {
            'Feature': col,
            'Type': 'Numeric',
            'Min': df[col].min(),
            'Max': df[col].max(),
            'Unique': df[col].nunique()
        }
    else:
        unique_vals = df[col].dropna().unique()
        summary = {
            'Feature': col,
            'Type': 'Categorical',
            'Unique Values Count': len(unique_vals),
            'Example Values': unique_vals[:5]
        }
    feature_summary.append(summary)

summary_df = pd.DataFrame(feature_summary)
print('Top 10 attribute summaries:')
display(summary_df.head(10))

### Transforming information about programming experience

The information from the column "For how many years have you been writing code and/or programming?" is converted into a numerical value using the midpoint of the interval (e.g., "5-10 years" → 7.5). Then, the first and second-order moments are calculated: minimum, maximum, mean, standard deviation, and median.

In [None]:
exp_col = "For how many years have you been writing code and/or programming?"

def experience_to_years(x):
    if pd.isnull(x):
        return np.nan
    # Find all numbers in the string
    nums = re.findall(r'\d+', x)
    if len(nums) == 2:
        return (float(nums[0]) + float(nums[1])) / 2.0
    elif len(nums) == 1:
        return float(nums[0])
    else:
        return np.nan

# Apply the function and create a new column
df['exp_years'] = df[exp_col].apply(experience_to_years)

min_exp = df['exp_years'].min()
max_exp = df['exp_years'].max()
mean_exp = df['exp_years'].mean()
std_exp = df['exp_years'].std()
median_exp = df['exp_years'].median()

print('Programming experience (years):')
print('Minimum:', min_exp)
print('Maximum:', max_exp)
print('Average:', mean_exp)
print('Standard deviation:', std_exp)
print('Median:', median_exp)

print('\nNote: The programming experience variable shows notable variability; extreme values may indicate the presence of outliers.')

## 1.b. Visualizations

In [None]:
# Visualization of the distribution of respondents who program in Python by age group
df_python = df[df[python_col].notna()]
age_counts_python_all = df_python[age_col].value_counts().sort_index()

plt.figure(figsize=(8,5))
age_counts_python_all.plot(kind='bar')
plt.title('Distribution of respondents who program in Python by age group')
plt.xlabel('Age group')
plt.ylabel('Number of respondents')
plt.show()

In [None]:
# Visualization of the distribution of respondents from Romania who program in Python by age categories
df_romania_python = df[df['In which country do you currently reside?'] == 'Romania']
df_romania_python = df_romania_python[df_romania_python[python_col].notna()]
age_counts_romania_python = df_romania_python[age_col].value_counts().sort_index()

plt.figure(figsize=(8,5))
age_counts_romania_python.plot(kind='bar')
plt.title('Distribution of respondents from Romania who program in Python by age categories')
plt.xlabel('Age Range')
plt.ylabel('Number of Respondents')
plt.show()

In [None]:
# Visualization of the age group distribution for female respondents from Romania who program in Python
df_romania_female_python_vis = df[(df['In which country do you currently reside?'] == 'Romania') & 
                                  (df['What is your gender? - Selected Choice'] == 'Female') & 
                                  (df[python_col].notna())]
age_counts_romania_female_python = df_romania_female_python_vis[age_col].value_counts().sort_index()

plt.figure(figsize=(8,5))
age_counts_romania_female_python.plot(kind='bar')
plt.title('Age group distribution of female respondents from Romania who program in Python')
plt.xlabel('Age range')
plt.ylabel('Number of respondents')
plt.show()

In [None]:
# Boxplot for identifying outliers in programming experience
plt.figure(figsize=(8,5))
plt.boxplot(df['exp_years'].dropna(), vert=False)
plt.title('Boxplot for programming experience')
plt.xlabel('Years of experience')
plt.show()