In [165]:
import pandas as pd
import numpy as np

df = pd.read_csv("assignment_3_dataset.csv", index_col=0)

## Cleaning and normalization 
---
- The caps column contained ambiguous entries (e.g., "??" and empty fields), which were normalized to the value "not answered" for consistency.
- Inconsistencies in the lang column such as variations in capitalization were standardized.
- Empty cells across the dataset were replaced with "not answered" where appropriate to maintain data completeness without introducing incorrect numeric assumptions.
- All relevant numerical variables were converted to floating-point format to ensure consistent data types and enable accurate statistical computations.

In [None]:
for c in df.columns:
    df[c] = df[c].replace(('','??'), np.nan).fillna("not answered")

# 2.Lang has spelling inconsistencies 

df['lang'] = df['lang'].str.capitalize()

In [168]:
#convert all ints to floats to make it easier to handle
for c in df.columns:
    df[c] = df[c].apply(lambda x: float(x) if isinstance(x, int) else x)

In [169]:


#For each with excessive whours they have the same amount in stmL
df['stmtL'] = pd.to_numeric(df['stmtL'], errors='coerce')
df['whours'] = pd.to_numeric(df['whours'], errors='coerce')
# Pick these and list them

outliers = df[
    (df["stmtL"] == df["whours"]) &
    (df["stmtL"] >= 34632)
]
print("Years coding:", (34632 / 40) / 52)


Years coding: 16.65


**Motivation**
Upon examining the data, several entries contain extremely high values for both stmtL (lines of code) and whours (work hours). For example, participant s149401 reports 34,632 work hours and the same number of lines of code. Assuming a typical full-time workload of approximately 2,000 hours per year (40 hours/week × 52 weeks), this would correspond to more than 16 years of continuous full-time development on the same task. This scenario is incompatible with the experimental context of the dataset, where productivity was measured in a controlled study rather than across decades of professional work.

These observations are therefore considered implausible measurement errors. To prevent distortion of statistical summaries and visualizations, entries exceeding the largest plausible observation (34,632) were excluded for both variables (whours and stmtL) using a single consistent threshold.

In [170]:
df = df.drop(df[df['stmtL'] >= 34632].index)

## Exploratory Data Analysis

---

In [171]:
df['stmtL_temp'] = pd.to_numeric(df['stmtL'], errors='coerce')

# Group by language and calculate the mean
lang_summary = df.groupby('lang')[['stmtL_temp', 'whours']].mean()

lang_summary

Unnamed: 0_level_0,stmtL_temp,whours
lang,Unnamed: 1_level_1,Unnamed: 2_level_1
C,9.3,9.3
C++,11.42,11.42
Java,15.572727,15.572727
Perl,3.378462,3.378462
Python,3.205,3.205
Rexx,5.4825,5.4825
Tcl,4.71625,4.71625


In [172]:
df = df.drop(columns=['stmtL_temp'])

This visualization presents the average number of lines of code and the average working hours for each programming language. The results indicate that Java and TCL exhibit notably higher values in both metrics, suggesting that programs written in these languages required more development effort and produced larger code bases. Additionally, the trend across languages aligns with the overall correlation observed between lines of code and working hours, where increased code size is associated with greater time investment.

A potential bias in the data is that all subjects appear to produce code at an identical rate of one line of code per hour, regardless of programming language or individual differences. This uniform productivity pattern is highly improbable in practice and likely indicates a data collection or recording error. As a result, conclusions involving productivity should be interpreted with caution.

In [173]:
df['z1000t_temp'] = pd.to_numeric(df['z1000t'], errors='coerce')
# Group by language and calculate the mean
speed_sum = df.groupby('lang')[['z1000t_temp']].mean()

df = df.drop(columns=['z1000t_temp'])

speed_sum

Unnamed: 0_level_0,z1000t_temp
lang,Unnamed: 1_level_1
C,5.42925
C++,2.974182
Java,4.937182
Perl,7.625
Python,6.434667
Rexx,15.739
Tcl,26.326


The table displays the average runtime for each programming language using the z1000 input. The results indicate that lower-level compiled languages such as C++ and C achieve the shortest execution times, while interpreted languages such as Tcl and Rexx show substantially longer runtimes. This suggests that compiled languages in this dataset generally exhibit higher performance efficiency compared to interpreted languages.

In [None]:
df[['stmtL', 'whours']].describe()

In [None]:
# Convert to numeric, invalid entries become NaN
df['z1000mem'] = pd.to_numeric(df['z1000mem'], errors='coerce')

#Mean, median, max memory usage per language
memory_summary = df.groupby('lang')['z1000mem'].agg(['mean', 'median', 'max','min', 'count']).sort_values('mean', ascending=False)
memory_summary

Based on these variables C++ memory usage in both mean and median however it has the relativly. 

In [None]:
# Total reliability as a simple sum of z1000rel + m1000rel
df['total_rel'] = df['z1000rel'] + df['m1000rel']


# Group by programming language and calculate average total reliability
lang_performance = (df.groupby('lang')['total_rel'].mean().sort_values(ascending=False)/2)

print("Average total reliability by language: (out of 100%)")
lang_performance

## Correlation Analysis

---

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


cols = ['stmtL', 'whours', 'z1000t', 'z1000mem']

# Convert all to numeric; non-numeric entries become NaN
numeric_df = df[cols].apply(pd.to_numeric, errors='coerce')

# Compute correlation
corr = numeric_df.corr(method='pearson')

# Plot heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(corr, annot=True, cmap='YlGnBu', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.show()

| Pearson correlation coefficient (r) value | Strength  | Direction |
|--------------------------------------------|------------|------------|
| Greater than .5                            | Strong     | Positive   |
| Between .3 and .5                          | Moderate   | Positive   |
| Between 0 and .3                           | Weak       | Positive   |
| 0                                          | None       | None       |
| Between 0 and –.3                          | Weak       | Negative   |
| Between –.3 and –.5                        | Moderate   | Negative   |
| Less than –.5                              | Strong     | Negative   |

- Based on the above table we can see that for both heat maps there is a strong correlation between whours and stmtL.
- Other correlation in both cases are weak. With the difference with the outliers filtered the weak positive correlation become weak negative ones. 

In [None]:
# Correlation betweeb high reliablity and caps

from scipy.stats import spearmanr

valid_caps = ['0-10%', '10-25%', '25-40%', '40-60%', '60-75%']
df_caps = df[df['caps'].isin(valid_caps)].copy()

order = {'0-10%': 1, '10-25%': 2, '25-40%': 3, '40-60%': 4, '60-75%': 5}
df_caps['caps_encoded'] = df_caps['caps'].map(order)

# Convert reliability to numeric
df_caps['m1000rel'] = pd.to_numeric(df_caps['m1000rel'], errors='coerce')

# Drop missing reliability values
df_caps = df_caps.dropna(subset=['m1000rel'])

# Spearman correlation
r, p = spearmanr(df_caps['caps_encoded'], df_caps['m1000rel'])

print(f"Spearman correlation: r = {r:.3f}, p = {p:.3f}")
print(f"Sample size: {len(df_caps)}")


sns.regplot(
    data=df_caps,
    x='caps_encoded',
    y='m1000rel',
    logistic=False,
    scatter_kws={'alpha':0.7}
)
plt.xticks(ticks=[1,2,3,4,5], labels=valid_caps)
plt.xlabel("Self-rated programmer capability")
plt.ylabel("Reliability score (m1000rel)")
plt.title("Spearman correlation: Capability vs Reliability")
plt.show()


## Hypothesis Testing

---
#### Hypothesis 
1. The more lines of code the more working hours are spent.
2. The longer the runtime, the more memory consumption the program requires.

In [None]:
# Hypothesis 1

from scipy.stats import pearsonr
cols = ['stmtL', 'whours']

# Make selected columns numeric; non-numeric -> NaN
numeric_df = df[cols].apply(pd.to_numeric, errors='coerce')

# Drop rows with NaN in either column (keeps indices aligned)
clean_df = numeric_df.dropna()

pearson_corr, p_val = pearsonr(clean_df['stmtL'], clean_df['whours'])
print(f"Pearson r = {pearson_corr:.3f}, p-value = {p_val:.3f}")


In [None]:
#pick and adjust data
df['z1000t'] = pd.to_numeric(df['z1000t'], errors='coerce')
df['z1000mem'] = pd.to_numeric(df['z1000mem'], errors='coerce')

df_clean = df.dropna(subset=['z1000t', 'z1000mem'])

r, p = pearsonr(df_clean['z1000t'], df_clean['z1000mem'])

print(f"Pearson correlation: r = {r:.3f}, p = {p:.3f}")



**Claim: The more lines of code, the more working hours are spent.**

- HO:There is no correlation between lines of code (stmtL) and working hours (whours).
- H1:There is a positive correlation — more lines of code → more hours worked.

Analysis output: Pearson r = 1.000, p-value = 0.000

r = 1.000 which shows a perfect *positive* correlation. This means that as the number of lines of code increases, working hours increas in an almost a perfect linear way. 
since p < 0.05 the resuls is statistically significant. 

**Conclusion**: Reject H0.
There is strong evidence of a perfect positive relationship between lines of code and working hours. In other words, programmers who write more lines of code also spend more time coding. 

---

**Claim: The longer the runtime, the more memory consumption the program requires.**

- H0: There is no relationship between runtime (z1000t) and memory consumption (z1000mem).
- H1: Runtime (z1000t) is positively correlated with memory consumption (z1000mem).

Analysis output: Pearson correlation: r = 0.073, p = 0.540

**r-value:** shows that there is a very small correlation.
**p-value:** Shows there is no statistical signifcance.

**Conclusion**: **fail** to reject H0. 
The correlation between runtime and memory is very weak therefore we can conclude there is no supporting evidence that programs with longer runtime consume more memory.




## Visualization & Reporting

---

In [None]:
sns.regplot(data=df_clean, x='whours', y='stmtL')
plt.xlabel("Work hours")
plt.ylabel("Lines of code")
plt.title("Relationship between work hours and code written")
plt.show()

### Discussing Hypothesis One

**The more lines of code, the more working hours are spent.**

The data analysis indicates a perfect positive correlation between the number of lines of code and the total working hours. However, this relationship is unlikely to reflect real-world productivity. The dataset shows that, for every participant, the reported number of lines of code is exactly equal to the number of hours worked. This implies that every subject wrote code at a constant pace of one line per hour, which is highly unrealistic across multiple individuals and programming tasks.

Therefore, although the statistical results technically support a strong relationship, it is important to acknowledge that the underlying data is likely flawed or recorded using a simplified measure. As a result, the conclusion is influenced more by the structure of the dataset than by genuine developer performance.

In [None]:
sns.regplot(data=df_clean, x='z1000t', y='z1000mem')
plt.xlabel("Runtime (min)")
plt.ylabel("Memory usage (KB)")
plt.title("Runtime vs Memory Usage (z1000 input)")
plt.show()


### Discussing Hypothesis One

**The longer the runtime, the more memory consumption the program requires.***

The statistical results do not provide evidence to support this hypothesis. Although a few outliers remain, the overall visualization and correlation analysis indicate that runtime and memory usage are not meaningfully related in this dataset. One plausible explanation is that memory usage on the systems used in this study does not act as a performance bottleneck. Modern programs often operate within efficient memory management environments, where additional memory demands do not necessarily translate into increased runtime. It is possible that under different conditions memory consumption could have a stronger effect on runtime. However, for the machines and programs examined here, this relationship does not appear to exist.

### Discussing Statistical Methods

**Correlation Computations**
Pearson’s correlation coefficient was used to evaluate relationships between continuous variables. For the correlation between lines of code and working hours, a linear association was expected. The resulting perfect correlation, also visible in the Pearson correlation heatmap, was initially surprising but became clear after further examination of the data, where each subject showed a fixed ratio of one line of code per hour.

For the relationship between self-rated programming capability (caps) and reliability (m1000rel), Spearman’s rank correlation was applied instead. Since caps is an ordinal variable with ranked percentage categories, Spearman’s method was more appropriate for assessing a monotonic relationship without assuming linearity.

**Hypothesis Testing Methods**
Pearson’s method was also applied for testing both hypotheses to maintain consistency in the analysis. Although the second hypothesis included some outliers—which could support the use of Spearman’s rank correlation due to its robustness against extreme values—Spearman’s method was deemed less suitable because the variables involved (runtime and memory usage) are ratio-scaled and not ordinal. Therefore, Pearson’s correlation remains a justified choice for this type of data.

**Visualization Tools**
Scatter plots were used to visualize the relationships tested in the hypotheses. This approach was particularly effective in Hypothesis 1, clearly illustrating the unrealistic nature of the perfect linear trend. Additionally, the correlation heatmap provided an intuitive overview of how multiple continuous variables relate to each other, further supporting the use of Pearson’s method throughout the study.

**Other insights**
Reflecting on the analysis doing a hypothesis test on the correlation between capability and output reliability could have deemed more intresting result since the correlation analysis already showed that subjects might have overestimate themselves in terms of reliablity. 