#### Importing our libraries and cleaned data

In [86]:
import pandas as pd
from cleaning import read_data, clean_linting_results, clean_review_test_results, clean_data_reviewss, clean_data_exercises, clean_data_code_blast_tests

read_linting_results, read_review_test_results, read_reviews, data_tests, read_code_blast_tests, read_exercises, data_implementation_exercise = read_data()

data_linting_result = clean_linting_results(read_linting_results)
data_review_test_results = clean_review_test_results(read_review_test_results)
data_reviews = clean_data_reviewss(read_reviews)
data_exercises = clean_data_exercises(read_exercises)
data_code_blast_tests = clean_data_code_blast_tests(read_code_blast_tests)


#### What are the differences of accepted and declined exercises between the dutch and english Jarvis users?

In the code below we gather data from both English and Dutch Jarvis users to create a clear view in differences between accepted and declined exercises.

In [87]:
# We merge data_review_test_results and data_reviews based on 'blast_review_id' and 'id'
merged_data = pd.merge(data_review_test_results, data_reviews, left_on='blast_review_id', right_on='id')

# We filter the languages
filtered_data = merged_data[merged_data['test_language'].isin(['nl', 'en'])]

# Because of the differences in amout of data for the two languages, we filter only the top 500
english_data = filtered_data[filtered_data['test_language'] == 'en'].head(500)
dutch_data = filtered_data[filtered_data['test_language'] == 'nl'].head(500)

# We count the accpeted and declined values for English users
english_counts = english_data['state'].value_counts()

# We count the accpeted and declined values for Dutch users
dutch_counts = dutch_data['state'].value_counts()

# We print the results
print("Amount of ACCEPTED and DECLINED for English user data:")
print(english_counts)

print("\nAmount of ACCEPTED and DECLINED for Dutch user data:")
print(dutch_counts)


Amount of ACCEPTED and DECLINED for English user data:
state
DECLINED    313
ACCEPTED    187
Name: count, dtype: int64

Amount of ACCEPTED and DECLINED for Dutch user data:
state
DECLINED    273
ACCEPTED    227
Name: count, dtype: int64


Based on the top 500 Dutch and top 500 English users we see that English Jarvis users get declined a lot more than the Dutch users.
One of the reasons of this could be that the English version of Jarvis is not as clear for students as the Dutch version. 

#### Is this data different for each code language?

With these precentages we get a clear view of the differences between the english and dutch Jarvis users about the exercises that get declined and accepted for each code language.


In [88]:
# We reset the index of data_linting_result
data_linting_result_reset = data_linting_result.reset_index()

# We find the code language using regex
filtered_data = filtered_data.reset_index(drop=True)
filtered_data['code_language'] = data_linting_result_reset['file_name'].str.extract(r'\.(\w+)$')

# We count the number of ACCEPTED and DECLINED for data per code language
counts_by_language = filtered_data.groupby(['test_language', 'code_language', 'state']).size().unstack(fill_value=0)

# We calculate the percentage of ACCEPTED and DECLINED per code language for Dutch users
dutch_percentage_by_language = (
    counts_by_language.loc['nl'].div(counts_by_language.loc['nl'].sum(axis=1), axis=0) * 100
)

# We calculate the percentage of ACCEPTED and DECLINED per code language for English users
english_percentage_by_language = (
    counts_by_language.loc['en'].div(counts_by_language.loc['en'].sum(axis=1), axis=0) * 100
)

# Set the number of decimals
decimals = 1

# Format for printing the results
format_str = "{:." + str(decimals) + "f}%"

# Print the results
print("\nPercentage of ACCEPTED and DECLINED for Dutch users per code language:")
print(dutch_percentage_by_language.map(lambda x: format_str.format(x)).rename_axis(columns={'state': ''}))

print("\nPercentage of ACCEPTED and DECLINED for English users per code language:")
print(english_percentage_by_language.map(lambda x: format_str.format(x)).rename_axis(columns={'state': ''}))



Percentage of ACCEPTED and DECLINED for Dutch users per code language:
              ACCEPTED DECLINED
code_language                  
css              45.4%    54.6%
htm              46.7%    53.3%
html             44.9%    55.1%
inc              40.0%    60.0%
js               45.0%    55.0%
php              45.3%    54.7%
py               46.4%    53.6%
sql              45.4%    54.6%

Percentage of ACCEPTED and DECLINED for English users per code language:
              ACCEPTED DECLINED
code_language                  
css              22.2%    77.8%
html             39.6%    60.4%
js               37.6%    62.4%
php              40.2%    59.8%
sql              30.6%    69.4%


We see that there are bigger gaps between the percentages per code language from the english users. One of the reasons this could be the case is because there are way less english users than dutch users. 