In [None]:
import pandas as pd

# Description of the Data

In [None]:
import pandas as pd
survey_data = pd.read_excel('student_data.xlsx', index_col=None)

This data contains students survey responses to the following items, ranking between **1 (strongly disagree) to 4 (strongly agree)**.

TASK VALUE [5 Items]:
* I think I will be able to use what I learn in this course in other courses.
* It is important for me to learn the course material in this class.
* I am very interested in the content area of this course.
* I think the course material in this class is useful for me to learn.
* I like the subject matter of this course.
* Understanding the subject matter of this course is very important to me.

METACOGNITIVE SELF REGULATION [5 Items]
* When I become confused about some concepts, I search for some strategies (re-reading, checking other resources, asking for help) to be able to understand them.
* I tried to change the way I study in order to fit the course requirements and the instructor's teaching style.
* When studying for this course I try to determine which concepts I don't understand well.
* When I study for this class, I set goals for myself in order to direct my activities in each study period.
* I give up studying when I have problem understanding a concept in this course.*

TIME ENVIRONMENT MANAGEMENT [6 Items]
* I usually study in a place where I can concentrate on my course work.
* I make good use of my study time for this course.
* I find it hard to stick to a study schedule.
* I have a regular place set aside for studying.
* I make sure that I keep up with the weekly assignments for this course.
* I watch the lecture videos regularly (especially for the classes that I miss).

The data also contains students' access logs to various components in OdtuClass for Modules 2 to Modules 5.

The following list provides explanation for the columns, taking Module 2 as an example:
* Module2PPT : Number of times a student viewed slides.
* Module2Recording: Number of times a student viewed zoom lecture recording.
* Module2InClass : Number of times a student viewed inclass activity in OdtuClass.
* Module2Lab : Number of times a student had any Lab-related activity in OdtuClass.
* Module2Quiz : Number of times a student had a Quiz-related activity in OdtuClass.

Display the first 5 rows of the data to get a better sense of it:

In [None]:
df = pd.read_excel("student_data.xlsx")
df.head()

Unnamed: 0,fullname,gender,year,dept,TV1,TV2,TV3,TV4,TV5,TV6,...,Module4Lab,Module5Lab,Module2PPT,Module3PPT,Module4Presentation,Module5PPT,Module2Recording,Module3Recording,Module4Recording,Module5Recording
0,in ltieca,Male,2nd year,CEIT,4,4,4,4,4,4,...,0,13,0,0,3,0,0,0,0,0
1,lagbsbi ye auc,Female,2nd year,BÖTE-CEIT,4,4,4,4,4,4,...,12,47,0,6,2,0,1,2,0,0
2,ugiyyelazm asl,Female,2nd year,CEIT,3,4,4,4,4,4,...,0,41,4,1,0,0,15,7,2,2
3,einluotiubgnncf ser,Female,2nd year,CEIT,4,4,4,4,4,4,...,2,48,0,3,0,1,0,0,0,0
4,redmetb are,Female,2nd year,Computer Education and Instructional Technology,4,4,4,4,3,4,...,0,26,0,0,0,0,0,0,0,2


It is difficult to display all the columns at once since we have a lot of columns. So, let's just print the columns not the dataframe:

In [None]:
print(df.columns)

Index(['fullname', 'gender', 'year', 'dept', 'TV1', 'TV2', 'TV3', 'TV4', 'TV5',
       'TV6', 'METACOG1', 'METACOG2', 'METACOG3', 'METACOG4', 'METACOG5',
       'TIMEMANG1', 'TIMEMANG2', 'TIMEMANG3', 'TIMEMANG4', 'TIMEMANG5',
       'TIMEMANG6', 'Module2Inclass', 'Module3Inclass', 'Module4Inclass',
       'Module5Inclass', 'Module6Inclass', 'Module2Lab', 'Module4Lab',
       'Module5Lab', 'Module2PPT', 'Module3PPT', 'Module4Presentation',
       'Module5PPT', 'Module2Recording', 'Module3Recording',
       'Module4Recording', 'Module5Recording'],
      dtype='object')


# Data Preparation

Before we proceed, we need to process students activity logs further. If students interacted with a component less than 3 times (e.g., viewed a powerpoint less than 3 times), then we will consider that they did not interact with that component at all. So, we will replace any value less than 3 with 0.

Perform this operation to all columns except the survey columns.

To select the correct columns, you can convert the `survey_data.columns` of the dataframe to a series by `survey_data.columns.to_series()` on which you can apply `.loc` to select all columns between 'Module2Inclass' and 'Module5Recording' using proper slicing.

Then you should use for loop to iterate through the selected columns, and at each iteration you should apply `.loc` on the dataframe to select the rows where the value of the column is less than 3, and then set the value of the column to 0 for those rows.

In [None]:
for col in survey_data.columns.to_series().loc['Module2Inclass': 'Module5Recording']:
    survey_data.loc[survey_data[col] < 3, col] = 0

## TASKS

**CHLG1.** Find out if students check powerpoints more or zoom recordings more across all modules. **REPORT** the exact values.

In [None]:
pptSum = survey_data[['Module2PPT', 'Module3PPT',
                      'Module4Presentation',
                      'Module5PPT']].sum().sum()
pptRec = survey_data[['Module2Recording',
                      'Module3Recording',
                      'Module4Recording',
                      'Module5Recording']].values.sum()
pptSum, pptRec

(np.int64(128), np.int64(52))

**CHLG2.** Find the average lab activities of female students vs male students across all modules. Identify which group had a higher average. **REPORT** the exact values.

Avoid writing column names one by one. Instead, use `myData.columns.str.contains()` to get the columns that contain the word 'Lab'.    


In [None]:
colLabs = survey_data.columns[survey_data.columns.str.contains('Lab')]

maleAvg = survey_data[survey_data.gender == 'Male'][colLabs].values.mean()
femaleAvg = survey_data[survey_data.gender == 'Female'][colLabs].values.mean()

# Display the results
print(f"Male Average: {maleAvg}")
print(f"Female Average: {femaleAvg}")

Male Average: 13.904761904761905
Female Average: 11.75


**CHLG3.** Which lab assignment was more challanging among males vs females? A lab assignment with more activities are considered more challanging. **REPORT** the exact values about the activity level.

In [None]:
survey_data[survey_data.gender == 'Male'][colLabs].mean()

Unnamed: 0,0
Module2Lab,6.285714
Module4Lab,3.5
Module5Lab,31.928571


In [None]:
survey_data[survey_data.gender == 'Female'][colLabs].mean()

Unnamed: 0,0
Module2Lab,5.1875
Module4Lab,1.375
Module5Lab,28.6875


Please make `fullname` the index of the dataframe.

In [None]:
survey_data.set_index('fullname', inplace=True)

### Reversing some items

When you collect research data through a survey, you generally introduce some reverse items to help ensure the reliability and validity of your measurements. Reverse items, also known as negatively worded or reverse-coded questions, are survey questions that are worded in the opposite direction of the construct you are attempting to measure. For example, if you are trying to measure a student's level of satisfaction with a course, you might ask the student to rate the following statements on a scale of 1 to 4:

1. I am satisfied with this course.
// some other survey items....
8. I am not satisfied with this course. [Reversed]

The student's response to the second  question would be reverse coded, meaning that a response of 1 would indicate a high level of satisfaction, while a response of 4 would indicate a low level of satisfaction.

In our dataset, TIMEMANG3, TV3, and METCOG5 items are reversed. So, if their value is 1, then they should become 4. If their value is 2, then they should become 3. If their value is 3, then they should become 2. If their value is 4, then they should become 1. We can easily do this, by subtracting the values from 5.

Reverse the values of TIMEMANG3, TV3, and METCOG5 items.

In [None]:
cols_to_reverse = ['TIMEMANG3', 'TV3', 'METACOG5']

survey_data[cols_to_reverse] = 5 - survey_data[cols_to_reverse]

In a survey we use multiple statements to measure an abstract construct. In our dataset, 3 constructs were measured using multiple statements. We need to combine the responses to these statements to create a single score for each construct. We can do this by taking the average of the responses to the statements that measure each construct.

Find the average scores for time management, task value, and metacognition. At the end, the dataframe should have a new column for each construct, and the values in these columns should be the average of the responses to the statements that measure each construct.

After you add these columns, drop the columns that measure the constructs.

At the end of this process, the dataframe should have the following columns:

['gender', 'year', 'dept', 'Module2Inclass', 'Module3Inclass',
 'Module4Inclass', 'Module5Inclass', 'Module6Inclass', 'Module2Lab',
 'Module4Lab', 'Module5Lab', 'Module2PPT', 'Module3PPT',
 'Module4Presentation', 'Module5PPT', 'Module2Recording',
 'Module3Recording', 'Module4Recording', 'Module5Recording',
 **'TimeMangScore'**, **'TaskValueScore'**, **'MetaCogScore'**]

In [None]:
tm_cols   = survey_data.columns[survey_data.columns.str.startswith('TIMEMANG')]
tv_cols   = survey_data.columns[survey_data.columns.str.startswith('TV')]
meta_cols = survey_data.columns[survey_data.columns.str.startswith('METCOG')]

survey_data['TimeMangScore']  = survey_data[tm_cols].mean(axis=1)
survey_data['TaskValueScore'] = survey_data[tv_cols].mean(axis=1)
survey_data['MetaCogScore']   = survey_data[meta_cols].mean(axis=1)

survey_data = survey_data[[
    'gender', 'year', 'dept',
    'Module2Inclass', 'Module3Inclass', 'Module4Inclass',
    'Module5Inclass', 'Module6Inclass',
    'Module2Lab', 'Module4Lab', 'Module5Lab',
    'Module2PPT', 'Module3PPT', 'Module4Presentation', 'Module5PPT',
    'Module2Recording', 'Module3Recording', 'Module4Recording', 'Module5Recording',
    'TimeMangScore', 'TaskValueScore', 'MetaCogScore'
]]

survey_data.head()

Unnamed: 0_level_0,gender,year,dept,Module2Inclass,Module3Inclass,Module4Inclass,Module5Inclass,Module6Inclass,Module2Lab,Module4Lab,...,Module3PPT,Module4Presentation,Module5PPT,Module2Recording,Module3Recording,Module4Recording,Module5Recording,TimeMangScore,TaskValueScore,MetaCogScore
fullname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
in ltieca,Male,2nd year,CEIT,12,10,10,10,10,4,0,...,0,3,0,0,0,0,0,3.666667,3.5,
lagbsbi ye auc,Female,2nd year,BÖTE-CEIT,10,10,13,12,12,5,12,...,6,2,0,1,2,0,0,2.833333,3.5,
ugiyyelazm asl,Female,2nd year,CEIT,38,20,10,8,8,6,0,...,1,0,0,15,7,2,2,3.333333,3.333333,
einluotiubgnncf ser,Female,2nd year,CEIT,10,10,12,10,10,9,2,...,3,0,1,0,0,0,0,4.0,3.5,
redmetb are,Female,2nd year,Computer Education and Instructional Technology,14,10,12,12,12,0,0,...,0,0,0,0,0,0,2,2.5,3.333333,


**CHLG4.** Use `describe()` to find the value that divides students into 50% based on Time Management average score.

In [None]:
survey_data[['TimeMangScore', 'MetaCogScore', 'TaskValueScore']].describe()

Unnamed: 0,TimeMangScore,MetaCogScore,TaskValueScore
count,30.0,0.0,30.0
mean,2.983333,,3.166667
std,0.599089,,0.358263
min,1.833333,,2.5
25%,2.541667,,2.875
50%,3.083333,,3.333333
75%,3.458333,,3.5
max,4.0,,3.5


Create a new column called TimeManLevel. If the Time Management average score is greater than or equal to the value that divides students into 50%, then the value of TimeManLevel should be 'HIGH'. Otherwise, the value of TimeManLevel should be 'LOW'.

Here, you should use `loc` to update the values in the column. For example, `df.loc[price==100, 'price_level'] = 'high'` will update the values in the `price_level` column to 'high' for the rows where the value of price is 100. If there is no such column, then it will create a new column called `price_level`.

In [None]:
tm_desc   = survey_data['TimeMangScore'].describe()
tm_cutoff = tm_desc['50%']

survey_data.loc[survey_data['TimeMangScore'] >= tm_cutoff, 'TimeManLevel'] = 'HIGH'
survey_data.loc[survey_data['TimeMangScore'] <  tm_cutoff, 'TimeManLevel'] = 'LOW'

survey_data[['TimeMangScore', 'TimeManLevel']].head()

Unnamed: 0_level_0,TimeMangScore,TimeManLevel
fullname,Unnamed: 1_level_1,Unnamed: 2_level_1
in ltieca,3.666667,HIGH
lagbsbi ye auc,2.833333,LOW
ugiyyelazm asl,3.333333,HIGH
einluotiubgnncf ser,4.0,HIGH
redmetb are,2.5,LOW


Compute the average of the columns for the students in the LOW group and the students in the HIGH group separately. Then, concatenate the results into a single dataframe. The resulting dataframe should have 2 columns: LOW and HIGH. The index of the dataframe should be the column names of the original dataframe. The final dataframe should look like the following:
<br>
<img src="img1.png">

**CHLG5.** Based on the final dataframe, identify the items (such as, Module2Lab) for which group LOW has a greater score than group HIGH. **REPORT** the exact values.

In [None]:
cols_to_compare = [c for c in survey_data.columns
                   if c not in ['gender', 'year', 'dept', 'TimeManLevel']]

group_means = survey_data.groupby('TimeManLevel')[cols_to_compare].mean()

mask_low_higher = group_means.loc['LOW'] > group_means.loc['HIGH']

low_vs_high = pd.DataFrame({
    'LOW_mean': group_means.loc['LOW', mask_low_higher],
    'HIGH_mean': group_means.loc['HIGH', mask_low_higher]
})

low_vs_high

Unnamed: 0,LOW_mean,HIGH_mean
Module5PPT,1.066667,0.666667
Module3Recording,0.8,0.666667


**CHLG.** At which module, the difference between total number of activities of group LOW and group HIGH is the greatest? **REPORT** the exact values.

In [None]:
modules = {
    'Module2': ['Module2Inclass','Module2Lab','Module2PPT','Module2Recording'],
    'Module3': ['Module3Inclass','Module3PPT','Module3Recording'],
    'Module4': ['Module4Inclass','Module4Lab','Module4Presentation','Module4Recording'],
    'Module5': ['Module5Inclass','Module5Lab','Module5PPT','Module5Recording'],
    'Module6': ['Module6Inclass']
}

rows = []
for mod, cols in modules.items():
    low_total  = survey_data.loc[survey_data['TimeManLevel']=='LOW', cols].sum().sum()
    high_total = survey_data.loc[survey_data['TimeManLevel']=='HIGH', cols].sum().sum()
    rows.append({
        'Module': mod,
        'LOW_total': low_total,
        'HIGH_total': high_total,
        'LOW_minus_HIGH': low_total - high_total,
        'abs_difference': abs(low_total - high_total)
    })

module_diffs = pd.DataFrame(rows)
module_diffs

Unnamed: 0,Module,LOW_total,HIGH_total,LOW_minus_HIGH,abs_difference
0,Module2,319,448,-129,129
1,Module3,198,242,-44,44
2,Module4,223,265,-42,42
3,Module5,503,823,-320,320
4,Module6,160,191,-31,31


### WRITE TWO MORE CHALLENGES YOURSELF AND SOLVE THEM

CHLG7. Compare the average Task Value and Metacognition scores of LOW vs HIGH time-management groups. REPORT the exact average TaskValueScore and MetaCogScore for each group.


In [None]:
tv_meta_means = survey_data.groupby('TimeManLevel')[['TaskValueScore', 'MetaCogScore']].mean()

tv_meta_means

Unnamed: 0_level_0,TaskValueScore,MetaCogScore
TimeManLevel,Unnamed: 1_level_1,Unnamed: 2_level_1
HIGH,3.333333,
LOW,3.0,


**CHLG8.** Create a new column called TotalActivities that is the sum of all module-related activity columns (Inclass, Lab, PPT/Presentation, Recording) for each student. Compare the average TotalActivities of LOW vs HIGH groups. REPORT which group is more active overall and the exact average values.

In [None]:
activity_cols = [c for c in survey_data.columns if c.startswith('Module')]
survey_data['TotalActivities'] = survey_data[activity_cols].sum(axis=1)

total_means = survey_data.groupby('TimeManLevel')['TotalActivities'].mean()

total_means

Unnamed: 0_level_0,TotalActivities
TimeManLevel,Unnamed: 1_level_1
HIGH,131.266667
LOW,93.533333
