# Module 6 Lab Assignment

In this activity you will work on an educational dataset that contains the following columns.

- `StudentId`: Unique identifier for each student.
- `CourseId`: Identifier for the course.
- `InteractionDate`: The date of the interaction.
- `InteractionType`: Type of interaction (e.g., 'Video', 'Quiz', 'Discussion').
- `DurationSeconds`: The duration of the interaction in seconds.
- `Completion`: Whether the interaction was completed (1 for completed, 0 for not completed).
- `Result`: The result of the interaction if applicable (e.g., quiz score).
- `ContentId`: The unique identifier for the content interacted with.
- `InteractionId`: A unique identifier for each interaction.


### **Do not use `groupby` or any other method we have not covered in the lectures yet.**

Please load the dataset into a dataframe and print the first 5 rows:

In [21]:
#First import pandas
import pandas as pd

In [29]:
df=pd.read_excel("educational_dataset.xlsx")
df.head()

Unnamed: 0,StudentId,CourseId,InteractionDate,InteractionType,DurationSeconds,Completion,Result,ContentId,InteractionId
0,49.0,C101,2023-11-10 04:11:03.727,Video,522.0,0,,D005,I263
1,45.0,C101,2023-11-02 14:22:50.741,Discussion,584.0,0,81.0,Q008,I434
2,15.0,C105,2023-11-01 08:04:48.577,Quiz,562.0,0,,D005,I165
3,14.0,C109,2023-11-11 15:11:54.228,Quiz,490.0,0,,D009,I158
4,33.0,C106,2023-11-04 22:16:06.733,Discussion,67.0,1,,D008,I619


**TASK 1 [15 PTS]** Check for and report the number of missing values in each column of the dataset. You should use loop to iterate through the columns. Your output MUST match the given output:

In [30]:
for col in df.columns:
    missing_count = df[col].isnull().sum()
    if missing_count > 0:
        print(f"{col} :  {missing_count}")

StudentId :  1
CourseId :  1
InteractionDate :  1
InteractionType :  1
DurationSeconds :  4
Result :  266


**TASK 2 [5 PTS]** Remove any rows with missing `StudentId` or `CourseId` since these are critical pieces of information. Use a **single** `dropna` statement. Print the shape of the dataframe before and after the removal. Your output should look like this:

In [31]:
print(f"before {df.shape}")
df.dropna(subset=['StudentId', 'CourseId'], inplace=True)
print(f"after {df.shape}")

before (500, 9)
after (498, 9)


**TASK 3 [25 PTS]** Fill in missing values for the `Result` column with the average `Result` of the corresponding course.

For this task, you should first create the following list using `dict comprehensions`.

In [32]:
course_avg_results = {
    course: float(df.loc[df['CourseId'] == course, 'Result'].mean().round(2))
    for course in df['CourseId'].unique()
}
course_avg_results

{'C101': 83.84,
 'C105': 88.72,
 'C109': 80.71,
 'C106': 72.52,
 'C107': 82.08,
 'C104': 73.96,
 'C103': 78.62,
 'C102': 67.64,
 'C108': 63.4,
 'C110': 87.68}

Then, pass the dictionary you obtained to `map` function to change the missing `Result` values.

[Hint: You can use `.loc` to find the rows with missing `Result` value.]

In [33]:
df.loc[df['Result'].isnull(), 'Result'] = df.loc[df['Result'].isnull(), 'CourseId'].map(course_avg_results)
display(df)

Unnamed: 0,StudentId,CourseId,InteractionDate,InteractionType,DurationSeconds,Completion,Result,ContentId,InteractionId
0,49.0,C101,2023-11-10 04:11:03.727,Video,522.0,0,83.84,D005,I263
1,45.0,C101,2023-11-02 14:22:50.741,Discussion,584.0,0,81.00,Q008,I434
2,15.0,C105,2023-11-01 08:04:48.577,Quiz,562.0,0,88.72,D005,I165
3,14.0,C109,2023-11-11 15:11:54.228,Quiz,490.0,0,80.71,D009,I158
4,33.0,C106,2023-11-04 22:16:06.733,Discussion,67.0,1,72.52,D008,I619
...,...,...,...,...,...,...,...,...,...
495,7.0,C104,2023-11-02 11:41:14.549,Video,405.0,1,73.96,Q005,I264
496,12.0,C106,2023-11-05 09:02:31.503,Discussion,79.0,0,72.52,D003,I773
497,46.0,C101,2023-11-02 06:18:02.164,Video,352.0,1,81.00,Q005,I989
498,29.0,C101,2023-11-11 17:53:30.421,Video,585.0,0,83.84,D008,I220


**TASK 4  [15 PTS]** Sort the DataFrame by `InteractionDate` and fill in missing `DurationSeconds` values using forward-fill (`ffill`) technique.

In [34]:
#First identify the number of missing values in DurationSeconds column
print(df['DurationSeconds'].isnull().sum())

4


In [36]:
#Write the code to fill the missing values.
#If you print the mean of DurationSeconds column, you should obtain the following value.
df["InteractionDate"] = pd.to_datetime(df["InteractionDate"])
df = df.sort_values("InteractionDate")
df["DurationSeconds"] = df["DurationSeconds"].ffill()
print(df["DurationSeconds"].mean())

336.69477911646584


**TASK 5  [15 PTS]** Compute the total duration spent in each course.

For this task, I first recommend creating the following dictionary using comprehensions:

In [37]:
course_duration_avg = {
    course: float(df.loc[df['CourseId'] == course, 'DurationSeconds'].mean())
    for course in df['CourseId'].unique()
}
display(course_duration_avg)

{'C108': 314.7692307692308,
 'C104': 337.84313725490193,
 'C101': 348.95,
 'C110': 337.02222222222224,
 'C107': 332.62,
 'C105': 368.36170212765956,
 'C103': 334.0,
 'C102': 310.27272727272725,
 'C106': 350.41818181818184,
 'C109': 324.1454545454545}

Then, you should create a Series object from the dictonary.

In [41]:
course_duration_series = pd.Series(course_duration_avg).sort_values()
course_duration_series

Unnamed: 0,0
C102,310.272727
C108,314.769231
C109,324.145455
C107,332.62
C103,334.0
C110,337.022222
C104,337.843137
C101,348.95
C106,350.418182
C105,368.361702


**TASK 6 [15 PTS]** Calculate the average `Result` for each course. In this task, you should adhere to the same approach used in Task 5, and obtain the following Series:

In [43]:
course_result_avg = {
    course: float(df.loc[df["CourseId"] == course, "Result"].mean())
    for course in df["CourseId"].unique()
}
course_result_series = pd.Series(course_result_avg)
course_result_series

Unnamed: 0,0
C108,63.4
C104,73.958431
C101,83.840667
C110,87.680889
C107,82.08
C105,88.72
C103,78.622308
C102,67.638182
C106,72.519273
C109,80.709818


**TASK 7 [10 PTS]** Check if there is any correlation with average Duration spent and the Result.

First, I recommend creating a new dataframe by concating the two Series object you created above.

In [44]:
df_course = pd.concat(
    [course_duration_series.rename("Duration"),
     course_result_series.rename("Result")],
    axis=1
)
df_course

Unnamed: 0,Duration,Result
C102,310.272727,67.638182
C108,314.769231,63.4
C109,324.145455,80.709818
C107,332.62,82.08
C103,334.0,78.622308
C110,337.022222,87.680889
C104,337.843137,73.958431
C101,348.95,83.840667
C106,350.418182,72.519273
C105,368.361702,88.72


Then, you can apply the `corr()` function to compute the correlation.

Based on the computed correlation value, please make brief statement about the relationship between Duration and Result.

In [47]:
df_course.corr()

Unnamed: 0,Duration,Result
Duration,1.0,0.669521
Result,0.669521,1.0


**[OPTIONAL] TASK 8.** Add a new column `InteractionCategory` that maps the `InteractionType` to categories such as `Active` for 'Quiz', 'Discussion' and `Passive` for Video' using a predefined dictionary, which you should define.

After adding the column, please display the number of Active and Passive values, as below:

In [48]:
category_mapping = {
    'Quiz': 'Active',
    'Discussion': 'Active',
    'Video': 'Passive'
}
df['InteractionCategory'] = df['InteractionType'].map(category_mapping)
display(df['InteractionCategory'].value_counts())

Unnamed: 0_level_0,count
InteractionCategory,Unnamed: 1_level_1
Active,346
Passive,151
