## Initial Run Through of the Dataset

In [2]:
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from ydata_profiling import ProfileReport

In [2]:
nltk.download("wordnet")
nltk.download("stopwords")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\limzh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\limzh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
df_ = pd.read_csv('../Data/responses_withVAKdom.csv', encoding="ansi")
df_.head()

Unnamed: 0,Timestamp,Email Address,"I understand that my participation is voluntary and that I am free to withdraw at any time without giving any reason. I, hereby agree to take part in the above study.",Gender,Level of Study,Field of study,Institutions,Country,Household Income,Preferred learning mode,...,25. When I first contact a new person,26. I first notice how people,27. If I am very angry,28. I find it easiest to remember,29. I think I can tell someone is lying because,30. When I'm meeting with an old friend,Please share any comments or suggestions related to this issue. Thank You,Faculty,Department,Dominant_VAK
0,1/13/2021 14:25,test@gmail.com,Agree,Female,Undergraduate,Veterinary,test,test,RM 3001 - 10 000,Face to Face,...,I arrange a face to face meeting,Look and dress,I keep replaying in my mind what it is that ha...,Things I have done,Their voice changes,"I say ""it's great to hear your voice!""",,,,Visual
1,1/13/2021 14:37,liyanashuib@gmail.com,Agree,Female,Postgraduate,Computing,UM,Malaysia,RM 3001 - 10 000,"Face to Face, Synchronous Online Learning (Rea...",...,I talk to them on the telephone,Stand and move,"I stomp about, slam doors and throw things",Faces,They avoid looking at you,"I say ""it's great to hear your voice!""",no,,,Visual
2,1/17/2021 19:04,azirasuhot@gmail.com,Agree,Female,Postgraduate,Computing,UM,Malaysia,RM 3001 - 10 000,Face to Face,...,I arrange a face to face meeting,Sound and speak,I keep replaying in my mind what it is that ha...,Things I have done,The vibes I get from them,"I say ""it's great to see you!""",,,,Visual
3,1/17/2021 19:09,haslina_m@um.edu.my,Agree,Female,Postgraduate,3:00,University Malaya,Malaysia,RM 10 001 - 25 000,"Face to Face, Asynchronous Online Learning (On...",...,I arrange a face to face meeting,Look and dress,I keep replaying in my mind what it is that ha...,Things I have done,The vibes I get from them,I give them a hug or a handshake,,,,Visual
4,1/18/2021 12:08,noorain277@um.edu.my,Agree,Female,Postgraduate,Humanities,UM,Malaysia,RM 10 001 - 25 000,Face to Face,...,I talk to them on the telephone,Sound and speak,I keep replaying in my mind what it is that ha...,Things I have done,The vibes I get from them,"I say ""it's great to see you!""",,,,Auditory


In [5]:
df_["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"].value_counts()[0:49]

Live lecture                      157
Live Lecture                       74
YouTube Video                      52
Youtube video                      51
Youtube                            45
Live lecture                       37
Youtube Video                      35
live lecture                       33
youtube video                      21
YouTube video                      19
youtube                            11
YouTube                             9
Pre-recorded lecture                6
Live Lecture                        6
live lecturer                       6
-                                   6
Youtube                             6
live lecture                        6
Live                                6
Live lecturer                       5
LIVE LECTURE                        5
Youtube video                       5
Youtube Video and Live Lecture      5
live lecture and youtube video      4
Games                               4
Youtube video and live lecture      4
YouTube vide

In [4]:
report = ProfileReport(df_, title="Quick EDA", minimal=True, html={"style": {"full_width": True}})
report.to_file("report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Details about Dataset
*to be read together with report.html generated from ProfileReport

- Dataset has 104 variables
- Dataset has 1052 observations/ rows/ entries
- Dataset has 836 missing values (0.8%)
- Variables inside dataset are all text, type conversion is needed for further analysis during the cleaning process
- Certain columns such as "Timestamp" and "Email Address"  will not be useful for our analysis; they will be removed before model training


In [5]:
df_.dtypes.value_counts()

object    104
dtype: int64

Some columns are self-explanatory. Due to the large amount of variables, only some are mentioned for clarfication:  
- **Level of Study** only has 5 distinct values, meaning it is a categorical variable.
- **Field of study** has 168 distinct values. Cleaning is needed to extract only the relevant fields of study for our analysis.
- **Institutions** are manually inputted, hence they have various formatting of the same data.
- **Household Income** is a categorical variable due to the 6 distinct values.
- **Preferred learning mode** is a multiple-choice variable during data collection. Processing is needed to separate the comma-separated values and perform one-hot encoding on those values. The same goes for **Preferred Social Media Platform**, **Preferred Communication Platform** and **Difficulties in Online Learning**.
- Variables which begin with the words **Learning Objects** have values which are from a scale of "Not at All", "Not Really", "Undecided", "Somewhat", and "Very Much". They represent the preference of learning object usage for people's studies. 
- Variables which begin with **"6. Online Instructional Strategies/Assessment"** also use the same scale. They represent students' preference of online instructional methods and assessments.
- The variable **"For technical or hands-on subject, ..."** is a short-answer question. Extra processing is needed to clean this variable.
- Variables which **begin with numbers** are questions which attempt to guess the respondents' learning style based on the VAK (Visual, Audio, Kinesthetic) model.
- **Dominant_VAK** describes the most likely learning style the entry is.

### Finding Missing Values

In [6]:
null_columns = df_.columns[df_.isnull().any()]

In [7]:
null_columns = null_columns.values

In [8]:
null_columns = [(null_column, df_[null_column].isnull().sum()) for null_column in null_columns]
null_columns

[('I understand that my participation is voluntary and that I am free to withdraw at any time without giving any reason. I, hereby agree to take part in the above study.',
  16),
 ('Institutions', 2),
 ('Please share any comments or suggestions related to this issue. Thank You',
  760),
 ('Faculty', 29),
 ('Department', 29)]

- The first tuple in the null_columns array "I understand that my ..." is a variable that will be removed from our analysis. 
- There isn't really much we can do about missing data in the comments variable. Keywords will be extracted from the existing entries, while the missing values will be replaced with "none".
- Other variables will be filled in with the value "Unknown".

Before proceeding towards data cleaning, I have decided to use some of the variables to perform multi-label classification. The initially decided labels will be the variables:
- **6. Online Instructional Strategies/Assessment [Written assignment]**
- **6. Online Instructional Strategies/Assessment [Case Study]**
- **6. Online Instructional Strategies/Assessment [Real Time Online Exam]**
- **6. Online Instructional Strategies/Assessment [Individual Project/Assignment]**
- **6. Online Instructional Strategies/Assessment [Group Project/Assignment]**
- **6. Online Instructional Strategies/Assessment [Online Quiz/Test - MCQ]**
- **6. Online Instructional Strategies/Assessment [Online Quiz/Test - Essay]**
- **6. Online Instructional Strategies/Assessment [Online Quiz/Test - Open Book]**
- **6. Online Instructional Strategies/Assessment [Peer Review Assessment Live Presentation]**
- **6. Online Instructional Strategies/Assessment [Recorded Presentation]**
- **6. Online Instructional Strategies/Assessment [Portfolio]**  

These variables will be converted into 1s and 0s, 1 indicating that they are more suitable for the particlar online technical assessment while 0 indicating otherwise.  

As I progress more into the model training step, some variables may be dropped to improve the model performance.

## Data Cleaning

In [9]:
df = df_.copy()

In [10]:
le = LabelEncoder()

In [11]:
def LabelEncoding(column):
  print(df[column].value_counts())
  print()
  df[column] = le.fit_transform(df[column])
  print(df[column].value_counts())

### Dropping Columns and "test" Row

In [12]:
df = df.iloc[1:, 3:]

In [13]:
df.head()

Unnamed: 0,Gender,Level of Study,Field of study,Institutions,Country,Household Income,Preferred learning mode,Preferred Social Media Platform,Preferred Communication Platform,Difficulties in Online Learning,...,25. When I first contact a new person,26. I first notice how people,27. If I am very angry,28. I find it easiest to remember,29. I think I can tell someone is lying because,30. When I'm meeting with an old friend,Please share any comments or suggestions related to this issue. Thank You,Faculty,Department,Dominant_VAK
1,Female,Postgraduate,Computing,UM,Malaysia,RM 3001 - 10 000,"Face to Face, Synchronous Online Learning (Rea...",Instagram,Whatsapp,"Technical Issues, Quality of Material",...,I talk to them on the telephone,Stand and move,"I stomp about, slam doors and throw things",Faces,They avoid looking at you,"I say ""it's great to hear your voice!""",no,,,Visual
2,Female,Postgraduate,Computing,UM,Malaysia,RM 3001 - 10 000,Face to Face,"Facebook, Twitter, Instagram","Email, University eLearning Chat Room, Whatsap...","Technical Issues, Engagement",...,I arrange a face to face meeting,Sound and speak,I keep replaying in my mind what it is that ha...,Things I have done,The vibes I get from them,"I say ""it's great to see you!""",,,,Visual
3,Female,Postgraduate,3:00,University Malaya,Malaysia,RM 10 001 - 25 000,"Face to Face, Asynchronous Online Learning (On...","Facebook, Youtube","Email, Whatsapp","Adaptability, Technical Issues, Time Managemen...",...,I arrange a face to face meeting,Look and dress,I keep replaying in my mind what it is that ha...,Things I have done,The vibes I get from them,I give them a hug or a handshake,,,,Visual
4,Female,Postgraduate,Humanities,UM,Malaysia,RM 10 001 - 25 000,Face to Face,Facebook,Whatsapp,Technical Issues,...,I talk to them on the telephone,Sound and speak,I keep replaying in my mind what it is that ha...,Things I have done,The vibes I get from them,"I say ""it's great to see you!""",,,,Auditory
5,Female,Undergraduate,Sports,University of Malaya,Malaysia,RM 3001 - 10 000,Face to Face,"Facebook, Instagram, Youtube","Email, University eLearning Chat Room, Whatsapp","Adaptability, Technical Issues, Self-Motivation",...,I try to get together to share an activity,Look and dress,I shout lots and tell people how I feel,Things I have done,The vibes I get from them,I give them a hug or a handshake,,,,Kinesthetic


### Change Gender to 1s and 0s

In [14]:
LabelEncoding("Gender")

Female    699
Male      352
Name: Gender, dtype: int64

0    699
1    352
Name: Gender, dtype: int64


### Change Level of Study to Discrete Numbers

In [15]:
LabelEncoding("Level of Study")

Undergraduate          900
Certificate/Diploma    134
Master                   9
Postgraduate             5
PhD                      3
Name: Level of Study, dtype: int64

4    900
0    134
1      9
3      5
2      3
Name: Level of Study, dtype: int64


In [16]:
df["Level of Study"].replace([4, 1, 3, 2], [1, 2, 2, 2], inplace=True)

In [17]:
df["Level of Study"].value_counts()

1    900
0    134
2     17
Name: Level of Study, dtype: int64

### Extract Relevant Study Fields
For the context of this project, which is Online Technical Assessment Classification based on Personalization, I am looking for students who are exposed to technical tasks such as programming. Hence, for simplicity, I am only getting students who are in the field of engineering, computer science, and other relevant fields.

In [18]:
relevant_fields = ["computer", "engineering", "information", "tech", "technology", "jurutera", "komputer", "maklumat", "teknologi"]
mask = df["Field of study"].str.contains('|'.join(relevant_fields), case=False)
df = df[mask]

In [19]:
LabelEncoding("Field of study")

Computer Science/Information Technology    214
Engineering                                 92
Medical Laboratory Technology                2
sijil sistem komputer                        2
Computer and Information Technology          1
Science and Technology Studies               1
medical lab technology                       1
Chemical engineering                         1
Name: Field of study, dtype: int64

1    214
3     92
4      2
7      2
2      1
5      1
6      1
0      1
Name: Field of study, dtype: int64


- 7 and 2 appear to be similar to 1
- 4, 5, and 6 look similar
- 0 belongs to 3

In [20]:
df["Field of study"].replace([7, 2, 5, 6, 0], [1, 1, 4, 4, 3], inplace=True)

In [21]:
df["Field of study"].value_counts()

1    217
3     93
4      4
Name: Field of study, dtype: int64

In [22]:
df["Field of study"].replace([1, 3, 4], [0, 1, 2], inplace=True)
df["Field of study"].value_counts()

0    217
1     93
2      4
Name: Field of study, dtype: int64

- 0 represents Computer Science/ Information Technology
- 1 represents Engineering
- 2 represents Medical Laboratory Technology

### Convert Household Income Category

In [23]:
LabelEncoding("Household Income")

Less than RM 4,849       158
RM 4,850 â€“ RM10,959    111
More than RM10,960        45
Name: Household Income, dtype: int64

0    158
2    111
1     45
Name: Household Income, dtype: int64


### Learning Objects & Online Instructional Strategies/Assessment

In [24]:
range_replacements = {"Not at All": 0, "Not Really": 1, "Undecided": 2, "Somewhat": 3, "Somewhat\t": 3, "Very Much": 4}

In [25]:
df.replace(range_replacements, inplace=True)

In [26]:
df["6. Online Instructional Strategies/Assessment [Blog Writing]"].value_counts()

2    104
3     70
1     65
0     46
4     29
Name: 6. Online Instructional Strategies/Assessment [Blog Writing], dtype: int64

### Target Variables

For these target variables, we label "Somewhat" and "Very Much" as 1, while the others are labelled as 0.

In [27]:
targets = [0, 1, 2, 3, 4]
replacements = [0, 0, 0, 1, 1]
df[["6. Online Instructional Strategies/Assessment [Written assignment]"]] = df[["6. Online Instructional Strategies/Assessment [Written assignment]"]].replace(targets, replacements)
df[["6. Online Instructional Strategies/Assessment [Case Study]"]] = df[["6. Online Instructional Strategies/Assessment [Case Study]"]].replace(targets, replacements)
df[["6. Online Instructional Strategies/Assessment [Real Time Online Exam]"]] = df[["6. Online Instructional Strategies/Assessment [Real Time Online Exam]"]].replace(targets, replacements)
df[["6. Online Instructional Strategies/Assessment [Individual Project/Assignment]"]] = df[["6. Online Instructional Strategies/Assessment [Individual Project/Assignment]"]].replace(targets, replacements)
df[["6. Online Instructional Strategies/Assessment [Group Project/Assignment]"]] = df[["6. Online Instructional Strategies/Assessment [Group Project/Assignment]"]].replace(targets, replacements)
df[["6. Online Instructional Strategies/Assessment [Online Quiz/Test - MCQ]"]] = df[["6. Online Instructional Strategies/Assessment [Online Quiz/Test - MCQ]"]].replace(targets, replacements)
df[["6. Online Instructional Strategies/Assessment [Online Quiz/Test - Essay]"]] = df[["6. Online Instructional Strategies/Assessment [Online Quiz/Test - Essay]"]].replace(targets, replacements)
df[["6. Online Instructional Strategies/Assessment [Online Quiz/Test - Open Book]"]] = df[["6. Online Instructional Strategies/Assessment [Online Quiz/Test - Open Book]"]].replace(targets, replacements)
df[["6. Online Instructional Strategies/Assessment [Peer Review Assessment Live Presentation]"]] = df[["6. Online Instructional Strategies/Assessment [Peer Review Assessment Live Presentation]"]].replace(targets, replacements)
df[["6. Online Instructional Strategies/Assessment [Recorded Presentation]"]] = df[["6. Online Instructional Strategies/Assessment [Recorded Presentation]"]].replace(targets, replacements)
df[["6. Online Instructional Strategies/Assessment [Portfolio]"]] = df[["6. Online Instructional Strategies/Assessment [Portfolio]"]].replace(targets, replacements)

In [28]:
df["6. Online Instructional Strategies/Assessment [Portfolio]"].value_counts()

1    166
0    148
Name: 6. Online Instructional Strategies/Assessment [Portfolio], dtype: int64

### Dominant_VAK


In [29]:
LabelEncoding("Dominant_VAK")

Visual         177
Kinesthetic     79
Auditory        58
Name: Dominant_VAK, dtype: int64

2    177
1     79
0     58
Name: Dominant_VAK, dtype: int64


### Learning Style Questions

A function is written so that the function LabelEncoding is run for all 30 variables.

In [30]:
def iterLabelEncoding(col1, col2):
  index_col1 = df.columns.get_loc(col1)
  index_col2 = df.columns.get_loc(col2)
  index_col3 = index_col1
  
  for i in range(index_col2 - index_col1 + 1):
    LabelEncoding(df.columns[index_col3])
    print()
    index_col3 += 1

In [31]:
iterLabelEncoding("1. When operating new equipment for the first time I prefer to", "30. When I'm meeting with an old friend")

Read the instructions                       177
Listen to or ask for an explaination         88
Have a go and learn by "trial and error"     49
Name: 1. When operating new equipment for the first time I prefer to, dtype: int64

2    177
1     88
0     49
Name: 1. When operating new equipment for the first time I prefer to, dtype: int64

Look at a map                            229
Ask for spoken directions                 81
Follow my nose or maybe use a compass      4
Name: 2. When seeking travel directions I, dtype: int64

2    229
0     81
1      4
Name: 2. When seeking travel directions I, dtype: int64

Follow a recipe                          242
Follow my instinct, tasting as I cook     47
Call a friend for explaination            25
Name: 3. When cooking a new dish I, dtype: int64

1    242
2     47
0     25
Name: 3. When cooking a new dish I, dtype: int64

Demonstrate and let them have a go    146
Explain verbally                      127
Write Instructions                     

To refer what the numbers represent, refer to the values in the output here.

## Multiple Choice Data Handling
A function one_hot_encode_multiple_choice is written to manage the columns which are from multiple-choice selections. The columns are:
- **Preferred learning mode**
- **Preferred Social Media Platform**
- **Preferred Communication Platform**
- **Difficulties in Online Learning**

In [32]:
def one_hot_encode_multiple_choice(col, suffix):
  a = set(', '.join(df[col]).split(', '))
  b = list(a)
  
  for i in b:
    new_col = str(i) + "_" + suffix
    df[new_col] = df[col].apply(lambda x: 1 if i in x else 0)
  
  df.drop(col, axis=1, inplace=True)

### Preferred learning mode

In [33]:
df["Preferred learning mode"].value_counts()

Face to Face                                                                                              97
Face to Face, Synchronous Online Learning (Real Time)                                                     49
Face to Face, Asynchronous Online Learning (On your own time)                                             40
Asynchronous Online Learning (On your own time)                                                           36
Synchronous Online Learning (Real Time), Asynchronous Online Learning (On your own time)                  33
Synchronous Online Learning (Real Time)                                                                   30
Face to Face, Synchronous Online Learning (Real Time), Asynchronous Online Learning (On your own time)    29
Name: Preferred learning mode, dtype: int64

In [34]:
one_hot_encode_multiple_choice("Preferred learning mode", "learnmode")
df

Unnamed: 0,Gender,Level of Study,Field of study,Institutions,Country,Household Income,Preferred Social Media Platform,Preferred Communication Platform,Difficulties in Online Learning,Learning Objects [Slide presentation],...,28. I find it easiest to remember,29. I think I can tell someone is lying because,30. When I'm meeting with an old friend,Please share any comments or suggestions related to this issue. Thank You,Faculty,Department,Dominant_VAK,Synchronous Online Learning (Real Time)_learnmode,Face to Face_learnmode,Asynchronous Online Learning (On your own time)_learnmode
27,0,2,0,Um,Malaysia,2,"Facebook, Instagram","Whatsapp, Telegram","Computer Literacy, Quality of Material",3,...,0,1,0,First round,Fsktm,Is,1,1,1,0
28,0,2,0,UM,Malaysia,2,"Facebook, Twitter, Instagram","Email, Whatsapp, Call","Technical Issues, Quality of Material, Engagement",4,...,0,0,2,,FSKTM,Information System,2,0,1,1
31,1,1,0,UM,Malaysia,0,"Twitter, Instagram, Youtube","Email, University eLearning Chat Room, Telegram","Technical Issues, Self-Motivation, Engagement,...",3,...,2,0,2,,FCSIT,IS,1,1,1,1
33,1,1,0,Education,Malaysia,0,"Youtube, Google classroom","Email, University eLearning Chat Room, Whatsap...","Technical Issues, Self-Motivation, Accessibili...",3,...,0,2,2,,FSKTM,IS,2,0,1,1
37,0,1,0,University of Malaya,Shah Alam,1,"Twitter, Instagram","Whatsapp, Telegram","Adaptability, Time Management, Self-Motivation...",4,...,0,0,2,,Faculty of Comp Science,Information System,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1047,0,1,0,Universiti Malaya,Malaysia,2,"Twitter, Instagram","Email, Whatsapp, Telegram","Self-Motivation, Focus/Commitment",3,...,2,0,2,,Faculty of Computer Science and Information Te...,Information System,2,0,1,1
1048,0,1,0,National University of Singapore,Malaysia,1,"Facebook, Instagram, Youtube","Email, Whatsapp, Telegram","Technical Issues, Time Management, Self-Motiva...",3,...,2,2,1,,School of Computing,Computer Science,2,1,1,0
1049,1,1,0,University of Malaya,MALAYSIA,2,"Twitter, Youtube","University eLearning Chat Room, Whatsapp","Time Management, Self-Motivation, Focus/Commit...",2,...,2,1,2,,Faculty of Computer Science and Information Te...,Artificial Intelligence,2,0,0,1
1050,0,1,0,Universiti Malaya,Malaysia,1,"Twitter, Instagram, Youtube","Email, Whatsapp","Adaptability, Time Management, Self-Motivation...",3,...,0,0,2,,Computer Science & Information Technology,Data Science,0,0,1,0


### Preferred Social Media Platform

In [35]:
df["Preferred Social Media Platform "].value_counts()

Youtube                                                     59
Facebook, Instagram, Youtube                                41
Instagram, Youtube                                          35
Twitter, Instagram, Youtube                                 34
Facebook, Youtube                                           27
Instagram                                                   16
Twitter, Youtube                                            15
Twitter                                                     10
Twitter, Instagram                                          10
Facebook                                                     7
Blogger/Wordpress, Youtube                                   5
Facebook, Instagram                                          5
Facebook, Twitter, Instagram, Youtube                        5
Facebook, Twitter, Instagram, Blogger/Wordpress, Youtube     4
Instagram, zoom                                              3
Facebook, Twitter, Instagram                           

In [36]:
one_hot_encode_multiple_choice("Preferred Social Media Platform ", "prefsocmed")
df

Unnamed: 0,Gender,Level of Study,Field of study,Institutions,Country,Household Income,Preferred Communication Platform,Difficulties in Online Learning,Learning Objects [Slide presentation],Learning Objects [Book],...,Telegram_prefsocmed,Reddit_prefsocmed,Google Classroom _prefsocmed,Tiktok_prefsocmed,google meet_prefsocmed,Twitter_prefsocmed,Google classroom_prefsocmed,Telegram and Google Classroom_prefsocmed,Blogger/Wordpress_prefsocmed,TIKTOK_prefsocmed
27,0,2,0,Um,Malaysia,2,"Whatsapp, Telegram","Computer Literacy, Quality of Material",3,3,...,0,0,0,0,0,0,0,0,0,0
28,0,2,0,UM,Malaysia,2,"Email, Whatsapp, Call","Technical Issues, Quality of Material, Engagement",4,4,...,0,0,0,0,0,1,0,0,0,0
31,1,1,0,UM,Malaysia,0,"Email, University eLearning Chat Room, Telegram","Technical Issues, Self-Motivation, Engagement,...",3,3,...,0,0,0,0,0,1,0,0,0,0
33,1,1,0,Education,Malaysia,0,"Email, University eLearning Chat Room, Whatsap...","Technical Issues, Self-Motivation, Accessibili...",3,1,...,0,0,0,0,0,0,1,0,0,0
37,0,1,0,University of Malaya,Shah Alam,1,"Whatsapp, Telegram","Adaptability, Time Management, Self-Motivation...",4,1,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1047,0,1,0,Universiti Malaya,Malaysia,2,"Email, Whatsapp, Telegram","Self-Motivation, Focus/Commitment",3,1,...,0,0,0,0,0,1,0,0,0,0
1048,0,1,0,National University of Singapore,Malaysia,1,"Email, Whatsapp, Telegram","Technical Issues, Time Management, Self-Motiva...",3,3,...,0,0,0,0,0,0,0,0,0,0
1049,1,1,0,University of Malaya,MALAYSIA,2,"University eLearning Chat Room, Whatsapp","Time Management, Self-Motivation, Focus/Commit...",2,3,...,0,0,0,0,0,1,0,0,0,0
1050,0,1,0,Universiti Malaya,Malaysia,1,"Email, Whatsapp","Adaptability, Time Management, Self-Motivation...",3,3,...,0,0,0,0,0,1,0,0,0,0


### Preferred Communication Platform

In [37]:
df["Preferred Communication Platform"].value_counts()

Email, Whatsapp, Telegram                                          58
Whatsapp                                                           45
Whatsapp, Telegram                                                 42
Email, Whatsapp                                                    38
University eLearning Chat Room, Whatsapp                           16
Email, University eLearning Chat Room, Whatsapp, Telegram          13
Email, University eLearning Chat Room, Whatsapp                    13
Email, Whatsapp, Call                                              11
University eLearning Chat Room, Whatsapp, Telegram                 10
Email, University eLearning Chat Room, Whatsapp, Call, Telegram     7
Email, Telegram                                                     7
Telegram                                                            6
Whatsapp, Call, Telegram                                            6
Whatsapp, Call                                                      6
Email, Whatsapp, Cal

In [38]:
one_hot_encode_multiple_choice("Preferred Communication Platform", "prefcomm")
df

Unnamed: 0,Gender,Level of Study,Field of study,Institutions,Country,Household Income,Difficulties in Online Learning,Learning Objects [Slide presentation],Learning Objects [Book],Learning Objects [Lecture Note],...,University eLearning Chat Room_prefcomm,Google Classroom_prefcomm,Kaizala_prefcomm,Zoom_prefcomm,Cisco Webex_prefcomm,Google Meet _prefcomm,Microsoft Teams_prefcomm,Telegram_prefcomm,Email_prefcomm,telegram_prefcomm
27,0,2,0,Um,Malaysia,2,"Computer Literacy, Quality of Material",3,3,3,...,0,0,0,0,0,0,0,1,0,0
28,0,2,0,UM,Malaysia,2,"Technical Issues, Quality of Material, Engagement",4,4,4,...,0,0,0,0,0,0,0,0,1,0
31,1,1,0,UM,Malaysia,0,"Technical Issues, Self-Motivation, Engagement,...",3,3,4,...,1,0,0,0,0,0,0,1,1,0
33,1,1,0,Education,Malaysia,0,"Technical Issues, Self-Motivation, Accessibili...",3,1,4,...,1,0,0,0,0,0,0,1,1,0
37,0,1,0,University of Malaya,Shah Alam,1,"Adaptability, Time Management, Self-Motivation...",4,1,4,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1047,0,1,0,Universiti Malaya,Malaysia,2,"Self-Motivation, Focus/Commitment",3,1,3,...,0,0,0,0,0,0,0,1,1,0
1048,0,1,0,National University of Singapore,Malaysia,1,"Technical Issues, Time Management, Self-Motiva...",3,3,4,...,0,0,0,0,0,0,0,1,1,0
1049,1,1,0,University of Malaya,MALAYSIA,2,"Time Management, Self-Motivation, Focus/Commit...",2,3,2,...,1,0,0,0,0,0,0,0,0,0
1050,0,1,0,Universiti Malaya,Malaysia,1,"Adaptability, Time Management, Self-Motivation...",3,3,4,...,0,0,0,0,0,0,0,0,1,0


### Difficulties in Online Learning

In [39]:
df["Difficulties in Online Learning"].value_counts()

Time Management, Self-Motivation                                                                                                7
Time Management, Self-Motivation, Focus/Commitment                                                                              6
Adaptability, Technical Issues, Time Management, Self-Motivation, Engagement, Focus/Commitment                                  6
Self-Motivation, Focus/Commitment                                                                                               6
Technical Issues, Self-Motivation, Focus/Commitment                                                                             6
                                                                                                                               ..
Computer Literacy, Cost, Health Issues                                                                                          1
Time Management, Self-Motivation, Quality of Material, Engagement, Focus/Commitment       

In [40]:
one_hot_encode_multiple_choice("Difficulties in Online Learning", "onlinelearningdifficulties")
df

  df[new_col] = df[col].apply(lambda x: 1 if i in x else 0)
  df[new_col] = df[col].apply(lambda x: 1 if i in x else 0)
  df[new_col] = df[col].apply(lambda x: 1 if i in x else 0)
  df[new_col] = df[col].apply(lambda x: 1 if i in x else 0)
  df[new_col] = df[col].apply(lambda x: 1 if i in x else 0)
  df[new_col] = df[col].apply(lambda x: 1 if i in x else 0)
  df[new_col] = df[col].apply(lambda x: 1 if i in x else 0)


Unnamed: 0,Gender,Level of Study,Field of study,Institutions,Country,Household Income,Learning Objects [Slide presentation],Learning Objects [Book],Learning Objects [Lecture Note],Learning Objects [Educational game],...,Focus/Commitment_onlinelearningdifficulties,Self-Motivation_onlinelearningdifficulties,Cost_onlinelearningdifficulties,Health Issues_onlinelearningdifficulties,University lack of student well-being management_onlinelearningdifficulties,Line connection_onlinelearningdifficulties,Not sure_onlinelearningdifficulties,Cost/Focus/Commitment_onlinelearningdifficulties,None_onlinelearningdifficulties,Engagement_onlinelearningdifficulties
27,0,2,0,Um,Malaysia,2,3,3,3,3,...,0,0,0,0,0,0,0,0,0,0
28,0,2,0,UM,Malaysia,2,4,4,4,4,...,0,0,0,0,0,0,0,0,0,1
31,1,1,0,UM,Malaysia,0,3,3,4,4,...,1,1,1,0,0,0,0,1,0,1
33,1,1,0,Education,Malaysia,0,3,1,4,2,...,1,1,1,0,0,0,0,1,0,0
37,0,1,0,University of Malaya,Shah Alam,1,4,1,4,3,...,1,1,1,0,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1047,0,1,0,Universiti Malaya,Malaysia,2,3,1,3,4,...,1,1,0,0,0,0,0,0,0,0
1048,0,1,0,National University of Singapore,Malaysia,1,3,3,4,3,...,1,1,0,0,0,0,0,0,0,1
1049,1,1,0,University of Malaya,MALAYSIA,2,2,3,2,3,...,1,1,0,0,0,0,0,0,0,0
1050,0,1,0,Universiti Malaya,Malaysia,1,3,3,4,3,...,1,1,0,0,0,0,0,0,0,0


### Online Instructional Strategies/Assessment Preference
The data in the column **"For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"** are manually inputted by respondents. Hence, the data format is highly inconsistent, has many commas and unrelated words. To clean and extract useful data from this column, NLTK is first used to remove common words (stopwords), next to do lemmatization so that the data becomes slightly more manageable. Then, the lemmatized data is checked for the occurrence of common instructional strategies, such as "YouTube", "Live Lecture", and so on.

In [41]:
df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"] \
= df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"].str.lower()

In [42]:
stop_words = set(stopwords.words("english"))

In [43]:
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

In [44]:
df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"] \
= df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"].apply(remove_stopwords)

In [45]:
lemmatizer = WordNetLemmatizer()

In [46]:
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

In [47]:
df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"] \
= df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"].apply(lemmatize_text)

In [48]:
df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"].value_counts()[0:49]

live lecture                                                                                                                                                                                                                       71
youtube video                                                                                                                                                                                                                      71
youtube                                                                                                                                                                                                                            21
youtube video live lecture                                                                                                                                                                                                          5
live                                                                            

In [49]:
tech_pref = ["youtube", "live", "debugging", "problem based learning", "recorded", "case study", "demo", "simulation", "guided", "discussion", "forum", "game", "journaling", "writing"]
for pref in tech_pref:
  df[pref] = df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"]\
    .apply(lambda x: 1 if pref in x else 0)

  df[pref] = df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"]\
  df[pref] = df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"]\
  df[pref] = df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"]\
  df[pref] = df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in previous question  eg. YouTube Video, Debugging or Live Lecture)?"]\
  df[pref] = df["For technical or hands-on subject, which Online Instructional Strategies/Assessment do you preferÂ\xa0(list can be referred in prev

In [50]:
df.iloc[:, -14:].value_counts()

youtube  live  debugging  problem based learning  recorded  case study  demo  simulation  guided  discussion  forum  game  journaling  writing
1        0     0          0                       0         0           0     0           0       0           0      0     0           0          104
0        1     0          0                       0         0           0     0           0       0           0      0     0           0           88
         0     0          0                       0         0           0     0           0       0           0      0     0           0           24
1        1     0          0                       0         0           0     0           0       0           0      0     0           0           21
0        0     0          0                       1         0           0     0           0       0           0      0     0           0           10
                                                  0         0           1     0           0       0        

### Converting Yes and No to Binary
These variables are affected:
- Are you familiar with the term learning style before this survey?
- Are you aware what your learning style is?
- Do you think knowing your own learning style is important in improving your learning ability?

In [51]:
replacements2 = {"Yes": 1, "No": 0}

In [52]:
df["Are you familiar with the term learning style before this survey?"].replace(replacements2, inplace=True)
df["Are you aware what your learning style is?"].replace(replacements2, inplace=True)
df["Do you think knowing your own learning style is important in improving your learning ability?"].replace(replacements2, inplace=True)

## Final Check Before EDA

In [53]:
report2 = ProfileReport(df, title="Final Check", minimal=True, html={"style": {"full_width": True}})
report2.to_file("report2.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

The variable "Please share any comments or suggestions related to this issue. Thank You" has 225 missing values. The column is considered to be dropped before the model training stage.  

The "discussion" column has only one value, which has no significant value. Hence, it will be dropped.  

A lot of the one-hot encoded columns are empty. It may be dropped depending on the situation during feature selection.

In [54]:
df = df.drop("discussion", axis=1)

## Export Data for EDA and Model Training

In [55]:
df.to_csv("cleaned_data.csv")