##### Implementation of the paper : "Language independent analysis and classification of discussion threads in coursera MOOC forums" by authored by Lorenzo A. Rossi and Omprakash Gnawali.

This notebook covers engineering of most of the prominent features described in the paper and utilises them for the linear kernel SVM to classify discussion threads of MOOC forums based on the language independent data into six prominent classes.

Notebook by L N Saaswath.

@inproceedings{coursera-iri2014,
   author = {Lorenzo A. Rossi and Omprakash Gnawali},
   title = {{Language Independent Analysis and Classification of Discussion Threads in Coursera MOOC Forums}},
   booktitle = {Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI 2014)},
   month = aug,
   year = {2014}
}





In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np
from sklearn import svm, model_selection, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler

In [None]:
#Load Dataset cloned to local repository

#course_info: contains data about 60 individual courses
basic_info_df = pd.read_csv("/content/drive/MyDrive/SRFP'21 IAS IIT RPR/courseraforums/data/course_information.csv")

#course_thread: contains quantitative data of threads/subforums in the forums of the courses
threads_df = pd.read_csv("/content/drive/MyDrive/SRFP'21 IAS IIT RPR/courseraforums/data/course_threads.csv")

#course_post: contains quantitative data about all anonymised posts in a thread
posts_df = pd.read_csv("/content/drive/MyDrive/SRFP'21 IAS IIT RPR/courseraforums/data/course_posts.csv")

In [None]:
basic_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             60 non-null     object
 1   course_id        60 non-null     object
 2   weeks            60 non-null     int64 
 3   hours            60 non-null     object
 4   start_date       60 non-null     object
 5   end_date         3 non-null      object
 6   type             60 non-null     object
 7   language         60 non-null     object
 8   num_threads      60 non-null     int64 
 9   mandatory_posts  4 non-null      object
 10  num_users        60 non-null     int64 
dtypes: int64(3), object(8)
memory usage: 5.3+ KB


In [None]:
threads_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99629 entries, 0 to 99628
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   thread_id        99629 non-null  int64  
 1   course_id        99629 non-null  object 
 2   og_forum         99628 non-null  object 
 3   og_forum_id      99629 non-null  int64  
 4   parent_forum     99618 non-null  object 
 5   parent_forum_id  99618 non-null  float64
 6   forum_chain      99629 non-null  object 
 7   depth            99629 non-null  int64  
 8   num_views        99629 non-null  int64  
 9   num_tags         99629 non-null  int64  
 10  forum_id         99629 non-null  int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 8.4+ MB


In [None]:
posts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 739074 entries, 0 to 739073
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   post_id     739074 non-null  int64  
 1   thread_id   739074 non-null  int64  
 2   course_id   739074 non-null  object 
 3   parent_id   739074 non-null  int64  
 4   order       739074 non-null  int64  
 5   user_id     739074 non-null  int64  
 6   user_type   739074 non-null  object 
 7   post_time   739074 non-null  int64  
 8   relative_t  739074 non-null  float64
 9   votes       739074 non-null  int64  
 10  num_words   739074 non-null  int64  
 11  forum_id    739074 non-null  int64  
dtypes: float64(1), int64(9), object(2)
memory usage: 67.7+ MB


### [Target Variable for our problem] forum_id : possibly re-mapped thread/subforum identifier 

2: General (Miscellaneous) Discussion  
3: Assignments  
4: Study Groups (Meetups)  
7: Course Feedback / Suggestions  
8: Lectures  
9: Platform Issues  
100: Signature Track  
otherwise: not remapped


### Filtering and Selecting of Courses for training and testing

In [None]:
basic_info_df.drop(6)

In [None]:
filt = (basic_info_df['language'] == 'E') & (basic_info_df['weeks']. astype(str). astype(float) < 7)

In [None]:
basic_info_df.loc[filt]

In [None]:
course = basic_info_df.course_id.unique()

course_train = basic_info_df.loc[filt].course_id.unique()
course_train = np.append(course_train,['intropsych-001'])
course_train_df = basic_info_df[basic_info_df.course_id.isin(course_train)]


course_train_df

In [None]:
course_test_df = basic_info_df.loc[[13, 18, 24, 29, 39, 52, 54, 55, 56, 59]]
course_test = course_test_df.course_id.unique()
course_test_df

In [None]:
course_list = np.append(course_train,course_test)
course_tt_df = basic_info_df[basic_info_df.course_id.isin(course_list)]
course_tt_df

In [None]:
posts = posts_df.copy(deep=True)
threads = threads_df.copy(deep=True)

posts = posts[posts.course_id.isin(course_list)]
threads = threads[threads.course_id.isin(course_list)]

## Feature Engineering
### Implementing the features described in the paper - Manipulation of posts dataframe and adding them to threads dataframe.

#### Vote measure
Sum of square of votes of posts in a thread. (Votes can be + or -)

In [None]:
posts["votes_sq"] = posts["votes"]**2
posts_vote = posts.groupby(['course_id','thread_id'])['votes_sq'].sum().reset_index()
posts_vote

#### Average word count (avg_words)
The average of the word count of all messages in a thread.

In [None]:
thread_featured = pd.merge(threads,posts_vote, on = ["thread_id", "course_id"])

In [None]:
avg_words = posts.groupby(['course_id', 'thread_id'])['num_words'].mean().reset_index()

In [None]:
thread_featured["avg_words"] = avg_words["num_words"].astype(int)

#### Number of Posts (num_posts)

In [None]:
#for getting number of posts feature, we filter those with parent_id = 0 [0 for posts and non-zero for comments]
filt_post = posts['parent_id'] == 0
filt_post = posts.loc[filt_post]

num_posts = filt_post.groupby(['course_id', 'thread_id'])['parent_id'].count().reset_index()
thread_featured['num_posts']= num_posts['parent_id']

#### Number of Comments (num_comments)

In [None]:
filt_comment = posts['parent_id'] != 0
filt_comment = posts.loc[filt_comment]

num_comments = filt_comment.groupby(['course_id', 'thread_id'])['parent_id'].count().reset_index()
num_comments.rename(columns = {'parent_id' : 'num_comments'},inplace = True)

In [None]:
thread_featured = pd.merge(left=thread_featured, right=num_comments, how='left',
                           left_on=['thread_id','course_id'], right_on=['thread_id','course_id'])

thread_featured['num_comments'] = thread_featured['num_comments'].fillna(0).astype(int)

#### Number of Messages
Total number of messages in a thread [num_posts + num_comments]

In [None]:
thread_featured['num_messages'] = thread_featured['num_comments'] +  thread_featured['num_posts']  

#### Relative Time 
Mean of the relative time of the posts in a thread, normalised.

In [None]:
t_rel = posts.groupby(['course_id', 'thread_id'])['relative_t'].mean().reset_index()
thread_featured = pd.merge(thread_featured, t_rel, on = ['course_id', 'thread_id'])

#### Staff Replied (?) 
Has any of the staff/instructor or Community TA have replied? [Boolean]

In [None]:
not_staff = ['Student', 'Anonymous']
posts['staff_replied'] = np.where(posts['user_type'].isin(not_staff), 0, 1)
st_replied = posts.groupby(['course_id','thread_id'])['staff_replied'].sum().reset_index()
st_replied['staff_replied'] = np.where(st_replied['staff_replied'] == 0, 0, 1)

In [None]:
st_replied

In [None]:
thread_featured = pd.merge(thread_featured, st_replied, on = ['course_id','thread_id'])

#### Number of Unique users (num_users)
Count of unique user IDs in a thread.

In [None]:
num_users = posts.groupby(['course_id','thread_id']).agg({"user_id": "nunique"}).reset_index()
thread_featured['num_users'] = num_users['user_id']

#### Number of Anonymous Messages (anon_messages)
Total number of anonymous messages in a thread.

In [None]:
anon_users = posts[posts['user_type'] == 'Anonymous'].groupby(['course_id', 'thread_id'])['user_type'].count().reset_index()
thread_featured = pd.merge(left = thread_featured, right = anon_users, how = 'left', left_on=['course_id', 'thread_id'], 
                            right_on=['course_id', 'thread_id'])

In [None]:
thread_featured.rename(columns = {'user_type' : 'anon_messages'}, inplace=True)
thread_featured['anon_messages'] = thread_featured['anon_messages'].fillna(0).astype(int)

In [None]:
posts.loc[posts['parent_id'] != 0]

#### Maximum Words (max_words)
Maximum number of words in a post in a thread.

In [None]:
max_words = posts.groupby(['course_id', 'thread_id'])['num_words'].max().reset_index()
max_words.rename(columns = {'num_words' : 'max_words'}, inplace = True)
thread_featured = pd.merge(thread_featured, max_words, on = ['course_id', 'thread_id'])

In [None]:
thread_featured

In [None]:
thread_copy = thread_featured.copy(deep= True)

In [None]:
thread_copy.drop(['og_forum', 'og_forum_id', 'forum_chain'], axis='columns', inplace=True)

In [None]:
#thread_copy['forum_id'] = thread_copy['forum_id'].replace([2, 3, 4, 7, 8, 9],['General', 'Assignments', 'Meetups', 'Feedback', 'Lectures', 'Logistics'])

In [None]:
thread_copy.dropna(inplace=True)
thread_copy = thread_copy[thread_copy['forum_id'].isin([2, 3, 4, 7, 8, 9])]

In [None]:
thread_copy.forum_id.unique()

In [None]:
#feature list
features = ['parent_forum_id', 'depth', 'num_views', 'num_tags', 
            'votes_sq', 'avg_words', 'num_posts', 'num_comments',
            'num_messages', 'relative_t', 'staff_replied', 'num_users',
            'anon_messages', 'max_words']
# Extracting features
X = thread_copy.loc[:, features].values

# Extracting target column
y = thread_copy.loc[:, ['forum_id']].values
y = y.ravel()

In [None]:
thread_copy.columns
thread_copy.forum_id.unique()

In [None]:
le = LabelEncoder()
le.fit([2, 3, 4, 7, 8, 9])

In [None]:
le.transform(y)

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split (X, y, test_size = 0.3, random_state = 55)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
clf_scaled = svm.SVC(kernel='linear', C = 1, probability=True).fit(X_train, y_train)

In [None]:
y_pred = clf_scaled.predict(X_test)

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.5803348325837081


In [None]:
y_prob = clf_scaled.predict_proba(X_test)

In [None]:
roc_auc = metrics.roc_auc_score(y_test, y_prob, multi_class="ovr",
                                  average="macro")

In [None]:
roc_auc

0.8018793626045625

In [None]:
X1 = scaler.fit_transform(X)

In [None]:
scores = model_selection.cross_val_score(clf_scaled, X1, y, cv=5, scoring='f1_macro')

In [None]:
scores