# Week 5 - Project 1
## Josh Iden  
### 2/21/23

![](PROJECT_1.png)

## Introduction  

This project will look at the **Social Network: MOOC User Action Dataset** data compiled by the Stanford Network Analysis Project (SNAP). 

Source: [https://snap.stanford.edu/data/act-mooc.html](https://snap.stanford.edu/data/act-mooc.html)

From dataset documentation:

*The MOOC user action dataset represents the actions taken by users on a popular MOOC platform. The actions are represented as a directed, temporal network. The nodes represent users and course activities (targets), and edges represent the actions by users on the targets. The actions have attributes and timestamps. To protect user privacy, we anonimize the users and timestamps are standardized to start from timestamp 0. The dataset is directed, temporal, and attributed.*

*Additionally, each action has a binary label, representing whether the user dropped-out of the course after this action, i.e., whether this is last action of the user.*

This analysis will focus on the hypothetical outcome, can degree centrality be used to predict the total number of actions taken before dropping out of the course. 

## The Data

The dataset contains three files:  

**mooc_actions.tsv**, 	*Time-ordered sequence of user actions.*  
**mooc_action_features.tsv**,  	*Features associated with each action.*  
**mooc_action_labels.tsv**, 	*Binary label associated with each action, indicating whether the student drops-out after the action.*  

Loading the data into pandas:

In [7]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

actions_fp = "act-mooc/mooc_actions.tsv" 
features_fp = "act-mooc/mooc_action_features.tsv" 
labels_fp = "act-mooc/mooc_action_labels.tsv"

actions = pd.read_csv(actions_fp, sep ="\t")
features = pd.read_csv(features_fp, sep = "\t")
labels = pd.read_csv(labels_fp, sep = "\t")

In [8]:
# preview actions file 
actions.head()

Unnamed: 0,ACTIONID,USERID,TARGETID,TIMESTAMP
0,0,0,0,0.0
1,1,0,1,6.0
2,2,0,2,41.0
3,3,0,1,49.0
4,4,0,2,51.0


In [9]:
# preview features file 
features.head()

Unnamed: 0,ACTIONID,FEATURE0,FEATURE1,FEATURE2,FEATURE3
0,0,-0.319991,-0.435701,0.106784,-0.067309
1,1,-0.319991,-0.435701,0.106784,-0.067309
2,2,-0.319991,-0.435701,0.106784,-0.067309
3,3,-0.319991,-0.435701,0.106784,-0.067309
4,4,-0.319991,-0.435701,0.106784,-0.067309


In [10]:
# preview labels file
labels.head()

Unnamed: 0,ACTIONID,LABEL
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


For this project, we are only focusing on the `actions` and `labels` data. I'd like to combine these two datasets using the `ACTIONID` column. Let's make sure we have the same number of rows in each dataframe. 

In [18]:
actions.shape[0] == labels.shape[0]

True

In [43]:
# subset the first two columns of actions data
actions.copy = actions.iloc[:,:2]

# join the datasets
df = pd.merge(actions.copy, labels, how="left", on="ACTIONID")
df["LABEL"] = df["LABEL"].astype('Int64')
df.head()

Unnamed: 0,ACTIONID,USERID,LABEL
0,0,0,0
1,1,0,0
2,2,0,0
3,3,0,0
4,4,0,0


In [44]:
df.describe()

Unnamed: 0,ACTIONID,USERID,LABEL
count,426865.0,426865.0,411749.0
mean,205264.204072,3044.392241,0.009875
std,118861.618527,1978.684215,0.098881
min,0.0,0.0,0.0
25%,102390.0,1277.0,0.0
50%,204927.0,2846.0,0.0
75%,308092.0,4715.0,0.0
max,411748.0,7046.0,1.0


In [25]:
actions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 411749 entries, 0 to 411748
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   ACTIONID   411749 non-null  int64  
 1   USERID     411749 non-null  int64  
 2   TARGETID   411749 non-null  int64  
 3   TIMESTAMP  411749 non-null  float64
dtypes: float64(1), int64(3)
memory usage: 12.6 MB
