# Week 5 - Project 1
## Josh Iden  
### 2/21/23

![](PROJECT_1.png)

## Introduction  

This project will look at the **Social Network: MOOC User Action Dataset** data compiled by the Stanford Network Analysis Project (SNAP). 

Source: [https://snap.stanford.edu/data/act-mooc.html](https://snap.stanford.edu/data/act-mooc.html)

From dataset documentation:

*The MOOC user action dataset represents the actions taken by users on a popular MOOC platform. The actions are represented as a directed, temporal network. The nodes represent users and course activities (targets), and edges represent the actions by users on the targets. The actions have attributes and timestamps. To protect user privacy, we anonimize the users and timestamps are standardized to start from timestamp 0. The dataset is directed, temporal, and attributed.*

*Additionally, each action has a binary label, representing whether the user dropped-out of the course after this action, i.e., whether this is last action of the user.*

This analysis will focus on the hypothetical outcome, can degree centrality be used to predict the total number of actions taken before dropping out of the course. 

## The Data

The dataset contains three files:  

**mooc_actions.tsv**, 	*Time-ordered sequence of user actions.*  
**mooc_action_features.tsv**,  	*Features associated with each action.*  
**mooc_action_labels.tsv**, 	*Binary label associated with each action, indicating whether the student drops-out after the action.*  

Loading the data into pandas:

In [119]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline

actions_fp = "act-mooc/mooc_actions.tsv" 
features_fp = "act-mooc/mooc_action_features.tsv" 
labels_fp = "act-mooc/mooc_action_labels.tsv"

actions = pd.read_csv(actions_fp, sep ="\t")
features = pd.read_csv(features_fp, sep = "\t")
labels = pd.read_csv(labels_fp, sep = "\t")

In [4]:
# preview actions file 
actions.head()

Unnamed: 0,ACTIONID,USERID,TARGETID,TIMESTAMP
0,0,0,0,0.0
1,1,0,1,6.0
2,2,0,2,41.0
3,3,0,1,49.0
4,4,0,2,51.0


In [5]:
# preview features file 
features.head()

Unnamed: 0,ACTIONID,FEATURE0,FEATURE1,FEATURE2,FEATURE3
0,0,-0.319991,-0.435701,0.106784,-0.067309
1,1,-0.319991,-0.435701,0.106784,-0.067309
2,2,-0.319991,-0.435701,0.106784,-0.067309
3,3,-0.319991,-0.435701,0.106784,-0.067309
4,4,-0.319991,-0.435701,0.106784,-0.067309


In [6]:
# preview labels file
labels.head()

Unnamed: 0,ACTIONID,LABEL
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


For this project, we are only focusing on the `actions` and `labels` data. I'd like to combine these two datasets using the `ACTIONID` column. Let's make sure we have the same number of rows in each dataframe. 

In [7]:
# check the datasets have equal number of observations
actions.shape[0] == labels.shape[0]

True

In [9]:
# subset the first two columns of actions data
actions.copy = actions.iloc[:,:2]

# join the datasets
df = pd.merge(actions.copy, labels, how="left", on="ACTIONID")
df["LABEL"] = df["LABEL"].astype('Int64')
df.head()

Unnamed: 0,ACTIONID,USERID,LABEL
0,0,0,0
1,1,0,0
2,2,0,0
3,3,0,0
4,4,0,0


In [10]:
df.describe()

Unnamed: 0,ACTIONID,USERID,LABEL
count,426865.0,426865.0,411749.0
mean,205264.204072,3044.392241,0.009875
std,118861.618527,1978.684215,0.098881
min,0.0,0.0,0.0
25%,102390.0,1277.0,0.0
50%,204927.0,2846.0,0.0
75%,308092.0,4715.0,0.0
max,411748.0,7046.0,1.0


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 426865 entries, 0 to 426864
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   ACTIONID  426865 non-null  int64
 1   USERID    426865 non-null  int64
 2   LABEL     411749 non-null  Int64
dtypes: Int64(1), int64(2)
memory usage: 13.4 MB


We see there's a bunch of NAs here. Removing these might affect our analysis. For the purpose of this analysis, while not scientifically appropriate, I am going to assume the NAs do not indicate a drop out action, and impute them with zeros. 

In [53]:
df['LABEL'] = df['LABEL'].fillna(0)

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 426865 entries, 0 to 426864
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   ACTIONID  426865 non-null  int64
 1   USERID    426865 non-null  int64
 2   LABEL     426865 non-null  Int64
dtypes: Int64(1), int64(2)
memory usage: 29.6 MB


Looking at this data using the `TARGETID` columns and `USERID` columns as the nodes. 

In [96]:
df1 = pd.merge(actions, labels, how='left', on='ACTIONID')
df1 = df1[['USERID','TARGETID','LABEL']]
df1['LABEL'] = df1['LABEL'].astype('Int64').fillna(0)
df1.head()

Unnamed: 0,USERID,TARGETID,LABEL
0,0,0,0
1,0,1,0
2,0,2,0
3,0,1,0
4,0,2,0


In [112]:
df1.describe()

Unnamed: 0,USERID,TARGETID,LABEL
count,426865.0,426865.0,426865.0
mean,3044.392241,26.73437,0.009525
std,1978.684215,21.096347,0.097132
min,0.0,0.0,0.0
25%,1277.0,9.0,0.0
50%,2846.0,22.0,0.0
75%,4715.0,39.0,0.0
max,7046.0,96.0,1.0


There are over 7k user IDs and 96 Target IDs. It's not going to be possible to connect this data. 

## Analysis  

First we create a graph object from the pandas dataframe. Then we calculate the degree centrality based on the graph. Next we'll perform a t-test on the means of the degree centrality between number of actions that result in a student dropping out versus not dropping out. 

In [107]:
# using networkx 2.3
G = nx.from_pandas_dataframe(df1, 'USERID', 'TARGETID', edge_attr='LABEL', create_using=nx.DiGraph())
print(nx.info(G))

Name: 
Type: DiGraph
Number of nodes: 7047
Number of edges: 178443
Average in degree:  25.3218
Average out degree:  25.3218


In [108]:
G.edges(data=True)

[(0, 0, {'LABEL': 0}),
 (0, 1, {'LABEL': 0}),
 (0, 2, {'LABEL': 0}),
 (0, 3, {'LABEL': 0}),
 (0, 4, {'LABEL': 0}),
 (0, 5, {'LABEL': 0}),
 (0, 6, {'LABEL': 0}),
 (0, 7, {'LABEL': 0}),
 (0, 8, {'LABEL': 0}),
 (0, 9, {'LABEL': 0}),
 (0, 10, {'LABEL': 0}),
 (0, 13, {'LABEL': 0}),
 (0, 15, {'LABEL': 0}),
 (0, 20, {'LABEL': 0}),
 (0, 17, {'LABEL': 0}),
 (0, 19, {'LABEL': 0}),
 (0, 18, {'LABEL': 0}),
 (0, 16, {'LABEL': 0}),
 (0, 21, {'LABEL': 0}),
 (0, 23, {'LABEL': 0}),
 (0, 11, {'LABEL': 0}),
 (0, 12, {'LABEL': 0}),
 (0, 67, {'LABEL': 0}),
 (0, 69, {'LABEL': 0}),
 (1, 10, {'LABEL': 0}),
 (1, 1, {'LABEL': 0}),
 (1, 2, {'LABEL': 0}),
 (1, 7, {'LABEL': 0}),
 (1, 0, {'LABEL': 0}),
 (1, 11, {'LABEL': 0}),
 (1, 12, {'LABEL': 0}),
 (1, 5, {'LABEL': 0}),
 (1, 6, {'LABEL': 0}),
 (1, 8, {'LABEL': 0}),
 (1, 16, {'LABEL': 0}),
 (1, 17, {'LABEL': 0}),
 (1, 18, {'LABEL': 0}),
 (1, 19, {'LABEL': 0}),
 (2, 1, {'LABEL': 0}),
 (2, 10, {'LABEL': 0}),
 (2, 3, {'LABEL': 0}),
 (2, 13, {'LABEL': 0}),
 (2, 8, {'L

In [111]:
nx.shortest_path_length(G)

{0: {0: 0,
  1: 1,
  2: 1,
  3: 1,
  4: 1,
  5: 1,
  6: 1,
  7: 1,
  8: 1,
  9: 1,
  10: 1,
  13: 1,
  15: 1,
  20: 1,
  17: 1,
  19: 1,
  18: 1,
  16: 1,
  21: 1,
  23: 1,
  11: 1,
  12: 1,
  67: 1,
  69: 1,
  14: 2,
  25: 2,
  26: 2,
  28: 2,
  27: 2,
  29: 2,
  30: 2,
  31: 2,
  33: 2,
  34: 2,
  22: 2,
  35: 2,
  32: 2,
  24: 2,
  36: 2,
  37: 2,
  41: 2,
  38: 2,
  45: 2,
  46: 2,
  44: 2,
  50: 2,
  39: 2,
  40: 2,
  47: 2,
  55: 2,
  48: 2,
  52: 2,
  49: 2,
  51: 2,
  53: 2,
  54: 2,
  42: 2,
  57: 2,
  59: 2,
  56: 2,
  78: 2,
  58: 2,
  80: 2,
  60: 2,
  61: 2,
  63: 2,
  62: 2,
  86: 2,
  65: 2,
  64: 2,
  79: 2,
  66: 2,
  84: 2,
  85: 2,
  43: 2,
  83: 2,
  81: 2,
  68: 2,
  70: 2,
  72: 2,
  76: 2,
  88: 2,
  87: 2,
  82: 2,
  71: 2,
  91: 3,
  96: 3,
  75: 3,
  74: 3,
  90: 3,
  73: 3,
  92: 3,
  93: 3,
  94: 3,
  95: 3,
  77: 3,
  89: 3},
 1: {1: 0,
  10: 1,
  2: 1,
  7: 1,
  0: 1,
  11: 1,
  12: 1,
  5: 1,
  6: 1,
  8: 1,
  16: 1,
  17: 1,
  18: 1,
  19: 1,
  3: 2,
  4

In [None]:
nx.draw(G)

Here's a problem I didn't anticipate. The nodes and edges are not connected, meaning there is no connection between User IDs. However, the TARGETIDs are also nodes, representing activities which each user performs. If I use these as nodes, I could adjust my analysis to measure degree centrality of certain activities with respect to number of actions and whether a user dropped out. Let's give it a shot:

In [65]:
degrees = []

for i, j in G.degree().items:
    degrees.append(j)
    
print("Minimum degree: ", min(degrees))
print("Maximum degree: ", max(degrees))
print("Average degree: ", round(sum(degrees)/len(degrees)))

TypeError: 'builtin_function_or_method' object is not iterable

I am not going to be able to connect this data, so let's look at the count of ACTIONIDs where the label is 1:

In [122]:
num_actions = df.groupby('USERID')['ACTIONID'].count().to_frame("ACTIONS").reset_index()
num_actions

Unnamed: 0,USERID,ACTIONS
0,0,78
1,1,27
2,2,198
3,3,6
4,4,10
...,...,...
7042,7042,5
7043,7043,19
7044,7044,7
7045,7045,11


In [123]:
num_actions.describe()

Unnamed: 0,USERID,ACTIONS
count,7047.0,7047.0
mean,3523.0,60.574003
std,2034.438006,60.265508
min,0.0,5.0
25%,1761.5,14.0
50%,3523.0,39.0
75%,5284.5,90.0
max,7046.0,517.0
