Paper: https://www.andrew.cmu.edu/user/lakoglu/pubs/StackOverflow-churn.pdf

Description of datasets: https://ia800107.us.archive.org/27/items/stackexchange/readme.txt

Site for download of datasets: https://archive.org/details/stackexchange

This code has 6 steps

    1. Load StackOverflow datasets as dataframe
    2. Extract and label the datasets for each tasks
    3. Extract & analyze features for each tasks
    4. Analyze features
    5. Train models for each tasks with the features
    6. Draw the graphs in the paper

1. Load StackOverflow datasets as dataframe

In [1]:
import sys
!{sys.executable} -m pip install xmltodict

Collecting xmltodict
  Downloading https://files.pythonhosted.org/packages/28/fd/30d5c1d3ac29ce229f6bdc40bbc20b28f716e8b363140c26eff19122d8a5/xmltodict-0.12.0-py2.py3-none-any.whl
Installing collected packages: xmltodict
Successfully installed xmltodict-0.12.0


In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

import xmltodict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [6]:
def load_from_google_drive(dir_id):
    files = []
    file_list = drive.ListFile({'q': "'{}' in parents".format(dir_id)}).GetList()
    for f in file_list:
        if f['title'] in ['Users.xml', 'Posts.xml',
                          'users_reduce.pkl', 'posts_reduce.pkl']\
                or 'pkl' in f['title'] or 'csv' in f['title']:
            print('  Load file: {}'.format(f['title']))
            f_ = drive.CreateFile({'id': f['id']})
            f_.GetContentFile(f['title'])
            files.append(f['title'])
    return files
  
  
# load_from_google_drive('1Fp_7GDH_t7xfnU8aXeKrcBC54_nECOcu')  ###| Full dataset
# load_from_google_drive('1haYAgCV-TqTMYIk8N4eGE9H4hY2np5xr')   ### Small dataset
load_from_google_drive('1CRE27AaxJuX-9Kxtgk2GnmxQt6ECHeJS')   ### Tiny dataset


  Load file: Users.xml
  Load file: Posts.xml


['Users.xml', 'Posts.xml']

In [0]:
# Read xml file and transform to pandas dataframe

def xml2df(xml_path):
    with open(xml_path) as f:
        dict_xml = xmltodict.parse(f.read())
        key = xml_path.split('.')[0].lower()
        xml_df = pd.DataFrame(dict_xml[key]['row'])
        xml_df.columns = xml_df.columns.str.lstrip('@')

        return xml_df

In [7]:
# 1. Read Users.xml

xml_path = 'Users.xml'
users_df = xml2df(xml_path)

# 2. Change data type of columns
users_df.head()
users_df['CreationDate'] = pd.to_datetime(users_df['CreationDate'])
users_df.dropna(subset=['Id'], inplace=True)
users_df['Id'] = users_df['Id'].astype(np.int64)
users_df.head()

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,WebsiteUrl,ProfileImageUrl
0,-1,1,2012-02-14 18:31:47.350,Community,2012-02-14T18:31:47.350,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",73,51,17,-1,,
1,2,101,2012-02-14 20:17:36.000,Geoff Dalgas,2014-08-29T13:45:59.997,"Corvallis, OR",<p>Developer on the Stack Overflow team. Find...,0,1,0,2,http://stackoverflow.com,https://i.stack.imgur.com/nDllk.png
2,3,3114,2012-02-14 20:20:16.000,Andy W,2017-10-28T00:57:06.677,"Dallas, TX, United States",<p>Assistant professor of criminology at the U...,8,291,15,298433,http://andrewpwheeler.wordpress.com/,
3,5,1969,2012-02-14 20:22:08.000,Stefano Borini,2014-05-24T05:16:16.957,,,52,2,0,29108,http://ForTheScience.org,
4,6,445,2012-02-14 20:22:35.000,Brian Ballsun-Stanton,2012-02-20T19:25:09.473,"Sydney, Australia","<p>Dr. Brian Ballsun-Stanton is a Philosopher,...",0,0,0,97049,http://denubis.wordpress.com,https://i.stack.imgur.com/jUaZ0.jpg


In [301]:
# 1. Read Posts.xml

xml_path = 'Posts.xml'
posts_df = xml2df(xml_path)

# 2. Change data type of columns
posts_df.head()
posts_df['CreationDate'] = pd.to_datetime(posts_df['CreationDate'])
posts_df.dropna(subset=['OwnerUserId'], inplace=True)
posts_df['OwnerUserId'] = posts_df['OwnerUserId'].astype(np.int64)
posts_df['PostTypeId'] = posts_df['PostTypeId'].astype(np.int64)
posts_df.head()

Unnamed: 0,Id,PostTypeId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastEditorUserId,LastEditDate,LastActivityDate,Title,Tags,AnswerCount,CommentCount,ParentId,OwnerDisplayName,AcceptedAnswerId,FavoriteCount,ClosedDate,LastEditorDisplayName,CommunityOwnedDate
0,1,1,2012-02-14 20:39:10.140,4,114.0,<p>I would like to open the meta discussion on...,5,40592.0,2016-08-15T14:58:04.950,2016-08-15T14:58:04.950,Softness of the closing criteria,<discussion>,2.0,0,,,,,,,
1,2,1,2012-02-14 20:41:04.273,3,108.0,"<p>Suppose user X comes in and ask ""How is the...",5,,,2012-02-14T21:06:26.337,"How should we behave for the ""reference"" quest...",<discussion>,1.0,0,,,,,,,
3,4,1,2012-02-14 21:37:15.053,3,67.0,<p>I think several questions are doing to be f...,30,,,2012-02-14T21:46:33.213,Is community wiki not available during beta?,<discussion><community-wiki>,1.0,0,,,5.0,1.0,,,
4,5,2,2012-02-14 21:46:33.213,5,,<p>Community wiki <em>questions</em> are a mod...,23,,,2012-02-14T21:46:33.213,,,,1,4.0,,,,,,
5,6,2,2012-02-14 22:07:56.987,4,,<p>I think it is a bit early to formulate poli...,31,,,2012-02-14T22:07:56.987,,,,0,1.0,,,,,,


In [0]:
posts_df['AcceptedAnswerId'].unique()

array([nan, '5', '26', '9', '14', '28', '38', '47', '52', '61', '67',
       '71', '75', '98', '105', '117', '154', '160', '162', '167', '173',
       '185', '183', '189', '193', '202', '209', '289', '304', '308',
       '318', '324', '328', '331', '356', '379', '406', '422', '427',
       '460', '479', '489', '492', '498', '567', '536', '538', '547',
       '552', '560', '579', '611', '621', '636', '637', '672', '684',
       '703', '698', '701', '708', '707', '716', '753', '847', '828',
       '836', '848', '871', '870', '873', '877', '883', '890', '916',
       '924', '940', '957', '971', '973', '988', '1006', '1030', '1060',
       '1079', '1085', '1088', '1113', '1096', '1108', '1115', '1127',
       '1132', '1134', '1146', '1150', '1152', '1156', '1167', '1171',
       '1177', '1209', '1223', '1244', '1249', '1255', '1271', '1279',
       '1287', '1311', '1320', '1323', '1336', '1334', '1345', '1348',
       '1366', '1436', '1439', '1441', '1451', '1463', '1477', '1493',
       '

In [9]:
# Save and Load dataframe
from google.colab import drive
drive.mount('/content/gdrive')

def save_df(df, filename):
    df.to_pickle("{}.pkl".format(filename))

    
def load_df(filename):
    return pd.read_pickle("{}.pkl".format(filename))

  


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


2. Extract and label the datasets for each tasks

You should extract the dataset for the period of the dataset: July 31, 2008 ~  July 31, 2012 

There are 2 tasks:

    A. After a user's K-th post, predict how likely it is that the user will churn
    B. After the T-th day from the account creation of a user, predict how likely it is that the user will churn

In [247]:
# Users와 posts를 CreationDate 기준으로 filtering (July 31, 2008 ~  July 31, 2012)

start_time = pd.to_datetime('2008-07-31')
end_time = pd.to_datetime('2012-07-31')

posts_df = posts_df[(posts_df['CreationDate'] >= start_time) & (posts_df['CreationDate'] <= end_time)]
print(posts_df.columns)
users_df = users_df[(users_df['CreationDate'] >= start_time) & (users_df['CreationDate'] <= end_time)]
print(users_df.columns)

Index(['Id', 'PostTypeId', 'CreationDate', 'Score', 'ViewCount', 'Body',
       'OwnerUserId', 'LastEditorUserId', 'LastEditDate', 'LastActivityDate',
       'Title', 'Tags', 'AnswerCount', 'CommentCount', 'ParentId',
       'OwnerDisplayName', 'AcceptedAnswerId', 'FavoriteCount', 'ClosedDate',
       'LastEditorDisplayName', 'CommunityOwnedDate'],
      dtype='object')
Index(['Id', 'Reputation', 'CreationDate', 'DisplayName', 'LastAccessDate',
       'Location', 'AboutMe', 'Views', 'UpVotes', 'DownVotes', 'AccountId',
       'WebsiteUrl', 'ProfileImageUrl'],
      dtype='object')


In [0]:
# Dataset in Task 1
#   Users: Post가 K개 이상인 user만 추출
#   Posts: User의 CreationDate 이후 K개의 posts만 추출

def getTask1Posts(posts, K=20):
    return posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId').filter(lambda post : len(post) >= K).groupby('OwnerUserId', as_index=False).apply(lambda owner : owner.nsmallest(K, 'CreationDate')).reset_index(level = 1, drop=True)

def getTask1Users(users, posts, K=20):
     return pd.DataFrame(posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId').filter(lambda post : len(post) >= K)['OwnerUserId'].drop_duplicates())
#     return posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId').filter(lambda post : len(post) >= K).groupby('OwnerUserId')['CreationDate'].nsmallest(K).index.unique(level=0)

In [0]:
getTask1Users(users_df.set_index('Id'),posts_df)

# getTask1Users(users_df, posts_df)
# post_temp = getTask1Users(users_df[users_df['Id'].isin(posts['OwnerUserId'])], posts)

Unnamed: 0_level_0,Reputation,CreationDate,DisplayName,LastAccessDate,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,WebsiteUrl,ProfileImageUrl,posts
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
73,42744,2012-02-15 00:04:47,eykanal,2019-08-28T17:51:31.323,,<p>I'm currently working in CMU's Software Eng...,292,356,47,56350,http://blog.erikdev.com/,,Id PostTypeId ... LastEditorDisplayName...


In [0]:
# Dataset in Task 2
#   Users: Post가 1개 이상인 user만 추출
#   Posts: User의 CreationDate로부터 T days 이내에 생성된 posts만 추출

def getTask2Posts(users, posts, T=30):
#     return posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId').filter(lambda post : len(post) >= K).groupby('OwnerUserId')['CreationDate'].nsmallest(K).index.unique(level=0)
#       return users.apply(lambda user : posts[(posts['OwnerUserId'] == user['Id']) & (posts['CreationDate'] >= user['CreationDate']) & (posts['CreationDate'] <= user['CreationDate'] +  pd.to_timedelta(T, 'days'))].sort_values('CreationDate'), axis=1)
#     return posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId').apply(lambda owner : owner[(owner['CreationDate'] >= users[users['Id'] == owner.index.unique()[0]]['CreationDate']) & (owner['CreationDate'] <= users[users['Id'] == owner['OwnerUserId'].unique()[0]][CreationDate] + timedelta(days=T))]).reset_index(level = 1, drop=True)
    return posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId').apply(lambda owner : owner[(owner['CreationDate'] <= users[users['Id'] == owner['OwnerUserId'].unique()[0]]['CreationDate'].values[0] + pd.to_timedelta(T, 'days'))]).reset_index(level = 1, drop=True)
#     return users

In [0]:
getTask2Posts(users_df, posts_df)

In [0]:
# Churn in Task 1
#   Churners: Users who did not post for at least 6 months from their K-th post 
#   Stayers:  Users who created at least one post within the 6 months from their K-th post

def getTask1Labels(users, posts, K=20):
    label_df = users.drop(users.columns, axis=1)
    label_df = getTask1Users(label_df, posts, K=K)
    label_df = getTimeGap1OfUser(label_df, posts)
    label_df = pd.merge(label_df, getTimeGapsOfPosts(posts, K), on='OwnerUserId')
    print(label_df)
    label_df = pd.merge(label_df, getTimeLastGapOfPosts(posts), on='OwnerUserId')
    label_df = getTimeSinceLastPost(label_df, posts)
    label_df = pd.merge(label_df, getTimeMeanGap(posts), on='OwnerUserId')
    label_df = pd.merge(label_df, getNumAnswers(posts), on='OwnerUserId')
    label_df = pd.merge(label_df, getNumQuestions(posts), on='OwnerUserId')
    label_df = pd.merge(label_df, getNumQuestions(label_df['num_answers'], label_df['num_questions']), on='OwnerUserId')
    
    label_df['is_churn'] = 0.0
    return label_df


In [319]:
getTask1Labels(users_df, posts_df).head()

OwnerUserId    
49           0     1.388314e+05
             0     2.711193e+06
             0     1.540083e+07
             0     4.268178e+06
             0     8.113198e+04
             0     6.037341e+05
             0     3.180121e+05
             0     1.315864e+06
             0     3.865373e+06
             0     1.040040e+06
             0     3.536625e+06
             0     3.291128e+06
             0     2.044127e+06
             0     9.814771e+06
             0     1.199587e+07
             0     3.545225e+06
             0     5.122160e+02
             0     8.409170e+02
             0     6.814835e+04
53           1     5.085749e+05
             1     5.018173e+04
             1     7.628693e+04
             1     3.029222e+04
             1     1.143713e+06
             1     1.462355e+06
             1     2.130517e+05
             1     2.932523e+06
             1     9.050564e+04
             1     4.640255e+06
             1     3.286944e+05
                       .

AttributeError: ignored

In [0]:
# Churn in Task2
#   Churners: Users who did not post for at least 6 months from T days after account creation
#   Stayers:  Users who created at least one post within the 6 months from T days after account creation

def getTask2Labels(users, posts, T=30):
    label_df = users.drop(users.columns, axis=1)
    label_df = getTask1Users(label_df, posts, K=1)

    label_df['is_churn'] = 0.0
    return label_df


In [143]:
getTask2Labels(users_df, posts_df).head()

Unnamed: 0,OwnerUserId,is_churn
0,5,0.0
3,30,0.0
4,23,0.0
5,31,0.0
7,73,0.0


3. Extract features for each tasks

3-1. Temporal features

In [0]:
# Temporal features 1: gap1
def getTimeGap1OfUser(users, posts):
    first_post = getTask1Posts(posts, K=1)
    
    users['gap1'] = users.apply(lambda user : (first_post[first_post['OwnerUserId'] == user['OwnerUserId']]['CreationDate'].reset_index(drop=True) - users_df[users_df['Id'] == user['OwnerUserId']]['CreationDate'].reset_index(drop=True)).dt.total_seconds(), axis = 1)
    return users


In [0]:
post_temp['gap'] = post_temp.apply(lambda post : post['CreationDate'].diff().tolist())
print(post_temp['gap'].iloc[-1])

[NaT, Timedelta('0 days 14:21:25.293000'), Timedelta('0 days 00:14:48.797000'), Timedelta('0 days 08:47:33.007000'), Timedelta('0 days 02:01:49.156000'), Timedelta('5 days 17:47:39.614000'), Timedelta('0 days 20:06:31.533000'), Timedelta('0 days 20:48:10.697000'), Timedelta('3 days 07:59:44.033000'), Timedelta('0 days 00:06:10.617000'), Timedelta('4 days 19:02:11.143000'), Timedelta('4 days 08:25:12.413000'), Timedelta('1 days 01:24:51.917000'), Timedelta('3 days 02:54:37.030000'), Timedelta('0 days 13:13:00.820000'), Timedelta('7 days 21:11:32.533000'), Timedelta('3 days 22:02:03.164000'), Timedelta('4 days 01:00:48.976000'), Timedelta('8 days 00:29:10.014000'), Timedelta('25 days 12:14:17.293000')]


In [0]:
# Temporal features 2: gapK
def getTimeGapsOfPosts(posts, K):
    k_posts = getTask1Posts(posts, K).groupby('OwnerUserId').apply(lambda post : pd.to_timedelta(post['CreationDate'].diff()[1:]).dt.total_seconds())
    print(k_posts)
    k_posts.index = ['gap' + str(i+2) for i in range(len(k_posts.columns))]
#     k_posts.columns = ['gap' + str(i+2) for i in range(len(k_posts.columns))]
    return k_posts.reset_index()


In [0]:
# Temporal features 3: last_gap
def getTimeLastGapOfPosts(posts):
    last_posts = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId').filter(lambda post : len(post) >= 2).groupby('OwnerUserId').apply(lambda owner : pd.to_timedelta(abs(owner.nlargest(2, 'CreationDate')['CreationDate'].diff()[1:])).dt.total_seconds()).reset_index(level=1, drop=True)
    last_posts = last_posts.rename('last_gap')
    return last_posts


In [0]:
# Temporal features 4: time_since_last_post
def getTimeSinceLastPost(users, posts):
    last_posts = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId').filter(lambda post : len(post) >= 1).groupby('OwnerUserId', as_index=False).apply(lambda owner : owner.nlargest(1, 'CreationDate'))[['OwnerUserId', 'CreationDate']].reset_index(drop=True)
    users['time_since_last_post'] = users.apply(lambda user : (end_time - last_posts[last_posts['OwnerUserId'] == user['OwnerUserId']]['CreationDate']).dt.total_seconds(), axis = 1)
    
    return users


In [0]:
# Temporal features 5: mean_gap
def getTimeMeanGap(posts):
    mean_gap_posts = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId').filter(lambda post : len(post) >= 2).groupby('OwnerUserId').apply(lambda post : pd.to_timedelta(post['CreationDate'].diff()[1:]).dt.total_seconds().mean())
    mean_gap_posts = mean_gap_posts.rename('mean_gap')
    return mean_gap_posts


3-2. Frequency features

In [0]:
# Frequency features 1: num_answers
# Frequency features 2: num_questions
def getNumAnswers(posts):
    print(posts.columns)
    num_answers = posts[posts['OwnerUserId'] > 0 & posts['PostTypeId'] == 2].groupby('OwnerUserId').count()
    print(num_answers.count())
    num_answers = num_answers.rename('num_answers')
    return num_answers

def getNumQuestions(posts):
    num_questions = posts[posts['OwnerUserId'] > 0 & posts['PostTypeId'] == 1].groupby('OwnerUserId').count()
    num_questions = num_questions.rename('num_questions')
    return num_questions


In [0]:
# Frequency features 3: ans_ques_ratio
def getAnsQuesRatio(num_answers, num_questions):
    return num_answers / num_questions


In [0]:
# Frequency features 4: num_posts
def getNumPosts(posts):
    posts['num_posts'] = posts['num_answers'] + posts['num_questions']
    return posts


3-3. Knowledge features

In [0]:
# Knowledge features 1: accepted_answerer_rep
def getRepOfAcceptedAnswerer(users, posts):
    posts['accepted_answerer_rep'] = posts[(posts['PostTypeId'] == 1) & posts['AcceptedAnswerId'].notna()].apply(lambda post : users[posts_df[posts_df['Id'] == post['AcceptedAnswerId']]['OwnerUserId']]['Reputation']).mean()
    return posts


In [0]:
# Knowledge features 2: max_rep_answerer 
def getMaxRepAmongAnswerer(users, posts):
  ques = posts[posts['PostTypeId'] == 1]['Id'].tolist()
  posts['accepted_answerer_rep'] = posts_df[posts_df['ParentId'].isin(ques)].groupby('ParentId').apply(lambda post : users[users['Id'] == post['OwnerUserId']]['Reputation']).max().mean()
  return posts


In [0]:
# Knowledge features 3: num_que_answered
def getNumQueAnswered(posts):
    posts['num_que_answered'] = len(posts[posts['AnswerCount'] > 0 & posts['PostTypeId'] == 1])
    return posts


In [0]:
# Knowledge features 4: time_for_first_ans
def getTimeForFirstAns(posts):
    return


In [0]:
# Knowledge features 5: rep_questioner
def getAvgRepOfQuestioner(users, posts):
    posts['rep_questioner'] = posts[(posts['PostTypeId'] == 2)].apply(lambda post : users[posts_df[posts_df['Id'] == post['ParentId']]['OwnerUserId']]['Reputation']).mean()
    return posts


In [0]:
# Knowledge features 6: rep_answerers
def getAvgRepOfAnswerer(users, posts):
    ques = posts[posts['PostTypeId'] == 1]['Id'].tolist()
    posts['rep_answerers'] = posts_df[posts_df['ParentId'].isin(ques)].groupby('ParentId').apply(lambda post : users[users['Id'] == post['OwnerUserId']]['Reputation']).mean()
    return posts


In [0]:
# Knowledge features 7: rep_co_answerers
def getAvgRepOfCoAnswerer(users, posts):
    posts['rep_co_answerers'] = posts[posts['PostTypeId'] == 2].groupby('ParentID').apply(lambda ans : users_df[users_df['Id'] == ans['OwnerUserId']]['Reputation'].mean())
    return


In [0]:
# Knowledge features 8: num_answers_recvd
def getAvgNumAnsReceived(posts):
    posts['num_answers_recvd'] = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId')[posts['PostTypeId'] == 1]['AnswerCount'].mean()
    return posts


3-4. Speed features

In [0]:
# Speed features 1: answering_speed
def getAnsweringSpeed(posts):
    posts['answering_speed'] = posts[posts['PostTypeId'] == 2].apply(lambda post : post['CreationDate'] - posts_df[posts_df['Id'] == post['ParentId']]['CreationDate'])
    return posts


3-5. Quality features

In [0]:
# Quality features 1: ans_score
# Quality features 2: que_score
def getScoreOfAnswers(posts):
    posts['ans_score'] = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId')[posts['PostTypeId'] == 2]['Score'].mean()
    return posts

def getScoreOfQuestions(posts):
    posts['que_score'] = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId')[posts['PostTypeId'] == 1]['Score'].mean()
    return posts


3-6. Consistency features

In [0]:
# Consistency features 1: ans_stddev
# Consistency features 2: que_stddev
def getStdevOfScoresOfAnswers(posts):
    posts['ans_stddev'] = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId')[posts['PostTypeId'] == 2]['Score'].std()
    return posts

def getStdevOfScoresOfQuestions(posts):
    posts['que_stddev'] = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId')[posts['PostTypeId'] == 1]['Score'].std()
    return posts


3-7. Gratitude features

In [0]:
# Gratitude features 1: ans_comments
# Gratitude features 2: que_comments
def getAvgNumOfAnswers(posts):
    posts['ans_comments'] = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId')[posts['PostTypeId'] == 2]['CommentCount'].mean()
    return posts

def getAvgNumOfQuestions(posts):
    posts['que_comments'] = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId')[posts['PostTypeId'] == 1]['CommentCount'].mean()
    return posts


3-8. Competitiveness features

In [0]:
# Competitiveness features 1: relative_rank_pos
def getRelRankPos(posts):
    return


3-9. Content features

In [0]:
# Content features 1: ans_length
# Content features 2: que_length
def getLengthOfAnswers(posts):
    posts['ans_length'] = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId')[posts['PostTypeId'] == 2]['Body'].apply(lambda content : len(content)).mean()
    return posts

def getLengthOfQuestions(posts):
    posts['que_length'] = posts[posts['OwnerUserId'] > 0].groupby('OwnerUserId')[posts['PostTypeId'] == 1]['Body'].apply(lambda content : len(content)).mean()
    return posts

3-10. Extract all features for each tasks

In [0]:
def getFeatures(features, users, posts, task, K=None, T=None):
    assert(task in [1,2])
    
    if -1 in features.index:
        features = features.drop([-1])
    
    return features

In [0]:
common_features = ['getTimeGap1OfUser',
                   'getNumAnswers', 'getNumQuestions','getAnsQuesRatio',
                   'getAnsweringSpeed',
                   'getScoreOfAnswers', 'getScoreOfQuestions',
                   'getStdevOfScoresOfAnswers', 'getStdevOfScoresOfQuestions',
                   'getAvgNumOfAnswers', 'getAvgNumOfQuestions',
                   'getRelRankPos',
                   'getLengthOfAnswers', 'getLengthOfQuestions',
                   'getRepOfAcceptedAnswerer', 'getMaxRepAmongAnswerer', 'getNumQueAnswered', 'getTimeForFirstAns', 'getAvgRepOfQuestioner', 'getAvgRepOfAnswerer', 'getAvgRepOfCoAnswerer', 'getAvgNumAnsReceived']
task1_features = common_features + ['getTimeGapsOfPosts']
for K in range(1, 20+1):
    task1_features.append()
    
task2_features = common_features + ['getTimeLastGapOfPosts', 'getTimeSinceLastPost', 'getTimeMeanGap', 'getNumPosts']
for T in [7, 15, 30]:
    task2_features.append()

TypeError: ignored

4. Analyze features


In [0]:
# Figure 2: Gap between posts
#    For a user who churns, gap between consecutive posts keeps increasing. 
#    Gaps for those who stay are much lower, and stabilize around 20,000 minutes,
#      indicating routine posting activity in every ≈2 weeks.

for K in range(2, 21):
    pass

In [0]:
# Figure 3: # Answers vs Churn probability
#    The probability of churning for a user decreases the more answers s/he provides.
#    It is even lower if s/he asks more questions alongside.

for features in task2_features:
    pass

In [0]:
# Figure 4: K vs Time taken for the first answer to arrive
#    The more the time taken for a user to receive an answer, 
#      the lesser the satisfaction level and the more the chances of churning.


5. Train models for each tasks with the features

    1. Decision Tree
    2. SVM (Linear)
    3. SVM (RBF)
    4. Logistic Regression
    

In [0]:
# Table 2: Performance on Task 1

from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold

seed = 1234

for i, features in enumerate(task1_features):
    pass

In [0]:
# Table 3: Performance on Task 2

for i, features in enumerate(task2_features):
    pass

6. Draw the graphs in the paper


In [0]:
# Table 4: Temporal Features Analysis

for i, features in enumerate(task1_features):
    pass

In [0]:
# Figure 5: Churn prediction accuracy when features from each category are used in isolation
