In [1]:
import pandas as pd

# Import Data

In [9]:
addiction_pre = pd.read_csv('../../data/addiction_post_features_tfidf_256.csv')
addiction_post = pd.read_csv('../../data/addiction_pre_features_tfidf_256.csv')

In [12]:
addiction_pre.head()

Unnamed: 0,subreddit,author,date,post,automated_readability_index,coleman_liau_index,flesch_kincaid_grade_level,flesch_reading_ease,gulpease_index,gunning_fog_index,...,tfidf_wish,tfidf_without,tfidf_wonder,tfidf_work,tfidf_worri,tfidf_wors,tfidf_would,tfidf_wrong,tfidf_x200b,tfidf_year
0,addiction,MushroomEagle,2020/01/01,Hadn’t even made it a day Just relapsed for th...,-1.57539,0.344164,0.73342,103.405519,94.238095,3.530736,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.13558
1,addiction,karynlackey,2020/01/01,I read my way out of addiction and into a mast...,5.55,8.679773,6.01,69.785,72.333333,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,addiction,throwaway2463684,2020/01/01,I think my friend relapsed I haven't known her...,8.384206,5.061883,7.942345,81.311765,63.814433,11.019588,...,0.0,0.0,0.0,0.0,0.039891,0.04003,0.111614,0.0,0.0,0.075885
3,addiction,Christiannan,2020/01/01,"It’s now been 8 years since I quit opiates, co...",0.600811,3.084659,1.647351,96.432108,90.621622,4.041081,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.604834
4,addiction,k_thrace_,2020/01/01,First sober NYE in (at least) a decade! Last n...,6.973831,6.105237,7.165552,77.975471,64.714286,10.816883,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Features

The published paper, from which the dataset is cited, has an in depth breakdown of the features which can be found here: [https://www.jmir.org/2020/10/e22635/](https://www.jmir.org/2020/10/e22635/).  
The feature extractions are as follows:
- LIWC (n=62);
- sentiment analysis (n=4); 
- basic word and syllable counts (n=8); 
- punctuation (n=1); 
- readability metrics (n=9); 
- term frequency–inverse document frequency (TF-IDF) ngrams (256-1024) to capture words and phrases that characterize specific posts; 
- manually built lexicons about suicidality (n=1), economic stress (n=1), isolation (n=1), substance use (n=1), domestic stress (n=1), and guns (n=1). 

*A sidenote on TF-IDF:*  
The raw document (posts) has been converted to a matrix of TF-IDF features.
In the sklearn library, the TfidfVectorizer is equivalent to sklearn `countvectorizer` followed by `TfidfTransformer` where Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency.  
More can be read on [sklearn tfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) page.

## High level analysis

In [17]:
columns_of_interest = ['subreddit', 'author', 'date', 'post', 'substance_use_total']

addiction_pre = addiction_pre.loc[:, columns_of_interest]
addiction_post = addiction_post.loc[:, columns_of_interest]

In [18]:
addiction_pre.head(5)

Unnamed: 0,subreddit,author,date,post,substance_use_total
0,addiction,MushroomEagle,2020/01/01,Hadn’t even made it a day Just relapsed for th...,0
1,addiction,karynlackey,2020/01/01,I read my way out of addiction and into a mast...,0
2,addiction,throwaway2463684,2020/01/01,I think my friend relapsed I haven't known her...,2
3,addiction,Christiannan,2020/01/01,"It’s now been 8 years since I quit opiates, co...",0
4,addiction,k_thrace_,2020/01/01,First sober NYE in (at least) a decade! Last n...,0


In [19]:
addiction_pre.tail(5)

Unnamed: 0,subreddit,author,date,post,substance_use_total
1778,addiction,miniaturepeach,2020/04/20,Anything that eases cocaine withdrawal? I’m a ...,2
1779,addiction,QDP-20,2020/04/20,Any statistics about quitting success rates fo...,5
1780,addiction,sunnydaygreyskies,2020/04/20,how am i still alive ok so over the course of ...,0
1781,addiction,cerebralanalyst,2020/04/20,I need help. Someone in my family fell into a ...,0
1782,addiction,losmo121,2020/04/20,How can I help/encourage without enabling? Hi ...,1


In [20]:
print(f'Total number of records in this dataset: {len(addiction_pre)}')
addiction_pre.describe()

Total number of records in this dataset: 1783


Unnamed: 0,substance_use_total
count,1783.0
mean,2.148626
std,3.162421
min,0.0
25%,0.0
50%,1.0
75%,3.0
max,40.0
