In this iPython Notebook, we will connect to the Reddit API, develop a set of methods that determines rates of delta awarding and use of evidence in comments per discussion. This is the notebook that collected and analyzed data for the paper: Priniski, H. J., & Horne, Z. (Under review). Attitude Change on Reddit's Change My View. 

In [8]:
import os
import praw
import pandas as pd
import datetime as dt
import warnings
warnings.filterwarnings('ignore')

In [10]:
reddit = praw.Reddit(client_id='',
                     client_secret='',
                     password='',
                     user_agent='',
                     username='')


subreddit = reddit.subreddit('changemyview')

Version 5.1.0 of praw is outdated. Version 5.3.0 was released Sunday December 17, 2017.


### getting deltas and delta awarded comments
The block below has functions that distill delta awarded comments and their subsequent threads of discussion.  This is done using breadth-first search through the dicsussion tree returned by the Reddit API.  To return a delta thread, the tree is traversed upwards until the root of the thread is found. 

In [21]:
def has_delta(body):
    if 'confirmed: 1 delta awarded to' in body.lower():
        return True
    return False

def set_value(value, value_list):
    value_list.append(value)
    return

def get_thread(comment, root, thread):

    if comment.parent() is root:
        return set_value(comment.body, thread)
    get_thread(comment.parent(), root, thread)
    set_value(comment.body, thread)

def get_delta_thread(post):
    post.comments.replace_more(limit = 0)
    queue = post.comments[:]
    threads = []
    root = post
    while queue:
        comment = queue.pop(0)
        if has_delta(comment.body):
            thread = []
            get_thread(comment, root, thread)
            threads.append(thread)
        queue.extend(comment.replies)
    
    return threads

def get_delta_count(comments):
    count = 0
    for comment in comments:
        if has_delta(comment):
            count += 1
    return count

### analyze_data exectues the majority of this notebook's functionality
Not all of the variables we wish to explore are returned by the Reddit API.  Therefore, we have to implement our own functions to find the number of deltas awarded in each discussion, distill the threads of discussion that lead to a delta, the number of links used in a discussion, etc.  The block of code below executes this functionality.  Since this function isn't very modularized and easy to follow, I descibe the steps of the function here. 

We will pass this function a post object.  It is a rich data structure returned by the Reddit API. 
```python
def analyze_data(post):
```

The following variables will hold the information:
- `comments`: a compressed list of the post's discussion
- `threads`: a list of the threads of conversation that lead to a delta being awarded
- `DACs`: a dictionary of the delta awarded comments and the amount of deltas awarded to it
- `links`: a list of links used in the discussion (len(links) returns the amount of links used in the discussion)
- `stat_lang_general`: the amount of posts that use statistically-oriented language used in the discussion 
- `stat_lang_deltas`: the amount delta awarded comments that use statistically-oriented langauage


```python
    comments = []
    threads = []
    DACs = dict()
    links = []
    #number of posts in the discussion using statistical language
    stat_lang_general = 0
    #number of DACs in the discussion using statistical language
    stat_lang_deltas = dict()
    
```

We must treet the comments object returned by the Reddit API as a tree. Therefore, conventional tree traversal methods are used  to go through the comment.  To do this, we treat the tree like data structure like a queue of posts.  

```python
    com = post.comments.replace_more(limit = 0)
    com_tree = post.comments[:]
    root = post
    while com_tree:
        comment = com_tree.pop(0)
```
We check for links, use of statistical language and if it contains a delta right away when when we first encounter a new comment
```python
        links.append([w for w in comment.body.lower().split() if has_link(w)])
        
        if has_stat_language(comment.body.lower()):
            stat_lang_general += 1
        comments.append(comment.body.lower())
        
        if has_delta(comment.body):
                
            thread = []
            get_thread(comment, root, thread)
            threads.append(thread)
            
            #get the root comment, the DAC by indexing the first element in the list
            if has_stat_language(thread[0]):
                stat_lang_deltas[thread[0]] =1

            if has_link(thread[0]):
                DACs[thread[0]] = 1
            else:
                DACs[thread[0]] = 0
                  
        com_tree.extend(comment.replies)
```
We then return the variables listed above.
```python
    return (comments, threads, DACs, links, stat_lang_general, stat_lang_deltas)

```




In [48]:
def analyze_data(post):
    #post = reddit.submission(url = url)

    comments = []
    threads = []
    DACs = dict()
    links = []
    
    #number of posts in the discussion using statistical language
    stat_lang_general = 0
    
    #number of DACs in the discussion using statistical language
    stat_lang_deltas = dict()
    
    com = post.comments.replace_more(limit = 0)
    com_tree = post.comments[:]
    root = post
    while com_tree:
        comment = com_tree.pop(0)

        #check for links here
        links.append([w for w in comment.body.lower().split() if has_link(w)])
        
        if has_stat_language(comment.body.lower()):
            stat_lang_general += 1
        comments.append(comment.body.lower())
        
        if has_delta(comment.body):
                
            thread = []
            get_thread(comment, root, thread)
            threads.append(thread)
            
            #get the root comment, the DAC by indexing the first element in the list
            if has_stat_language(thread[0]):
                stat_lang_deltas[thread[0]] =1
            
            if has_link(thread[0]):
                DACs[thread[0]] = 1
            else:
                DACs[thread[0]] = 0
    
                    
        com_tree.extend(comment.replies)
    return (comments, threads, DACs, links, stat_lang_general, stat_lang_deltas)

We are interested in the statistically-oriented language in each post.  These functions calcaulte if there are digits and commonly used statistical analysis words.

In [50]:
def has_dig(w):
    return any(ch.isdigit() for ch in w)

def has_stat_language(w):
    terms = ['data', 'stats', 'statistics', 'figures', '%', 'percent', 'average']
    if any(t in w for t in terms) or has_dig(w):
        return True

We are also interested in how frequently people cite external data.  The functions below do two things: calculate the total number of links used in a discussions, and calcuate the total number of comments using a link. Additionally, we save what links are used so we can analyze the types of data repliers like to incoperate into their replies.

In [52]:
def has_link(w):
    extensions = ['http://', '.com', '.org', '.gov', 'pdf', '.net', 'www.']
    if any(e in w for e in extensions) and not ('reddit.com' in w):
        return True

def cnt_links(links):
    com_links = 0
    tot_links = 0
    links_list = []
    for l in links:
        if not (l == []):
            post_links = []
            for _ in l:
                post_links.append(_)
                tot_links +=1
            links_list.append(post_links)
            com_links += 1
    return com_links, tot_links, links_list

When we collect posts and run the analyze_data function on each submission, a directory will be created to clearnly save all the relvant data in its proper subdirectories. 

To signify when the data is collected, the date variable will equal today's date. 

We are also are collecting top reddit posts. If you are interested in collecting a different type of post, e.g. 'hot' posts, then change the variable `posts` to hot.  

In [None]:
now = dt.datetime.now()

date = str(now.month) + '-' + str(now.day)
posts = 'top'
directory = 'data/'+date+'/delta_threads/'

if not os.path.exists(directory):
    os.makedirs(directory)

In this block, you will signfy how any posts you want to collect.  Then a directory will be created with the following information:

The delta_threads directory is a directory of all the deltas awarded in the discussion.  You can find the post text and title of the each discussion (along with the number of deltas awarded, number of comments, number of links used, etc.) in the synopsis_data.xlsx spreadsheet.  The name of each `delta_thread` file follows the following order: `post-number_date-of-collection_delta_threads.xlsx`



`date
|----delta_threads
|    |
|    |-----01_date_delta_threads.xlsx, 02_date_delta_threads.xlsx, ...
|
|----date_synopsis_data.xlsx`



In [9]:
df = []

#post count will identify the post/submission number.  it is useful for comparing data in the synopsis sheet 
# to the data in the delta thread directories
post_cnt = 0

NO_OF_POSTS = 525

for submission in subreddit.top(limit=NO_OF_POSTS):
    
    post_time = submission.created
    title = submission.title
    selftext = submission.selftext.replace(',',' ')
    comments, threads, DACs, links, stat_lang_general, stat_lang_deltas = analyze_data(submission)
    #you can save comments to a csv if you desire.  they are returned as a compressed list and don't have their 
    #tree encoding
    delta_count = get_delta_count(comments)
    x = sum(DACs.values())
    df.append([title,selftext, len(comments), len(DACs.keys()), delta_count, cnt_links(links)[0], cnt_links(links)[1],cnt_links(links)[2],x,stat_lang_general, len(stat_lang_deltas.keys()), post_time])
    
    threads_DF = pd.DataFrame(threads) 
    threads_DF.to_excel(directory+str(post_cnt)+'_'+date+'_delta_thread.xlsx', header = False)
    post_cnt += 1 

NameError: name 'subreddit' is not defined

### A more detailed description of the steps from the convoluted block of code above follows here:

For convience, we will save our data to a list of lists data strucutre and later convert this to a pandas dataframe.  Here we also determine how many submissions we wish to collect.  In our case it's `525`.
```python
df = []
post_cnt = 0
NO_OF_POSTS = 525
```

To connect to the Reddit api we must call the `subreddit` object created above.  We pass this the number of posts we wish to collect.  

```python
for submission in subreddit.top(limit=NO_OF_POSTS):
```

A submission variable is created for each post, and we will iterate through as many posts equal to `NO_OF_POSTS`.

The information we can collect directly from the Reddit API is

```python
    post_time = submission.created
    title = submission.title
    selftext = submission.selftext.replace(',',' ')
```

The rest of the variables we are interested in analyzing, such as the number of deltas, the delta threads, the use of links and statistically-oriented language needs to be calcaulted locally. We can do this by calling the `analyze_data` function implemented above.  
```python
    comments, threads, DACs, links, stat_lang_general, stat_lang_deltas = analyze_data(submission)
    #Here we can save comments to a csv if you desire.  they are returned as a compressed list and don't have their tree encoding
    delta_count = get_delta_count(comments)
    x = sum(DACs.values())
    
```
We will now append all of the relvant information to our list of lists `df`. This data will be saved the synopsis file. 

```python
    df.append([title,selftext, len(comments), len(DACs.keys()), delta_count, cnt_links(links)[0], cnt_links(links)[1],cnt_links(links)[2],x,stat_lang_general, len(stat_lang_deltas.keys()), post_time])
```

All delta threads will be saved in it's own directory.  We signfy this directory here.

```python
    threads_DF = pd.DataFrame(threads) 
    threads_DF.to_excel(directory+str(post_cnt)+'_'+date+'_delta_thread.xlsx', header = False)
```


And of course, implement our post count 
```python
post_cnt += 1
```
We will now save the data to a pandas dataframe, and save the data to an execl file. 
```python
DF = pd.DataFrame(df,columns = ['title','selftext','No. of Comments', 'No. of DACs', 'No. of Deltas', 'No. of Com. with Links','Total Links','Links', 'No. of DAC with Links','No. of Com. using stat. lang.','No. of DACs using stat. lang.','Post Time'])
DF.to_excel('data/'+date+'/'+posts+'_'+date+'_CMV.xlsx')
```