## This notebook goes over how I derived cosine similarities between user input and reddit posts.##

**Notably, I made use of the useful reddit api `praw` to scrape reddit of new posts, and did little NLP on those posts.**

In [1]:
import os
import requests
import operator
import string
import re
import nltk
import numpy as np
import praw
import pandas as pd

from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
#-*- coding: utf-8 -*-

**Create a function to use the reddit api, `praw`.**

In [2]:
def reddit_api(id, secret_id, user_agent, username, password):
    reddit = praw.Reddit(client_id=id,
                         client_secret=secret_id,
                         user_agent=user_agent,
                         username=username,
                         password=password)
    return reddit

reddit = reddit_api('id', 
                    'secret_id', 
                    'Mental Health Script by /u/tatinthehat',
                    'tatinthehat',
                    'password')

**Create a set of empty lists to populate for later.**

In [3]:
self_text = []
url = []
title = []
comment_num = []

**Crawl a subreddit (here, /r/anxiety) for all title, submissions, urls, selftext, and the number of comments, append them to the appropriate empty list.**

In [4]:
def(subreddit, limit):
subreddit = 'anxiety'
limit = 50
for submission in reddit.subreddit(subreddit).hot(limit=limit):
    if submission.selftext != '':
        title.append(submission.title)
        url.append(submission.url)
        self_text.append(submission.selftext)
        comment_num.append(submission.num_comments)

**Create a function to generate a set of custom stop words.**

In [5]:
def custom_stop():
    custom_stop = stopwords.words('english')
    del custom_stop[109:112]
    custom_stop = set(custom_stop)
    etc_stop = set(('\'ve', '[', ']', '\[\]', '\'s', '\'m', 'n\'t', '``', '\\n', '.', '\.', '...', '-', '\'\'', '(', ')', 'm', 's', 've', ',', ':', '*', '@', '!', '$', '%', '&', '?', '\'', '\"', '\"m', '\"n\'t\"', ' ','removed', 'deleted', '[]','0', 'te'))
    stop_words = custom_stop.union(etc_stop)
    return stop_words

stop_words = custom_stop()

** Use the lists we populated, create a data frame with them, remove stopwords and tokenize text.**

In [6]:
posts = pd.DataFrame({'title': title, 'url': url, 'selftext': self_text, 'number': comment_num })
def df_processing(df):
    df['tokenized_selftext'] = df.apply(lambda row: nltk.word_tokenize(row['selftext']), axis=1)
    df['tokenized_selftext'] = df['tokenized_selftext'].apply(lambda x: [item for item in x if item not in stop_words])
    df['stemmed_selftext'] = df.apply(lambda row: nltk.word_tokenize(row['selftext']), axis=1)
    return df

posts = df_processing(posts)
posts = posts.reset_index(drop = True)

In [7]:
posts.head(5)

Unnamed: 0,number,selftext,title,url,tokenized_selftext,stemmed_selftext
0,43,As I'm sure many of you already know first-han...,Weekly Success Thread: Share your victories la...,https://www.reddit.com/r/Anxiety/comments/5s0g...,"[As, I, sure, many, already, know, first-hand,...","[As, I, 'm, sure, many, of, you, already, know..."
1,33,Greetings & Salutations!\n\nUse this post to i...,Welcoming Newcomers & Free Talk Thread - Febru...,https://www.reddit.com/r/Anxiety/comments/5sd5...,"[Greetings, Salutations, Use, post, introduce,...","[Greetings, &, Salutations, !, Use, this, post..."
2,29,This happened to me last night. I ended up get...,Does anybody get anxious and cannot pinpoint a...,https://www.reddit.com/r/Anxiety/comments/5syv...,"[This, happened, last, night, I, ended, gettin...","[This, happened, to, me, last, night, ., I, en..."
3,2,"Today, I got a call to an interview for a job,...",I turned down a job-trial.,https://www.reddit.com/r/Anxiety/comments/5szn...,"[Today, I, got, call, interview, job, I, got, ...","[Today, ,, I, got, a, call, to, an, interview,..."
4,6,My boyfriend and I have been together for two ...,Anxiety ended my relationship,https://www.reddit.com/r/Anxiety/comments/5t0j...,"[My, boyfriend, I, together, two, years, part,...","[My, boyfriend, and, I, have, been, together, ..."


**Mimic user input similar to how it is done in the web app.**

In [8]:
def input_text(text):
    text= text.translate(None, string.punctuation)
    text= nltk.word_tokenize(text)
    return text

In [9]:
text = 'Here\'s an example of user input that the app would take in. The app strips out all punctuation, tokenizes it, and evaluates it for length.'
input = input_text(text)

** Create function that will do cosine similarities between two text examples.**

In [10]:
def cosine_sim(text1, text2):
    vectorizer = TfidfVectorizer(analyzer = 'word', max_features = 75)
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]

**Create an empty list to store similarity values, then create a loop where the user input is compared to reddit posts. Then append to the empty list.** Similarities are cosine similarities, multiplied by 100 to get a percentage. 100% means the input matches the reddit post 100%.

In [11]:
similarity = []

for i in range(len(posts)):
    x = str(input)
    y = str(posts['tokenized_selftext'][i])
    z = cosine_sim(x,y)
    z = z * 100
    similarity.append(z)


**The next part appends the similarities back on to the post, and sorts them in descending order.**

In [12]:
def sort_posts(df, column, similarities):
    df = df.reset_index(drop = True)
    df[column] = similarities
    df[column] = df[column].round(decimals = 1)
    df = posts.sort([column], ascending = False)







In [13]:
posts.head(10)

Unnamed: 0,number,selftext,title,url,tokenized_selftext,stemmed_selftext,similarity
22,9,I wanted to talk a bit about my experience wit...,My Experience with Lexapro/Escitalopram,https://www.reddit.com/r/Anxiety/comments/5sy4...,"[I, wanted, talk, bit, experience, lexapro, I,...","[I, wanted, to, talk, a, bit, about, my, exper...",12.3
45,4,"In my own personal experience, it is highly im...",Your struggles are legitimate,https://www.reddit.com/r/Anxiety/comments/5sut...,"[In, personal, experience, highly, important, ...","[In, my, own, personal, experience, ,, it, is,...",10.0
25,0,So the story starts out in middle school 6th g...,Help!,https://www.reddit.com/r/Anxiety/comments/5szv...,"[So, story, starts, middle, school, 6th, grade...","[So, the, story, starts, out, in, middle, scho...",9.9
46,0,I don't know if this belongs here but I'm just...,Want to be happy but struggling,https://www.reddit.com/r/Anxiety/comments/5sxt...,"[I, know, belongs, I, lost, right, Im, 18, yea...","[I, do, n't, know, if, this, belongs, here, bu...",8.2
32,1,"I know therapy has its ups and downs, but I fe...",The never ending battle,https://www.reddit.com/r/Anxiety/comments/5sy6...,"[I, know, therapy, ups, downs, I, feel, like, ...","[I, know, therapy, has, its, ups, and, downs, ...",8.1
20,0,"I can't function as a human being, so I don't ...",Struggling...,https://www.reddit.com/r/Anxiety/comments/5t0d...,"[I, ca, function, human, I, know, talk, Please...","[I, ca, n't, function, as, a, human, being, ,,...",7.0
31,1,(I'm not sure if this post belongs here as it'...,relationships + Anxiety,https://www.reddit.com/r/Anxiety/comments/5szh...,"[I, not, sure, post, belongs, relationship, st...","[(, I, 'm, not, sure, if, this, post, belongs,...",6.7
16,13,The company I work for is... kinda bad on the ...,I wish my mental problems were a legitimate re...,https://www.reddit.com/r/Anxiety/comments/5svh...,"[The, company, I, work, kinda, bad, human, res...","[The, company, I, work, for, is, ..., kinda, b...",6.6
26,2,"Had a strong panic attack last week, the stron...",Just meditated successfully for the first time...,https://www.reddit.com/r/Anxiety/comments/5sv3...,"[Had, strong, panic, attack, last, week, stron...","[Had, a, strong, panic, attack, last, week, ,,...",5.8
1,33,Greetings & Salutations!\n\nUse this post to i...,Welcoming Newcomers & Free Talk Thread - Febru...,https://www.reddit.com/r/Anxiety/comments/5sd5...,"[Greetings, Salutations, Use, post, introduce,...","[Greetings, &, Salutations, !, Use, this, post...",5.5


**The next few lines of code creates a dictionary from the posts above.** This code specifically will be used in conjunction with `Flask` and will be called using `jinja` code on the appropriate webpage.

In [14]:
posts_list = []
for i in range(len(posts)):
    posts_list.append(dict(title = posts.iloc[i]['title'],
                           url = posts.iloc[i]['url'],
                           number = posts.iloc[i]['number'],
                           similarity = posts.iloc[i]['similarity']))

In [15]:
posts_list[0:5]

[{'number': 9,
  'similarity': 12.300000000000001,
  'title': u'My Experience with Lexapro/Escitalopram',
  'url': u'https://www.reddit.com/r/Anxiety/comments/5sy4ll/my_experience_with_lexaproescitalopram/'},
 {'number': 4,
  'similarity': 10.0,
  'title': u'Your struggles are legitimate',
  'url': u'https://www.reddit.com/r/Anxiety/comments/5sut6k/your_struggles_are_legitimate/'},
 {'number': 0,
  'similarity': 9.9000000000000004,
  'title': u'Help!',
  'url': u'https://www.reddit.com/r/Anxiety/comments/5szvt6/help/'},
 {'number': 0,
  'similarity': 8.1999999999999993,
  'title': u'Want to be happy but struggling',
  'url': u'https://www.reddit.com/r/Anxiety/comments/5sxt55/want_to_be_happy_but_struggling/'},
 {'number': 1,
  'similarity': 8.0999999999999996,
  'title': u'The never ending battle',
  'url': u'https://www.reddit.com/r/Anxiety/comments/5sy6g2/the_never_ending_battle/'}]