# NLP Project
## Step 1: Defining the Problem
Many people who are diagnosed with one neurological condition or mental illness discover at some point that they also have one or more other conditions. For example, many people have both anxiety and depression at the same time. Some of the symptoms of autism and ADHD are so similar that people may not be able to tell which one they have, or if they have both. 

This makes it confusing and messy to get to a correct diagnosis, which has implications for prognosis, treatment, medication, and accomodations at school or work. Can analyzing posts from other people about their own experiences, thoughts, feelings, etc in relation to their condition help individuals understand their own mind better? 

### Problem Statement

The following models aim to classify posts from subreddits for several overlapping mental health conditions, with the goal of helping people understand their brains (or their loved ones’ brains) better. These models could point patients in the right direction when seeking professional help, or could even be used by clinicians to supplement evaluation. 

## Step 2: Data Collection

I chose 4 subreddits associated with a well-known psychological condition:
- [ADHD](https://www.reddit.com/r/ADHD/)
- [Anxiety](https://www.reddit.com/r/Anxiety/)
- [Depression](https://www.reddit.com/r/depression/)
- [Aspergers](https://www.reddit.com/r/aspergers/)

Using the PushShift API provided by Reddit, I wrote a function that could scrape a specfied number of batches of a specified size. To avoid repeatedly returning the same posts, I used the UTC of the post to narrow the request to posts written before the earliest post of the previous batch. 

Then I saved the posts from each subreddit as a separate dataframe, to be cleaned in the next step. 

#### Notes and disclaimers:
*Aspergers is not a recognized diagnosis anymore, but many people still identify with the term. The Aspergers subreddit is about the same size as the Autism subreddit but it has more text-based posts, so it was a more appropriate choice for this project.*

*This is a project for a class, not a replacement for an actual, certified mental health professional! If you've stumbled across this tool and have concerns about your mental well-being, please talk to your doctor.*

In [1]:
import pandas as pd
import requests
import numpy as np
import time

In [2]:
#https://www.shanelynn.ie/pandas-iloc-loc-select-rows-and-columns-dataframe/
#https://stackoverflow.com/questions/9539921/how-do-i-create-a-python-function-with-optional-arguments

def get_reddit(subreddit, size, call_count, before_utc):
    df = pd.DataFrame()
    utc = before_utc
    i = 0
    while i < call_count:
        url = 'https://api.pushshift.io/reddit/search/submission'
        params = {
            'subreddit' : subreddit,
            'size' : size,
            'before' : utc
                }
        res = requests.get(url, params)
        if res.status_code == 200:
            posts = res.json()['data']
            minidf = pd.DataFrame(posts)[['subreddit', 'title', 'selftext', 'created_utc']]
            utc = minidf.iloc[-1, -1]
            df = df.append(minidf)
            time.sleep(1)
            i += 1
        else:
            return f'Unexpected status code: {res.status_code}'
    return df

In [3]:
anxiety_df = get_reddit('Anxiety',100, 50, 1635429621)

In [4]:
adhd_df = get_reddit('ADHD',100, 50, 1635429621)

In [5]:
depression_df = get_reddit('Depression',100, 50, 1635429621)

In [6]:
autism_df = get_reddit('Aspergers',100, 50, 1635429621)

In [7]:
anxiety_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 99
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    2000 non-null   object
 1   title        2000 non-null   object
 2   selftext     2000 non-null   object
 3   created_utc  2000 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 78.1+ KB


In [8]:
adhd_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1999 entries, 0 to 99
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    1999 non-null   object
 1   title        1999 non-null   object
 2   selftext     1999 non-null   object
 3   created_utc  1999 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 78.1+ KB


In [9]:
depression_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 99
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    2000 non-null   object
 1   title        2000 non-null   object
 2   selftext     1999 non-null   object
 3   created_utc  2000 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 78.1+ KB


In [10]:
autism_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 99
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    2000 non-null   object
 1   title        2000 non-null   object
 2   selftext     1993 non-null   object
 3   created_utc  2000 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 78.1+ KB


In [11]:
anxiety_df.to_csv('anxiety.csv', index = False)
adhd_df.to_csv('adhd.csv', index = False)
depression_df.to_csv('depression.csv', index = False)
autism_df.to_csv('autism.csv', index = False)