# Problem Statement:

### Can a Classification model, trained on NLP data, correctly predict subreddit of origin between a 'good advice' and a 'bad advice' subreddit when fed `author`, `title`, and `selftext`?


### Notes on the Data and Subreddits:

**LifeProTips: (LPT)**
- "Tips that improve your life in one way or another"
- A subreddit dedicated to sharing 'helpful' user-provided advice for navigating a plethora of sitautions.
- ***The number `1` will always represent the LPT subreddit when attached to a variable.***
    - For example, `posts_1` or `data_1` refer to the LPT subreddit.

**UnethicalLifeProTips: (ULPT)**
- "An Unethical Life Pro Tip (or ULPT) is a tip that improves your life in a meaningful way, perhaps at the    
expense of others and/or with questionable legality. Due to their nature, do not actually follow any of these 
tips–they're just for fun. Share your best tips you've picked up throughout your life, and learn from others!"
- A subreddit dedicated to sharing mocking, 'joke' user-provided 'advice on a number of subjects and situations
-     - ***The number `2` will always represent the ULPT subreddit when attached to a variable.***
    - For example, `posts_2` or `data_2` refer to the ULPT subreddit.
    
### Predictor and Target Variable:

**The predictor variables are `author`, `title`, and `selftext`.**

**The target variable is `subreddit`.**

# Import Libraries

In [358]:
import pandas as pd
import requests
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import regex as re
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split, cross_val_score

# Function to call in Reddit info via API

**Below: This function:**

1) Variabilizes the 'base' url of the Pushshift API

2) Uses two editable dictionaries of parameters to add-on to the API URL
    - Current parameters are `subreddit` and `size`.

3) Assign the `HTTP Reponse` request and variabilize the `status code`.

4) An `if` statement checks that the value of the `status_code` for both subreddits are within the bounds of `Success 2XX`.

5) If successful:
    - The `JSON` is read-in for each subreddit, 
    - The posts are saved from each JSON 

In [313]:
def generate_json_posts(subreddit_str, size):
    
    # Setup URL of API
    base_url = "https://api.pushshift.io/reddit/search/submission"    
    
    # Create the params of the API URL
    params = {
        "subreddit": subreddit_str,
        "size": size
    }

    # Response
    res = requests.get(base_url, params)
    res_check = res.status_code
    
    # Check response is good
    if (res_check >= 200 and res_check < 300):
        
        # Create JSON:
        data = res.json()
        posts = data["data"]
        
        return posts
    else:
        return f"Check HTTP Error: {res_check}"

In [314]:
lpt_posts = generate_json_posts("LifeProTips", 500)

### Check Posts dtypes:

In [315]:
print(f"Type: {type(lpt_posts)}\n")
print(f"Each post type: {type(lpt_posts[0])}")

Type: <class 'list'>

Each post type: <class 'dict'>


### Create Lists of Text to Manipulate:

In [316]:
# Isolate the keys we will want to manipulate:

def get_list(posts_list, feature):
    lst = []

    # Iterate through my list:
    for post in posts_list:

        # Run through the post dict keys:
        for key, value in post.items():

            # Get the value into a list
            if key == feature:
                lst.append(value)
                
    return lst
        

In [369]:
lpt_authors = get_list(lpt_posts, "author")
# lpt_authors

lpt_titles = get_list(lpt_posts, "title")
# lpt_titles

lpt_selftext = get_list(lpt_posts, "selftext")
# lpt_selttext

# Preprocessing

### Cleaning:

- **HTML Artifacts:**
- **Non-Letters**
- **Stopwords**

**Remove Non-Letters**

In [346]:
def remove_non_letters(lst):
    new_lst = []
    for i in lst:
        soup = BeautifulSoup(i)
        new_lst.append(re.sub("[^a-zA-Z]", " ", soup.get_text()))
    return new_lst

In [351]:
# lpt_authors = remove_non_letters(lpt_authors)
lpt_titles = remove_non_letters(lpt_titles)
lpt_selftext = remove_non_letters(lpt_selftext)
# lpt_selftext

**Stopwords**

- In the first iteration of this model, the `LPT` or `lpt` word will be removed as a stopword.

In [368]:
stopset = set(nltk.corpus.stopwords.words("english"))
stopset.add("LPT")  # The capital is technically in alpha order
stopset.add("lpt")

stopset
# https://stackoverflow.com/questions/5511708/adding-words-to-nltk-stoplist

{'LPT',
 'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'lpt',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'ov

In [None]:
def remove_stopwords

In [359]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### Tokenize:

In [352]:
tokenizer = RegexpTokenizer(r'\w+')  # Remove punctuation, whitespace

In [353]:
def tokenize_list(lst):
    token_lst = []
    for i in lst:
        token_lst.append(tokenizer.tokenize(i.lower()))
    return token_lst

In [354]:
lpt_author_tokens = tokenize_list(lpt_authors)
lpt_titles_tokens = tokenize_list(lpt_titles)
lpt_selftext_tokens = tokenize_list(lpt_selftext)

In [277]:
# List of lists:

# lpt_titles_tokens
# lpt_author_tokens
# lpt_selftext_tokens

### Lemmatize:

- This can help with some typos in our word analysis.
    - For example, we can use lemmatization to identify `untill`, and make a necessary adjustment to model input
- Lemmatization will not be applied to `author`, as these are the usernames attached to the post submission to the subreddit.

In [355]:
lemma = WordNetLemmatizer()

In [356]:
def get_lemma(token_list):
    
    # Instantiate list
    tokens_lem = []
    
    # Iterate through the list of tokenized posts
    for lst in token_list:
        
        # Add each lemmatized word to new list
        tokens_lem.append([lemma.lemmatize(i) for i in lst])
    
    return tokens_lem

In [357]:
lpt_title_lemma = get_lemma(lpt_titles_tokens)
# lpt_title_lemma
lpt_selftext_lemma = get_lemma(lpt_selftext_tokens)
# lpt_selftext_lemme

In [253]:
# list(zip(lpt_titles_tokens, lpt_title_lemma))  # not sure how helpful this is

In [256]:
# for i, j in list(zip(lpt_titles_tokens, lpt_title_lemma)):
#     if i != j:
#         print(i, j)

# Create Dataframe

### Model Features Set:

- `author`
    - The author of the post
- `title`
    - The title of the post
- `selftext`
    - Included in the post, this is the 'content' of the post and appears under the title.
    - Not every post in LPT has `selftext` - Many appear with only a title


In [33]:
def create_df(post_list):
    df = pd.DataFrame(post_list)
    return df

In [51]:
lpt_df = create_df(lpt_posts)

67

# Exploration

In [71]:
def print_col_info(df, col1, col2, col3):
    print("NaN")
    print(f"{col1}:", df[col1].isna().sum())
    print(f"{col2}:", df[col2].isna().sum())
    print(f"{col3}:", df[col3].isnull().sum())

In [72]:
print_col_info(lpt_df, "author", "title", "selftext")

NaN
author: 0
title: 0
selftext: 67


In [20]:
df[["subreddit", "title", "selftext"]].head()  # Title has 'LPT' in it, could throw the model off 

Unnamed: 0,subreddit,title,selftext
0,LifeProTips,LPT: need an expensive prescription drug but h...,
1,LifeProTips,"LPT: After a shower, dry yourself in there to ...","So, I personally don't know if doing that is j..."
2,LifeProTips,"LPT: After a shower, dry up in there to keep y...",[deleted]
3,LifeProTips,If shady people come up and ask you if your na...,
4,LifeProTips,LPT: If you’re going to bring a birthday cake ...,


# Pre-Preprocessing:

Common column conversions:

In [None]:
# def binarize_bool(posts):
#     for post in posts:
#         for key in 

### X:

In [101]:
def get_titles(posts_list):
    title_list = []
    
    for post in posts_list:
        for key, value in post.items():
            if key == "title":
                title_list.append(value)
    return title_list

def get_selftext(posts_list):
    selftext_list = []
    
    for post in posts_list:
        for key, value in post.items():
            if key == "selftext":
                selftext_list.append(value)
    return selftext_list

**Remove HTML Artifacts**

In [334]:
def remove_html(lst):
    
    for i in lst:
        soup = BeautifulSoup(i)
    return soup.get_text(i)


remove_html(lpt_titles)

'If you’re living alone, while in quarantine, disable your face id/touch id, it will save you a lot of time'

In [332]:
lpt_titles[50]  # This selftext has HTML artifacts

'LPT: If you feel pressured into complementing somebody\'s baby even though you think it\'s ugly, say "Now THAT\'S a baby!"'

In [333]:
ex1 = BeautifulSoup(lpt_titles[50])
ex1.get_text()

'LPT: If you feel pressured into complementing somebody\'s baby even though you think it\'s ugly, say "Now THAT\'S a baby!"'