# Problem Statement:

### Can a Classification model, trained on NLP data, correctly predict subreddit of origin between a 'good advice' and a 'bad advice' subreddit when fed `author`, `title`, and `selftext`?


### Notes on the Data and Subreddits:

**LifeProTips: (LPT)**
- "Tips that improve your life in one way or another"
- A subreddit dedicated to sharing 'helpful' user-provided advice for navigating a plethora of sitautions.

**UnethicalLifeProTips: (ULPT)**
- "An Unethical Life Pro Tip (or ULPT) is a tip that improves your life in a meaningful way, perhaps at the    
expense of others and/or with questionable legality. Due to their nature, do not actually follow any of these 
tips–they're just for fun. Share your best tips you've picked up throughout your life, and learn from others!"
- A subreddit dedicated to sharing mocking, 'joke' user-provided 'advice on a number of subjects and situations
    
### Predictors and Target Variable:

**Model 1.0:**
- The predictor variable is `title`.
- The target variable is `subreddit`.

# Import Libraries

In [360]:
import pandas as pd
import numpy as np
import requests
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import regex as re
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Function to call in Reddit info via API

**Below: This function:**

1) Variabilizes the 'base' url of the Pushshift API

2) Uses two editable dictionaries of parameters to add-on to the API URL
    - Current parameters are `subreddit` and `size`.

3) Assign the `HTTP Reponse` request and variabilize the `status code`.

4) An `if` statement checks that the value of the `status_code` for both subreddits are within the bounds of `Success 2XX`.

5) If successful:
    - The `JSON` is read-in for each subreddit, 
    - The posts are saved from each JSON 

In [361]:
def generate_json_posts(subreddit_str, size):
    
    # Setup URL of API
    base_url = "https://api.pushshift.io/reddit/search/submission"    
    
    # Create the params of the API URL
    params = {
        "subreddit": subreddit_str,
        "size": size
    }

    # Response
    res = requests.get(base_url, params)
    res_check = res.status_code
    
    # Check response is good
    if (res_check >= 200 and res_check < 300):
        
        # Create JSON:
        data = res.json()
        posts = data["data"]
        
        return posts
    else:
        return f"Check HTTP Error: {res_check}"

In [429]:
lpt_posts = generate_json_posts("LifeProTips", 500)
ulpt_posts = generate_json_posts("UnethicalLifeProTips", 500)

### Check Posts dtypes:

In [363]:
print(f"Type: {type(lpt_posts)}\n")
print(f"Each post type: {type(lpt_posts[0])}")

Type: <class 'list'>

Each post type: <class 'dict'>


# Preprocessing

### Cleaning:

- **HTML Artifacts:**
- **Non-Letters**
- **Stopwords**

**Remove Non-Letters**

The function below takes in the json dictionary object and a specific key to be cleaned.

In [364]:
def remove_non_letter(json, key):
    for i in range(len(json)):
        soup = BeautifulSoup(json[i][key])  # Create the soup object
        json[i][key] = re.sub("[^a-zA-Z]", " ", soup.get_text())  # Clean out the non-alphabetical characters
    return json

**Make lowercase**

In [365]:
def to_lower(json, key):
    for i in range(len(json)):
        json[i][key] = json[i][key].lower()
    return json

### Tokenize:

In [366]:
tokenizer = RegexpTokenizer(r'\w+')  # Remove punctuation, whitespace

In [367]:
def get_tokens(json, key):
    for i in range(len(json)):
        json[i][key] = tokenizer.tokenize(json[i][key])
    return json

In [368]:
# Check
# get_tokens(lpt_posts, "title")

### Lemmatize:

- This can help with some typos in our word analysis.
    - For example, we can use lemmatization to identify `untill`, and make a necessary adjustment to model input
- Lemmatization will not be applied to `author`, as these are the usernames attached to the post submission to the subreddit.

In [369]:
lemma = WordNetLemmatizer()

In [370]:
def to_lemma(json, key):
    for i in range(len(json)):
        json[i][key] = [lemma.lemmatize(j) for j in json[i][key]]
    return json
#     return json

In [371]:
# Check
# to_lemma(lpt_posts[:2], "title")

### Remove Stopwords

**Stopwords**

- In the first iteration of this model, the `LPT` or `lpt` word will be removed from the `title` and `selftext` as a stopword.

In [372]:
stopset = set(nltk.corpus.stopwords.words("english"))
stopset.add("LPT")  # The capital is technically in alpha order
stopset.add("lpt")
stopset.add("ULPT")
stopset.add("ulpt")

# stopset
# https://stackoverflow.com/questions/5511708/adding-words-to-nltk-stoplist

In [373]:
def remove_stopword(json, key):
    for i in range(len(json)):
        for j in json[i][key]:
            if j in stopset:
                json[i][key].remove(j)

    return json

## Function to perform each task:

- The idea behind this function is to have a function to call on a given feeature that should have all of the preprocessing tasks performed, as listed above.
    - Otherwise, each of the above functions can be called on a feature as needed.

In [376]:
def posts_to_words(json, key):

    # Remove non-letters:
    remove_non_letter(json, key)
    
    # Make lowercase:
    to_lower(json, key)
    
    # Tokenize:
#     get_tokens(json, key)
    
    # Lemmatize:
#     to_lemma(json, key)
    
    # Remove Stop words:
#     remove_stopword(json, key)
    
    return json

In [377]:
# Author
# lpt_clean_posts = posts_to_words(lpt_posts, "author")
# ulpt_clean_posts = posts_to_words(ulpt_posts, "author")

# Title
lpt_clean_posts = posts_to_words(lpt_posts, "title")
ulpt_clean_posts = posts_to_words(ulpt_posts, "title")

# Selftext
# lpt_clean_posts = posts_to_words(lpt_posts, "selftext")
# ulpt_clean_posts = posts_to_words(ulpt_posts, "selftext")

# Check
# ulpt_clean_posts[0]

# Create Dataframe

### Model Features Set:

- ~~`author`~~
    - ~~The author of the post~~
- `title`
    - The title of the post
- ~~`selftext`~~
    - ~~Included in the post, this is the 'content' of the post and appears under the title.~~
    - ~~Not every post in LPT has `selftext` - Many appear with only a title~~


In [402]:
df1 = pd.DataFrame(lpt_clean_posts)
df2 = pd.DataFrame(ulpt_clean_posts)

In [403]:
df = df1.append(df2, ignore_index=True)

In [404]:
df.shape

(1000, 70)

In [405]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 70 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  1000 non-null   object 
 1   allow_live_comments            1000 non-null   bool   
 2   author                         1000 non-null   object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          884 non-null    object 
 5   author_flair_text              0 non-null      object 
 6   author_flair_type              884 non-null    object 
 7   author_fullname                884 non-null    object 
 8   author_patreon_flair           884 non-null    object 
 9   author_premium                 884 non-null    object 
 10  awarders                       1000 non-null   object 
 11  can_mod_post                   1000 non-null   bool   
 12  contest_mode                   1000 non-null   bo

In [406]:
df["title"].head()

0    lpt  whenever you say  sorry   you should inst...
1    lpt   use a fake number app when looking onlin...
2    lpt  at the end of the quarantine and after at...
3    what to do if your wallet is lost or stolen   lpt
4    lpt  if you decide to start walking due to cor...
Name: title, dtype: object

### Feature Engineering:

**Binarize target `y` variable**

In [407]:
df["subreddit"].value_counts()

UnethicalLifeProTips    500
LifeProTips             500
Name: subreddit, dtype: int64

In [432]:
# Create numeric values for y var to be passed into model

df["subreddit"] = df["subreddit"].map({"LifeProTips": 1,
                                       "UnethicalLifeProTips": 0
                                      })

In [409]:
df["subreddit"].value_counts()

1    500
0    500
Name: subreddit, dtype: int64

In [410]:
X = df["title"]
y = df["subreddit"]

### Save Dataframe:

In [433]:
df.to_pickle("./datasets/df_model_1.0")

### Train/Test Split

In [411]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.33,
                                                    random_state=42)

# First Model

### Create Count Vectorizer

**Instantiate**

In [413]:
# Hyparams set to lesson defaults
cvec = CountVectorizer()

**Fit**

In [416]:
X_train_sc = cvec.fit_transform(X_train)
X_test_sc = cvec.transform(X_test)

In [420]:
print(f"X_train shape: {X_train_sc.shape}")
print(f"X_test_sc shape: {X_test_sc.shape}\n")
print(f"X_train_sc feature names: {cvec.get_feature_names()[0:1000:250]}")

X_train shape: (670, 2754)
X_test_sc shape: (330, 2754)

X_train_sc feature names: ['ability', 'booze', 'contact', 'earth']


In [421]:
# Baseline score:
y_test.value_counts()  # even 50/50 split - may need to tweak this?

1    165
0    165
Name: subreddit, dtype: int64

### Create Estimator

In [426]:
logreg = LogisticRegression(solver="lbfgs")

In [427]:
logreg.fit(X_train_sc, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [428]:
print(f"Train score: {logreg.score(X_train_sc, y_train)}") 
print(f"Test score: {logreg.score(X_test_sc, y_test)}")

Train score: 1.0
Test score: 0.906060606060606


### First Model Score Notes:

- There is evidence of overfitting. On our train dataset, we scored a perfect **1.0** in accuracy. On the test dataset, the scored was only **0.90**. 
    - One reason for this was that no stopwords were removed from the raw data.
    - On many `title`s, there is a **lpt** or **ulpt** added in the text. This would correlate STRONGLY to which subreddit the text belongs to, and is most likely throwing off our model