# Problem Statement:

### Can the text of a subreddit post's `title` and `selftext` reliably predict if a post is 'good advice' or 'bad advice'?

    
    
### Predictors and Target Variable:

**Model 1.2:**
- The predictor variable is `title`.
- The target variable is `subreddit`.

### Pipeline & GridSearch:
- 

In [1]:
import pandas as pd
import numpy as np
import requests
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import regex as re
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Create JSON Files via API:

In [2]:
def generate_json_posts(subreddit_str, size):
    
    # Setup URL of API
    base_url = "https://api.pushshift.io/reddit/search/submission"    
    
    # Create the params of the API URL
    params = {
        "subreddit": subreddit_str,
        "size": size
    }

    # Response
    res = requests.get(base_url, params)
    res_check = res.status_code
    
    # Check response is good
    if (res_check >= 200 and res_check < 300):
        
        # Create JSON:
        data = res.json()
        posts = data["data"]
        
        return posts
    else:
        return f"Check HTTP Error: {res_check}"

In [3]:
lpt_posts = generate_json_posts("LifeProTips", 500)
ulpt_posts = generate_json_posts("UnethicalLifeProTips", 500)

### Save JSON Files:

In [4]:
pd.to_pickle(lpt_posts, "../datasets/lpt_posts_json")
pd.to_pickle(ulpt_posts, "../datasets/ulpt_posts_json")

# Create Dataframe

### Model Features Set:

- ~~`author`~~
    - ~~The author of the post~~
- `title`
    - The title of the post
- `selftext`
    - Included in the post, this is the 'content' of the post and appears under the title.
    - Not every post in LPT has `selftext` - Many appear with only a title


In [130]:
lpt_df = pd.DataFrame(lpt_posts)
ulpt_df = pd.DataFrame(ulpt_posts)

In [131]:
df = lpt_df.append(ulpt_df, ignore_index=True)

In [132]:
df.shape

(1000, 70)

In [133]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 70 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  1000 non-null   object 
 1   allow_live_comments            1000 non-null   bool   
 2   author                         1000 non-null   object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          896 non-null    object 
 5   author_flair_text              0 non-null      object 
 6   author_flair_type              896 non-null    object 
 7   author_fullname                896 non-null    object 
 8   author_patreon_flair           896 non-null    object 
 9   author_premium                 896 non-null    object 
 10  awarders                       1000 non-null   object 
 11  can_mod_post                   1000 non-null   bool   
 12  contest_mode                   1000 non-null   bo

In [134]:
df["title"].head()

0     How can we help our communities during COVID-19.
1    LPT Request: When you feel physically tired af...
2    When you feel physically tired after a long da...
3    If your bank has been charging you fees for a ...
4    LPT: Have a professional voice actor voice you...
Name: title, dtype: object

# Feature Engineering & Preprocessing

### Cleaning:

- **HTML Artifacts:**
- **Non-Letters**
- **Stopwords**

**Lemmatize:**

- This can help with some typos in our word analysis.
    - For example, we can use lemmatization to identify `untill`, and make a necessary adjustment to model input
- Lemmatization will not be applied to `author`, as these are the usernames attached to the post submission to the subreddit.

In [22]:
lemma = WordNetLemmatizer()

In [143]:
def to_lemma(data, col):
#     data[col] = [i.split() for i in data[col]]  # Creates a list of split words
    
    new_lst = []
    
    data[col].apply(lambda i: lemma.lemmatize(i))
    return data[col]
#     for i in range(len(data[col])):
# #         print(data[col][i])  # Each list of words in a post
        
#         for j in data[col][i]:
# #             print(j)  # Each word
#             new_lst.append(lemma.lemmatize(j))
# #             print(j)
#             data[col][i] = new_lst
#     print(data[col])
            
    
#     return  " ".join(i for i in data[col])  # This takes the list and joins it


In [149]:
# to_lemma(df, "title")
df["title"].apply(lambda i, j: lemma.lemmatize(j))

TypeError: <lambda>() missing 1 required positional argument: 'j'

In [142]:
df["title"]

0       How can we help our communities during COVID-19.
1      LPT Request: When you feel physically tired af...
2      When you feel physically tired after a long da...
3      If your bank has been charging you fees for a ...
4      LPT: Have a professional voice actor voice you...
                             ...                        
995                                         UTLP Request
996             ULPT Cheating in my online mockup exams?
997    ULPT: Every retailer, essentially, is extendin...
998    Every store, essentially, is extending their r...
999    ULPT REQUEST: Found credit card after purchasi...
Name: title, Length: 1000, dtype: object

### Remove Stopwords 

In [16]:
df["title"] = [" ".join([i for i in x.split()
                         if i not in stopset])
                         for x in df["title"]]

In [17]:
df["title"].head()

0     20 páginas para encontrar ofertas de teletrabajo
1    LPT: Get bidet. They drastically reduce amount...
2    LPT: If want stay close friends family, respon...
3    LPT: When halfway sleeve cookies, pull sleeve ...
4                                 Happy 4/20 day enjoy
Name: title, dtype: object

### Define Stopwords
- In this iteration of the model, the `LPT` or `lpt` word will be removed from the `title` and `selftext` as a stopword.

In [135]:
stopset = set(nltk.corpus.stopwords.words("english"))
stopset.add("lpt")
stopset.add("lptrequest")

stopset.add("ulpt")
stopset.add("ulptrequest")

# stopset
# https://stackoverflow.com/questions/5511708/adding-words-to-nltk-stoplist

## Function to perform each Preprocessing task:

- The idea behind this function is to have a function to call on a given feeature that should have all of the preprocessing tasks performed, as listed above.
    - Otherwise, each of the above functions can be called on a feature as needed.

In [11]:
def clean_df(data, col):

    # Remove non-letters:
    new_lst = []
    for i in data[col]:
        soup = BeautifulSoup(i, "lxml")
        new_lst.append(re.sub("[^a-zA-Z]", " ", soup.get_text()))
    data[col] = new_lst
    # Some reference to: https://www.reddit.com/r/learnpython/comments/an62wx/how_to_remove_html_from_pandas_dataframe_without/
    
    
    # Make lowercase:
    data[col] = data[col].str.lower()
    

    # Lemmatize:
    lemma = WordNetLemmatizer()
    data[col].apply(lambda i: lemma.lemmatizer(i))
    
#     for row in data[col]:
#         new_lst = [lemma.lemmatize(row)]
#     print(new_lst)
    # Remove Stopwords:
    
    
    return data

In [12]:
clean_df(df, "title");

  ' that document to Beautiful Soup.' % decoded_markup


In [13]:
df["title"].head()

0     how can we help our communities during covid    
1    lpt request  when you feel physically tired af...
2    when you feel physically tired after a long da...
3    if your bank has been charging you fees for a ...
4    lpt  have a professional voice actor voice you...
Name: title, dtype: object

**Binarize target `y` variable**

In [None]:
df["subreddit"].value_counts()

In [None]:
# Create numeric values for y var to be passed into model

df["subreddit"] = df["subreddit"].map({"LifeProTips": 1,
                                       "UnethicalLifeProTips": 0
                                      })

In [None]:
df["subreddit"].value_counts()

### Save Dataframe:

In [None]:
df.to_pickle("./datasets/df_model_1.0")

# First Model

In [None]:
X = df["title"]
y = df["subreddit"]

### Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.33,
                                                    random_state=42)

### Create Count Vectorizer

**Instantiate**

In [None]:
# Hyparams set to lesson defaults
cvec = CountVectorizer()

**Fit**

In [None]:
X_train_sc = cvec.fit_transform(X_train)
X_test_sc = cvec.transform(X_test)

In [None]:
print(f"X_train shape: {X_train_sc.shape}")
print(f"X_test_sc shape: {X_test_sc.shape}\n")
print(f"X_train_sc feature names: {cvec.get_feature_names()[0:1000:250]}")

In [None]:
# Baseline score:
y_test.value_counts()  # even 50/50 split - may need to tweak this?

### Create Estimator

In [None]:
# Instantiate


In [None]:
# Fit


In [None]:
# print(f"Train score: {}")
# print(f"Test score: {}")

### Second Model Score Notes:

- 