# Problem Statement:

### Can the text of a subreddit post's `title`, `author`, and `selftext` reliably predict if a post is 'good advice' or 'bad advice'?

    
    
### Predictors and Target Variable:

**Model 1.2:**
- The predictor variables are `title`, ` selftext`, `author`.
- The target variable is `subreddit`.

### Pipeline & GridSearch:
- 

In [1]:
import pandas as pd
import numpy as np
import requests
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import regex as re
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Create JSON Files via API:

In [2]:
def generate_json_posts(subreddit_str, size):
    
    # Setup URL of API
    base_url = "https://api.pushshift.io/reddit/search/submission"    
    
    # Create the params of the API URL
    params = {
        "subreddit": subreddit_str,
        "size": size
    }

    # Response
    res = requests.get(base_url, params)
    res_check = res.status_code
    
    # Check response is good
    if (res_check >= 200 and res_check < 300):
        
        # Create JSON:
        data = res.json()
        posts = data["data"]
        
        return posts
    else:
        return f"Check HTTP Error: {res_check}"

In [3]:
lpt_posts = generate_json_posts("LifeProTips", 500)
ulpt_posts = generate_json_posts("UnethicalLifeProTips", 500)

### Save JSON Files:

In [4]:
# pd.to_pickle(lpt_posts, "../datasets/lpt_posts_json")
# pd.to_pickle(ulpt_posts, "../datasets/ulpt_posts_json")

### Read JSON Files:

In [4]:
lpt_posts = pd.read_pickle("../datasets/lpt_posts_json")
ulpt_posts = pd.read_pickle("../datasets/ulpt_posts_json")

# Create Dataframe

### Model Features Set:

- ~~`author`~~
    - ~~The author of the post~~
- `title`
    - The title of the post
- `selftext`
    - Included in the post, this is the 'content' of the post and appears under the title.
    - Not every post in LPT has `selftext` - Many appear with only a title


In [87]:
lpt_df = pd.DataFrame(lpt_posts)
ulpt_df = pd.DataFrame(ulpt_posts)

In [88]:
df = lpt_df.append(ulpt_df, ignore_index=True)

In [89]:
df.shape

(1000, 70)

In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 70 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  1000 non-null   object 
 1   allow_live_comments            1000 non-null   bool   
 2   author                         1000 non-null   object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          896 non-null    object 
 5   author_flair_text              0 non-null      object 
 6   author_flair_type              896 non-null    object 
 7   author_fullname                896 non-null    object 
 8   author_patreon_flair           896 non-null    object 
 9   author_premium                 896 non-null    object 
 10  awarders                       1000 non-null   object 
 11  can_mod_post                   1000 non-null   bool   
 12  contest_mode                   1000 non-null   bo

In [91]:
df["title"].head()

0     How can we help our communities during COVID-19.
1    LPT Request: When you feel physically tired af...
2    When you feel physically tired after a long da...
3    If your bank has been charging you fees for a ...
4    LPT: Have a professional voice actor voice you...
Name: title, dtype: object

## Exploration:

- Investigate additional features: `score`, `author`, `over_18`

    - `author` 
         - Set cvec hyparam 'lowercase' == False
         - Has a number of `[deleted]` author names

- Notes for next model:
    - `created_utc` & `retrieved_on` as features

In [192]:
# Column names
df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'removed_by_category', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'suggested_sort',
       'thumbnail', 'title', 'total_awards_rec

**`over_18`**

- No NaN, but- **this a actually a pretty weighted variable**, may not be ideal to put into my model.

In [220]:
df["over_18"].isna().sum() # NaN

0

In [222]:
df["over_18"].value_counts()

False    996
True       4
Name: over_18, dtype: int64

**`author`**

In [250]:
# This is on the whole df
    # Can setup a custom function org by subreddit \
    # Or use .groupby()
df["author"].value_counts()

[deleted]              104
warren_street            5
Maximum-Cash             4
workhard200723           4
vsop666                  3
                      ... 
Hudgpop                  1
throwaway4206991142      1
fwiurak2                 1
Sockemslol2              1
LionBastard1             1
Name: author, Length: 698, dtype: int64

**`score`**

In [252]:
df["score"].isna().sum()

0

In [255]:
df["score"].value_counts()

1       807
0        65
2        29
3        15
4        12
6         6
5         5
7         5
8         5
10        4
13        4
17        3
9         3
21        2
12        2
15        1
14        1
18        1
19        1
20        1
1837      1
22        1
26        1
31        1
1019      1
37        1
225       1
800       1
649       1
1477      1
439       1
9627      1
2437      1
3428      1
2387      1
276       1
187       1
38        1
137       1
134       1
1147      1
122       1
107       1
104       1
80        1
69        1
4149      1
32        1
Name: score, dtype: int64

# Feature Engineering & Preprocessing

### Cleaning:

- **HTML Artifacts:**
- **Non-Letters**
- **Stopwords**
- **Lemmatize words**

**Lemmatize:**

- This can help with some typos in our word analysis.
    - For example, we can use lemmatization to identify `untill`, and make a necessary adjustment to model input
- Lemmatization will not be applied to `author`, as these are the usernames attached to the post submission to the subreddit.

In [92]:
def to_lemma(data, col):
    lemma = WordNetLemmatizer()
    tokenizer = RegexpTokenizer(r'\w+')
    for i in range(len(data[col])):
        data[col][i] = tokenizer.tokenize(data[col][i])
        for j in data[col][i]:
            data[col][i] = [lemma.lemmatize(j) for j in data[col][i]]
    data[col] = data[col].apply(lambda i: " ".join(i))
    return data

### Define Stopwords
- In this iteration of the model, the `LPT` or `lpt` word will be removed from the `title` and `selftext` as a stopword.

In [190]:
stopset = set(nltk.corpus.stopwords.words("english"))
stopset.add("lpt")
stopset.add("lptrequest")

stopset.add("ulpt")
stopset.add("ulptrequest")

## Function to perform each Preprocessing task:

- The idea behind this function is to have a function to call on a given feeature that should have all of the preprocessing tasks performed, as listed above.
    - Otherwise, each of the above functions can be called on a feature as needed.

In [94]:
def clean_df(data, col):

    # Remove non-letters:
    new_lst = []
    for i in data[col]:
        soup = BeautifulSoup(i, "lxml")
        new_lst.append(re.sub("[^a-zA-Z]", " ", soup.get_text()))
    data[col] = new_lst
    # Some reference to: https://www.reddit.com/r/learnpython/comments/an62wx/how_to_remove_html_from_pandas_dataframe_without/
    
    
    # Make lowercase:
    data[col] = data[col].str.lower()
    

    # Lemmatize:
    for i in range(len(data[col])):
        data[col][i] = tokenizer.tokenize(data[col][i])
        for j in data[col][i]:
            data[col][i] = [lemma.lemmatize(j) for j in data[col][i]]
    data[col] = data[col].apply(lambda i: " ".join(i))
    

    # Remove Stopwords:
    data[col] = [" ".join([i for i in x.split()
                           if i not in stopset])
                           for x in data[col]]
    
    return data

In [95]:
clean_df(df, "title");

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [270]:
df["selftext"].dropna(inplace=True)

In [271]:
clean_df(df, "selftext")

  ' that document to Beautiful Soup.' % decoded_markup


ValueError: Length of values does not match length of index

In [262]:
clean_df(df, "selftext")

TypeError: object of type 'float' has no len()

In [267]:
df["selftext"].dtypes

dtype('O')

In [268]:
df["title"].head()

0                                 help community covid
1    request feel physically tired long day work ta...
2    feel physically tired long day work taking sho...
3    bank ha charging fee bank account good time as...
4    professional voice actor voice voicemail messa...
Name: title, dtype: object

In [97]:
df.shape

(1000, 70)

### Binarize target `y` variable

In [98]:
df["subreddit"].value_counts()

LifeProTips             500
UnethicalLifeProTips    500
Name: subreddit, dtype: int64

In [99]:
# Create numeric values for y var to be passed into model

df["subreddit"] = df["subreddit"].map({"LifeProTips": 1,
                                       "UnethicalLifeProTips": 0
                                      })

In [100]:
df["subreddit"].value_counts()

1    500
0    500
Name: subreddit, dtype: int64

### Column for if post contains `selftext` called  `has_selftext`

In [113]:
df["selftext"].isna().sum()

df["has_selftext"] = df["selftext"].notnull().astype(int)

print((df["has_selftext"] == 1).sum())

print((df["has_selftext"] == 0).sum())

924
76


### Clean the selftext:

- Not every subreddit post has `selftext`, which is the body of a post.
    - It seems clear that most of the text context of the subreddits are contained in the `title` field, making it an important `predictor` variable.
- Issues with `selftext`:
    - `[removed]`
    - `[deleted]`
    - Contains emojis

In [120]:
df["selftext"].unique()

array(['[removed]',
       'It\'s like painting, we all think we can do it until you see a professional painter do it. A professional voice actor can make your voicemail sound sharp and polished. It\'s very affordable to hire voice actors online to do a specific word count voicemail. It\'s the added touch to give a "wow" to your clients.',
       '',
       'This is really more for me than for you but I know there’s a million others like me feeling really powerless right now that this will mean a lot to.',
       'There are whole websites dedicated to gathering manuals, hope this helps.',
       'If you ever wanted to have a conversation with someone but the cooldown was too long (say 5 minutes), one solution I found is that you can edit your messages and keep the conversation going. It can be efficient, and is also good for hiding your conversation to other people. Use this as you wish',
       "Use the zoom on your phone to zoom in. I just did it to read the name of the type of canne

In [261]:
print((df["selftext"] == "[removed]").sum())
removed_selftext = (df["selftext"] == "[removed]").sum()

print(f"Percent [removed]: {(removed_selftext / len(df)) * 100}%")

print((df["selftext"] == "[deleted]").sum())
deleted_selftext = (df["selftext"] == "[deleted]").sum()

print(f"Percent [deleted]: {(deleted_selftext / len(df)) * 100}%")

print((df["selftext"].isna().sum()))
nan_selftext = df["selftext"].isna().sum()

print(f"Percent NaN: {(nan_selftext / len(df)) * 100}%")

total_selftext_probs = removed_selftext + deleted_selftext + nan_selftext
print(total_selftext_probs)

print(f"Percent Total Probs: {(total_selftext_probs / len(df)) * 100}%")

519
Percent [removed]: 51.9%
28
Percent [deleted]: 2.8000000000000003%
76
Percent NaN: 7.6%
623
Percent Total Probs: 62.3%


### Save Dataframe:

In [141]:
pd.to_pickle(df, "../datasets/df_model_1.2")

# First Model

In [114]:
X = df[["title", "has_selftext"]]
y = df["subreddit"]

### Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.33,
                                                    random_state=42)

### Create Count Vectorizer

**Instantiate**

In [None]:
# Hyparams set to lesson defaults
cvec = CountVectorizer()

**Fit**

In [None]:
X_train_sc = cvec.fit_transform(X_train)
X_test_sc = cvec.transform(X_test)

In [None]:
print(f"X_train shape: {X_train_sc.shape}")
print(f"X_test_sc shape: {X_test_sc.shape}\n")
print(f"X_train_sc feature names: {cvec.get_feature_names()[0:1000:250]}")

In [None]:
# Baseline score:
y_test.value_counts()  # even 50/50 split - may need to tweak this?

### Create Estimator

In [None]:
# Instantiate


In [None]:
# Fit


In [None]:
# print(f"Train score: {}")
# print(f"Test score: {}")

### Second Model Score Notes:

- 