# Project Three: Web APIs & NLP

______

---
### Table of Contents
---
1. [**Problem Statement**](#problemstatement)
2. [**Data Collection**](#dc)
3. [**Data Cleaning & EDA**](#datacleanEDA)
4. [**Preprocessing and Modeling**](#ppmodeling)
5. [**Evaluation**](#evaluation)
6. [**Conclusion**](#conclusion)

---

### Notebooks
- **[Exploratory Analysis Notebook](3-2_ExploratoryDataAnalysis.ipynb)**
- **[Modeling, Evaluation, and Conclusion](3-3_ModelingEvaluationConclusion.ipynb)**

<a id='problemstatement'></a>

## Problem Statement

For project 3, the assignment is two-fold:

- Using Pushshift's API, you'll collect posts from two subreddits of your choosing.
- You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.


Today in the enligh-speaking world the term **cult** has become a widely used and popular term. It is typical used in a connoting tone and seen in most scenarios as an *ad hominem* attack on a group with different practices and doctrines. The term **cult** defined on X is described as: 
“definition” 

This can easily be a way to describe religion thus causing the confusion on how we define the these two words. Even within sociology studies there is prominent debate on how to define this word due to the amount of cultural, historical, and vastly differing situation that this word is used for. 

The initiative to develop better definitions for these words stems from desire to be able to identify groups that may be dangerous vs those that are more benign. Defining these two terms comes with its own controversy because by limiting the definition of religion it could lead to interfering with freedom of religion while a term to broad could allow for more dangerous and abusive groups to arise and excuse them from legal obligations. 

For this project the goal is to accurately predict if someone is speaking in reference to a cult or a religion to further develop how sociologist define these two terms, but also be an initial step in creating ML that will allow the identification of a potential dangerous cult to be identified and not mistaken for a religion. Using natural language process techniques on over 4,400 post from the two subreddits: **r/religion** and **r/cults** our goal is to identify keywords, part-of-speach language, and sentiment analysis trend too ultimately and accurately predict the correct subreddit for each post. 


<a id='dc'></a>

## Data Collection: API 
---


### 1) First Import necessary packages:

In [14]:
#Import Packages 
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 

### 2) Use API to request post data from reddit:

### Below I am requesting post data from two subreddits by using the [PullShift]('https://github.com/pushshift/api') API. 
- The Pullshift API has a limit of a max of 100 posts per request. Because I am aiming for around 2,000 rows of data I will need to request data multiple times. Pullshift has a param called **'before'** that I will use to distinguish the date of the post to pull from, without this the pull request would all be the same posts. 


In [162]:
#defining the function name as load_data
#giving it two variables:
#1) url: this the pullshift api url for reddit submission (aka posts)
#2) params: this will specify the subbreddit thread we are interested in requesting 
def load_data(url, params):
    #Setting the limit of each pull at 100 (the max allowed by pullshift api)
    limit = 100
    
    #creating a for loop to iterate through and setting the range at 11 
    #becasue I want 1000 post per subbredit by the end
    for i in range(22):
        
        #setting requests = to res
        #and requesting with my url and params from the API
        res = requests.get(url, params)
        
        #Creating if statement to determine if our data is 'good' = '200'
        if res.status_code == 200:
            #defining our requested json as data
            data = res.json()
            #iterating through the list of the data to pull out the submission and defining them as posts 
            posts = data['data']
            
            #saving the posts list to a panda dataframe
            df = pd.DataFrame(posts)
            #saving the panda dataframe as a csv and using the subbredit and pull request to name it
            df.to_csv(f"./{params['subreddit']}_{i}.csv")
            #finding the created_utc code within the last or (min) post of the df
            created_utc = df['created_utc'].min()
            #setting the param 'before' to the identified created_utc
            params['before'] = created_utc
      
        #if none of this works tell me it failed    
        else:
            print("Failed to load data from reddit")
            #Give me the dataaaa
            return 

###  Requesting my Data through my function load_data():

1.  r/religion:

In [163]:
#Url and param variables for the first of my subreddits: religion
url = 'https://api.pushshift.io/reddit/search/submission'
params = {'subreddit': 'religion', 'size': 100}

In [164]:
#running the function
load_data(url,params)

 After running this funciton I have 21 .csv files saved to my computer. Each file includes data on 100 posts from the subbreddit: religion. 

---

2. r/cults

In [165]:
#Url and param variables for the 2nd of my subreddits: cult
url_2 = 'https://api.pushshift.io/reddit/search/submission'
params_2 = {'subreddit': 'cults', 'size':100}

In [166]:
#running the function
load_data(url_2,params_2)

After running this funciton I have 21 .csv files saved to my computer. Each file includes data on 100 posts from the subbreddit: cults.

---

###  3) Reading in the csv files:

In [24]:
#reading in all 10 cult csv files as panda dataframes
cult_0 = pd.read_csv('./api_output/cults_0.csv')
cult_1 = pd.read_csv('./api_output/cults_1.csv')
cult_2 = pd.read_csv('./api_output/cults_2.csv')
cult_3 = pd.read_csv('./api_output/cults_3.csv')
cult_4 = pd.read_csv('./api_output/cults_4.csv')
cult_5 = pd.read_csv('./api_output/cults_5.csv')
cult_6 = pd.read_csv('./api_output/cults_6.csv')
cult_7 = pd.read_csv('./api_output/cults_7.csv')
cult_8 = pd.read_csv('./api_output/cults_8.csv')
cult_9 = pd.read_csv('./api_output/cults_9.csv')
cult_10 = pd.read_csv('./api_output/cults_10.csv')
cult_11 = pd.read_csv('./api_output/cults_11.csv')
cult_12 = pd.read_csv('./api_output/cults_12.csv')
cult_13 = pd.read_csv('./api_output/cults_13.csv')
cult_14 = pd.read_csv('./api_output/cults_14.csv')
cult_15 = pd.read_csv('./api_output/cults_15.csv')
cult_16 = pd.read_csv('./api_output/cults_16.csv')
cult_17 = pd.read_csv('./api_output/cults_17.csv')
cult_18 = pd.read_csv('./api_output/cults_18.csv')
cult_19 = pd.read_csv('./api_output/cults_19.csv')
cult_20 = pd.read_csv('./api_output/cults_20.csv')
cult_21 = pd.read_csv('./api_output/cults_21.csv')


In [25]:
#Reading in all 10 religion csv files as panda dataframes
rel_0 = pd.read_csv('./api_output/religion_0.csv')
rel_1 = pd.read_csv('./api_output/religion_1.csv')
rel_2 = pd.read_csv('./api_output/religion_2.csv')
rel_3 = pd.read_csv('./api_output/religion_3.csv')
rel_4 = pd.read_csv('./api_output/religion_4.csv')
rel_5 = pd.read_csv('./api_output/religion_5.csv')
rel_6 = pd.read_csv('./api_output/religion_6.csv')
rel_7 = pd.read_csv('./api_output/religion_7.csv')
rel_8 = pd.read_csv('./api_output/religion_8.csv')
rel_9 = pd.read_csv('./api_output/religion_9.csv')
rel_10 = pd.read_csv('./api_output/religion_10.csv')
rel_11 = pd.read_csv('./api_output/religion_11.csv')
rel_12 = pd.read_csv('./api_output/religion_12.csv')
rel_13 = pd.read_csv('./api_output/religion_13.csv')
rel_14 = pd.read_csv('./api_output/religion_14.csv')
rel_15 = pd.read_csv('./api_output/religion_15.csv')
rel_16 = pd.read_csv('./api_output/religion_16.csv')
rel_17 = pd.read_csv('./api_output/religion_17.csv')
rel_18 = pd.read_csv('./api_output/religion_18.csv')
rel_19 = pd.read_csv('./api_output/religion_19.csv')
rel_20 = pd.read_csv('./api_output/religion_20.csv')
rel_21 = pd.read_csv('./api_output/religion_21.csv')

---
### Panda Dataframe rel_0:

My goal is to merge all these panda datafames together. Before I do this I am going to explore my first panda dataframe rel_0 to get a sense of the shape, types of columns, and what columns I will want in my final dataframes. 

---

In [26]:
rel_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 72 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     100 non-null    int64  
 1   all_awardings                  100 non-null    object 
 2   allow_live_comments            100 non-null    bool   
 3   author                         100 non-null    object 
 4   author_flair_css_class         1 non-null      object 
 5   author_flair_richtext          93 non-null     object 
 6   author_flair_text              8 non-null      object 
 7   author_flair_type              93 non-null     object 
 8   author_fullname                93 non-null     object 
 9   author_patreon_flair           93 non-null     object 
 10  author_premium                 93 non-null     object 
 11  awarders                       100 non-null    object 
 12  can_mod_post                   100 non-null    bool

In [27]:
rel_0.shape

(100, 72)

In [28]:
rel_0[['subreddit','id','title','selftext','created_utc']].head(5)

Unnamed: 0,subreddit,id,title,selftext,created_utc
0,religion,lwi96e,my boyfriends parents don't want him with me b...,[removed],1614736208
1,religion,lwhxen,Demasiado Natural.,,1614735154
2,religion,lwgtbw,signs of the end times,,1614731605
3,religion,lwgq4r,What is the text of Isaiah 14:12-17 in the Torah?,I'm having a discussion elsewhere and the post...,1614731322
4,religion,lwgefv,Is Joe Biden allowed to do this?,I am not Catholic but is Joe Biden allowed to ...,1614730304


My panda dataframe has 100 rows (as expected), and 72 columns. Based on the .info() request I can tell there are a lot of columns in the dataframe that I will not use for my model. Since the goal of this model is two use text data to determine the subbreddit I am only going to concatanate the panda dataframes on the columns I will need. 

---
**The columns I am concatanating are:**
- **subreddit:** the name of the subbreddit the post came from
- **id:** post id
- **title:** title of the post
- **selftext:** the description text or also reffered to as the context of the post

In [29]:
#using pd.concat to concatanate all 22 panda dataframes
df = pd.concat([rel_0,
                rel_1, 
                rel_2, 
                rel_3, 
                rel_4, 
                rel_5, 
                rel_6, 
                rel_7, 
                rel_8, 
                rel_9,
                rel_10,
                rel_11,
                rel_12,
                rel_13,
                rel_14,
                rel_15,
                rel_16,
                rel_17,
                rel_18,
                rel_19,
                rel_20,
                rel_21,
                cult_0,
                cult_1,
                cult_2,
                cult_3,
                cult_4,
                cult_5,
                cult_6,
                cult_7,
                cult_8,
                cult_9,
                cult_10,
                cult_11,
                cult_12,
                cult_13,
                cult_14,
                cult_15,
                cult_16,
                cult_17,
                cult_18,
                cult_19,
                cult_20,
                cult_21], axis=0)

In [30]:
#assigning the dataframe to only keep the four columns I need
df = df[['subreddit','id','title','selftext']]

In [31]:
#Using df.shape to know how many rows and columns my new dataframes has:
df.shape

(4400, 4)

In [32]:
#saving my combined Data set to a csv file just incase.
df.to_csv('religion_cults_reddit.csv',index=False)

<a id='datacleanEDA'></a>

# Data Cleaning

---

### 1) Data Cleaning : Missing values

In [565]:
df = pd.read_csv('./religion_cults_reddit.csv')

In [566]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4400 entries, 0 to 4399
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  4400 non-null   object
 1   id         4400 non-null   object
 2   title      4400 non-null   object
 3   selftext   2616 non-null   object
dtypes: object(4)
memory usage: 137.6+ KB


In [567]:
#finding the total number of missing values in the selftext column
df['selftext'].isnull().sum()

1784

In [568]:
#using groupby and a lambda function to see the percentage of missing values by each subreddit
df.groupby("subreddit").apply(lambda x: x.isnull().mean())

Unnamed: 0_level_0,subreddit,id,title,selftext
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
cults,0.0,0.0,0.0,0.458182
religion,0.0,0.0,0.0,0.352727


---

**Missing Values:**

Based on the .info() pull each post has a value under title, but there are 1,784 post without selftext data. This means those post might not have any text in the contents of the post but inturn have a link, video, image etc. The table above shows that **45%** of the cult subreddit post have no selftext and **35%** of the relgion subreddits have no selftext. 

**What are my options:**


Assuming I want to use both the title and the selftext in my model:

**1. I could drop the missing rows which would leave me with with less data.** 
- This does not seem like a great option becasue it leaves me with a lot less data to train/test/split my model on and cause a poor output for my model.

Assuming I only need one of these columns for my model:

**2. I could drop the entire 'selftext column** 
- and only use the 'title column to predict the subbreddit. Since I have yet to even explore the data at this point I do not feel great about this option. The title might not be enough to determine the topic and the subtext could deliver some very interest insights on the topics at hand. 

---
**Descsion:** 
I originally requested the data with a range(11) after determining that a could precentage of my rows would have NaN for **selftext** I went back and re-requested my data pull from the API with double the range(22). I decided to do this becasue now I can drop those rose and still remain with 2,616 rows of data as a healthy dataset size for modeling.  

---

In [569]:
df.head()

Unnamed: 0,subreddit,id,title,selftext
0,religion,lwi96e,my boyfriends parents don't want him with me b...,[removed]
1,religion,lwhxen,Demasiado Natural.,
2,religion,lwgtbw,signs of the end times,
3,religion,lwgq4r,What is the text of Isaiah 14:12-17 in the Torah?,I'm having a discussion elsewhere and the post...
4,religion,lwgefv,Is Joe Biden allowed to do this?,I am not Catholic but is Joe Biden allowed to ...


It looks like there is also a value in selftext that say [removed] which could be that the content of the post was removed. I will also drop rows with that value as well.

In [571]:
df[df['selftext']== '[removed]']

Unnamed: 0,subreddit,id,title,selftext
0,religion,lwi96e,my boyfriends parents don't want him with me b...,[removed]
5,religion,lwgazb,I (24M) have tried to talk to my friend (21M) ...,[removed]
13,religion,lwao0g,What was your experience like with Eastern Ort...,[removed]
14,religion,lwad22,Video on the origin of creation of man by Jeff...,[removed]
19,religion,lw8qd3,Come And See,[removed]
...,...,...,...,...
4105,cults,hz7u8x,23-year-old woman had to flee danger from her ...,[removed]
4210,cults,hsskab,I want to join Occult for money ritual +234706...,[removed]
4304,cults,ho33ir,Become a Genius,[removed]
4358,cults,hivj1b,https://bsusa01.wixsite.com/ringizmorocket,[removed]


In [572]:
df[df['selftext'] == '[deleted]']

Unnamed: 0,subreddit,id,title,selftext
37,religion,ltvmmk,Why do people think America was built on Chris...,[deleted]
53,religion,ltr054,I'm Looking for outcasted religious people who...,[deleted]
63,religion,ltnu25,"Muslims, I am sure you get this all the time b...",[deleted]
66,religion,ltlqjn,"I want to convert to Islam, but I need your help",[deleted]
74,religion,ltekkv,"How, precisely, does confession work?",[deleted]
...,...,...,...,...
4142,cults,hwq518,Looking for articles/links that explore the na...,[deleted]
4156,cults,hvm7ov,4 psychological techniques cults use to recrui...,[deleted]
4184,cults,huawsr,"Mike Rinder goes ""further, harder, deeper, and...",[deleted]
4254,cults,hr549r,Mgtow cult investigation,[deleted]


In the above code I filtered for selfttext values that contain the str [removed] and that equals 489 rows. I am going to change these values a NaN value. Once I drop alll the NaN values I should have just over 2000 rows left.  

In [573]:
#defining a function to find and replace all cells with [removed] with NaN
#this will allow me to drop all these rows in one go with .dropna
def clean_selftext(text_val):
    #iterate through the row values to find ['removed']
    if text_val == '[removed]':
        #return NaN if the value is [removed]
        return ''
    if text_val == '[deleted]':
         return ''
    if text_val == '[':
        return ''
    #if NOT [removed] than leave the text as is. 
    else:
        return text_val

In [574]:
#using .map to call my clean_selfttext function on the 'selftext' column. Then reassigning it to the column
#to make it stick
df['selftext'] = df['selftext'].map(clean_selftext)

In [575]:
#I now have 2273 NaN values
df['selftext'].isnull().sum()

1784

In [576]:
#droping all the NaN values and getting dataframe shape
df = df.fillna('')
df.shape

(4400, 4)

In [577]:
df['text'] = df['title']  + ' ' + df['selftext']
df['subreddit_class'] = [1 if i == 'religion' else 0 for i in df['subreddit']]

In [578]:
df.head()

Unnamed: 0,subreddit,id,title,selftext,text,subreddit_class
0,religion,lwi96e,my boyfriends parents don't want him with me b...,,my boyfriends parents don't want him with me b...,1
1,religion,lwhxen,Demasiado Natural.,,Demasiado Natural.,1
2,religion,lwgtbw,signs of the end times,,signs of the end times,1
3,religion,lwgq4r,What is the text of Isaiah 14:12-17 in the Torah?,I'm having a discussion elsewhere and the post...,What is the text of Isaiah 14:12-17 in the Tor...,1
4,religion,lwgefv,Is Joe Biden allowed to do this?,I am not Catholic but is Joe Biden allowed to ...,Is Joe Biden allowed to do this? I am not Cath...,1


In [579]:
df['subreddit_class'].value_counts()

0    2200
1    2200
Name: subreddit_class, dtype: int64

In [580]:
#saving my combined Data set to a csv file just incase.
df.to_csv('religion_cults_reddit_preprocessing.csv',index=False)

# Pre-Processing 

When dealing with text data, I want to make some pre-processing steps in order to have clean test data.

- lowercase
- replacing any http/s url strings
- removing punctuation
- removing numbers
- removing words less than 2 letters and all english stopwords
- remove any unwanted characters or symbols

In [369]:
#(source: https://stackoverflow.com/questions/54396405/how-can-i-preprocess-nlp-text-lowercase-remove-special-characters-remove-numb)
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 

def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"") 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)

In [370]:
#going to clean the combined title and selftext colum 
df['cleanText']=df['text'].map(lambda s:preprocess(s))

#creating title_clean column of just the title cleaned
df['cleanTitle']=df['title'].map(lambda s:preprocess(s))


In [372]:
df.head()

Unnamed: 0,subreddit,id,title,selftext,text,subreddit_class,cleanText,cleanTitle
0,religion,lwi96e,my boyfriends parents don't want him with me b...,,my boyfriends parents don't want him with me b...,1,boyfriends parents want christian,boyfriends parents want christian
1,religion,lwhxen,Demasiado Natural.,,Demasiado Natural.,1,demasiado natural,demasiado natural
2,religion,lwgtbw,signs of the end times,,signs of the end times,1,signs end times,signs end times
3,religion,lwgq4r,What is the text of Isaiah 14:12-17 in the Torah?,I'm having a discussion elsewhere and the post...,What is the text of Isaiah 14:12-17 in the Tor...,1,text isaiah torah discussion elsewhere poster ...,text isaiah torah
4,religion,lwgefv,Is Joe Biden allowed to do this?,I am not Catholic but is Joe Biden allowed to ...,Is Joe Biden allowed to do this? I am not Cath...,1,joe biden allowed catholic joe biden allowed p...,joe biden allowed


In [515]:
df.to_csv('religion_cults_reddit_preprocessed.csv',index=False)

 In the next notebook I will be doing exploratory analysis across these two subreddit datasets with natural language processing techniques such as CountVectorizing and TIFD.

**[Exploratory Analysis Notebook](3-2_ExploratoryDataAnalysis.ipynb)** 
 
---