# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web APIs & Classification

### Description

In week four we've learned about a few different classifiers. In week five we'll learn about webscraping, APIs, and Natural Language Processing (NLP). Now we're going to put those skills to the test.

For project 3, your goal is two-fold:
1. Using Reddit's API, you'll collect posts from two subreddits of your choosing.
2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.


#### About the API

Reddit's API is fairly straightforward. For example, if I want the posts from [`/r/boardgames`](https://www.reddit.com/r/boardgames), all I have to do is add `.json` to the end of the url: https://www.reddit.com/r/boardgames.json

To help you get started, we have a primer video on how to use Reddit's API: https://www.youtube.com/watch?v=5Y3ZE26Ciuk

---

### Requirements

- Gather and prepare your data using the `requests` library.
- **Create and compare two models**. One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- A Jupyter Notebook with your analysis for a peer audience of data scientists.
- An executive summary of the results you found.
- A short presentation outlining your process and findings for a semi-technical audience.

**Pro Tip 1:** You can find a good example executive summary [here](https://www.proposify.biz/blog/executive-summary).

**Pro Tip 2:** Reddit will give you 25 posts **per request**. To get enough data, you'll need to hit Reddit's API **repeatedly** (most likely in a `for` loop). _Be sure to use the `time.sleep()` function at the end of your loop to allow for a break in between requests. **THIS IS CRUCIAL**_

**Pro tip 3:** The API will cap you at 1,000 posts for each subreddit (assuming the subreddit has that many posts).

**Pro tip 4:** At the end of each loop, be sure to save the results from your scrape as a `csv`: JSON from Reddit > Pandas DataFrame > CSV. That way, if something goes wrong in your loop, you won't lose all your data.

---

### Necessary Deliverables / Submission

- Code and executive summary must be in a clearly commented Jupyter Notebook.
- You must submit your slide deck.
- Materials must be submitted by **10:00 AM on Monday, April 8th**.

---

## Rubric
Your local instructor will evaluate your project (for the most part) using the following criteria.  You should make sure that you consider and/or follow most if not all of the considerations/recommendations outlined below **while** working through your project.

For Project 3 the evaluation categories are as follows:<br>
**The Data Science Process**
- Problem Statement
- Data Collection
- Data Cleaning & EDA
- Preprocessing & Modeling
- Evaluation and Conceptual Understanding
- Conclusion and Recommendations

**Organization and Professionalism**
- Organization
- Visualizations
- Python Syntax and Control Flow
- Presentation

**Scores will be out of 30 points based on the 10 categories in the rubric.** <br>
*3 points per section*<br>

| Score | Interpretation |
| --- | --- |
| **0** | *Project fails to meet the outlined expectations; many major issues exist.* |
| **1** | *Project close to meeting expectations; many minor issues or a few major issues.* |
| **2** | *Project meets expectations; few (and relatively minor) mistakes.* |
| **3** | *Project demonstrates a thorough understanding of all of the considerations outlined.* |


### The Data Science Process

**Problem Statement** 
- Is it clear what the goal of the project is?
- What type of model will be developed?
- How will success be evaluated?
- Is the scope of the project appropriate?
- Is it clear who cares about this or why this is important to investigate?
- Does the student consider the audience and the primary and secondary stakeholders?

**Data Collection** 
- Was enough data gathered to generate a significant result?
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through custom functions, pipelines, and/or automation?
- Was thought given to the server receiving the requests such as considering number of requests per second?

**Data Cleaning and EDA** 
- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Preprocessing and Modeling** 
- Is text data successfully converted to a matrix representation?
- Are methods such as stop words, stemming, and lemmatization explored?
- Does the student properly split and/or sample the data for validation/training purposes?
- Does the student test and evaluate a variety of models to identify a production algorithm (**AT MINIMUM:** Bayes and one other model)?
- Does the student defend their choice of production model relevant to the data at hand and the problem?
- Does the student explain how the model works and evaluate its performance successes/downfalls?

**Evaluation and Conceptual Understanding** 
- Does the student accurately identify and explain the baseline score?
- Does the student select and use metrics relevant to the problem objective?
- Does the student interpret the results of their model for purposes of inference?
- Is domain knowledge demonstrated when interpreting results?
- Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?

**Conclusion and Recommendations** 
- Does the student provide appropriate context to connect individual steps back to the overall project?
- Is it clear how the final recommendations were reached?
- Are the conclusions/recommendations clearly stated?
- Does the conclusion answer the original problem statement?
- Does the student address how findings of this research can be applied for the benefit of stakeholders?
- Are future steps to move the project forward identified?


### Organization and Professionalism

**Project Organization**
- Are modules imported correctly (using appropriate aliases)?
- Are data imported/saved using relative paths?
- Does the README provide a good executive summary of the project?
- Is markdown formatting used appropriately to structure notebooks?
- Are there an appropriate amount of comments to support the code?
- Are files & directories organized correctly?
- Are there unnecessary files included?
- Do files and directories have well-structured, appropriate, consistent names?

**Visualizations**
- Are sufficient visualizations provided?
- Do plots accurately demonstrate valid relationships?
- Are plots labeled properly?
- Are plots interpreted appropriately?
- Are plots formatted and scaled appropriately for inclusion in a notebook-based technical report?

**Python Syntax and Control Flow**
- Is care taken to write human readable code?
- Is the code syntactically correct (no runtime errors)?
- Does the code generate desired results (logically correct)?
- Does the code follows general best practices and style guidelines?
- Are Pandas functions used appropriately?
- Are `sklearn` and `NLTK` methods used appropriately?

**Presentation**
- Is the problem statement clearly presented?
- Does a strong narrative run through the presentation building toward a final conclusion?
- Are the conclusions/recommendations clearly stated?
- Is the level of technicality appropriate for the intended audience?
- Is the student substantially over or under time?
- Does the student appropriately pace their presentation?
- Does the student deliver their message with clarity and volume?
- Are appropriate visualizations generated for the intended audience?
- Are visualizations necessary and useful for supporting conclusions/explaining findings?


---

### Why we choose this project for you?
This project covers three of the biggest concepts we cover in the class: Classification Modeling, Natural Language Processing and Data Wrangling/Acquisition.

Part 1 of the project focuses on **Data wrangling/gathering/acquisition**. This is a very important skill as not all the data you will need will be in clean CSVs or a single table in SQL.  There is a good chance that wherever you land you will have to gather some data from some unstructured/semi-structured sources; when possible, requesting information from an API, but often scraping it because they don't have an API (or it's terribly documented).

Part 2 of the project focuses on **Natural Language Processing** and converting standard text data (like Titles and Comments) into a format that allows us to analyze it and use it in modeling.

Part 3 of the project focuses on **Classification Modeling**.  Given that project 2 was a regression focused problem, we needed to give you a classification focused problem to practice the various models, means of assessment and preprocessing associated with classification.   


# Code Starts Here

In [1]:
import praw
import pandas as pd
from datetime import datetime as dt

In [2]:
reddit = praw.Reddit(client_id='xM8fuZEVl3srfQ',
                     client_secret='uNvu_daDvhJFHcnKUBT2rpJ5p4A',
                     user_agent='praw', \
                     username='PrawRobot_', \
                     password='Delta#062010')

In [3]:
right = reddit.subreddit('republican')

In [4]:
data_dict = {}

In [7]:
right_top = right.top(limit=1000)
i = 0

for title in right_top:
    print(i) # A progress tracker
    
    key = str(title.id) + "_" + str(title.subreddit)
    data_dict[key] = {}
    data_dict[key]["title"] = title.title
    data_dict[key]["id"] = title.id
    data_dict[key]["subreddit"] = title.subreddit
    data_dict[key]["time"] = dt.fromtimestamp(title.created).strftime('%c')
    data_dict[key]["num_comments"] = title.num_comments
    
    submission = reddit.submission(id=title.id)
    submission.comments.replace_more(limit=None)
    comments = [comment.body for comment in submission.comments.list()]
    
    data_dict[key]["comments"] = comments
    i += 1 

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [None]:
# right_top = right.top(limit=1000)
# i = 0

# for title in right_top:
#     print(i) # A progress tracker
    
#     key = str(title.id) + "_" + str(title.subreddit)
#     data_dict[key] = {}
#     data_dict[key]["title"] = title.title
#     data_dict[key]["id"] = title.id
#     data_dict[key]["subreddit"] = title.subreddit
#     data_dict[key]["time"] = dt.fromtimestamp(title.created).strftime('%c')
#     data_dict[key]["num_comments"] = title.num_comments
    
#     submission = reddit.submission(id=title.id)
#     submission.comments.replace_more(limit=None)
#     comments = [comment.body for comment in submission.comments.list()]
    
#     data_dict[key]["comments"] = comments
#     i += 1 

In [8]:
left = reddit.subreddit('democrats')

left_top = left.top(limit=1000)
i = 0

for title in left_top:
    print(i) # A progress tracker
    
    key = str(title.id) + "_" + str(title.subreddit)
    data_dict[key] = {}
    data_dict[key]["title"] = title.title
    data_dict[key]["id"] = title.id
    data_dict[key]["subreddit"] = title.subreddit
    data_dict[key]["time"] = dt.fromtimestamp(title.created).strftime('%c')
    data_dict[key]["num_comments"] = title.num_comments
    
    submission = reddit.submission(id=title.id)
    submission.comments.replace_more(limit=None)
    comments = [comment.body for comment in submission.comments.list()]
    
    data_dict[key]["comments"] = comments
    i += 1 

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [12]:
import pickle

with open("rep_dem.pkl","wb") as f:
    pickle.dump(data_dict,f)

In [13]:
df = pd.DataFrame(data_dict).T

In [14]:
#Head and Tail
df.tail(2).append(df.head(2))

Unnamed: 0,comments,id,num_comments,subreddit,time,title
8nbfwl_democrats,[Why bother lying about such things? Why can’t...,8nbfwl,50,democrats,Wed May 30 15:27:11 2018,Trump places Nashville rally crowd size above ...
7rft0w_democrats,[That whole family values thing was a lie just...,7rft0w,21,democrats,Thu Jan 18 23:21:04 2018,Keep a record of their treachery and hypocrisy.
atkbwd_Republican,"[Also r/politicalhumor, When I first joined re...",atkbwd,129,Republican,Fri Feb 22 12:30:39 2019,Thought you guys might like this
axz3oi_Republican,"[18 U.S. Code § 2381. Treason\nWhoever, owing ...",axz3oi,158,Republican,Wed Mar 6 08:57:51 2019,I’m sure everyone agrees to not let her back i...


In [15]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from itertools import chain

word_bucket = list(chain.from_iterable(df["comments"].values))

In [17]:
word_bucket[0]

'Also r/politicalhumor'

In [None]:
cvec = CountVectorizer(ngram_range=(2,4),
                       stop_words='english',
                       max_features=None,
                       max_df = 800,
                       min_df = 50
                      )

B = cvec.fit(word_bucket)

B = cvec.transform(word_bucket)

B = pd.DataFrame(B.toarray(), columns=cvec.get_feature_names())

In [25]:
A = pd.DataFrame(B.sum(), columns=["count"]).sort_values("count")

In [29]:
for i in A.index:
    if i == "robot": print(i)

In [32]:
A.shape

(1325, 1)

In [None]:
A = df["comments"]["a039wy_democrats"][0:10]

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
A_tokens = tokenizer.tokenize(A.lower())

In [None]:
from bs4 import BeautifulSoup  
example1 = BeautifulSoup(A)
print(example1.get_text())

In [None]:
import re

def ngrams(input, n):
  input = input.lower().split(' ')
  output = []
  for i in range(len(input)-n+1):
    output.append(input[i:i+n])
  return output

ngrams('a bm,C d e', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

In [None]:
A = re.split("(\W+)", 'a bm,C \d e').join()
re.search("\w{1,}",A)

In [None]:
\w{1,}

In [None]:
" ".join(A)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
import string

listed = []
for i in df["comments"]["a039wy_democrats"]:
        listed.append(i.lower().translate(str.maketrans('', '', string.punctuation)).split(" "))

In [None]:
two_gram = []
for i in listed:
    gram = []
    if len(i) %2 == 0:
        for j in range(len(i)):
            

In [None]:
# Import lemmatizer. (Same as above.)
from nltk.stem import WordNetLemmatizer

# Instantiate lemmatizer. (Same as above.)
lemmatizer = WordNetLemmatizer()

In [None]:
tokens_lem = [lemmatizer.lemmatize(i) for i in df["comments"]["a039wy_democrats"]]