## Presentation: 

https://www.youtube.com/watch?v=0J5WwS2b3CE&list=UUHibHZFb2jsm25vQnkQ4WTw&index=2

# Using Reddit's API for Predicting Comments

In [2]:
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
import pandas_datareader.data as web
from datetime import datetime


start = datetime(2015, 9, 1)
end = datetime(2018, 9, 7)
web.DataReader('AAPL', 'robinhood', start, end)

Unnamed: 0_level_0,Unnamed: 1_level_0,close_price,high_price,interpolated,low_price,open_price,session,volume
symbol,begins_at,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AAPL,2017-09-12,158.464000,161.517800,False,156.405100,160.187900,reg,71714046
AAPL,2017-09-13,157.272000,157.577400,False,155.558000,157.488800,reg,44907361
AAPL,2017-09-14,155.922400,157.025800,False,155.735300,156.621900,reg,23760749
AAPL,2017-09-15,157.498600,158.572400,False,155.646600,156.109600,reg,49114602
AAPL,2017-09-18,156.306600,158.109400,False,155.641700,157.725200,reg,28269435
AAPL,2017-09-19,156.365700,157.390200,False,156.080100,157.134100,reg,20810632
AAPL,2017-09-20,153.745400,155.902700,False,151.538700,155.548100,reg,52951364
AAPL,2017-09-21,151.105300,153.479400,False,150.474800,153.479400,reg,37511661
AAPL,2017-09-22,149.627600,150.002000,False,148.317400,149.280400,reg,46645443
AAPL,2017-09-25,148.307600,149.568500,False,146.938300,147.755900,reg,44387336


In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [3]:
#import necessary stuff
import requests
import json
import pandas as pd
import time
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder


In [4]:
#plug in reddit nfl
URL = "http://www.reddit.com/r/nfl.json"

In [5]:
#plug in reddit politics
URL2 = "http://www.reddit.com/r/politics.json"

In [6]:
#get reddit nfl in requests.get
res = requests.get(URL, headers={'User-agent': 'Harry 1.0'})

In [7]:
#get reddit politics in requests.get
res2 = requests.get(URL2, headers={'User-agent': 'Harry 1.0'})

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [8]:
#see what we hav eto work with
res.status_code

200

In [9]:
res2.status_code

200

In [10]:
reddit_dict = res.json()

In [11]:
reddit_dict.keys()

dict_keys(['kind', 'data'])

In [12]:
reddit_dict2 = res2.json()

In [13]:
reddit_dict2.keys()

dict_keys(['kind', 'data'])

In [14]:
reddit_dict2['kind']

'Listing'

In [15]:
reddit_dict2['data']

{'modhash': '',
 'dist': 27,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'politics',
    'selftext': 'William Browder, founder and CEO of Hermitage Capital Management, was the largest foreign investor in Russia until 2005, when he was denied entry to the country for exposing corruption in Russian state-owned companies.\n\t\t\t\t\t\nIn 2009 his Russian lawyer, Sergei Magnitsky, was killed in a Moscow prison after uncovering and exposing a US $230 million fraud committed by Russian government officials. Because of their impunity in Russia, Browder has spent the last eight years conducting a global campaign to impose visa bans and asset freezes on individual human rights abusers, particularly those who played a role in Magnitsky’s false arrest, torture and death.\n\t\t\t\t\t\nThe USA was the first to impose these sanctions with the passage of the 2012 “Magnitsky Act.” A Global Magnitsky Bill, which broadens the scope of the US Magnitsky Act to human 

In [16]:
reddit_dict['kind']

'Listing'

In [17]:
reddit_dict['data']

{'modhash': '',
 'dist': 27,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'nfl',
    'selftext': "Welcome to today's open thread, where /r/nfl users can discuss anything they wish not related directly to the NFL.\n\nWant to talk about personal life? Cool things about your fandom? Whatever happens to be dominating today's news cycle? Do you have something to talk about that didn't warrant it's own thread? This is the place for it!\n\n---\n\nRemember, that there are other subreddits that may be a good fit for what you want to post - every day all day!\n\n* /r/NFLFandom for showing off your fandom\n* /r/NFL_Draft for talking in depth about the draft\n* /r/NFLNoobs for noob questions, no judgement\n* /r/nflblogs for posting blog posts - including your own\n* /r/nflofftopic for talking about anything with NFL fans\n* /r/nfffffffluuuuuuuuuuuu for all kinds of humor posts\n* /r/nflcirclejerk for when /r/NFL just becomes too much\n* ... and more - see the 

In [18]:
reddit_dict2['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [19]:
reddit_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [20]:
reddit_dict['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'nfl',
   'selftext': "Welcome to today's open thread, where /r/nfl users can discuss anything they wish not related directly to the NFL.\n\nWant to talk about personal life? Cool things about your fandom? Whatever happens to be dominating today's news cycle? Do you have something to talk about that didn't warrant it's own thread? This is the place for it!\n\n---\n\nRemember, that there are other subreddits that may be a good fit for what you want to post - every day all day!\n\n* /r/NFLFandom for showing off your fandom\n* /r/NFL_Draft for talking in depth about the draft\n* /r/NFLNoobs for noob questions, no judgement\n* /r/nflblogs for posting blog posts - including your own\n* /r/nflofftopic for talking about anything with NFL fans\n* /r/nfffffffluuuuuuuuuuuu for all kinds of humor posts\n* /r/nflcirclejerk for when /r/NFL just becomes too much\n* ... and more - see the sidebar!",
   'author_fullname': 't2_4i7ue',


In [21]:
reddit_dict2['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'politics',
   'selftext': 'William Browder, founder and CEO of Hermitage Capital Management, was the largest foreign investor in Russia until 2005, when he was denied entry to the country for exposing corruption in Russian state-owned companies.\n\t\t\t\t\t\nIn 2009 his Russian lawyer, Sergei Magnitsky, was killed in a Moscow prison after uncovering and exposing a US $230 million fraud committed by Russian government officials. Because of their impunity in Russia, Browder has spent the last eight years conducting a global campaign to impose visa bans and asset freezes on individual human rights abusers, particularly those who played a role in Magnitsky’s false arrest, torture and death.\n\t\t\t\t\t\nThe USA was the first to impose these sanctions with the passage of the 2012 “Magnitsky Act.” A Global Magnitsky Bill, which broadens the scope of the US Magnitsky Act to human rights abusers around the world,was passed at

In [22]:
#get stuff to start with on reddit to plug into url
reddit_dict2['data']['after']

't3_9f72yv'

In [23]:
#get stuff to start with on reddit to plut ingo url2
reddit_dict['data']['after']

't3_9f9k0d'

In [24]:
reddit_dict['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [25]:
reddit_dict['data']['children'][0]['data']['selftext']

"Welcome to today's open thread, where /r/nfl users can discuss anything they wish not related directly to the NFL.\n\nWant to talk about personal life? Cool things about your fandom? Whatever happens to be dominating today's news cycle? Do you have something to talk about that didn't warrant it's own thread? This is the place for it!\n\n---\n\nRemember, that there are other subreddits that may be a good fit for what you want to post - every day all day!\n\n* /r/NFLFandom for showing off your fandom\n* /r/NFL_Draft for talking in depth about the draft\n* /r/NFLNoobs for noob questions, no judgement\n* /r/nflblogs for posting blog posts - including your own\n* /r/nflofftopic for talking about anything with NFL fans\n* /r/nfffffffluuuuuuuuuuuu for all kinds of humor posts\n* /r/nflcirclejerk for when /r/NFL just becomes too much\n* ... and more - see the sidebar!"

In [26]:
posts = [p['data'] for p in reddit_dict['data']['children']]

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [27]:
pd.DataFrame(posts).to_csv('posts.csv')

In [28]:
reddit_dict['data']['after']

't3_9f9k0d'

In [29]:
#create custom url to work with and go through reddit
URL + '?after=' + reddit_dict['data']['after']

'http://www.reddit.com/r/nfl.json?after=t3_9f9k0d'

In [30]:
# need to bust out a for loop to gather an appropriate amount of data for nfl
posts = []  # empty lists of posts
after = None

for i in range(35):
    if after == None:    # if "after" == None, that means we're hitting it for the first time. 
        current_url = URL
    else:
        current_url = URL + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers = {'User-agent': 'Foo Bar 1.0'})
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    pd.DataFrame(posts).to_csv('boardgames.csv', index = False)
    time.sleep(1)
df2 =pd.DataFrame(posts)

http://www.reddit.com/r/nfl.json
http://www.reddit.com/r/nfl.json?after=t3_9f9k0d
http://www.reddit.com/r/nfl.json?after=t3_9fa8lo
http://www.reddit.com/r/nfl.json?after=t3_9f3vq5
http://www.reddit.com/r/nfl.json?after=t3_9f0qk0
http://www.reddit.com/r/nfl.json?after=t3_9exp3h
http://www.reddit.com/r/nfl.json?after=t3_9ese96
http://www.reddit.com/r/nfl.json?after=t3_9f13qx
http://www.reddit.com/r/nfl.json?after=t3_9etjc4
http://www.reddit.com/r/nfl.json?after=t3_9exfoj
http://www.reddit.com/r/nfl.json?after=t3_9et7mj
http://www.reddit.com/r/nfl.json?after=t3_9eyasr
http://www.reddit.com/r/nfl.json?after=t3_9eyxtb
http://www.reddit.com/r/nfl.json?after=t3_9esv5m
http://www.reddit.com/r/nfl.json?after=t3_9eihu7
http://www.reddit.com/r/nfl.json?after=t3_9et4i1
http://www.reddit.com/r/nfl.json?after=t3_9eohv0
http://www.reddit.com/r/nfl.json?after=t3_9eo5s4
http://www.reddit.com/r/nfl.json?after=t3_9egn6q
http://www.reddit.com/r/nfl.json?after=t3_9eghso
http://www.reddit.com/r/nfl.json?aft

In [31]:
# need to bust out a for loop to gather an appropriate amount of data for politics
posts = []  # empty lists of posts
after = None

for i in range(35):
    if after == None:    # if "after" == None, that means we're hitting it for the first time. 
        current_url = URL2
    else:
        current_url = URL2 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers = {'User-agent': 'Foo Bar 1.0'})
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    pd.DataFrame(posts).to_csv('boardgames2.csv', index = False)
    time.sleep(1)
df1 = pd.DataFrame(posts)

http://www.reddit.com/r/politics.json
http://www.reddit.com/r/politics.json?after=t3_9f72yv
http://www.reddit.com/r/politics.json?after=t3_9f804k
http://www.reddit.com/r/politics.json?after=t3_9f6nql
http://www.reddit.com/r/politics.json?after=t3_9f9iuq
http://www.reddit.com/r/politics.json?after=t3_9f3weq
http://www.reddit.com/r/politics.json?after=t3_9farid
http://www.reddit.com/r/politics.json?after=t3_9f41kc
http://www.reddit.com/r/politics.json?after=t3_9f8v15
http://www.reddit.com/r/politics.json?after=t3_9ey1ua
http://www.reddit.com/r/politics.json?after=t3_9fash4
http://www.reddit.com/r/politics.json?after=t3_9f4455
http://www.reddit.com/r/politics.json?after=t3_9ewowd
http://www.reddit.com/r/politics.json?after=t3_9favfz
http://www.reddit.com/r/politics.json?after=t3_9f8zvy
http://www.reddit.com/r/politics.json?after=t3_9f8yk4
http://www.reddit.com/r/politics.json?after=t3_9f56km
http://www.reddit.com/r/politics.json?after=t3_9f7l89
http://www.reddit.com/r/politics.json?after=

In [32]:
#check shapes
df1.shape

(877, 96)

In [33]:
#check shapes
df2.shape

(874, 93)

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [34]:
#concat the dataframes on top of eachother
df3 = pd.concat([df1,df2], axis=0)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  


In [35]:
#see what columns are missing what
set(df1.columns) - set(df2.columns)

{'post_hint', 'preview', 'thumbnail_height', 'thumbnail_width'}

In [36]:
#get the shape of the concated dataframe to check it all aligns properly
df3.shape

(1751, 97)

In [37]:
#see what columns are missing what other way around
set(df2.columns) - set(df1.columns)

{'media_metadata'}

In [38]:
#upload the dataframe and keep it
df3.to_csv('patriotsworldnews.csv', index = False)

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [39]:
## YOUR CODE HERE

In [40]:
#import stuff for nlp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LogisticRegression, LinearRegression
logreg = LogisticRegression()
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [41]:
df3.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,Bill_Browder,,,verified,"[{'e': 'text', 't': 'Bill Browder'}]",,Bill Browder,...,,,"My name is Bill Browder, I’m the founder and C...",5142,https://www.reddit.com/r/politics/comments/9f8...,[],,False,all_ads,6
1,,,False,therealdanhill,,,,[],,,...,,,Rhode Island Primary Election,424,https://www.reddit.com/r/politics/comments/9f8...,[],,False,all_ads,6
2,,,False,FetcherLeVache,,,,[],,,...,93.0,140.0,Susan Collins Complains of “Bribery” After Non...,6375,https://slate.com/news-and-politics/2018/09/su...,[],,False,all_ads,6
3,,,False,ege3,,,,[],,,...,73.0,140.0,Evidence of Kavanaugh Perjury Mounts After Dur...,8278,https://www.commondreams.org/news/2018/09/12/e...,[],,False,all_ads,6
4,,,False,AFWxGuy,,,us-flag,"[{'a': ':flag-us:', 'e': 'emoji', 'u': 'https:...",7be44c6e-be39-11e6-b398-0eae18c336b8,:flag-us: America,...,73.0,140.0,A Series Of Suspicious Money Transfers Followe...,36510,https://www.buzzfeednews.com/article/anthonyco...,[],,False,all_ads,6


In [42]:
#create a y variable that is zeros and ones
y = df3.subreddit.eq('nfl').mul(1)

In [43]:
#create an x variable out of the titles
X = df3['title']
#do not have descriptions :/

In [44]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42, test_size = .18)

In [45]:
#import countvectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [46]:
#get cvec going
cvec = CountVectorizer(analyzer = 'word', stop_words = 'english', ngram_range=(0, 3), lowercase=True)

In [47]:
#get logisticregression
logreg =  LogisticRegression()

In [48]:
#fit the information
logreg.fit(cvec.fit_transform(X_train), y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [49]:
#score training data
logreg.score(cvec.transform(X_train), y_train)

1.0

In [50]:
#score testing data
logreg.score(cvec.transform(X_test), y_test)

0.9778481012658228

In [51]:
#definitely a bit overfit but still pretty good

## Predicting subreddit using Random Forests + Another Classifier

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

#### Thought experiment: What is the baseline accuracy for this model?

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [52]:
cvec = CountVectorizer(analyzer = 'word', stop_words = 'english', ngram_range=(0, 3), lowercase=True)

In [53]:
#import tvec
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words = 'english', ngram_range=(1,1))

In [54]:
## YOUR CODE HERE
#import pandas and other critical stuff with decision trees
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier


tree = DecisionTreeClassifier()

tree.fit(tvec.fit_transform(X_train), y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [55]:
#predict the x tests
df10 = pd.DataFrame(tree.predict_proba(tvec.transform(X_test)))

In [56]:
df10 = df10[1]

In [57]:
df10

0      0.0
1      0.0
2      0.0
3      0.0
4      1.0
5      0.0
6      1.0
7      0.0
8      1.0
9      1.0
10     0.0
11     0.0
12     0.0
13     0.0
14     1.0
15     1.0
16     0.0
17     0.0
18     1.0
19     0.0
20     0.0
21     0.0
22     1.0
23     0.0
24     1.0
25     1.0
26     1.0
27     1.0
28     0.0
29     0.0
      ... 
286    1.0
287    0.0
288    1.0
289    0.0
290    0.0
291    0.0
292    0.0
293    0.0
294    0.0
295    1.0
296    1.0
297    0.0
298    1.0
299    1.0
300    1.0
301    1.0
302    0.0
303    1.0
304    1.0
305    0.0
306    1.0
307    0.0
308    0.0
309    0.0
310    1.0
311    1.0
312    0.0
313    1.0
314    0.0
315    0.0
Name: 1, Length: 316, dtype: float64

In [58]:
#test which would be for the prediction given a quote. left is politics, right is nfl
tree.predict_proba(tvec.transform(['nfl patriots nike football politics belichick']))

array([[0., 1.]])

In [59]:
tree.predict_proba(tvec.transform(['nfl patriots nike football trump belichick']))

array([[1., 0.]])

In [60]:
tree.predict_proba(tvec.transform(['Kamala Harris']))

array([[1., 0.]])

In [61]:
tree.predict_proba(tvec.transform(['nike']))

array([[1., 0.]])

In [62]:
tree.predict_proba(tvec.transform(['trump']))

array([[1., 0.]])

In [63]:
from sklearn.metrics import confusion_matrix
confusion_matrix(df10, y_test)

array([[162,  17],
       [  2, 135]], dtype=int64)

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [64]:
#perform decision tree classifier for training
tree.score(tvec.transform(X_train), y_train)

1.0

In [65]:
#perform decision tree classifier for test
tree.score(tvec.transform(X_test), y_test)

0.939873417721519

In [66]:
#random forest classifier import
rf = RandomForestClassifier()

In [67]:
#fit and score both training and test set
rf.fit(tvec.fit_transform(X_train), y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [68]:
rf.score(tvec.transform(X_train), y_train)

0.997212543554007

In [69]:
rf.score(tvec.transform(X_test), y_test)

0.939873417721519

In [70]:
## YOUR CODE HERE
#the model explains 95.7%. 

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [71]:
#import tfidvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words = 'english',
                      analyzer = 'word', 
                       ngram_range=(1,5))

In [72]:
#logreg fit and score test and training set
logreg.fit(tvec.fit_transform(X_train), y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [73]:
logreg.score(tvec.transform(X_train), y_train)

1.0

In [74]:
logreg.score(tvec.transform(X_test), y_test)

0.9778481012658228

In [75]:
df11 = pd.DataFrame(logreg.predict(tvec.transform(X_test)))

In [76]:
confusion_matrix(df11, y_test)

array([[163,   6],
       [  1, 146]], dtype=int64)

In [97]:
logreg.predict_proba(tvec.transform(['tom brady']))

array([[0.33219739, 0.66780261]])

In [99]:
logreg.predict_proba(tvec.transform(['le veon bell']))

array([[0.2558686, 0.7441314]])

In [95]:
logreg.predict_proba(tvec.transform(['kaepernick']))

array([[0.53244665, 0.46755335]])

In [96]:
logreg.predict_proba(tvec.transform(['calls for kavanaugh impeachment grow']))

array([[0.6918661, 0.3081339]])

In [91]:
logreg.predict_proba(tvec.transform(['trump']))

array([[0.9964507, 0.0035493]])

# Executive Summary
---
Put your executive summary in a Markdown cell below.

 We started by comparing ESPN and Politics through their individual subreddits. By being able to predict through titles of reddit posts which subreddit a given post is in. 
 
 The value of this is that we are able to determine if a given piece of information applies to sports or politics. As things are g etting more and more blurred between the lines being aware of the differences is critical. Whether you are a news outlet that wants to keep blurring those lines or push to keep them entirely seporate yo uhave the information to make that decision. 
 
On my testing set, we got an rsquared score of 95% meaning that our data our model is able to predict 95% of the information. This is quite effective and helps us determine which news stories are optimal for which news outlets. 

With the information we have built, we are able to quickly determine what posts should belong where on a quick basis/ automated basis. We can also determine if we are appropriately posting in the correct spots and track the trends overall of what we are doing. 