# Using Reddit's API for Predicting Comments

In [78]:
import pandas as pd
pd.core.common.is_list_like = pd.api.types.is_list_like
import pandas_datareader.data as web
from datetime import datetime


start = datetime(2015, 9, 1)
end = datetime(2018, 9, 7)
web.DataReader('AAPL', 'robinhood', start, end)

ModuleNotFoundError: No module named 'pandas_datareader'

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [1]:
#import necessary stuff
import requests
import json
import pandas as pd
import time
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder


In [2]:
#plug in reddit nfl
URL = "http://www.reddit.com/r/nfl.json"

In [3]:
#plug in reddit politics
URL2 = "http://www.reddit.com/r/politics.json"

In [4]:
#get reddit nfl in requests.get
res = requests.get(URL, headers={'User-agent': 'Harry 1.0'})

In [5]:
#get reddit politics in requests.get
res2 = requests.get(URL2, headers={'User-agent': 'Harry 1.0'})

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [6]:
#see what we hav eto work with
res.status_code

200

In [7]:
res2.status_code

200

In [8]:
reddit_dict = res.json()

In [9]:
reddit_dict.keys()

dict_keys(['kind', 'data'])

In [10]:
reddit_dict2 = res2.json()

In [11]:
reddit_dict2.keys()

dict_keys(['kind', 'data'])

In [12]:
reddit_dict2['kind']

'Listing'

In [13]:
reddit_dict2['data']

{'modhash': '',
 'dist': 26,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'politics',
    'selftext': 'Welcome to the \'What happened in your state last week\' thread, where you can post any local political news stories that you find important in the comments. This is a weekly thread posted every Monday, in order to facilitate more discussion on local issues on /r/politics. Since this is intended to be a thread about local politics, top-level comments that are exclusively about national issues will not be allowed. When commenting, please include the state you\'re living in, and don\'t forget to link sources. **Also, please actually describe what happened. "I live in X, you know what happened" isn\'t helpful to users and will be removed.**\n\nIf someone from your state made a news round-up that you think is insufficient, feel free to comment to that round-up with further news stories. Enjoy discussion, and review [our civility guidelines](https://ww

In [14]:
reddit_dict['kind']

'Listing'

In [15]:
reddit_dict['data']

{'modhash': '',
 'dist': 27,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'nfl',
    'selftext': '[New York Jets](/r/nyjets#away) [at](#at) [Detroit Lions](/r/detroitlions#home)\n\n\n[](/# "GT-PRIMETIME")\n\n\n----\n\n\n* Ford Field\n* Detroit, Michigan\n\n----\n\n######[](#start-box-score)\n\n\n\n\n\n| | | | | | |\n| :-- | :-- | :-- | :-- | :-- |  :-- |\n|      |**First**|**Second**|**Third**|**Fourth**| **Halftime** |\n|**Jets**| 7 | 10 | 0 | 0 | 17 |\n|**Lions**| 7 | 3 | 0 | 0 | 10 |\n\n\n######[](#end-box-score)\n\n----\n\n* General information\n* \n\n----\n\n| | |\n| :-- | --: |\n| **Coverage** | **Odds** |\n| ESPN | Detroit -7 O/U 44 |\n\n \n| |\n|:---|\n| **Weather** |\n| [64°F/Wind 9mph/Partly cloudy/No precipitation expected](https://www.yr.no/place/United_States/Michigan/Detroit#weather-03n "Weather forecast from yr.no, delivered by the Norwegian Meteorological Institute and the NRK") |\n \n\n\n\n----\n\n\n\n* Game Stats\n* \n\n----\n\n| 

In [16]:
reddit_dict2['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [17]:
reddit_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [18]:
reddit_dict['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'nfl',
   'selftext': '[New York Jets](/r/nyjets#away) [at](#at) [Detroit Lions](/r/detroitlions#home)\n\n\n[](/# "GT-PRIMETIME")\n\n\n----\n\n\n* Ford Field\n* Detroit, Michigan\n\n----\n\n######[](#start-box-score)\n\n\n\n\n\n| | | | | | |\n| :-- | :-- | :-- | :-- | :-- |  :-- |\n|      |**First**|**Second**|**Third**|**Fourth**| **Halftime** |\n|**Jets**| 7 | 10 | 0 | 0 | 17 |\n|**Lions**| 7 | 3 | 0 | 0 | 10 |\n\n\n######[](#end-box-score)\n\n----\n\n* General information\n* \n\n----\n\n| | |\n| :-- | --: |\n| **Coverage** | **Odds** |\n| ESPN | Detroit -7 O/U 44 |\n\n \n| |\n|:---|\n| **Weather** |\n| [64°F/Wind 9mph/Partly cloudy/No precipitation expected](https://www.yr.no/place/United_States/Michigan/Detroit#weather-03n "Weather forecast from yr.no, delivered by the Norwegian Meteorological Institute and the NRK") |\n \n\n\n\n----\n\n\n\n* Game Stats\n* \n\n----\n\n| | | | | | |\n| :-- | :-- | :-- | :-- | :-- | 

In [19]:
reddit_dict2['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'politics',
   'selftext': 'Welcome to the \'What happened in your state last week\' thread, where you can post any local political news stories that you find important in the comments. This is a weekly thread posted every Monday, in order to facilitate more discussion on local issues on /r/politics. Since this is intended to be a thread about local politics, top-level comments that are exclusively about national issues will not be allowed. When commenting, please include the state you\'re living in, and don\'t forget to link sources. **Also, please actually describe what happened. "I live in X, you know what happened" isn\'t helpful to users and will be removed.**\n\nIf someone from your state made a news round-up that you think is insufficient, feel free to comment to that round-up with further news stories. Enjoy discussion, and review [our civility guidelines](https://www.reddit.com/r/politics/wiki/rulesandregs#wik

In [20]:
#get stuff to start with on reddit to plug into url
reddit_dict2['data']['after']

't3_9eq3b3'

In [21]:
#get stuff to start with on reddit to plut ingo url2
reddit_dict['data']['after']

't3_9epwx8'

In [22]:
reddit_dict['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [23]:
reddit_dict['data']['children'][0]['data']['selftext']

'[New York Jets](/r/nyjets#away) [at](#at) [Detroit Lions](/r/detroitlions#home)\n\n\n[](/# "GT-PRIMETIME")\n\n\n----\n\n\n* Ford Field\n* Detroit, Michigan\n\n----\n\n######[](#start-box-score)\n\n\n\n\n\n| | | | | | |\n| :-- | :-- | :-- | :-- | :-- |  :-- |\n|      |**First**|**Second**|**Third**|**Fourth**| **Halftime** |\n|**Jets**| 7 | 10 | 0 | 0 | 17 |\n|**Lions**| 7 | 3 | 0 | 0 | 10 |\n\n\n######[](#end-box-score)\n\n----\n\n* General information\n* \n\n----\n\n| | |\n| :-- | --: |\n| **Coverage** | **Odds** |\n| ESPN | Detroit -7 O/U 44 |\n\n \n| |\n|:---|\n| **Weather** |\n| [64°F/Wind 9mph/Partly cloudy/No precipitation expected](https://www.yr.no/place/United_States/Michigan/Detroit#weather-03n "Weather forecast from yr.no, delivered by the Norwegian Meteorological Institute and the NRK") |\n \n\n\n\n----\n\n\n\n* Game Stats\n* \n\n----\n\n| | | | | | |\n| :-- | :-- | :-- | :-- | :-- | :-- |\n| **Passing** |  | **Cmp/Att** | **Yds** | **Tds** | **Ints** |\n| S.Darnold | [](/

In [24]:
posts = [p['data'] for p in reddit_dict['data']['children']]

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [25]:
pd.DataFrame(posts).to_csv('posts.csv')

In [26]:
reddit_dict['data']['after']

't3_9epwx8'

In [27]:
#create custom url to work with and go through reddit
URL + '?after=' + reddit_dict['data']['after']

'http://www.reddit.com/r/nfl.json?after=t3_9epwx8'

In [28]:
# need to bust out a for loop to gather an appropriate amount of data for nfl
posts = []  # empty lists of posts
after = None

for i in range(35):
    if after == None:    # if "after" == None, that means we're hitting it for the first time. 
        current_url = URL
    else:
        current_url = URL + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers = {'User-agent': 'Foo Bar 1.0'})
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    pd.DataFrame(posts).to_csv('boardgames.csv', index = False)
    time.sleep(1)
df2 =pd.DataFrame(posts)

http://www.reddit.com/r/nfl.json
http://www.reddit.com/r/nfl.json?after=t3_9epwx8
http://www.reddit.com/r/nfl.json?after=t3_9egfjc
http://www.reddit.com/r/nfl.json?after=t3_9epqyw
http://www.reddit.com/r/nfl.json?after=t3_9eqz1q
http://www.reddit.com/r/nfl.json?after=t3_9eiiks
http://www.reddit.com/r/nfl.json?after=t3_9efu7k
http://www.reddit.com/r/nfl.json?after=t3_9ef8lw
http://www.reddit.com/r/nfl.json?after=t3_9eq0c6
http://www.reddit.com/r/nfl.json?after=t3_9ehm7t
http://www.reddit.com/r/nfl.json?after=t3_9ehspc
http://www.reddit.com/r/nfl.json?after=t3_9efjz5
http://www.reddit.com/r/nfl.json?after=t3_9es5yn
http://www.reddit.com/r/nfl.json?after=t3_9eqsqs
http://www.reddit.com/r/nfl.json?after=t3_9eox7h
http://www.reddit.com/r/nfl.json?after=t3_9edviz
http://www.reddit.com/r/nfl.json?after=t3_9e6gp4
http://www.reddit.com/r/nfl.json?after=t3_9e8iu5
http://www.reddit.com/r/nfl.json?after=t3_9dv134
http://www.reddit.com/r/nfl.json?after=t3_9e2q8n
http://www.reddit.com/r/nfl.json?aft

In [29]:
# need to bust out a for loop to gather an appropriate amount of data for politics
posts = []  # empty lists of posts
after = None

for i in range(35):
    if after == None:    # if "after" == None, that means we're hitting it for the first time. 
        current_url = URL2
    else:
        current_url = URL2 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers = {'User-agent': 'Foo Bar 1.0'})
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    pd.DataFrame(posts).to_csv('boardgames2.csv', index = False)
    time.sleep(1)
df1 = pd.DataFrame(posts)

http://www.reddit.com/r/politics.json
http://www.reddit.com/r/politics.json?after=t3_9eq3b3
http://www.reddit.com/r/politics.json?after=t3_9emon1
http://www.reddit.com/r/politics.json?after=t3_9en8yz
http://www.reddit.com/r/politics.json?after=t3_9eshj4
http://www.reddit.com/r/politics.json?after=t3_9eqllx
http://www.reddit.com/r/politics.json?after=t3_9eowyz
http://www.reddit.com/r/politics.json?after=t3_9erf0g
http://www.reddit.com/r/politics.json?after=t3_9empj5
http://www.reddit.com/r/politics.json?after=t3_9er02b
http://www.reddit.com/r/politics.json?after=t3_9eqw9k
http://www.reddit.com/r/politics.json?after=t3_9espb5
http://www.reddit.com/r/politics.json?after=t3_9empi3
http://www.reddit.com/r/politics.json?after=t3_9epf1h
http://www.reddit.com/r/politics.json?after=t3_9emj9m
http://www.reddit.com/r/politics.json?after=t3_9eftiv
http://www.reddit.com/r/politics.json?after=t3_9ephvs
http://www.reddit.com/r/politics.json?after=t3_9eet8n
http://www.reddit.com/r/politics.json?after=

In [30]:
#check shapes
df1.shape

(876, 96)

In [31]:
#check shapes
df2.shape

(868, 93)

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [32]:
#concat the dataframes on top of eachother
df3 = pd.concat([df1,df2], axis=0)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  


In [33]:
#see what columns are missing what
set(df1.columns) - set(df2.columns)

{'post_hint', 'preview', 'thumbnail_height', 'thumbnail_width'}

In [34]:
#get the shape of the concated dataframe to check it all aligns properly
df3.shape

(1744, 97)

In [35]:
#see what columns are missing what other way around
set(df2.columns) - set(df1.columns)

{'media_metadata'}

In [36]:
#upload the dataframe and keep it
df3.to_csv('patriotsworldnews.csv', index = False)

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [37]:
## YOUR CODE HERE

In [38]:
#import stuff for nlp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LogisticRegression, LinearRegression
logreg = LogisticRegression()
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [39]:
df3.head()

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,optimalg,,,,"[{'a': ':flag-nl:', 'e': 'emoji', 'u': 'https:...",dd16848a-34e3-11e8-be76-0ea72814b416,:flag-nl: The Netherlands,...,,,"The ""What happened in your state last week?"" M...",148,https://www.reddit.com/r/politics/comments/9eo...,[],,False,all_ads,6
1,,,False,Mysteriagant,,,,"[{'a': ':flag-tx:', 'e': 'emoji', 'u': 'https:...",fd871cfc-8e72-11e6-84ba-0ee844677561,:flag-tx: Texas,...,70.0,140.0,"Ted Cruz Takes Brave Anti-Tofu Stance, Despite...",19074,https://www.esquire.com/food-drink/amp23061435...,[],,False,all_ads,6
2,,,False,roku44,,,,[],,,...,70.0,140.0,Trump reportedly exploded at his ex-lawyer aft...,3369,https://www.businessinsider.com/trump-exploded...,[],,False,all_ads,6
3,,,False,wonderingsocrates,,,,[],,,...,73.0,140.0,What boycott? Nike sales are up 31 percent sin...,6254,https://www.nbcnews.com/business/business-news...,[],,False,all_ads,6
4,,,False,_basquiat,,,districtofcolumbia-flag,"[{'a': ':flag-dc:', 'e': 'emoji', 'u': 'https:...",57776a2a-8e71-11e6-bfb6-0e7000497d17,:flag-dc: District Of Columbia,...,78.0,140.0,Bombs Away: Trump Has the I.Q. of an Inbred Ta...,2410,https://www.vanityfair.com/news/2018/09/trump-...,[],,False,all_ads,6


In [40]:
#create a y variable that is zeros and ones
y = df3.subreddit.eq('nfl').mul(1)

In [41]:
#create an x variable out of the titles
X = df3['title']
#do not have descriptions :/

In [42]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42, test_size = .18)

In [43]:
#import countvectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [44]:
#get cvec going
cvec = CountVectorizer(analyzer = 'word', stop_words = 'english', ngram_range=(0, 3), lowercase=True)

In [45]:
#get logisticregression
logreg =  LogisticRegression()

In [46]:
#fit the information
logreg.fit(cvec.fit_transform(X_train), y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [47]:
#score training data
logreg.score(cvec.transform(X_train), y_train)

0.9993006993006993

In [48]:
#score testing data
logreg.score(cvec.transform(X_test), y_test)

0.9808917197452229

In [49]:
#definitely a bit overfit but still pretty good

## Predicting subreddit using Random Forests + Another Classifier

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

#### Thought experiment: What is the baseline accuracy for this model?

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [50]:
cvec = CountVectorizer(analyzer = 'word', stop_words = 'english', ngram_range=(0, 3), lowercase=True)

In [51]:
#import tvec
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words = 'english', ngram_range=(1,1))

In [52]:
## YOUR CODE HERE
#import pandas and other critical stuff with decision trees
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier


tree = DecisionTreeClassifier()

tree.fit(tvec.fit_transform(X_train), y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [53]:
#predict the x tests
df10 = pd.DataFrame(tree.predict_proba(tvec.transform(X_test)))

In [54]:
df10 = df10[1]

In [55]:
df10

0      0.0
1      0.0
2      1.0
3      0.0
4      1.0
5      0.0
6      0.0
7      0.0
8      0.0
9      0.0
10     1.0
11     0.0
12     1.0
13     1.0
14     1.0
15     0.0
16     1.0
17     1.0
18     0.0
19     1.0
20     1.0
21     0.0
22     0.0
23     1.0
24     1.0
25     1.0
26     0.0
27     1.0
28     1.0
29     0.0
      ... 
284    1.0
285    0.0
286    1.0
287    1.0
288    0.0
289    1.0
290    1.0
291    1.0
292    1.0
293    1.0
294    1.0
295    1.0
296    0.0
297    1.0
298    1.0
299    1.0
300    0.0
301    0.0
302    1.0
303    0.0
304    0.0
305    1.0
306    0.0
307    0.0
308    1.0
309    0.0
310    0.0
311    1.0
312    0.0
313    1.0
Name: 1, Length: 314, dtype: float64

In [56]:
#test which would be for the prediction given a quote. left is politics, right is nfl
tree.predict_proba(tvec.transform(['nfl patriots nike football politics belichick']))

array([[0., 1.]])

In [57]:
tree.predict_proba(tvec.transform(['nfl patriots nike football trump belichick']))

array([[1., 0.]])

In [58]:
tree.predict_proba(tvec.transform(['Kamala Harris']))

array([[0., 1.]])

In [59]:
tree.predict_proba(tvec.transform(['nike']))

array([[1., 0.]])

In [60]:
tree.predict_proba(tvec.transform(['trump']))

array([[1., 0.]])

In [61]:
from sklearn.metrics import confusion_matrix
confusion_matrix(df10, y_test)

array([[131,   2],
       [ 28, 153]], dtype=int64)

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [62]:
#perform decision tree classifier for training
tree.score(tvec.transform(X_train), y_train)

0.9993006993006993

In [63]:
#perform decision tree classifier for test
tree.score(tvec.transform(X_test), y_test)

0.9044585987261147

In [64]:
#random forest classifier import
rf = RandomForestClassifier()

In [65]:
#fit and score both training and test set
rf.fit(tvec.fit_transform(X_train), y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [66]:
rf.score(tvec.transform(X_train), y_train)

0.9965034965034965

In [67]:
rf.score(tvec.transform(X_test), y_test)

0.9522292993630573

In [68]:
## YOUR CODE HERE
#the model explains 95.7%. 

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [69]:
#import tfidvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words = 'english',
                      analyzer = 'word', 
                       ngram_range=(1,5))

In [70]:
#logreg fit and score test and training set
logreg.fit(tvec.fit_transform(X_train), y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [71]:
logreg.score(tvec.transform(X_train), y_train)

0.9993006993006993

In [72]:
logreg.score(tvec.transform(X_test), y_test)

0.9808917197452229

In [73]:
df11 = pd.DataFrame(logreg.predict(tvec.transform(X_test)))

In [74]:
confusion_matrix(df11, y_test)

array([[154,   1],
       [  5, 154]], dtype=int64)

In [75]:
logreg.predict_proba(tvec.transform(['canada']))

array([[0.56506284, 0.43493716]])

In [76]:
logreg.predict_proba(tvec.transform(['mack is playing so well gruden is trending']))

array([[0.36003472, 0.63996528]])

In [77]:
logreg.predict_proba(tvec.transform(['trump']))

array([[0.99612695, 0.00387305]])

# Executive Summary
---
Put your executive summary in a Markdown cell below.

 We started by comparing ESPN and Politics through their individual subreddits. By being able to predict through titles of reddit posts which subreddit a given post is in. 
 
 The value of this is that we are able to determine if a given piece of information applies to sports or politics. As things are g etting more and more blurred between the lines being aware of the differences is critical. Whether you are a news outlet that wants to keep blurring those lines or push to keep them entirely seporate yo uhave the information to make that decision. 
 
On my testing set, we got an rsquared score of 95% meaning that our data our model is able to predict 95% of the information. This is quite effective and helps us determine which news stories are optimal for which news outlets. 

With the information we have built, we are able to quickly determine what posts should belong where on a quick basis/ automated basis. We can also determine if we are appropriately posting in the correct spots and track the trends overall of what we are doing. 