# The "Most Epic" Data-Science Project
# Notebook-1 (Data Collection)
### Perry Shyr

# Problem Statement:

### The data-science problem investigated here is trying to answer where a given post originates.  The main subreddits from two popular science-fiction franchises, 'Star Wars' and 'Star Trek' were chosen for this problem.  Two of the most effective binary classifiers are examined in detail to provide the answer to the sourcing problem and in the process explain the most significant language features of the solution.

# Executive Summary

###  The Social-Justice Warriors are taking a bashing among fans of the post-Lucas Star-Wars-Extended-Universe.  On the other hand, all is relatively quiet on the Star-Trek-Extended-Universe front.  The average Sci-Fi enthusiast would do well to avoid the toxic tension of the former fandom in the media.

### There’s a huge bummer associated with encountering a spoiler.  Whether it be in sports of movies, wouldn’t it be great to avoid that algorithmicly?  We need a reliable way to identify if a post or media article is talking about Star Wars and what are the most important words to look for to make that identification.

### My model can generalize new posts for the source and thereby the expected level of tameness in the contents.  Trained from the titles of about 900 posts, one can be almost 90% sure which flavor of the galactic genres one is about to read.

### There are distinct collections of words that make the separation between the two subjects possible, such that only 10-15% of posts are truly ambiguous.  This can be done using one for two classifier models.

### Don’t wait.  Subscribe to our filtering guide today and save yourself the aggravation of reading another SJW-battering comment.  Get out of the Dark Side, once and for all.


# The Data Gathering.

### First, import the necessary tool libraries to start the collecting the data.

In [8]:
import requests
import json
import time
import pandas as pd

import pickle

## Use API requests:

In [9]:
URL_w = "http://www.reddit.com/r/StarWars.json"
URL_t = "http://www.reddit.com/r/startrek.json"

In [10]:
headers_w = { 'User-agent' : 'Bleep-bot 0.1' }
headers_t = { 'User-agent' : 'Bleep-bot 0.2' }

In [11]:
res_w = requests.get(URL_w, headers=headers_w)
res_t = requests.get(URL_t, headers=headers_t)

In [12]:
json_sw = res_w.json()     

### Look at the data.  Are they signs of promotional ads?

In [13]:
json_sw['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'StarWars',
   'selftext': "Things are getting out of hand when it comes to people, toxicity and opinions, and this sub's reputation is suffering because of it. Loving a movie is fine, disliking a movie is also fine. As long as you voice your opinion in a civilized manner then all will be cool. What's not cool is being a dick to someone that doesn't share your opinion. Billy Joe hates TLJ, he has a right to hate it if he wants, that doesn't give you a pass to be a dick to Billy Joe just because you think TLJ should be a multi Oscar winner. But that door swings both ways, Billy Joe has no right to be a dick to others for disagreeing with him, as long as the disagreeing is done in a civilized way.\n\nThe toxicity ends now. If you can't converse in a civilized manner, then we don't want you here.\n\n\nSo in short, keep criticism constructive and keep responses to criticism constructive. \n\n\n\nOn a more positive note, we

In [14]:
x = json_sw['data']['children'][4]
list(x['data'].keys())                       #  Look at the headers of the metadata for clues.

['approved_at_utc',
 'subreddit',
 'selftext',
 'author_fullname',
 'saved',
 'mod_reason_title',
 'gilded',
 'clicked',
 'title',
 'link_flair_richtext',
 'subreddit_name_prefixed',
 'hidden',
 'pwls',
 'link_flair_css_class',
 'downs',
 'thumbnail_height',
 'parent_whitelist_status',
 'hide_score',
 'name',
 'quarantine',
 'link_flair_text_color',
 'author_flair_background_color',
 'subreddit_type',
 'ups',
 'domain',
 'media_embed',
 'thumbnail_width',
 'author_flair_template_id',
 'is_original_content',
 'user_reports',
 'secure_media',
 'is_reddit_media_domain',
 'is_meta',
 'category',
 'secure_media_embed',
 'link_flair_text',
 'can_mod_post',
 'score',
 'approved_by',
 'thumbnail',
 'edited',
 'author_flair_css_class',
 'author_flair_richtext',
 'content_categories',
 'is_self',
 'mod_note',
 'created',
 'link_flair_type',
 'wls',
 'banned_by',
 'author_flair_type',
 'contest_mode',
 'selftext_html',
 'likes',
 'suggested_sort',
 'banned_at_utc',
 'view_count',
 'archived',
 'n

## One of the frequent sponsors on this Star-Wars subreddit is Grammarly.com, which asks, "Want to write better?"

In [15]:
[post['data']['title'] for post in json_sw['data']['children'] if 'Want to write better?' in post['data']['title']]

[]

### Although Grammarly was prominently featured among the first 25 posts on a given day, the words in their tagline were absent.  We might safely conclude that ads are not being pulled using the Reddit-API.

### Proceed to collect the allotted maximum of 1,000 daily posts:

#### A loop with a throttling timer is used to collect posts, 25 at a time, in initially JSON-format.

### Collecting Star-Wars posts:

In [7]:
posts_w02 = []
after_w = None
for i in range(42):
    print(i)
    if after_w == None:
        params_w = {}
    else:
        params_w = {'after':after_w}
    URL_w = "http://www.reddit.com/r/StarWars.json"
    res_w = requests.get(URL_w, params=params_w, headers=headers_w)
    if res_w.status_code == 200:
        json_sw = res_w.json()
#         the_json_w.append(json_w)
        posts_w02.extend(json_sw['data']['children'])
        after_w = json_sw['data']['after']
    else:
        print(res_w.status_code)
        break
    time.sleep(1)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


### Save the original data in JSON-files:

In [13]:
# with open('../data/wars_0906.json', 'w+') as f:   
#     json.dump(posts_w02, f)                   # Avoid writing over existing files

In [None]:
len(posts_w02)                                # Check how many total posts were collected.

### Organize the data collected into data-frames:

In [22]:
posts_extracted_w = []
for post in posts_w02:
    posts_extracted_w.append({'title': post['data']['title'], 'text': post['data']['selftext']})
df_sw=pd.DataFrame(posts_extracted_w)
df_sw.head(10)

Unnamed: 0,text,title
0,Things are getting out of hand when it comes t...,On opinions.
1,,I couldn’t resist...
2,,Fetch!
3,,"Got my first tattoo done yesterday, think this..."
4,,The Battle of Crait must have been a bit weird...
5,,"New apartment, new Star Wars setup!"
6,,A hand-embroidered travel poster for the fores...
7,,Scoundrels
8,,You have to believe...
9,,LEGO Star Wars is my addiction!


### Accumulate posts saved from a few days earlier:

In [3]:
with open('../data/postsL_wars0830_2051.pkl', 'rb') as f:
    posts_test_w1u = pickle.load(f)

In [4]:
posts_saved_w = []
for post in posts_test_w1u:
    posts_saved_w.append({'title': post['data']['title'], 'text': post['data']['selftext']})
pd.DataFrame(posts_saved_w).head(10)

Unnamed: 0,text,title
0,Things are getting out of hand when it comes t...,On opinions.
1,Its been three weeks since the release of Thra...,Thrawn: Alliances by Timothy Zahn - Discussion...
2,,Sith Acolyte I 3D printed and finished.
3,,As promised the finished Executor (sorry about...
4,,My son loves star wars but the real x-wing is ...
5,,Action Figure Movie Poster
6,,My company's camper van has been converted to ...
7,,LEGO collection at nearby shop
8,,Chillin’ in Bespin.
9,,Draw me like one of your Scarif Girls


In [5]:
len(posts_saved_w)                                  # About 714 unique Star-Wars posts were already collected.

714

In [6]:
posts_saved_w[1]                                    # Examine one of the posts saved.

{'title': 'Thrawn: Alliances by Timothy Zahn - Discussion Thread',
 'text': "Its been three weeks since the release of Thrawn: Alliances, so we figure its time to have a discussion.\n\nWhat did you like, what didn't you like.  Just how awesome is Timothy Zahn?  :)\n\nLets break down the book here and post your thoughts.  Thanks!"}

In [11]:
len(posts_extracted_w)                              # Check how many posts are collected from the loop above.

1038

In [12]:
posts_extracted_w[1037]                             # Examine one of the posts just collected.

{'title': "Dave Bautista Reveals He's Auditioned for a 'Couple' of 'Star Wars' Movies, Also reveals he loved 'Rogue One'",
 'text': ''}

In [68]:
for post in posts_saved_w:
    posts_extracted_w.append(post)                  # Combine the saved posts to the cnew posts.

In [69]:
len(posts_extracted_w)

1758

In [70]:
df_sw = pd.DataFrame(posts_extracted_w)             # Convert the result into a data-frame.

In [71]:
df_sw.head()                                        # Check the 'head' of the data-frame.

Unnamed: 0,text,title
0,Things are getting out of hand when it comes t...,On opinions.
1,Its been three weeks since the release of Thra...,Thrawn: Alliances by Timothy Zahn - Discussion...
2,,"Wow, okay then."
3,,Hot Take: R2-D2 is the most consistently best ...
4,,Anakin vs Obiwan. Was the most anticipated lig...


In [72]:
print('Duplicated values:', df_sw.duplicated().sum())  # Count the number of overlapping duplicates.

Duplicated values: 897


In [75]:
df_sw.drop_duplicates(inplace=True)                    # Remove the duplicate posts.

In [77]:
print(df_sw.shape)
df_sw.head(10)

(861, 2)


Unnamed: 0,text,title
0,Things are getting out of hand when it comes t...,On opinions.
1,Its been three weeks since the release of Thra...,Thrawn: Alliances by Timothy Zahn - Discussion...
2,,"Wow, okay then."
3,,Hot Take: R2-D2 is the most consistently best ...
4,,Anakin vs Obiwan. Was the most anticipated lig...
5,,"Well it wasn’t in Maz Kanata’s basement, but i..."
6,,"Something I photoshopped together for fun, fig..."
7,,Finished cardboard Executor class super star d...
8,One of the best things I feel the Clone Wars d...,Anyone else love the scenes where Anakin and O...
9,,Samurai Stormtrooper


In [104]:
df_sw.reset_index(inplace=True)

### After removing the duplicate posts, reset the index to facilitate subsequent processing (without gaps).

### Save the unique set of Star-Wars posts to a CSV-file.

In [109]:
# df_sw.to_csv('../data/posts_wars.csv', index=False)   # Commented-out here to avoid over-writing the posts.

#### (All done with the Star-Wars data collection of subreddit posts.  Time to move on to the next subreddit...)

## Collecting Star-Trek posts (in a similar fashion):

In [14]:
posts_t02 = []
after_t = None
for i in range(42):
    print(i)
    if after_t == None:
        params_t = {}
    else:
        params_t = {'after':after_t}
    URL_t = "http://www.reddit.com/r/startrek.json"
    res_t = requests.get(URL_t, params=params_t, headers=headers_t)
    if res_t.status_code == 200:
        json_st = res_t.json()
        posts_t02.extend(json_st['data']['children'])
        after_t = json_st['data']['after']
    else:
        print(res_t.status_code)
        break
    time.sleep(1)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


In [15]:
len(posts_t02)

1049

# Save the raw data as a JSON-file, as soon as possible:

In [16]:
with open('../data/trek_0906.json', 'w+') as f:             # Sep-06 is the latest date that posts were collected.
    json.dump(posts_t02, f)

# Iterate through the JSON-file, putting title and text into a list, then into a data-frame.

In [23]:
posts_extracted_t = []
for post in posts_t02:
    posts_extracted_t.append({'title': post['data']['title'], 'text': post['data']['selftext']})
df_st = pd.DataFrame(posts_extracted_t)
df_st.head(10)

Unnamed: 0,text,title
0,"Well met, fellow Trekkies!\n\nIt's been a whil...","State of the Subreddit: Flairs, Spoilers and C..."
1,"I watched ENT, TNG, currently on DS9 season 2\...",Thanks to Star Trek for making me feel like hu...
2,,Chris Pine says he would love to do Star Trek ...
3,What impact do you think this will have on the...,CBS reportedly negotiating exit for CEO Les Mo...
4,I finally got around to watching Voyager and I...,Is it just me or does the Federation really ne...
5,,Which captain would you feel the most comforta...
6,Let's keep it in-universe only,"Besides ""JANEWAY MURDERED TUVIX"" what in-unive..."
7,Using the Mobile Emitter tech that The Doctor ...,Do you think in the 29th Century that Professo...
8,So its been a while since I bought a new star ...,Recommendations on NX-01 model? What company i...
9,So I just rewatched Best of Both Worlds. And i...,Wolf 359: Voyager plothole


In [130]:
print('Duplicated values:', df_st.duplicated().sum())
df_st.shape

Duplicated values: 76


(1044, 2)

In [131]:
df_st.drop_duplicates(inplace=True)

In [132]:
print('Duplicated values:', df_st.duplicated().sum())
df_st.shape

Duplicated values: 0


(968, 2)

In [136]:
df_st.reset_index(inplace=True)

### From one pull, we already have more posts on Star-Trek than that on Star-Wars.

### Save the Star-Trek posts data-frame to a CSV-file.

In [141]:
# df_st.to_csv('../data/posts_trek.csv', index=False)    # Commented out to avoid over-writing the data files.

# Finding fresh posts:

### After the collection of the data to be used for modeling, I thought ahead about collecting even more unseen data to further test how our models perform.  I collect posts from both subreddits on Sep-06 for this purpose using the established loop above.  I start by retrieving data pegged for model training called 'combined.CSV,' with which I remove posts shared in common.  This leaves me with only posts from Sep-06, not repeated from the past.

In [24]:
df_combo = pd.read_csv('../data/combined.csv')

In [57]:
df_combo.head()

Unnamed: 0,text,title,is_trek
0,Things are getting out of hand when it comes t...,On opinions.,0
1,Its been three weeks since the release of Thra...,Thrawn: Alliances by Timothy Zahn - Discussion...,0
2,,"Wow, okay then.",0
3,,Hot Take: R2-D2 is the most consistently best ...,0
4,,Anakin vs Obiwan. Was the most anticipated lig...,0


In [42]:
df_sw.shape                                         # This object contained the Star-Wars posts collected on Sep-06.

(1038, 2)

In [27]:
df_st.head()                                        # This object contained the Star-Trek posts collected on Sep-06.

Unnamed: 0,text,title
0,"Well met, fellow Trekkies!\n\nIt's been a whil...","State of the Subreddit: Flairs, Spoilers and C..."
1,"I watched ENT, TNG, currently on DS9 season 2\...",Thanks to Star Trek for making me feel like hu...
2,,Chris Pine says he would love to do Star Trek ...
3,What impact do you think this will have on the...,CBS reportedly negotiating exit for CEO Les Mo...
4,I finally got around to watching Voyager and I...,Is it just me or does the Federation really ne...


In [None]:
# posts_sw_0906 = [post for post in df_sw['title'] if post not in df_combo['title']]

## The result is 289 Star-Wars posts ready for final testing.

In [65]:
# new_sw = pd.DataFrame(list(df_sw_new), columns=['test_titles'])
# new_sw['target'] = 0
# new_sw.tail()

Unnamed: 0,test_titles,target
284,Legend of Jedi Knight Shiin. Thank you for the...,0
285,Doing my daughters laundry tonight (11 years o...,0
286,Need a pepakura file!,0
287,Just trying to help...,0
288,A retrospective review celebrating the 10th an...,0


In [None]:
# new_sw.to_csv('../data/new_sw_0906.csv', index=False)       # Commented out to avoid overwriting the saved data.

## I continue to collect this extra testing set of posts from 'r/startrek' on Sep-06 and create a column 'target' to record the source of the posts.  I set aside 152 Star-Trek posts ready for final testing.

In [64]:
# new_st = pd.DataFrame(list(df_st_new), columns=['test_titles'])
# new_st['target'] = 1
# new_st.tail()

Unnamed: 0,test_titles,target
147,"""Star Trek V: The Final Frontier"" (1989): God ...",1
148,Random DS9 questions,1
149,"[video] ""Matters of internal security. The age...",1
150,Starting TOS for the very first time,1
151,Jammer:Pondering Patrick Stewart’s return to t...,1


In [63]:
# new_st.to_csv('../data/new_st_0906.csv', index=False)       # Commented out to avoid overwriting the saved data.

## Continue to Notebook-2.