# Predict the Epic Sci-Fi Universe
# Notebook-2 (Preprocessing)
### Perry Shyr

### This notebook covers the concatenation and data processing steps in preparation for modeling and analysis.  from the data collected in the first notebook the majority class (more active posts) came from the 'r/startrek' subreddit.  For model training purposes, both sets of posts from the two sources need to be combined.  It was determined that the "Star/star" from the two franchise titles would not contribute to the evaluation of the two main  models examined, so the "Star/star" of those bi-grams were removed.  An inelegant way to perform the removal is used pending a search for better code in the form of a function.

## Load libraries and data:

In [1]:
import requests
import json
import time
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import regex as re

np.random.seed(42)

%matplotlib inline

In [21]:
posts_w1u = pd.read_csv('../data/posts_wars.csv')   # Unique Star-Wars posts loaded.
posts_t1u = pd.read_csv('../data/posts_trek.csv')   # Unique Star-Trek posts loaded.

In [22]:
posts_w1u.head()                                    # Check the Star-Wars posts.

Unnamed: 0,index,text,title
0,0,Things are getting out of hand when it comes t...,On opinions.
1,1,Its been three weeks since the release of Thra...,Thrawn: Alliances by Timothy Zahn - Discussion...
2,2,,"Wow, okay then."
3,3,,Hot Take: R2-D2 is the most consistently best ...
4,4,,Anakin vs Obiwan. Was the most anticipated lig...


In [23]:
print(posts_w1u['text'].isnull().sum()*100./len(posts_w1u), '% of Star-Wars texts are blank.')

62.48548199767712 % of Star-Wars texts are blank.


In [24]:
posts_t1u.head()                                    # Check the Star-Trek posts.

Unnamed: 0,index,text,title
0,0,,‘Star Trek: Discovery’ Cast Tease Juicy Storyl...
1,1,,The Shuttlecraft Galileo NCC-1701/7 Blueprints...
2,2,,"If it's ever necessary to show young Picard, h..."
3,3,,Throwback to everythingisterrible discovering ...
4,4,,"Composers Ron Jones, Dennis McCarthy and Jay C..."


In [25]:
print(posts_t1u['text'].isnull().sum()*100./len(posts_t1u), '% of Star-Trek texts are blank.')

28.40909090909091 % of Star-Trek texts are blank.


In [26]:
print('The number of unique Star-Wars posts is ',len(posts_w1u), '.')
print('The number of unique Star-Trek posts is ',len(posts_t1u), '.')

The number of unique Star-Wars posts is  861 .
The number of unique Star-Trek posts is  968 .


### Identify the source of the posts here with "0" for 'r/startrek' and "1" for 'r/StarWars.'

In [27]:
posts_w1u['is_trek'] = 0

In [28]:
posts_t1u['is_trek'] = 1

### Combine the posts from both subreddits here.

In [29]:
posts_df = pd.concat([posts_w1u,posts_t1u], axis=0, ignore_index=True)

In [30]:
posts_df.tail()                                  # Note the extra 'index' feature generated.

Unnamed: 0,index,text,title,is_trek
1824,964,,Kate Mulgrew panel at the 2018 Star Trek Conve...,1
1825,965,How about show that the reason he never marrie...,Brilliant Idea for Star Trek: Picard,1
1826,966,,The Trek Family Photo,1
1827,967,,My inner thoughts when I heard the news,1
1828,968,http://i.imgur.com/Pf3HWEp.gifv\n\n/r/startrek...,When Patrick Stewart thought he was done with ...,1


### Drop the index column created after the concatenation of the two groups of posts.

In [31]:
posts_df.drop('index', axis=1, inplace=True)

In [32]:
posts_df.head(15)                               # Verify that only three columns remain.

Unnamed: 0,text,title,is_trek
0,Things are getting out of hand when it comes t...,On opinions.,0
1,Its been three weeks since the release of Thra...,Thrawn: Alliances by Timothy Zahn - Discussion...,0
2,,"Wow, okay then.",0
3,,Hot Take: R2-D2 is the most consistently best ...,0
4,,Anakin vs Obiwan. Was the most anticipated lig...,0
5,,"Well it wasn’t in Maz Kanata’s basement, but i...",0
6,,"Something I photoshopped together for fun, fig...",0
7,,Finished cardboard Executor class super star d...,0
8,One of the best things I feel the Clone Wars d...,Anyone else love the scenes where Anakin and O...,0
9,,Samurai Stormtrooper,0


In [33]:
posts_df.shape

(1829, 3)

####  Save the resulting object (combined titles and selftexts of the posts, as well as their source) to a CSV-file here:

In [90]:
# posts_df.to_csv('../data/combined.csv', index=False)

### Note: The bi-grams for the franchise titles should be reduced to a meaningful uni-gram, here.  I use Regex in the replace-method to do this.
#### (Using this series of replace-functions preserves all other references to "Star"/"star" in a way that using "star" as a stop-word, does not.  An example would be "Death Star," not being reduced to simply "Death.")

In [8]:
test_string = 'Wish upon a star.  A star is born.  Star that paper.  Did you see the new StarWars film.  What episode of Star Trek did you watch on TV.  My favorite Star Wars movie is Empire Strikes Back.  How big is the Death Star?  I navigated by starlight.  That is a bright star.  A Vulcan is an advanced race on StarTrek.  That star wars character is scary.  Is starwars a great saga?  My startrek toy is really cool.  That startrek weapon is so futuristic.'

In [9]:
test_string                         # This is the string used to test the Regex used.

'Wish upon a star.  A star is born.  Star that paper.  Did you see the new StarWars film.  What episode of Star Trek did you watch on TV.  My favorite Star Wars movie is Empire Strikes Back.  How big is the Death Star?  I navigated by starlight.  That is a bright star.  A Vulcan is an advanced race on StarTrek.  That star wars character is scary.  Is starwars a great saga?  My startrek toy is really cool.  That startrek weapon is so futuristic.'

In [34]:
posts_df['title'] = posts_df['title'].str.replace(r'(S|s)tar\s?(T|t)rek','Trek')

In [35]:
posts_df['title'] = posts_df['title'].str.replace(r'(S|s)tar\s?(W|w)ars','Wars')

In [36]:
posts_df['title']

0                                            On opinions.
1       Thrawn: Alliances by Timothy Zahn - Discussion...
2                                         Wow, okay then.
3       Hot Take: R2-D2 is the most consistently best ...
4       Anakin vs Obiwan. Was the most anticipated lig...
5       Well it wasn’t in Maz Kanata’s basement, but i...
6       Something I photoshopped together for fun, fig...
7       Finished cardboard Executor class super star d...
8       Anyone else love the scenes where Anakin and O...
9                                    Samurai Stormtrooper
10      what's your favorite thing from disney canon s...
11                        My sons Lego Wars collection...
12                  Just watched ANH with a Live Concert!
13                      Upcoming Comic Book Release Dates
14      Wife bought us tickets for the Empire Strikes ...
15                     What I want from a Boba Fett movie
16                                     A little practice.
17      Found 

#### The combined titles and selftexts of the posts without reference to the "Star" or "star" in the franchise bi-grams, as well as their source are saved here in a separate CSV-file named "combined_no_star.CSV.":

In [174]:
posts_df.to_csv('../data/combined_no_star.csv', index=False)

### Retrieve the fresh posts collected on Sep-06 and perform the same replacements.  This way the modeling will provide consistent results for direct comparison.

In [43]:
new_test_sw = pd.read_csv('../data/new_sw_0906.csv')
new_test_st = pd.read_csv('../data/new_st_0906.csv')

In [44]:
new_test_sw.head()                                  # Note that just the titles were saved during the later collection.

Unnamed: 0,test_titles,target
0,I drew darth vader!and i need opinions.,0
1,Should TIE/D automated starfighters become can...,0
2,Fulfill your destiny,0
3,Sweet cufflinks I found at Fan Expo,0
4,🤦🏼‍♂️,0


In [45]:
print('The number of fresh unique Star-Wars posts set aside for testing is ',len(new_test_sw), '.')
print('The number of fresh unique Star-Trek posts set aside for testing is ',len(new_test_st), '.')

The number of fresh unique Star-Wars posts set aside for testing is  289 .
The number of fresh unique Star-Trek posts set aside for testing is  152 .


#### Concatenate the fresher sets of posts together for a total of 441 posts that can be used for testing.  For some reason the 'r/StarWars' source was more active than the 'r/startrek' subreddit.  As a result, about 65% of the 441 posts were from Star-Wars fans and become the majority class.

In [46]:
new_test_df = pd.concat([new_test_sw,new_test_st], axis=0, ignore_index=True)

In [47]:
new_test_df['test_titles'] = new_test_df['test_titles'].replace(r'(S|s)tar\s?(T|t)rek','Trek')

In [None]:
new_test_df['test_titles'] = new_test_df['test_titles'].replace(r'(S|s)tar\s?(W|w)ars','Wars')

### The same removal of the strings "Star" and "star" from bi-gram references to the franchise title was performed with the modified posts saved to a separate-CSV file named "combined_new_test_no_star.CSV."

In [21]:
new_test_df.to_csv('../data/combined_new_test_no_star.csv', index=False)

## Continue to Notebook-3, for the modeling steps. > > > > > > >

### The next notebook fits five separate classification models to our data, including logistic-regression, multinomial Naive-Bayes, k-Nearest Neighbors, Random-forests (ensemble method) and Support-vector machines models.  I feed a TF-IDF-vextorized dataset into each model and search for the best hyperparameters to use in the modeling.  The accuracy scores and the degree of overfitting is compared before narrowing the list to the two best models for more detailed evaluation.