# Final Project: Phase 2

---

In [12]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn
import json
import warnings
warnings.filterwarnings('ignore')

## Reimporting

First, let's reimport all of the data we finalized during our EDA session.

In [13]:
df = pd.read_csv("./project_data/USVideos.csv")

with open("./project_data/US_category_id.json") as f:
    categories = json.load(f)

cat_map = {}
for index,cat in enumerate(categories["items"]):
    cat_map[int(cat["id"])]=cat["snippet"]["title"]
    
df["category"] = df["category_id"].map(cat_map)

data = df[["title","channel_title","category","description","tags","views"]]

data["log_views"] = np.log(data["views"]+1)

Let's take stock of what we have:

In [14]:
data.head()

Unnamed: 0,title,channel_title,category,description,tags,views,log_views
0,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,People & Blogs,SHANTELL'S CHANNEL - https://www.youtube.com/s...,SHANtell martin,748374,13.525659
1,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,Entertainment,"One year after the presidential election, John...","last week tonight trump presidency|""last week ...",2418783,14.698775
2,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,Comedy,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,14.975981
3,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,Entertainment,Today we find out if Link is a Nickelback amat...,"rhett and link|""gmm""|""good mythical morning""|""...",343168,12.745978
4,I Dare You: GOING BALD!?,nigahiga,Entertainment,I know it's been a while since we did this sho...,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,14.555413


## Splitting into Train/Test

First, we split our dataset into features and class variables.

In [15]:
X = data.drop("views",axis=1)
Y = data["log_views"]

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, Y, random_state=42)

In [18]:
# Checking to see if they're together
X["channel_title"].value_counts().head()

ESPN                                      203
The Tonight Show Starring Jimmy Fallon    197
Vox                                       193
TheEllenShow                              193
Netflix                                   193
Name: channel_title, dtype: int64

In [19]:
X_train_raw["channel_title"].value_counts().head()

The Late Show with Stephen Colbert        155
ESPN                                      149
Late Night with Seth Meyers               143
The Tonight Show Starring Jimmy Fallon    142
TheEllenShow                              142
Name: channel_title, dtype: int64

In [20]:
X_test_raw["channel_title"].value_counts().head()

BuzzFeedVideo                             59
The Tonight Show Starring Jimmy Fallon    55
WIRED                                     54
Netflix                                   54
ESPN                                      54
Name: channel_title, dtype: int64

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

## Feature Transformation

Use your training data to fit any transformers or encoder your need, then apply the fit transformer to your test data. This applies to:
* Normalizing/standardizing your features
* Using Bag of Words or TF-IDF to encode strings
* PCA or dimensionality reduction

**Rationale**: In practice, we won't be able to see the test data we'll be making predicting for, so we shouldn't use that data as the basis for any transformation or feature extraction.

Here, we **only take the top N** words when doing our count vectors. We do this because to use the full word list would be infeasable since there are so many. Also, due to the nature of language, there are probably only a few words that are used frequently - but many words used only once.

In [22]:
# This function applies our feature transformation to our training and test data
# top_n

def apply_feature_transformation(X_train, X_test,top_n=100):
    # First, create and fit our vectorizers on the training data
    # We only take the top_n words into consideration, and simply ignore anything outside of that
    tag_vectorizer = CountVectorizer(token_pattern=r"([^|]+)", max_features=top_n)
    tag_vectors = tag_vectorizer.fit_transform(X_train["tags"].str.replace("\"","")).toarray()

    description_vectorizer = CountVectorizer(max_features=top_n)
    description_vectors = description_vectorizer.fit_transform(X_train["description"].values.astype('U')).toarray()

    title_vectorizer = CountVectorizer(max_features=top_n)
    title_vectors = title_vectorizer.fit_transform(X_train["title"]).toarray()

    # Next, we concatinate the tag, description, and title vectors into one vector
    # This is so we can have one "object" that we can fit into our models
    # Therefore, this will create a matrix of size [num_train_instances * (3 * top_n)]
    X_train_transfomed = np.concatenate((tag_vectors, description_vectors, title_vectors), axis=1)
    
    # Now we fit our test data
    # This will create a matrix of size [num_test_instances * (3 * top_n)]
    X_test_transformed = np.concatenate(
        (
            tag_vectorizer.transform(X_test["tags"].str.replace("\"","")).toarray(),
            description_vectorizer.transform(X_test["description"].values.astype('U')).toarray(),
            title_vectorizer.transform(X_test["title"]).toarray()
        ),
        axis=1
    )
    
    return X_train_transfomed, X_test_transformed

In [23]:
X_train, X_test = apply_feature_transformation(X_train_raw,X_test_raw,100)

In [24]:
# Checking to make sure we shaped our features right
# There should be 3*top_n features
X_train.shape

(30711, 300)

In [25]:
X_test.shape

(10238, 300)

In [26]:
X_train.shape

(30711, 300)

In [27]:
X_test.shape

(10238, 300)

# Trying Different Models

Now that we have our data fully preprocessed, let's try it with some different models to see what we can get.

### Linear Regression

It's always good to work your way up from the simplest models you know before trying the more complicated ones. So let's see if a simple linear regression will solve our problem.

In [28]:
from sklearn.linear_model import LinearRegression

In [29]:
reg = LinearRegression().fit(X_train, y_train)

In [30]:
print("Training score", reg.score(X_train,y_train))
print("Testing score:", reg.score(X_test,y_test))

Training score 0.2489374603044553
Testing score: 0.2334513260988368


Eugh. Looks like linear regression isn't powerful enough for this problem. There's no use trying to tune this model if even the base accuracy doesn't look promising.

### KNN Regression

In [31]:
from sklearn.neighbors import KNeighborsRegressor

In [32]:
neigh = KNeighborsRegressor(n_neighbors=2)
neigh.fit(X_train,y_train)

KNeighborsRegressor(n_neighbors=2)

Evaluating the KNN Regression takes some time, so we'll only evaluate the testing set.

In [33]:
print("Testing score:", neigh.score(X_test,y_test))

Testing score: 0.8336338911371903


### Multi Layer Perceptron Regression

Let's see if we can squeeze anything out of a neural network. (What you didn't see was fiddling with the params and layer topology)

In [34]:
from sklearn.neural_network import MLPRegressor

In [35]:
nn = MLPRegressor(max_iter=45, hidden_layer_sizes = (300,100,50,10))
nn.fit(X_train, y_train)

MLPRegressor(hidden_layer_sizes=(300, 100, 50, 10), max_iter=45)

In [36]:
nn.score(X_train,y_train)

0.9236261434884704

In [37]:
nn.score(X_test,y_test)

0.8703367358279579

**To continue, let's choose the KNN model**. This model is semi-interpretable, so it's perfered over the ANN.

## Hyperparameter Tuning
For the sake of time, we'll only be finding hyperparams for our KNN model.

In [38]:
from sklearn.model_selection import GridSearchCV

In [39]:
def find_best_hyperparameters(X_train, y_train):
    """
    Input: The training X features and Y labels/values
    Output: The classifier with the best hyperparams, the predictions
    """
    neigh = KNeighborsRegressor()
    param_grid = {"n_neighbors": [1,2],
                  "p": [1,2]}
    
    # Warning, takes a while!
    search = GridSearchCV(neigh, param_grid)
    search.fit(X_train,y_train)
    return search, search.predict(X_test)

In [40]:
best_model, predictions = find_best_hyperparameters(X_train, y_train)

In [41]:
best_model

GridSearchCV(estimator=KNeighborsRegressor(),
             param_grid={'n_neighbors': [1, 2], 'p': [1, 2]})

In [42]:
predictions

array([13.26273374, 14.49446214, 14.19448261, ..., 10.08341666,
       16.12511238, 12.28874602])