Using Youtube "Trending" page data to predict the 'Category' of a video given the 'Title'.
======
***

# Table of Contents

## Part 1: Reading and Merging Data Sources
## Part 2: Train (using Naive Bayes)
## Part 3: Test

# Part 1: Reading and Merging Data Sources
***


### Data source:
+ https://www.kaggle.com/datasnaek/youtube-new
+ Please ensure USvideos.csv and US_category_id.json is in the working directory.

### Import Modules

In [1]:
import numpy as np
import pandas as pd
import collections
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Import the CSV and take an initial look:

In [2]:
USvids = pd.read_csv("USvideos.csv", header=0)
USvids.head(3)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...


### Delete unused columns and rename the remaining columns:

In [3]:
keep_columns = ['title','category_id']
new_USvids = USvids[keep_columns]
new_USvids.to_csv("newUS.csv", index=False)
new_USvids = pd.read_csv("newUS.csv", header=0, names=['Title','Category_ID'])

### The data source provided descriptions of Category_ID in a seperate JSON file. 
### Let's look at the JSON file:

In [4]:
Categories_JSON = pd.read_json("US_category_id.JSON")
Categories_JSON.head(3)

Unnamed: 0,kind,etag,items
0,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."
1,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."
2,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."


### Create a list of dictionaries with ID and Category label mapping:

In [5]:
CategoryDict = [{'id': item['id'], 'title': item['snippet']['title']} for item in Categories_JSON['items']]

### Create a data frame of the above information

In [6]:
CategoriesDF = pd.DataFrame(CategoryDict)
Categories = CategoriesDF.rename(index=str, columns={"id": "Category_ID", "title": "Category"})
Categories.head(3)

Unnamed: 0,Category_ID,Category
0,1,Film & Animation
1,2,Autos & Vehicles
2,10,Music


# Part 2: Train (using Naive Bayes)
***

### Split 'Title' into a string of words using CountVectorizer:

In [7]:
vector = CountVectorizer()
counts = vector.fit_transform(new_USvids['Title'].values)

### Use the naive Bayes model and target 'Category':

In [8]:
NB_Model = MultinomialNB()
targets = new_USvids['Category_ID'].values
NB_Model.fit(counts,targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Check Accuracy using a 90/10 train/test split

In [9]:
X= counts
y= targets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .1)

NBtest = MultinomialNB().fit(X_train, y_train)
nb_predictions = NBtest.predict(X_test)
acc_nb = NBtest.score(X_test, y_test)
print('The Naive Bayes Algorithm scored an accuracy of', acc_nb)

The Naive Bayes Algorithm scored an accuracy of 0.90386803185438


## Satisfactory accuracy, training using Historical Data is complete.
***

# Part 3: Test

### Enter hypothetical titles to predict the category for: 

In [10]:
Titles = ["Hilarious cat plays with toy",
        "Best fashion looks for Spring 2018",
        "Olympics opening ceremony highlights",
        "Warriors basketball game versus the cavs",
        "CNN world news on donald trump",
        "Police Chase in Hollywood",
        "Ed Sheeran - Perfect (Official Music Video)",
        "how to do eyeshadow"
         ]

### Insert said titles into naive Bayes model:

In [11]:
Titles_counts = vector.transform(Titles)
Predict = NB_Model.predict(Titles_counts)
Predict

array([24, 22, 17, 17, 25, 25, 10, 26], dtype=int64)

### Output will be an array of numbers. Iterate through the Category Dictionary (from JSON file)  to find "title":

In [12]:
CategoryNamesList = []
for Category_ID in Predict:
    MatchingCategories = [x for x in CategoryDict if x["id"] == str(Category_ID)]
    if MatchingCategories:
        CategoryNamesList.append(MatchingCategories[0]["title"])

### Map these values to the Titles we want to Predict:

In [13]:
TitleDataFrame = []
for i in range(0, len(Titles)):
    TitleToCategories = {'Title': Titles[i],  'Category': CategoryNamesList[i]}
    TitleDataFrame.append(TitleToCategories)

### Convert the resulting Dict to a Data Frame:

In [14]:
PredictDF = pd.DataFrame(Predict)
TitleDF = pd.DataFrame(TitleDataFrame)
PreFinalDF = pd.concat([PredictDF, TitleDF], axis=1)
PreFinalDF.columns = (['Categ_ID', 'Predicted Category', 'Hypothetical Video Title'])
FinalDF = PreFinalDF.drop(['Categ_ID'],axis=1)
cols = FinalDF.columns.tolist()
cols = cols[-1:] + cols[:-1]
FinalDF= FinalDF[cols]

# View Final Prediction Results:

In [15]:
FinalDF

Unnamed: 0,Hypothetical Video Title,Predicted Category
0,Hilarious cat plays with toy,Entertainment
1,Best fashion looks for Spring 2018,People & Blogs
2,Olympics opening ceremony highlights,Sports
3,Warriors basketball game versus the cavs,Sports
4,CNN world news on donald trump,News & Politics
5,Police Chase in Hollywood,News & Politics
6,Ed Sheeran - Perfect (Official Music Video),Music
7,how to do eyeshadow,Howto & Style
