### Assignment Statement:
Scrap YouTube (or similar websites like Dailymotion) to collect data for classifying videos in
following categories:-

### Categories Required:
* Travel Blogs
* Science and Technology
* Food
* Manufacturing
* History
* Art and Music

### Parameters Stated:
* Video id 
* Title 
* Description 
* Category

Also, Please note that collecting at-least 1700 samples per category is mandatory. 

### Some tips:
1. Utilize packages like nltk to sanitize descriptions from Credits, contact information,
subscription requests etc. as these things will not contribute towards better trained
models or they may even exacerbate accuracy.
2. Try to minimize the effort you put into transforming b/w formats by only using just the
above csv.
3. This exercise is for you to demonstrate your ability to handle large volumes of data. You
are free to explore techniques online, but the code you submit should be completely
written by you.
4. At the end of this exercise please submit the data through an excel sheet

### Text classification: 

You have to choose and use one model/techniques from each of the following categories

* Category Model types
    1. Linear classifiers, 
         Naive-Bayes classifiers or 
         SVMs
    2. Bagging models, 
        Boosting models or 
        shallow NNs
    3. CNN, 
        LSTM, 
         GRU, 
         Bidirectional RNNs, or 
         RCNNS


## Libraries & Tools

In [0]:
#libraries - Basic
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import os
import sys
import json 
import requests

## Data Extraction - YouTube

In [0]:
from apiclient.discovery import build
from apiclient.errors import HttpError
from oauth2client.tools import argparser

DEVELOPER_KEY = "AIzaSyB-VmLtjb6p8xpkhOfmeiLqx0b-WxUscr8"
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"

def youtube_search(q, max_results=50, order="relevance", token=None):

  youtube = build(YOUTUBE_API_SERVICE_NAME,
                  YOUTUBE_API_VERSION,
    developerKey=DEVELOPER_KEY)

  search_response = youtube.search().list(
     q=q,
    type="video",
    pageToken=token,
    order = order,
    part="id,snippet",
    maxResults=max_results,

  ).execute()
  
  videoId = []
  title = []
  categoryId = []
  description = []
    
  for search_result in search_response.get("items", []):
    
    if search_result["id"]["kind"] == "youtube#video":
      title.append(search_result['snippet']['title']) 
      videoId.append(search_result['id']['videoId'])
      response = youtube.videos().list(
          part='statistics, snippet',
          id=search_result['id']['videoId']
      ).execute()
      categoryId.append(response['items'][0]['snippet']['categoryId'])
      description.append(response['items'][0]['snippet']['description'])
      
  youtube_dict = {'videoId':videoId, 'categoryId':categoryId, 'title':title, 'description':description}
  return youtube_dict

In [7]:
# starting pulling the data
# max_results = 1700 #as specified in the assignment statement
TEXT = youtube_search("travel")

df = pd.DataFrame(data=TEXT)
df.head()

Unnamed: 0,categoryId,description,title,videoId
0,19,This has got to be one of the craziest travel ...,My Craziest Travel Story!! (STORY TIME),ctzu7grNhuo
1,22,Tala is back in the Philippines! My Canadian ...,CANADIAN GIRL Comes Back To Travel The Philipp...,8MVWUcghpqQ
2,22,มาค่ะ หรูหรา แบบเจ้าหญิง เมืองเทพนิยาย เวียนนา...,เที่ยว VIENNA นี่มันเมืองเจ้าหญิงจากเทพนิยายชั...,KDIEMIrWxvY
3,22,This extensive list shows the 31 Cheapest Budg...,31 INSANELY AFFORDABLE Budget Travel Destinati...,sRyslbdtT90
4,26,"The 10 BIGGEST Travel MISTAKES I've Made, so y...",The 10 BIGGEST Travel Mistakes TO NOT MAKE,wwNqEzyBy6E


In [0]:
# mapping category id with category name:
def get_data(key, region, *ids):
    url = "https://www.googleapis.com/youtube/v3/videos?part=snippet&id={ids}&key={api_key}"
    r = requests.get(url.format(ids=",".join(ids), api_key=key))
    js = r.json()
    items = js["items"]
    cat_js = requests.get("https://www.googleapis.com/youtube/v3/videoCategories?part=snippet&regionCode={}&key={}".format(region,
        key)).json()
    categories = {d['id']: d["snippet"]['title'] for d in cat_js["items"]}
    for item in items:
        yield item["snippet"]["title"], categories[item["snippet"]["categoryId"]]

The response throws a lot of information about the video, 
we will extract the parameters mentioned in the assignment

In [21]:
for title, cat in get_data(DEVELOPER_KEY, 'US', df.videoId[4]):
  print(title, cat)

The 10 BIGGEST Travel Mistakes TO NOT MAKE Howto & Style
