## Merge/Pull Request Checklist

* For the csv files in the **Resources** folder, dont commit the zip files or the extracted csv files.
* Make sure to restart and clear output in jupyter notebook.
* Code is documented and includes comments.

## Tasks and Timeline

* Update the status of the tasks on the team [Trello Board](https://trello.com/b/qjMY63WI/whos-doing-what) so we know who is working on what.

## Getting Started and Setup

* **Team Lead** for this section: Phil
* Before running these cells, you will need an API key for using YouTube API v3.
    * If you already have a Google API key, you can use that one or create a new one from the Google Cloud Console.
    * Instructions for creating an API key and enabling YouTube API v3 are in the [README file](./README.md).
    * After you have your API key, create a file called **config.py** in the project root directory (**team_hopper**) where you will add the key.
* After you have your API key set up, run the cells in this section to set up the project locally on your computer.
* Running these cells will:
    * Import the necessary dependencies, including reading the YouTube API key from the config.py file.
    * Extract the data zip files in the **Resources** folder, which contains all of the csv files needed for this project.
    * Import the csv files into this notebook.
    * Read the csv files into pandas dataframes.
* Questions or issues with the setup instructions, ask phil.


In [None]:
# Import the dependencies used for this project.
import pandas as pd
import requests
from config import youtube_api_key
from pprint import pprint
import time
from pathlib import Path
import os, zipfile
import shutil
import glob

In [None]:
# We need the country codes to instruct the YouTube API
# to return the list of video categories available in the specified country.
# These values are ISO 3166-1 alpha-2 country codes.
country_codes = ["US", "GB", "CA", "DE", "FR", "AU", "IE","IN", "JP", "KR", "MX", "RU", "ES"]

In [None]:
# Running this cell will unzip the data files in the Resources folder for you.
extension = ".zip"
extracted_dir_name = "youtube_trending"

# Get the current working directory..
# You need to be in the root directory of this project (same directory as this notebook) for this to work properly.
cwd_dir_name = os.getcwd()
print(f"The current working directory is {cwd_dir_name}.")

os.chdir("Resources") # change directory from working dir to dir with the zip file(s) .
# This should be your "Resources" folder.
dir_name = os.getcwd()
print(f"You are now in the following directory: {dir_name}.")

for item in os.listdir(dir_name): # loop through the items in the directory.
    if item.endswith(extension): # check for ".zip" extension
        if item == "youtube_trending.zip":
            extracted_dir_name = "youtube_trending"
        if item == "trending_videos_2020.zip":
            extracted_dir_name = "trending_videos_2020"
        try:
            file_name = os.path.abspath(item) # get full path of files
            zip_ref = zipfile.ZipFile(file_name) # create zipfile object
            # Check if the directory where we plan to extract the files already exists or not.
            if not os.path.exists(extracted_dir_name):
                os.mkdir(extracted_dir_name) # make a directory where the zip files will be extracted.
            unzipped_directory = os.path.join(extracted_dir_name) # reference to the directory where the zip files will be extracted.
            zip_ref.extractall(unzipped_directory) # extract file to dir
            zip_ref.close() # close file
            print(f"Successfully unzipped youtube data files into the following folder: {unzipped_directory} inside of {dir_name}.")
        except:
            print(f"Error trying to unzip youtube data file(s).")
            
# Go up one directory into the project root directory.
os.chdir(os.path.normpath(os.getcwd() + os.sep + os.pardir))
print(os.path.normpath(os.getcwd() + os.sep + os.pardir))

In [None]:
# Path to the csv files from the trending youtube video statistics kaggle dataset.
path_to_youtube_trending_csvs = os.path.join(".", "Resources", "youtube_trending")
all_files = glob.glob(os.path.join(path_to_youtube_trending_csvs, "*.csv"))

df_from_each_file = []

for f in all_files:
    filename = os.path.basename(f)
    df_country = pd.read_csv(f, encoding ="ISO-8859-1")
    df_country["Country"] = f"{filename[0]}{filename[1]}"
    df_from_each_file.append(df_country)

# Concantenated dataframe that contains all countries.
# Can filter list by country using the "Country" column
trending_videos_concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
trending_videos_concatenated_df


In [None]:
# Run this cell to get the list of video categories from the YouTube API,
# which can be associated with youtube videos by category id.
def getVideoCategories(country_code):    
    base_url_categories = "https://www.googleapis.com/youtube/v3/videoCategories"
    part_categories = "snippet"
    query_url_categories = f"{base_url_categories}?part={part_categories}&regionCode={country_code}&key={youtube_api_key}"
    categories_response = requests.get(query_url_categories).json()
    category_items = categories_response["items"]
    categories = []
    
    for category in category_items:
        categories_dict = {}
        categories_dict["category_id"] = category["id"]
        categories_dict["channel_id"] = category["snippet"]["channelId"]
        categories_dict["title"] = category["snippet"]["title"]
        categories.append(categories_dict)
    return categories

for country in country_codes:
    categories = []
    categories = getVideoCategories(country)
    categories_df = pd.DataFrame(categories)
    output_file = f"{country}_categories.csv"
    output_dir = Path("./Resources/categories")
    output_dir.mkdir(parents=True, exist_ok=True)
    categories_df.to_csv(output_dir / output_file, index=False)

In [None]:
# Path to the csv files that list the different categories.
path_to_categories_csvs = os.path.join(".", "Resources", "categories")
all_category_files = glob.glob(os.path.join(path_to_categories_csvs, "*.csv"))

df_from_each_categories_file = []

for f in all_category_files:
    filename = os.path.basename(f)
    df_categories = pd.read_csv(f, encoding ="ISO-8859-1")
    df_categories["Country"] = f"{filename[0]}{filename[1]}"
    df_from_each_categories_file.append(df_categories)

# Concantenated dataframe that contains all categories
# Can filter list by country using the "Country" column
categories_concatenated_df = pd.concat(df_from_each_categories_file, ignore_index=True)

# Merge the dataframe of trending videos with the dataframe of categories on category_id and on country.
merged_trending_df = pd.merge(trending_videos_concatenated_df, categories_concatenated_df,  how='left', left_on=['category_id','Country'], right_on = ['category_id','Country'], suffixes=("_video", "_category"))

merged_trending_df

In [None]:
# These csv files are YouTube's most popular videos for 2020.

# Path to the csv files.
path_to_trending_2020_csvs = os.path.join(".", "Resources", "trending_videos_2020")
all_files_2020 = glob.glob(os.path.join(path_to_trending_2020_csvs, "*.csv"))

df_from_each_file_2020 = []

for f in all_files_2020:
    filename = os.path.basename(f)
    df_country = pd.read_csv(f, encoding ="ISO-8859-1")
    df_country["Country"] = f"{filename[0]}{filename[1]}"
    df_from_each_file_2020.append(df_country)

# Concantenated dataframe that contains all countries.
# Can filter list by country using the "Country" column
trending_2020_concatenated_df = pd.concat(df_from_each_file_2020, ignore_index=True)
trending_2020_concatenated_df

In [None]:
# Path to the csv files that list the different categories.
path_to_categories_csvs = os.path.join(".", "Resources", "categories")
all_category_files = glob.glob(os.path.join(path_to_categories_csvs, "*.csv"))

df_from_each_categories_file = []

for f in all_category_files:
    filename = os.path.basename(f)
    df_categories = pd.read_csv(f, encoding ="ISO-8859-1")
    df_categories["Country"] = f"{filename[0]}{filename[1]}"
    df_from_each_categories_file.append(df_categories)

# Concantenated dataframe that contains all categories
# Can filter list by country using the "Country" column
categories_concatenated_df = pd.concat(df_from_each_categories_file, ignore_index=True)

# Merge the dataframe of trending videos with the dataframe of categories on category_id and on country.
merged_trending_2020_df = pd.merge(trending_2020_concatenated_df, categories_concatenated_df,  how='left', left_on=['category_id','Country'], right_on = ['category_id','Country'], suffixes=("_video", "_category"))

merged_trending_2020_df

## Retrieving a list of YouTube's most popular videos

* You do **NOT** need to run this to set up the repository.
* The following function retrieves a list of YouTube's most popular videos using version 3 of the YouTube API.
* The API is updated daily to return the list of trending videos, which can be found on YouTube's site [here](https://www.youtube.com/feed/trending).
* The function takes a country code as input, which identifies the country for which you are retrieving videos.

In [None]:
def getTrendingVideos(country_code):
    # The base api url for the youtube data api.
    base_url = "https://www.googleapis.com/youtube/v3/videos"

    # The page token identifies a specific page the API should return.
    next_page_token="&"

    # Comma separated list of one or more video resource properties that the API response will include.
    part = "snippet,contentDetails,statistics"

    # The chart that you want to retrieve.
    # mostPopular - returns the most popular (trending) videos.
    chart = "mostPopular"

    # The max results that should be returned in the list. Can return up to 50 results per page.
    max_results = 50

    # Create variable to store list of trending videos.
    videos = []

    while next_page_token is not None:
        print(f"One sec... getting trending videos for {country_code}....")
        query_url = f"{base_url}?part={part}{next_page_token}chart={chart}&key={youtube_api_key}&maxResults={max_results}&regionCode={country_code}"
        trending_videos_response = requests.get(query_url).json()
        trending_videos = trending_videos_response["items"]
        for video in trending_videos:
            snippet = video["snippet"]
            contentDetails = video["contentDetails"]
            statistics = video["statistics"]

            video_dict = {}

             # Fetch the id of the video.
            video_dict["video_id"] = video["id"]

            # The date the video was on youtube's trending list.
            video_dict["trending_date"] = time.strftime("%y.%d.%m")

            # Fetch video content details
            # duration - the property value is an ISO 8601 duration. 
            video_dict["duration"] = contentDetails["duration"]
            video_dict["captions_available"] = contentDetails["caption"]

            # Fetch basic details about the video (snippet).
            video_dict["title"] = snippet["title"]
            video_dict["description"] = snippet["description"]
            video_dict["publish_time"] = snippet["publishedAt"]
            video_dict["category_id"] = snippet["categoryId"]
            video_dict["channel_id"] = snippet["channelId"]
            video_dict["channel_title"] = snippet["channelTitle"]
            video_dict["localized_description"] = snippet["localized"]["description"]
            video_dict["localized_title"] = snippet["localized"]["title"]
            video_dict["live_broadcast_content"] = snippet["liveBroadcastContent"]
            try:
                video_dict["tags"] = snippet["tags"]
            except KeyError:
                video_dict["tags"] = []
            video_dict["thumbnail_link"] = snippet["thumbnails"]["default"]["url"]

            # Fetch video statistics.
            video_dict["comments_disabled"] = False
            try:
                video_dict["comment_count"] = statistics["commentCount"]
            except KeyError:
                video_dict["comment_count"] = 0
                video_dict["comments_disabled"] = True
            ratings_disabled = False
            try:
                video_dict["dislikes"] = statistics["dislikeCount"] 
                video_dict["likes"] = statistics["likeCount"]
            except KeyError:
                video_dict["dislikes"] = 0
                video_dict["likes"] = 0
                ratings_disabled = True
            video_dict["favorites"] = statistics["favoriteCount"]    
            video_dict["views"] = statistics["viewCount"]

            videos.append(video_dict)

        # Check the nextPageToken on the API response to see if there is another page to fetch data from.
        try:
            next_page_token = trending_videos_response["nextPageToken"]
            next_page_token = f"&pageToken={next_page_token}&"
        except KeyError:
            next_page_token = None
            
    return videos

In [None]:
# Running this cell retrieves the most popular videos for each country using the YouTube API and stores the output
# in the Resources folder.
# You do NOT need to run this cell.
for country in country_codes:
    country_videos = []
    country_videos = getTrendingVideos(country)
    country_videos_df = pd.DataFrame(country_videos)
#   Leave the below lines commented out.
#   output_file = f"{country}_videos.csv"
#   output_dir = Path("./Resources/trending_videos_2020")
#   output_dir.mkdir(parents=True, exist_ok=True)
#   country_videos_df.to_csv(output_dir / output_file, index=False, header=False, mode="a")

country_videos_df.head()

In [None]:
# This cell zips up the csv data files in "Resources/trending_videos_2020"
# You do NOT need to run this.
# dir_name = Path("./Resources/trending_videos_2020")
# shutil.make_archive("./Resources/trending_videos_2020", 'zip', dir_name)

## Clean the Data

* **Team Lead** for this section: Jenna
* Check for null/na values. Remove (if necessary).
* Rename columns to be something more meaningful (remove underscores from column names).
    * For example, change "category_id" to "Category ID".
* Remove unnecessary columns.
* As an additional resource, check out [this file](https://umn.bootcampcontent.com/University-of-Minnesota-Boot-Camp/UofM-STP-DATA-PT-11-2019-U-C/blob/master/04-Pandas/Activities/2019-12-16_and_17_Pandas_lesson_2/03-Ins_CleaningData/Solved/CleaningData.ipynb).
* Do anything else that you think will make the data easy to work with.

In [None]:
clean_trending_df = merged_trending_df[['video_id', 'trending_date', 'title_video', 'channel_title', 
                                        'publish_time', 'tags', 'views', 'likes', 'dislikes', 
                                        'Country', 'title_category']]
clean_trending_df.head()

In [None]:
clean_trending_df = clean_trending_df.rename(columns = 
                                             {"video_id": "Video ID", 
                                              "trending_date": "Trending Date",
                                             "title_video": "Video Title", 
                                             "channel_title":"Channel Title", 
                                             "publish_time": "Publish Time",
                                             "tags": "Tags",
                                             "views": "Views", 
                                              "likes": "Likes",
                                             "dislikes": "Dislikes",
                                             "description": "Description", 
                                             "title_category": "Category"})
clean_trending_df.head()

In [None]:
#look for N/As
clean_trending_df.count()

In [None]:
#clean_trending_df['Category'].unique()
#if time figure out what nan is
clean_trending_df.dropna(how='any')


In [None]:
clean_trending_df['Category'].unique()

In [None]:
categories_concatenated_df['title'].unique()

## Summary Statistics

* **Team Lead** for this section: Katrina
* Perform a .describe() on the dataframe to get some quick summary statistics.
* Calculate the mean, median, standard deviation, variance, and standard error of mean (sem) for trending videos.
  * Do this for the numeric columns: views, likes, and comments.
  * Might be also a good idea to group the dataframe by category id and calculate those same statistics.
* As an additional resource, check out [this file](https://umn.bootcampcontent.com/University-of-Minnesota-Boot-Camp/UofM-STP-DATA-PT-11-2019-U-C/blob/master/05-Matplotlib/Activities/2020-01-06_and_07_lesson_3/01-Ins_Summary_Statistics/Solved/samples.ipynb).

## What's next?