# Analysis of YouTube Watch & Search History

The internet has changed the way we consume media, and platforms like YouTube have emerged as significant sources of information and entertainment. As a frequent user of YouTube, I've often wondered about my viewing habits. What patterns might emerge from a careful examination of the videos I watch, the channels I frequent, and the times I choose to engage with this platform?

<br> To answer these questions, I have collected my YouTube watch history and search history in JSON formats, providing a rich dataset that includes information such as video titles, viewing times, and associated channels. This dataset provides a unique opportunity to dive deep into my personal YouTube use and uncover any interesting trends or patterns.

## Dataset Overview

1. **YouTube Watch History('watch-history.json')**: This dataset contains my personal YouTube watch history. The data is stored in JSON format with the following attributes:
    * **'header'**: This field indicates the platform, which in this case is always "YouTube".

    * **'title'**: This field contains the action and the title of the video watched. The action is always "Watched" followed by the video title.
    
    * **'titleUrl'**: This field contains the URL of the watched video.
    * **'subtitles'**: This field includes the channel name and the URL of the channel.
    * **'time'**: This field records the timestamp of the watch action in the format "YYYY-MM-DDTHH:MM:SS.SSSZ".
    * **'products'**: This field indicates the platform's product used, which in this case is always "YouTube".
    * **'activityControls'**: This field indicates the type of activity, which in this case is always "YouTube watch history".
2. **YouTube Search History ('search-history.json')**: This dataset contains my personal YouTube search history. The data is stored in JSON format with the following attributes:
    * **'header'**: Similar to the watch history, this field indicates the platform, which is always "YouTube".

    * **'title'**: This field contains the action and the search term used. The action is always "Searched for" followed by the search term.
    
    * **'titleUrl'**: This field contains the URL of the search results.
    * **'time'**: This field records the timestamp of the search action in the format "YYYY-MM-DDTHH:MM:SS.SSSZ".
    * **'products'**: This field indicates the platform's product used, which in this case is always "YouTube".
    * **'activityControls'**: This field indicates the type of activity, which in this case is always "YouTube search history".

<br> The goal of this project is to analyze these datasets to uncover insights and patterns about my YouTube viewing and searching habits. I'll be investigating aspects such as the distribution of watch times, the most commonly watched channels, the most commonly used search terms, and how these aspects might have changed over time.

## Project Goals
In this project, I will leverage Python and SQL, along with various libraries like pandas for data manipulation, NLTK for natural language processing, and matplotlib for visualization, to explore my YouTube watch history. The steps I plan to follow are:

<br> 1. **Understanding the data**: I'll start by examining the structure of the JSON file and identifying the range of dates covered in the watch history.

<br> 2. **Data Cleaning**: I will transform the JSON data into a pandas DataFrame for easier analysis, handle any missing or inconsistent fields, and convert data types where necessary.

<br> 3. **Exploratory Data Analysis (EDA)**: Here, I'll look at the distribution of my watch times, how my YouTube usage has evolved over time, and identify the channels and video categories I watch the most.

<br> 4. **Natural Language Processing (NLP) with NLTK**: I'll use NLP techniques to identify common words in video titles, classify titles into different topics, and analyze the sentiment of the video titles.

<br> 5. **Further Analysis**: I'll look for trends or patterns in the types of videos I watch, explore correlations between video lengths and my watch times, and even attempt to predict future watching habits based on my historical data.

<br> 6. **Data Visualization**: Finally, I'll create various visualizations to better understand my data and share my findings.

Through this project, I hope to gain insights into my personal YouTube usage and demonstrate how data science techniques can be applied to analyze and understand digital media consumption. Let's dive in!





## Import Libraries

In [1]:
import sqlite3
import pandas as pd
import json


## Load the Data

In [2]:
# Load the watch history JSON file
with open('watch-history.json') as f:
    watch_data = json.load(f)

# Convert the watch data to a pandas DataFrame
watch_df = pd.json_normalize(watch_data)

# Load the search history JSON file
with open('search-history.json') as f:
    search_data = json.load(f)

# Convert the search data to a pandas DataFrame
search_df = pd.json_normalize(search_data)


In [4]:
watch_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33000 entries, 0 to 32999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   header            33000 non-null  object
 1   title             33000 non-null  object
 2   titleUrl          32704 non-null  object
 3   subtitles         27046 non-null  object
 4   time              33000 non-null  object
 5   products          33000 non-null  object
 6   activityControls  33000 non-null  object
 7   details           3586 non-null   object
 8   description       3481 non-null   object
dtypes: object(9)
memory usage: 2.3+ MB


In [5]:
search_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11623 entries, 0 to 11622
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   header            11623 non-null  object
 1   title             11623 non-null  object
 2   titleUrl          11621 non-null  object
 3   time              11623 non-null  object
 4   products          11623 non-null  object
 5   activityControls  11623 non-null  object
 6   description       3673 non-null   object
 7   details           3767 non-null   object
 8   subtitles         2 non-null      object
dtypes: object(9)
memory usage: 817.4+ KB
