##  Overview
######  In this notebook of code we cleaned the data, reformated and filtered the data using the numpy , pandas libraries 
At first glance 
 - The variables that include spaces in their names needed to be filled with an _ . This is because it is more efficient for filtering and calling the variables in a function 
 - Start time variable included date and time stamp must be seperated
 - From a seperated start time date, month and year will be filtered out into seperate columns 
 - The format of two columns: Attributes, Title, and Supplemental video type will all be reformated to lower case for convenience when data is visualized 
 - The time stamp however only needed its own column. It is originally in a 12-hour format 
 - Lastly this dataset will only be filtered for Kareena's watch history and exported into a csv file 
 - This csv file will then be used in building a dashoard 

In [2]:
# Importing Libraries 
import os 
import sys
import pandas as pd
import numpy as np


In [3]:
#Reading in viewing activity 
netflix = pd.read_csv("C:/Users/Kareena/Documents/Personal_Projects/Netflix_project/ViewingActivity.csv")
# There are 10 columns and 29927 rows
netflix.shape 
print(netflix.head())

  Profile Name           Start Time  Duration  \
0         Anil  2022-06-09 04:33:53  00:00:09   
1         Anil  2022-05-27 06:20:49  00:01:50   
2         Anil  2022-05-27 06:18:42  00:02:03   
3         Anil  2022-05-27 06:17:54  00:00:45   
4         Anil  2022-05-27 02:44:23  02:29:17   

                                    Attributes  \
0              Autoplayed: user action: None;    
1                                          NaN   
2                                          NaN   
3              Autoplayed: user action: None;    
4  Autoplayed: user action: User_Interaction;    

                                  Title Supplemental Video Type  \
0             Teaser: A Perfect Pairing          TEASER_TRAILER   
1  Season 1 Trailer: The Lincoln Lawyer                 TRAILER   
2            Trailer: A Perfect Pairing                 TRAILER   
3                 RRR_hook_primary_16x9                    HOOK   
4                   Gangubai Kathiawadi                     NaN   

 

In [4]:
# Understanding the dataset
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29927 entries, 0 to 29926
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Profile Name             29927 non-null  object
 1   Start Time               29927 non-null  object
 2   Duration                 29927 non-null  object
 3   Attributes               5898 non-null   object
 4   Title                    29927 non-null  object
 5   Supplemental Video Type  1245 non-null   object
 6   Device Type              29927 non-null  object
 7   Bookmark                 29927 non-null  object
 8   Latest Bookmark          29927 non-null  object
 9   Country                  29927 non-null  object
dtypes: object(10)
memory usage: 2.3+ MB


# Data types of the data set 
- All 10 fields data types are objects 
    - Profile name (string)
    - Start Time (time)
    - Duration (time)
    - Attributes (string)
    - Title (string)
    - Supplemental Video Type (string)
    - Device Type (string)
    - Bookmark (time)
    - Latest Bookmark (time)
    - Country (string) 

In [5]:
#Conversion of Start TIme
netflix['Start Time'] = pd.to_datetime(netflix['Start Time'],utc = True)

In [6]:
#Renaming variables with _ for space 
netflix.rename(columns = {'Profile Name':'Profile_Name'},inplace = True)
netflix.rename(columns = {'Start Time':'Start_Time'},inplace = True)
netflix.rename(columns = {'Supplemental Video Type':'Supplemental_Video_Type'},inplace = True)
netflix.rename(columns = {'Device Type':'Device_Type'},inplace = True)
netflix.rename(columns = {'Latest Bookmark':'Latest_Bookmark'},inplace = True)

print(netflix.dtypes)


Profile_Name                            object
Start_Time                 datetime64[ns, UTC]
Duration                                object
Attributes                              object
Title                                   object
Supplemental_Video_Type                 object
Device_Type                             object
Bookmark                                object
Latest_Bookmark                         object
Country                                 object
dtype: object


In [7]:
netflix['Start_Time'] = pd.to_datetime(netflix['Start_Time'])
netflix['Date'],netflix['Time']=netflix['Start_Time'].dt.normalize(),netflix['Start_Time'].dt.time
print(netflix.head())
print(netflix.dtypes)

  Profile_Name                Start_Time  Duration  \
0         Anil 2022-06-09 04:33:53+00:00  00:00:09   
1         Anil 2022-05-27 06:20:49+00:00  00:01:50   
2         Anil 2022-05-27 06:18:42+00:00  00:02:03   
3         Anil 2022-05-27 06:17:54+00:00  00:00:45   
4         Anil 2022-05-27 02:44:23+00:00  02:29:17   

                                    Attributes  \
0              Autoplayed: user action: None;    
1                                          NaN   
2                                          NaN   
3              Autoplayed: user action: None;    
4  Autoplayed: user action: User_Interaction;    

                                  Title Supplemental_Video_Type  \
0             Teaser: A Perfect Pairing          TEASER_TRAILER   
1  Season 1 Trailer: The Lincoln Lawyer                 TRAILER   
2            Trailer: A Perfect Pairing                 TRAILER   
3                 RRR_hook_primary_16x9                    HOOK   
4                   Gangubai Kathiawadi

In [13]:
# Seperating Month and Year into their own columns 
netflix['Month'] = netflix['Start_Time'].dt.month
netflix['Year'] = netflix['Start_Time'].dt.year

In [14]:
print(netflix.dtypes)
print(netflix.head())

Profile_Name                            object
Start_Time                 datetime64[ns, UTC]
Duration                                object
Attributes                              object
Title                                   object
Supplemental_Video_Type                 object
Device_Type                             object
Bookmark                                object
Latest_Bookmark                         object
Country                                 object
Date                       datetime64[ns, UTC]
Time                                    object
Month                                    int64
Year                                     int64
dtype: object
  Profile_Name                Start_Time  Duration  \
0         Anil 2022-06-09 04:33:53+00:00  00:00:09   
1         Anil 2022-05-27 06:20:49+00:00  00:01:50   
2         Anil 2022-05-27 06:18:42+00:00  00:02:03   
3         Anil 2022-05-27 06:17:54+00:00  00:00:45   
4         Anil 2022-05-27 02:44:23+00:00  02:29:17   

   

In [19]:
# Making Attributes and Title all lowercase (Easier to filter)
netflix['Title'] = netflix['Title'].str.lower()
netflix['Supplemental_Video_Type'] = netflix['Supplemental_Video_Type'].str.lower()
netflix['Attributes'] = netflix['Attributes'].str.lower()

In [26]:
#print(netflix.head())
kareenaNetflixHistory = netflix[netflix.Profile_Name =='Kareena']
kareenaNetflixHistory.shape 
netflix.isnull().sum() ## -- Checking NA values in each column (Attributes: 24029, Supplemental Video Type: 28682)

# To fill in those Nan Values for description : Not_applicable, ALL lower case
netflix.fillna('not_applicable',inplace=True)

In [28]:
print(kareenaNetflixHistory.head())


     Profile_Name                Start_Time  Duration      Attributes  \
2337      Kareena 2022-06-15 19:35:10+00:00  00:02:27  not_applicable   
2338      Kareena 2022-06-15 19:34:11+00:00  00:00:41  not_applicable   
2339      Kareena 2022-06-15 18:34:18+00:00  00:01:03  not_applicable   
2340      Kareena 2022-06-15 18:13:11+00:00  00:05:03  not_applicable   
2341      Kareena 2022-06-15 18:08:30+00:00  00:04:03  not_applicable   

                                                  Title  \
2337  grey's anatomy: season 17: someone saved my li...   
2338  grey's anatomy: season 17: i'm still standing ...   
2339  grey's anatomy: season 17: i'm still standing ...   
2340  grey's anatomy: season 17: tradition (episode 15)   
2341  grey's anatomy: season 17: look up child (epis...   

     Supplemental_Video_Type    Device_Type  Bookmark  Latest_Bookmark  \
2337          not_applicable  iPhone 12 Pro  00:20:18         00:20:18   
2338          not_applicable  iPhone 12 Pro  00:41:49     

Profile_Name               0
Start_Time                 0
Duration                   0
Attributes                 0
Title                      0
Supplemental_Video_Type    0
Device_Type                0
Bookmark                   0
Latest_Bookmark            0
Country                    0
Date                       0
Time                       0
Month                      0
Year                       0
dtype: int64

In [29]:
# Filtering only for Kareenas Viewing Activity 
kareenaNetflixHistory = netflix[netflix.Profile_Name =='Kareena']

kareenaNetflixHistory.to_csv('kareenaNeflixHistory.csv')

# Originally there were 10 columns and 29927 rows, now filtered there are 22067 rows 
kareenaNetflixHistory.shape 

(22067, 14)