# Assignment 4.1

## Reading S3 bucket into SageMaker Studio

In [40]:
import boto3
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from collections import Counter

s3_client = boto3.client("s3")

BUCKET='ads508projectbucket'

# For Country_and_Subscriber_df
Country_and_Subscriber_Key='Aggregated_Metrics_By_Country_And_Subscriber_Status.csv'
Country_and_Subscriber_response = s3_client.get_object(Bucket=BUCKET, Key=Country_and_Subscriber_Key)
Country_and_Subscriber_df = pd.read_csv(Country_and_Subscriber_response.get("Body"))

# For Video_df
Video_Key='Aggregated_Metrics_By_Video.csv'
Video_response = s3_client.get_object(Bucket=BUCKET, Key=Video_Key)
Video_df = pd.read_csv(Video_response.get("Body"))

# For Comments_df
Comments_Key='All_Comments_Final.csv'
Comments_response = s3_client.get_object(Bucket=BUCKET, Key=Comments_Key)
Comments_df = pd.read_csv(Comments_response.get("Body"))

# For Performance_df
Performance_Key='Video_Performance_Over_Time.csv'
Performance_response = s3_client.get_object(Bucket=BUCKET, Key=Performance_Key)
Performance_df = pd.read_csv(Performance_response.get("Body"))

## Transforming Data for Training

### Comments_df

For this data set, we want to find the max reply count and the max like count for a comment in each video. To do this, we will group by unique Video ID and find the max. 

In [161]:
Comments_altered = Comments_df[['VidId', 'Reply_Count', 'Like_Count']]
grouped_comments = Comments_altered.groupby(by = 'VidId', as_index = False).max()
transformed_comments = grouped_comments[['VidId', 'Reply_Count', 'Like_Count']]
transformed_comments.rename(columns = {'Reply_Count': 'Reply_Comment_Count', 'Like_Count': 'Like_Comment_Count'}, inplace = True)
transformed_comments.head()

Unnamed: 0,VidId,Reply_Comment_Count,Like_Comment_Count
0,-3d1NctSv0c,3,5
1,-ONQ628CXKQ,5,31
2,-kX2b6TF_9k,2,2
3,-pdXWmj9xxU,3,9
4,-zbLpoJVBMI,4,17


This is the only useful information needed from this data set, so we will keep this as is and move onto the next transformation. 

### Country_and_Subscriber_df

### Video_df

For this data frame, we need to remove the first row due to it giving column values and not row values. Then, we are going to change the video publish time to day of the week. We will also create a boolean feature which states if 'Data Science' appears in the title of the video or not. We also need to convert the average view duration to seconds rather than minute second format. Lastly, we will remove the total likes and dislikes column and replace it with a 'like_dislike_ratio' column. From this data frame, we want to attempt to predict the 'Your estimated revenue (USD)' feature. This feature tells us the estimated revenue of the video and will inform the video creator if they are being paid less or more than predicted by our model from historical data. To remove any features that will bias our results, we also want to remove 'RPM' and 'CPM' for giving away monetary information. 

In [152]:
# removing first row
transform_video = Video_df.iloc[1:,:]
transform_video.reset_index(inplace = True, drop = True)

# changing 'Video publish time' to day of the week
transform_video['Video pub­lish time'] = pd.to_datetime(transform_video['Video pub­lish time'])
transform_video['Video_Day_Published'] = transform_video['Video pub­lish time'].dt.day_name()

# we now want to create n-1 dummy features to replace 'Video_Day_Published'
dummy_day = pd.get_dummies(transform_video['Video_Day_Published'], drop_first = True)
transform_video = pd.concat([transform_video, dummy_day], axis = 1)

# next, we will created a boolean feature which specifies if 'Data Science' is in the title or not
test = range(len(transform_video['Video title']))
test = pd.DataFrame(test)
row = 0
for i in transform_video['Video title']:
    find_data = i.count('Data')
    find_science = i.count('Science')
    if find_data != 0 and find_science != 0:
        test.iloc[row, 0] = 1
    else:
        test.iloc[row, 0] = 0
    row = row + 1

test.rename(columns = {0: 'DateScience?'}, inplace = True)
transform_video = pd.concat([transform_video, test], axis = 1)

# changing average view duration to seconds format
test = range(len(transform_video['Av­er­age view dur­a­tion']))
test = pd.DataFrame(test)
row = 0
for i in transform_video['Av­er­age view dur­a­tion']:
    time = i.split(':')
    hour = int(time[0])
    hour_to_sec = hour * 3600
    min = int(time[1])
    min_to_sec = min * 60
    sec = int(time[2])
    total_time = hour_to_sec + min_to_sec + sec
    test.loc[row, 0] = total_time
    row = row + 1

test.rename(columns = {0: 'Average_view_duration_(s)'}, inplace = True)
transform_video = pd.concat([transform_video, test], axis = 1)

# finding the like to dislike ratio
transform_video['Like_Dislike_Ratio'] = round(transform_video['Likes'] / transform_video['Dis­likes'], 2)
row = 0
for i in transform_video['Like_Dislike_Ratio']:
    if i > 10000:
        transform_video.loc[row,'Like_Dislike_Ratio'] = transform_video.loc[row,'Likes'] 
    row = row + 1

# Now, we will remove features which are not needed
transform_video.drop(['Video title', 'Video pub­lish time', 'RPM (USD)', 'CPM (USD)', 'Video_Day_Published', 'Av­er­age view dur­a­tion', 'Likes', 'Dis­likes'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [153]:
transform_video.head()

Unnamed: 0,Video,Com­ments ad­ded,Shares,Sub­scribers lost,Sub­scribers gained,Av­er­age per­cent­age viewed (%),Views,Watch time (hours),Sub­scribers,Your es­tim­ated rev­en­ue (USD),...,Im­pres­sions click-through rate (%),Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday,DateScience?,Average_view_duration_(s),Like_Dislike_Ratio
0,4OZip0cgOho,907,9583,451,46904,36.65,1253559,65850.7042,46453,7959.533,...,3.14,0,0,0,0,0,0,1,189,49.79
1,78LjdAAw0wA,412,4,15,12,6.26,2291,200.2966,-3,6.113,...,0.72,0,0,0,1,0,0,0,314,32.5
2,hO_YKK_0Qck,402,152,9,198,15.12,21350,3687.3387,189,202.963,...,2.53,0,0,0,1,0,0,0,621,58.73
3,uXLnbdHMf8w,375,367,40,1957,33.41,49564,2148.311,1917,155.779,...,4.01,0,1,0,0,0,0,1,156,119.18
4,Xgg7dIKys9E,329,118,11,161,9.55,13429,1034.3945,150,39.92,...,3.38,0,0,0,0,0,1,0,277,39.33


In [154]:
transform_video.columns

Index(['Video', 'Com­ments ad­ded', 'Shares', 'Sub­scribers lost',
       'Sub­scribers gained', 'Av­er­age per­cent­age viewed (%)', 'Views',
       'Watch time (hours)', 'Sub­scribers',
       'Your es­tim­ated rev­en­ue (USD)', 'Im­pres­sions',
       'Im­pres­sions click-through rate (%)', 'Monday', 'Saturday', 'Sunday',
       'Thursday', 'Tuesday', 'Wednesday', 'DateScience?',
       'Average_view_duration_(s)', 'Like_Dislike_Ratio'],
      dtype='object')