# RQ6 Posting Effectiveness

For this question, we need to consider two datasets:
* instagram_profiles
* instagram_posts

## Finding the average time a user lets pass
To find the average time a user lets pass before posting another post, one can do the following:
1. Get all profiles from the profiles dataset.
2. For each profile, find the maximum and minimum post time in the posts dataset.
3. Find the average by subtracting the minimum post time from the maximum post time and dividing by the number of posts the profile has.
4. Take the average of all post frequency averages.

Caveats:
* Not all profiles that appear in the profiles dataset also appear in the posts dataset (even if they have posts). Since we care about finding the average time, a user MUST exist in the posts dataset for the analysis to be conclusive.
* The number of posts for each profile shown in the profiles dataset is unreliable, meaning that it doesn't mean that all those posts will be found in the posts dataset.

Choices:
* Only consider users that appear in both posts and profiles dataset.
* Remove posts that don't have a post time.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [94]:
profiles = pd.read_csv('data/instagram_profiles.csv', sep='\t', usecols=['profile_id', 'profile_name', 'n_posts', 'following', 'followers'])
profiles = profiles[profiles.profile_id.notnull()]
profiles.profile_id = profiles.profile_id.astype('Int64')
profiles = profiles.drop_duplicates(subset='profile_id', keep="last")
profiles = profiles.sort_values(by='profile_id')

In [3]:
posts = pd.read_csv('data/instagram_posts.csv', sep='\t', usecols=['profile_id', 'cts'], parse_dates=['cts'], index_col=False)

In [99]:
filtered_posts = posts.dropna()
filtered_posts = filtered_posts[filtered_posts.profile_id.isin(profiles.profile_id)]
sorted_posts = filtered_posts.sort_values(by='profile_id')

In [95]:
profiles_with_posts = profiles[profiles.profile_id.isin(sorted_posts.profile_id.unique())]

In [104]:
# count the actual number of posts
profiles_with_posts['true_n_posts'] = sorted_posts.groupby('profile_id')['profile_id'].count().values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profiles_with_posts['true_n_posts'] = sorted_posts.groupby('profile_id')['profile_id'].count().values


In [107]:
sorted_post_times = sorted_posts.cts.values

In [110]:
# Find the indeces
profiles_with_posts['start_index'] = profiles_with_posts.true_n_posts.shift(1)
profiles_with_posts.start_index = profiles_with_posts.start_index.cumsum().astype('Int64')
profiles_with_posts.iloc[0, -1] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profiles_with_posts['start_index'] = profiles_with_posts.true_n_posts.shift(1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profiles_with_posts.start_index = profiles_with_posts.start_index.cumsum().astype('Int64')


In [117]:
def find_time_stamps(profile_id, n_posts, start_index):

    start_index = int(start_index)
    n_posts = int(n_posts)
    profile_post_times = sorted_post_times[start_index:start_index + n_posts]
    
    min_post_time = profile_post_times.min()
    max_post_time = profile_post_times.max()
  
    return max_post_time, min_post_time

In [119]:
max_post, min_post = np.vectorize(find_time_stamps)(profiles_with_posts.profile_id, profiles_with_posts.true_n_posts, profiles_with_posts.start_index)

In [120]:
profiles_with_posts['max_post'] = max_post
profiles_with_posts['min_post'] = min_post

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profiles_with_posts['max_post'] = max_post
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profiles_with_posts['min_post'] = min_post


In [124]:
profiles_with_posts['post_time_delta'] = profiles_with_posts.max_post - profiles_with_posts.min_post
profiles_with_posts['avg_post_time'] = profiles_with_posts.post_time_delta / (profiles_with_posts.true_n_posts - 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profiles_with_posts['post_time_delta'] = profiles_with_posts.max_post - profiles_with_posts.min_post
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profiles_with_posts['avg_post_time'] = profiles_with_posts.post_time_delta / (profiles_with_posts.true_n_posts - 1)


### What is the average time (days and minutes) a user lets pass before publishing another post?

#### How to answer this question
1. Get all profiles from the `profiles.csv` dataset. 
2. For each profile, find the maximum and minimum post time in the `posts.csv` dataset.
3. Find the average by subtracting the minimum post time from the maximum post time and dividing by the number of posts the profile has.
4. Take the average of all post frequency averages.

The answer to this question is somewhat distorted by people who have either posted 0 or 1 posts. In both scenarios, the average post time will be 0, which doesn't contribute to increasing the cumulative average post time of all users.

Hence, users with 0 or 1 posts reduce the overall average post time.

In [127]:
def timedelta_to_days_minutes(timedelta):
    """Converts a timedelta to days and minutes.
    
    :args
    timestamp - a pandas timestamp
    
    :returns
    (days, minutes) - a tuple containing the days and minutes of the timedelta
    """
    
    minutes = timedelta.components[1] * 60 + timedelta.components[2]
    
    return timedelta.days, minutes

In [128]:
days, minutes = timedelta_to_days_minutes(profiles_with_posts.avg_post_time.mean())
print(f"Average time for a user lets pass before publishing another post, taking into account all users, is {days} days and {minutes} minutes.")

Average time for a user lets pass before publishing another post, taking into account all users, is 30 days and 127 minutes.


In [136]:
days, minutes = timedelta_to_days_minutes(profiles_with_posts[profiles_with_posts.true_n_posts > 1].avg_post_time.mean())
print(f"Average time for a user lets pass before publishing another post, ignoring users who only posted 0 or 1 times, is {days} days and {minutes} minutes.")

Average time for a user lets pass before publishing another post, ignoring users who only posted 0 or 1 times, is 30 days and 127 minutes.


### Plot the top 3 users that publish posts more frequently (calculate the average time that passes between posts), including their amount of followers and following. Provide insights from that chart.

We will ignore users with 0 or 1 posts here since they have an average post time of 0.

In [137]:
filtered_profiles = profiles_with_post[profiles.n_posts > 1]
top_3_users = profiles_with_posts.sort_values(by='avg_post_time').head(3)

In [133]:
top_3_users

Unnamed: 0,profile_id,profile_name,following,followers,n_posts,true_n_posts,start_index,max_post,min_post,post_time_delta,avg_post_time
3793471,7489389995,bassamaludovic5522,139.0,59.0,2.0,2,21718985,2018-04-12 17:18:11,2018-04-12 17:18:11,0 days 00:00:00,0 days 00:00:00
2445653,8801686,alex_dinsdale,1572.0,705.0,1590.0,2,784912,2018-09-09 02:19:42,2018-09-09 02:19:42,0 days 00:00:00,0 days 00:00:00
1083546,1823731731,twine_9,1009.0,30.0,2.0,2,13327349,2017-09-24 15:06:46,2017-09-24 15:06:46,0 days 00:00:00,0 days 00:00:00
860099,6824887607,whoareyou10111,556.0,27.0,2.0,2,20822221,2017-12-28 23:24:55,2017-12-28 23:24:55,0 days 00:00:00,0 days 00:00:00
1284883,5556785151,garrison_marilyn,17.0,26.0,2.0,2,19120751,2017-06-05 01:40:35,2017-06-05 01:40:35,0 days 00:00:00,0 days 00:00:00
3054841,9197904884,al.ya5116,30.0,37.0,2.0,2,23672661,2018-11-08 08:50:47,2018-11-08 08:50:47,0 days 00:00:00,0 days 00:00:00
3227001,6324740239,roseanne4247,1377.0,288.0,2.0,2,20396391,2017-11-03 13:07:59,2017-11-03 13:07:59,0 days 00:00:00,0 days 00:00:00
1263527,10360928171,worldrecordeggsouthafrica,5115.0,149.0,2.0,2,24222557,2019-03-04 10:48:30,2019-03-04 10:48:30,0 days 00:00:00,0 days 00:00:00
2006244,9187046644,nurhakocak.65857,5087.0,139.0,2.0,2,23659115,2018-11-24 16:43:21,2018-11-24 16:43:21,0 days 00:00:00,0 days 00:00:00
2908530,6109197943,no.name7291,2435.0,76.0,13.0,12,20098423,2018-12-16 06:49:14,2018-12-16 06:49:13,0 days 00:00:01,0 days 00:00:00.090909090


In [39]:
top_3_users.profile_id.astype('Int64')

2235637    1524289417
3380299    3308136716
948783     3308141582
Name: profile_id, dtype: Int64

In [57]:
posts[posts.profile_id == 1446650654].cts.max()

NaT

In [62]:
profiles[profiles.profile_id == 1524289417]

Unnamed: 0,profile_id,profile_name,following,followers,n_posts,start_index,max_post,min_post
2235637,1524289000.0,_.xoxomaddie._,4228.0,2213.0,31.0,1064700774,1970-01-01,1970-01-01
