In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import datatable as dt

## [RQ7] Of course, calculating probabilities is a job that any Data Scientist must know. So let's compute some engaging figures.

- ## What's the probability that a posts receives more than 20% "likes" of the number of followers a user has?

#### We've to compare the number of likes and number of followers for each user to compute this probability, so we create a new DataFrame with these columns in which we are interested.

#### First DataFrame from "instagram_profiles.csv" in which we have removed NaN values because they don't give us informations and converted follower's number to an integer for each user.

In [2]:
from_prof = dt.fread('instagram_profiles.csv', sep='\t', columns={"sid", "followers"}).to_pandas()

In [3]:
from_prof = from_prof.dropna()

In [4]:
from_prof['followers'] = from_prof['followers'].astype(int)
from_prof

Unnamed: 0,sid,followers
0,4184446,146
1,4184457,1145
2,4184460,324
5,4184465,192
6,4184471,4137
...,...,...
4509578,4184455,809
4509579,4184458,599
4509580,4184463,261
4509581,4184467,481


#### Now we want the second DataFrame from "instagram_posts.csv" to merge it with the first one including "sid_profile"(sequence ID of the profile from Profiles) to manage to connect and compare likes and comments in order.

In [5]:
from_post = dt.fread('instagram_posts.csv', sep='\t', columns={"sid_profile", "numbr_likes"}).to_pandas()

#### At first we clean dataset by values in which we aren't interested(sid_profile values equal to '-1' don't have any sense).

In [7]:
from_post = from_post[['sid_profile', 'numbr_likes']][from_post['sid_profile'] != -1].dropna()
from_post

Unnamed: 0,sid_profile,numbr_likes
0,3496776,80.0
10,3303402,114.0
26,3406435,46.0
29,3529017,66.0
52,3206132,1983.0
...,...,...
42710181,3496776,107.0
42710182,3496776,69.0
42710183,3496776,133.0
42710188,1421602,322.0


#### Converting into int values

In [8]:
from_post['numbr_likes'] = from_post['numbr_likes'].astype(int)
from_post

Unnamed: 0,sid_profile,numbr_likes
0,3496776,80
10,3303402,114
26,3406435,46
29,3529017,66
52,3206132,1983
...,...,...
42710181,3496776,107
42710182,3496776,69
42710183,3496776,133
42710188,1421602,322


#### Now we  can merge the dataframes to calculate the probability!... we notice that there are less rows cause of mismatching of two columns 'sid' and 'sid_profile' from the two different datasets.

In [9]:
df_probs = pd.merge(from_post, from_prof, left_on = 'sid_profile', right_on = 'sid').drop('sid', axis = 1)
df_probs

Unnamed: 0,sid_profile,numbr_likes,followers
0,3496776,80,1204
1,3496776,86,1204
2,3496776,168,1204
3,3496776,102,1204
4,3496776,145,1204
...,...,...,...
27134177,1355462,125,1066
27134178,1451349,117,377
27134179,3500677,147,866
27134180,1788260,58,475


#### We have to do two checks.

#### The first one is about when the number of followers is equal to zero and an user has at least one like in  his posts.

In [10]:
check = ((df_probs['followers'] == 0) & (df_probs['numbr_likes'] > 0)).sum()
check

2276

#### Now we delete rows where the number of followers is equal to zero to can do the second check.  

In [11]:
df_probs = df_probs[df_probs['followers'] != 0]
df_probs

Unnamed: 0,sid_profile,numbr_likes,followers
0,3496776,80,1204
1,3496776,86,1204
2,3496776,168,1204
3,3496776,102,1204
4,3496776,145,1204
...,...,...,...
27134177,1355462,125,1066
27134178,1451349,117,377
27134179,3500677,147,866
27134180,1788260,58,475


#### Second check is about when number of likes is more than 20% of the number of followers that are now all different from zero. So we pick the two columns of our interest and do three steps:

1. Create a new column **likes_followers_ratio** in which we have values from ratio between **numbr_likes** and **followers**
1. Pick only the values that are > 0.2 
1. Count how many these values are 

#### They will be our possible cases for the probability we have to calculate

In [27]:
check2 = df_probs[['numbr_likes', 'followers']]
check2["likes_followers_ratio"] = check2.numbr_likes / check2.followers
check2 = check2[check2.likes_followers_ratio > 0.2].likes_followers_ratio.count()
check2

4131855

#### At this point we have all informations to compute the probability requested and we show it.

In [28]:
print(f" The probability that a post receives more than 20% likes of the number of followers a user has is: {round((check + check2) / df_probs.shape[0], 5)}, so the percentage is {round((((check + check2) / df_probs.shape[0])*100), 3)}%")

 The probability that a post receives more than 20% likes of the number of followers a user has is: 0.15238, so the percentage is 15.238%


- ## Do users usually return to locations? Extract the probability that a user returns to a site after having posted it in the past. Does that probability make sense to you? Explain why or why not.

#### Take columns in which we are interested and do a bit of preprocessing, removing NaN values and converting the columns into int values.

In [29]:
prof_loc = dt.fread('instagram_posts.csv', sep='\t', columns={"profile_id", "location_id"}).to_pandas()
prof_loc

Unnamed: 0,profile_id,location_id
0,2.237948e+09,1.022366e+15
1,5.579335e+09,4.574268e+14
2,3.134296e+08,4.574268e+14
3,1.837593e+09,4.574268e+14
4,1.131527e+09,4.574268e+14
...,...,...
42710192,5.556457e+09,4.574268e+14
42710193,3.371865e+08,4.574268e+14
42710194,3.289886e+09,4.574268e+14
42710195,8.536366e+09,4.267235e+06


In [30]:
prof_loc = prof_loc.dropna().astype('int64')
prof_loc 

Unnamed: 0,profile_id,location_id
0,2237947779,1022366247837915
1,5579335020,457426771112991
2,313429634,457426771112991
3,1837592700,457426771112991
4,1131527143,457426771112991
...,...,...
42710192,5556457201,457426771112991
42710193,337186454,457426771112991
42710194,3289886053,457426771112991
42710195,8536366360,4267235


#### Remove rows that are not duplicated because we want to analize only users return to a site!

In [31]:
prof_loc_dup = prof_loc[prof_loc.duplicated(subset=["profile_id", "location_id"], keep=False)]
prof_loc_dup

Unnamed: 0,profile_id,location_id
0,2237947779,1022366247837915
15,176274494,282618748
17,8492416500,130379727582083
21,8492416500,130379727582083
25,8492416500,130379727582083
...,...,...
42710179,2237947779,1022366247837915
42710180,2237947779,1022366247837915
42710181,2237947779,1022366247837915
42710182,2237947779,1022366247837915


#### We count how many rows are in the dataset above in which we remove duplicates to take all users in which we are interested only once(all possible cases of the probability). 

In [32]:
numbr_dup = prof_loc_dup.drop_duplicates(subset=["profile_id", "location_id"]).shape[0]
numbr_dup

2962104

#### Calculate probability requested dividing for all total cases(n groups from the data we have), so all profiles counted once. 

In [33]:
prob = numbr_dup / prof_loc.groupby(["profile_id", "location_id"]).ngroups

In [34]:
print(f" The probability that a user returns to a site after having posted it in the past is: {round(prob*100, 2)}%")

 The probability that a user returns to a site after having posted it in the past is: 14.04%


### Comments about this probability

###### This probability does not make much sense since if a user posts on Instagram withe the location tag it is not certain that he or she has returned to that location. Let me explain further.  For example, a user might have been in that site and not post anything at that time and instead post more posts with that location once he or she returned home. Or post multiple posts on different days, even a long time apart, and have only been to that location once!!

###### Nevertheless, this probability gives us a very rough idea of how much people used to return to those places or sites, or even better, how muche they like to show that they visit the same place many times or that they have visited various parts of that place even though in reality they may have been in it only once in their lives and maybe even for a short time.

###### We can say that it turns out to be interesting to find this out but for the purpose of getting very relevant information its quite useless and we should go into much more detail!!