### Analyzing Roger Federer Instagram Posts

Are the comment sentiment, caption sentiment, and likes dependent of whether Federer's posts revolve around the sport he is famous for? (Tennis)

The model will apply Natural Language Processing to real world data and follow up the results with a series of A/B tests to determine whether there is statistically a significant difference between tennis and non-tennis posts.

H0: No significant impact of whether or not the content is tennis related on post performance metrics.

H1: Significant impact of whether or not the content is tennis related on post performance metrics.


In [22]:
import instaloader

# Initialize Instaloader and login
L = instaloader.Instaloader()
L.login('account', 'password')
# Get Roger Federer's profile
profile = instaloader.Profile.from_username(L.context, 'rogerfederer')

# Collect 10 posts
posts = []
for post in profile.get_posts():
    if len(posts) >= 10:
        break
    posts.append(post)

In [23]:
# Define keywords related to tennis
tennis_keywords = ['tennis', 'match', 'serve', 'racket', 'court', 'game','ball',
                  'backhand','forehand','volley','slice','dropshot','overhead']

def is_tennis_related(caption):
    # Check if any keyword related to tennis is in the caption (case-insensitive)
    return any(keyword in caption.lower() for keyword in tennis_keywords)

# Analyze each post
post_data = []
for post in posts:
    comments = []
    # Collect up to 20 comments from each post
    for comment in post.get_comments():
        comments.append(comment.text)
        if len(comments) >= 20:
            break

    post_info = {
        'caption': post.caption,
        'likes': post.likes,
        'comments': comments, 
        'is_tennis_related': is_tennis_related(post.caption)
    }
    post_data.append(post_info)


In [38]:
from textblob import TextBlob

def analyze_sentiment(text):
    return TextBlob(text).sentiment.polarity

# Analyze sentiment for captions and comments
for post in post_data:
    post['caption_sentiment'] = analyze_sentiment(post['caption'])
    post['comments_sentiment'] = [analyze_sentiment(comment) for comment in post['comments']]
    post['average_comments_sentiment'] = sum(post['comments_sentiment']) / len(post['comments_sentiment']) if post['comments_sentiment'] else 0

# Calculate average likes and sentiment based on tennis-related posts
tennis_related_likes = [post['likes'] for post in post_data if post['is_tennis_related']]
non_tennis_related_likes = [post['likes'] for post in post_data if not post['is_tennis_related']]

tennis_related_sentiment = [post['average_comments_sentiment'] for post in post_data if post['is_tennis_related']]
non_tennis_related_sentiment = [post['average_comments_sentiment'] for post in post_data if not post['is_tennis_related']]

tennis_related_sentiment_caption = [post['caption_sentiment'] for post in post_data if post['is_tennis_related']]
non_tennis_related_sentiment_caption = [post['caption_sentiment'] for post in post_data if not post['is_tennis_related']]

# Print results
avg_likes_tennis = sum(tennis_related_likes) / len(tennis_related_likes) if tennis_related_likes else 0
avg_likes_non_tennis = sum(non_tennis_related_likes) / len(non_tennis_related_likes) if non_tennis_related_likes else 0
avg_comments_tennis = sum(tennis_related_sentiment) / len(tennis_related_sentiment) if tennis_related_sentiment else 0
avg_comments_non_tennis = sum(non_tennis_related_sentiment) / len(non_tennis_related_sentiment) if non_tennis_related_sentiment else 0
avg_caption_tennis = sum(tennis_related_sentiment_caption) / len(tennis_related_sentiment_caption) if tennis_related_sentiment_caption else 0
avg_caption_non_tennis = sum(non_tennis_related_sentiment_caption) / len(non_tennis_related_sentiment_caption) if non_tennis_related_sentiment_caption else 0

# Print results
print("Average Likes (Tennis Related):", f'{avg_likes_tennis:.2f}')
print("Average Likes (Not Tennis Related):", f'{avg_likes_non_tennis:.2f}')
print("Average Comment Sentiment (Tennis Related):", f'{avg_comments_tennis:.2f}')
print("Average Comment Sentiment (Not Tennis Related):", f'{avg_comments_non_tennis:.2f}')
print("Average Caption Sentiment (Tennis Related):", f'{avg_caption_tennis:.2f}')
print("Average Caption Sentiment (Not Tennis Related):", f'{avg_caption_non_tennis:.2f}')

Average Likes (Tennis Related): 67333.67
Average Likes (Not Tennis Related): 264584.50
Average Comment Sentiment (Tennis Related): 0.21
Average Comment Sentiment (Not Tennis Related): 0.18
Average Caption Sentiment (Tennis Related): 0.08
Average Caption Sentiment (Not Tennis Related): 0.34


In [34]:
import pandas as pd
post_df=pd.DataFrame(post_data)
post_df.index=['Post 1','Post 2','Post 3','Post 4','Post 5','Post 6','Post 7','Post 8','Post 9','Post 10']
post_df

Unnamed: 0,caption,likes,comments,is_tennis_related,caption_sentiment,comments_sentiment,average_comments_sentiment
Post 1,Wann habt ihr euch das letzte mal ⏰ für echte ...,34120,[its the aggressive selfie stick pull out for ...,False,0.1,"[0.0, -0.25, 0.0, 0.325, 0.5, 0.5, 0.0, 0.2, 0...",0.207083
Post 2,This is the next chapter of excellence. \n\nUn...,35813,"[Epic day out👏🏼, Amazing event to attend, than...",True,0.083333,"[0.1, 0.42500000000000004, 0.625, 0.0, 0.0, 0....",0.17271
Post 3,Ready for a new autumn adventure? Be careful t...,132450,"[The soundtrack ""Falling for Autumn"" is now li...",False,0.269949,"[0.3352272727272727, 0.0, 0.0, 0.0, 1.0, -0.06...",0.329309
Post 4,Here for a NY Minute 🥨🗽🎾🍎🚕,558204,"[Out here living your best life, and we love t...",False,0.0,"[0.75, 0.05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...",0.1125
Post 5,A story of unequalled dedication and deep pass...,73503,"[Omg! I need this ASAP 😭😭😭😭, Yes Please !, I d...",True,0.055195,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...",0.223047
Post 6,"He’s here, and he’s perfect 🤩",333564,"[No matter what you do, you will never achieve...",False,1.0,"[0.0, 0.5666666666666667, 0.0, 0.0, 0.0, 0.0, ...",0.084167
Post 7,"Ladies and gentleman, the RF 01. \n\nWith grou...",141101,[@wilsontennis @rogerfederer Can we buy these ...,True,0.05,"[0.0, 0.008333333333333331, -0.0625, 0.0, 0.42...",0.035734
Post 8,Introducing the RF Collection by Wilson + Roge...,47749,"[I want one!, This collection is peRFect 🔥, OM...",True,0.125,"[0.0, 1.0, 0.5, 0.7833333333333333, 0.15340909...",0.464962
Post 9,‘Federer’ tells the story of a life like no ot...,44132,[He didn't even congratulate in Novak Djokovic...,True,0.11125,"[0.0, 0.0, 1.0, 0.48828125, 0.5, 0.0, 0.0, 0.0...",0.119414
Post 10,New episodes of 24 Hours with Roger are here.\...,61704,[Roger was already a ballet expert on the cour...,True,0.045455,"[0.0, 0.0, 1.0, 0.5, 0.0, 0.0, 0.35, 0.4, 0.0,...",0.249653


In [48]:
from scipy import stats

# Test if there is a significant difference between the likes on tennis and non-tennis posts

t_stat, p_value = stats.ttest_ind(tennis_related_likes, non_tennis_related_likes)


# Draw conclusions based on the results

print(f"T-Statistic: {t_stat:.2f}. \n\nThe group means of the likes on posts regarding tennis and not regarding tennis are about {abs(t_stat):.0f} standard deviations away from each other.")


print(f"\nP-Value: {p_value:.2f}.")
if p_value > 0.05:
    print('\nThere is no significant difference between the like figures for the posts.')
else:
    print('\nThere is a significant difference in the like figures for the posts.')

T-Statistic: -2.10. 

The group means of the likes on posts regarding tennis and not regarding tennis are about 2 standard deviations away from each other.

P-Value: 0.07.

There is no significant difference between the like figures for the posts.


In [47]:
from scipy import stats

# Test if there is a significant difference between comment sentiments on tennis and non-tennis posts (A/B Test)

t_stat, p_value = stats.ttest_ind(tennis_related_sentiment, non_tennis_related_sentiment)


# Draw conclusions based on the results

print(f"T-Statistic: {t_stat:.2f}. \n\nThe group means of the comment sentiments regarding tennis and not regarding tennis are about {abs(t_stat):.0f} standard deviations away from each other.")


print(f"\nP-Value: {p_value:.2f}.")
if p_value > 0.05:
    print('\nThere is no significant difference between the comment sentiments for the posts.')
else:
    print('\nThere is a significant difference in the comment sentiments for the posts.')

T-Statistic: 0.32. 

The group means of the comment sentiments regarding tennis and not regarding tennis are about 0 standard deviations away from each other.

P-Value: 0.76.

There is no significant difference between the comment sentiments for the posts.


In [46]:
from scipy import stats

# Test if there is a significant difference between caption sentiments on tennis and non-tennis posts (A/B Test)

t_stat, p_value = stats.ttest_ind(tennis_related_sentiment_caption, non_tennis_related_sentiment_caption)


# Draw conclusions based on the results

print(f"T-Statistic: {t_stat:.2f}. \n\nThe group means of the caption sentiments on posts regarding tennis and not regarding tennis are about {abs(t_stat):.0f} standard deviations away from each other.")


print(f"\nP-Value: {p_value:.2f}.")
if p_value > 0.05:
    print('\nThere is no significant difference between the caption sentiments for the posts.')
else:
    print('\nThere is a significant difference in the caption sentiments for the posts.')

T-Statistic: -1.47. 

The group means of the caption sentiments on posts regarding tennis and not regarding tennis are about 1 standard deviations away from each other.

P-Value: 0.18.

There is no significant difference between the caption sentiments for the posts.


#### Conclusion

There was not enough statistical evidence to reject the null hypothesis. 

Therefore, it could not be concluded that the use of tennis content in posts impacts performance metrics.

With the use of more posts, perhaps it would be possible to obtain a statistically significant difference in post performances. 

The difference in likes was substantial, leading to a p-value of 0.07 (7% probability that the population means are the same). With more data, it is possible that the p-value would move into the critical region (under 5% probability).