**Social media data analysis - Date of Data: (insert accordingly)**

This notebook is a simplified version of the one used for the UCL Social Consultancy Challenge 2024.

To run any code, please Ctrl + Enter else go to the top, select Run>Run Selected Cell or Run All. 

Steps to obtain data
1.	Go to instagram.com and sign-in. Click on the “More” button at the bottom left, followed by Settings.
2.	Click “Account Center”.
3.	Click “Your information and permissions” followed by “Download your information”.
4.	Click “Download or transfer information”
5.	Click the account which you want to download data from>All available information>Download to device
6.	Select “All time” for “Date Range” and “JSON” for format>Create files
7.	Click “Download your information”>Download
8.	Once the file is ready, download and unzip the file.9.	Find “audience_insights.json”, “content_interactions.json”, “followers_1.json”, “following.json”, “liked_posts.json”, “post_comments_1.json”, “posts.json”, “reels.json” and put it in your working folder with this notebook to run the analysis. Not if any file is missing or renamed, there will be errors..10. Create an 'Images' folder in the working folder


**0. Install Relevant Packages**

If this is your first time using this notebook, remove all the # in front of the next box only.

In [None]:
#!pip install pandas
#!pip install numpy
#!pip install geopandas
#!pip install matplotlib
#!pip install datetime
#!pip install wordcloud

**1. Import Relevant Packages**

This section imports the relevant packages for analysis and functions. No action required.

In [None]:
import json
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import re
import datetime
from wordcloud import WordCloud

#put before and after the items in a print statement to bold (e.g. print(f"{BOLD}Mar 13 - Jun 10 vs. Dec 14 - Mar 12{END}"))
BOLD = '\033[1m'
END = '\033[0m'

In [None]:
def plot_average_interaction_day_time(df, column1, column2, column3, savepath):
    # Calculate the average interaction value for each unique combination of day and time
    avg_interaction = df.groupby([column1, column2])[column3].mean().reset_index()

    # Define the correct order of days
    day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

    # Map day names to their corresponding order
    avg_interaction['DayOrder'] = avg_interaction[column1].map({day: i for i, day in enumerate(day_order)})

    # Sort the DataFrame by day order
    avg_interaction.sort_values(by='DayOrder', inplace=True)

    # Scale the interaction values for marker size
    size_scale_factor = 10
    scaled_interaction = avg_interaction[column3] * size_scale_factor

    # Create a scatter plot
    plt.figure(figsize=(12, 6))  # Slightly wider plot for better readability
    scatter = plt.scatter(avg_interaction['DayOrder'], avg_interaction[column2], s=scaled_interaction,
                          c=avg_interaction[column3], cmap='viridis', alpha=0.7)  # Use colormap

    plt.xlabel('Day of the Week', fontsize=12)
    plt.ylabel('Time of Day (24-hour format)', fontsize=12)
    plt.title('Average Interaction Value by Day and Time', fontsize=14, fontweight = 'bold')
    plt.xticks(range(len(day_order)), day_order, rotation=45, ha='right', fontsize=10)
    plt.yticks(range(24), [f'{hour:02}:00' for hour in range(24)], fontsize=10) 
    
    # Colorbar that matches the actual interaction values
    cbar = plt.colorbar(scatter)
    cbar.set_label('Average Interaction Value', fontsize=12)

    #save the plot
    plt.savefig(savepath, dpi=300, bbox_inches='tight', transparent = True) 

**2a. Load data for followers/following**

This section loads the Followers and Following data. No action required.

In [None]:
#load following data
with open('following.json', 'r', encoding='utf-8') as file_object:
    following = json.load(file_object)

In [None]:
#extract the 'value' fields so we know who we are following
following_list = [item['value'] for following in following['relationships_following'] for item in following['string_list_data']]

In [None]:
#load followers data
with open('followers_1.json', 'r', encoding='utf-8') as file_object:
    followers = json.load(file_object)

In [None]:
#extract the 'value' fields so we know who are our followers
followers_list = [entry['string_list_data'][0]['value'] for entry in followers if entry['string_list_data']]

**2b. Filter out potential collaborators (e.g. Community groups not following back)**

This section filters out potential collaborators. In the next box, additional words can be added to filter out relevant usernames. The last box prints users who are not following back.

In [None]:
#skip this step if you do not want to filter out accounts with any key words
#update accordingly
filter_words = ['community', 'london']

#filter the following list
following_filtered_list = [username for username in following_list if any(filter_word in username.lower() for filter_word in filter_words)]
print(following_filtered_list)

#filter the followers list
followers_filtered_list = [username for username in followers_list if any(filter_word in username.lower() for filter_word in filter_words)]
print(followers_filtered_list)

In [None]:
#convert list to sets
following_set = set(following_filtered_list) #if not filtering for filter words, you can replace "following_filtered_list" with "following_list"
followers_set = set(followers_filtered_list) #if not filtering for filter words, you can replace "followers_filtered_list" with "followering_list"

#find the difference (followers not in following)
not_following_back = followers_set - following_set

#convert the result back to a list
not_following_back_list = list(not_following_back)

#print the usernames or IDs that are not following back
print("Users not following back:", not_following_back_list)

**3. Audience Insights**

This section presents the demographic breakdown of the Instagram followers. No action required.

In [None]:
#load audience insights
with open('audience_insights.json', 'r', encoding='utf-8') as file_object:
    audienceinsights = json.load(file_object)

In [None]:
audienceinsights['organic_insights_audience'][0]['string_map_data']

In [None]:
follower_count = audienceinsights['organic_insights_audience'][0]['string_map_data']['Followers']['value']
print(follower_count)

In [None]:
city = audienceinsights['organic_insights_audience'][0]['string_map_data']['Follower Percentage by City']['value']
print(city)

In [None]:
country = audienceinsights['organic_insights_audience'][0]['string_map_data']['Follower Percentage by Country']['value']
print(country)

In [None]:
age_gender = audienceinsights['organic_insights_audience'][0]['string_map_data']['Follower Percentage by Age for All Genders']['value']
print(age_gender)

In [None]:
#split by commas
split_items = age_gender.split(", ")

#initialize empty lists for age groups and percentages
age_groups = []
percentages = []

#extract age groups and percentages
for item in split_items:
    age, percent_str = item.split(": ")
    age_groups.append(age)
    percentages.append(float(percent_str.replace("%", "")))

#create a pie chart
plt.figure(figsize=(8, 8))
wedges, _ = plt.pie(percentages, labels=[f"{age} ({percent:.1f}%)" for age, percent in zip(age_groups, percentages)], startangle=45, wedgeprops={'edgecolor': 'black'})

#add a legend with labels only
#plt.legend(wedges, age_groups, title="Age Groups", loc="upper left")

plt.title('Age Group Distribution', fontweight="bold")
plt.axis('equal')  # equal aspect ratio ensures the pie chart is drawn as a circle

#save the plot
plt.savefig('Images/agegroupdistributionpie.png', dpi=300, bbox_inches='tight', transparent = True) 

In [None]:
age_male = audienceinsights['organic_insights_audience'][0]['string_map_data']['Follower Percentage by Age for Men']['value']
print(age_male)

In [None]:
#split by commas
split_items = age_male.split(", ")

#initialize empty lists for age groups and percentages
age_groups = []
percentages = []

#extract age groups and percentages
for item in split_items:
    age, percent_str = item.split(": ")
    age_groups.append(age)
    percentages.append(float(percent_str.replace("%", "")))

#create a pie chart
plt.figure(figsize=(8, 8))
wedges, _ = plt.pie(percentages, labels=[f"{age} ({percent:.1f}%)" for age, percent in zip(age_groups, percentages)], startangle=45, wedgeprops={'edgecolor': 'black'})

#add a legend with labels only
#plt.legend(wedges, age_groups, title="Age Groups", loc="upper left")

plt.title('Age Group Distribution (Male)', fontweight="bold")
plt.axis('equal')  # equal aspect ratio ensures the pie chart is drawn as a circle

#show the plot
plt.show()

In [None]:
age_female = audienceinsights['organic_insights_audience'][0]['string_map_data']['Follower Percentage by Age for Women']['value']
print(age_female)

In [None]:
#split by commas
split_items = age_female.split(", ")

#initialize empty lists for age groups and percentages
age_groups = []
percentages = []

#extract age groups and percentages
for item in split_items:
    age, percent_str = item.split(": ")
    age_groups.append(age)
    percentages.append(float(percent_str.replace("%", "")))

#create a pie chart
plt.figure(figsize=(8, 8))
wedges, _ = plt.pie(percentages, labels=[f"{age} ({percent:.1f}%)" for age, percent in zip(age_groups, percentages)], startangle=45, wedgeprops={'edgecolor': 'black'})

#add a legend with labels only
#plt.legend(wedges, age_groups, title="Age Groups", loc="upper left")

plt.title('Age Group Distribution (Female)', fontweight="bold")
plt.axis('equal')  # equal aspect ratio ensures the pie chart is drawn as a circle

#show the plot
plt.show()

In [None]:
# Parse the data
def parse_data(data_str):
    data = {}
    for item in data_str.split(", "):
        age, percent_str = item.split(": ")
        data[age] = float(percent_str.replace("%", ""))
    return data

age_gender_data = parse_data(age_gender)
age_male_data = parse_data(age_male)
age_female_data = parse_data(age_female)

# Create a line graph
plt.figure(figsize=(10, 6))
plt.plot(age_groups, age_gender_data.values(), label='Overall')
plt.plot(age_groups, age_male_data.values(), label='Male')
plt.plot(age_groups, age_female_data.values(), label='Female')
plt.xlabel('Age Groups')
plt.ylabel('Percentage')
plt.title('Age Group Distribution', fontweight = "bold")
plt.legend()
plt.grid(True)

#save the plot
plt.savefig('Images/agegroupdistributiongraph.png', dpi=300, bbox_inches='tight', transparent = True) 

In [None]:
female = audienceinsights['organic_insights_audience'][0]['string_map_data']['Total Follower Percentage for Women']['value']
print(female)
male = audienceinsights['organic_insights_audience'][0]['string_map_data']['Total Follower Percentage for Men']['value']
print(male)

**4 Content Interactions**

This section shows mainly the difference between various types of interactions between two time periods as defined by Instagram. Example of the time period is Mar 13- Jun 10 2024 vs. Dec 14 2023 - Mar 12 2024)

Definitions
- Content Interactions are when users interact directly with your posts and stories i.e. how many likes, comments, shares and saves you received.
- Post Interactions are when users interactly directly with your posts i.e. how many likes, comments, shares and saves you received.

In [None]:
#load content interactions
with open('content_interactions.json', 'r', encoding='utf-8') as file_object:
    content_interactions = json.load(file_object)

In [None]:
print(content_interactions)

In [None]:
content_interactions['organic_insights_interactions'][0]['string_map_data']

In [None]:
content_interactions_delta = content_interactions['organic_insights_interactions'][0]['string_map_data']
content_interactions_values = {
    'Content Interactions': content_interactions_delta['Content Interactions Delta']['value'],
    'Post Interactions': content_interactions_delta['Post Interactions Delta']['value'],
    'Story Interactions': content_interactions_delta['Story Interactions Delta']['value'],
    'Video Interactions': content_interactions_delta['Video Interactions Delta']['value'],
    'Reels Interactions': content_interactions_delta['Reels Interactions Delta']['value'],
    'Live Video Interactions': content_interactions_delta['Live Video Interactions Delta']['value'],
    'Accounts Engaged': content_interactions_delta['Accounts Engaged Delta']['value']
}

#edit this line to change comparison date as shown in data above
print(f"{BOLD}Mar 13 - Jun 10 vs. Dec 14 - Mar 12{END}")
for metric, value in content_interactions_values.items():
    #remove the percentage sign and anything after it by splitting after first space
    value_without_percent = value.split(" ")[0] 
    print(f"{metric}: {value_without_percent}") 

**5 Posts**

This section allows you to see which post performs better based on certain metrics, plot visuals of selected metric against space/time/space and time to understand when to post.

Available metrics for posts: Profile visits, Impressions, Follows, Accounts reached, Saves, Likes, Comments, Shares

Available functions
- post_data.sort_values(by='insert one of above metrics here', ascending = False)
     - This sorts all available posts according to the top metric selected
     - Replace 'insert one of above metrics here'
- plot_average_interaction_day(post_data, 'Day of Creation', 'insert one of above metrics here', 'Images/yourfilename.png')
     - This plots a bar graph of the selected metric for the entire week
     - Replace 'insert one of above metrics here' and 'yourfilename' with your ideal filename
- plot_average_interaction_time(post_data, 'Hour of Creation', 'insert one of above metrics here', 'Images/yourfilename.png')
     - This plots a line graph of the selected metric for the entire 24 hours in a day
     - Replace 'insert one of above metrics here' and 'yourfilename' with your ideal filename
- plot_average_interaction_day_time(post_data, 'Day of Creation', 'Hour of Creation', 'insert one of above metrics here', 'Images/yourfilename.png')
     - This plots a dot plot of the average of a selected metric for each time of the day and day of the week
     - Replace 'insert one of above metrics here' and 'yourfilename' with your ideal filename

In [None]:
#load posts
with open('posts.json', 'r', encoding='utf-8') as file_object:
    posts = json.load(file_object)

In [None]:
#posts['organic_insights_posts'][0]['media_map_data']['Media Thumbnail']['title']
#posts['organic_insights_posts'][0]['string_map_data']

In [None]:
#creates a dataframe of the metrics for each post
#initialize lists to store data
titles = []
timestamps = []
days_of_week = []
hours_of_creation = []

#initialize a dictionary for fields to be converted to integers
post_data = {
    'Profile visits': [],
    'Impressions': [],
    'Follows': [],
    'Accounts reached': [],
    'Saves': [],
    'Likes': [],
    'Comments': [],
    'Shares': []
}

#extract relevant data for all items
for post in posts['organic_insights_posts']:
    titles.append(post['media_map_data'].get('Media Thumbnail', {}).get('title', 'N/A'))
    timestamp = post['string_map_data']['Creation Timestamp']['timestamp']
    dt = datetime.datetime.fromtimestamp(timestamp)
    timestamps.append(dt.strftime('%Y-%m-%d %H:%M:%S'))
    days_of_week.append(dt.strftime('%A'))
    hours_of_creation.append(dt.hour)

    #convert specific fields to integers, handling errors
    for key in post_data.keys():
        try:
            value = post['string_map_data'].get(key, {}).get('value', 'N/A')
            #if the value is already an integer, no conversion is needed
            if isinstance(value, int):
                post_data[key].append(value)
            #if the value is a string containing a number, convert to an integer
            elif isinstance(value, str) and value.isdigit():
                post_data[key].append(int(value))
            else:
                post_data[key].append(0)  #default value for non-numeric data
        except KeyError:
            post_data[key].append(0)  #default value for missing keys

#create a dataFrame
post_data = pd.DataFrame({
    'Title': titles,
    'Creation Timestamp': timestamps,
    'Day of Creation': days_of_week,
    'Hour of Creation': hours_of_creation,
    **post_data  # Unpack the post_data dictionary
})

#print the dataFrame
post_data.head()

In [None]:
##optional step to filter out posts for specific dates. Remove only one # from the start of each each line if running this code
#post_data['Creation Timestamp'] = pd.to_datetime(post_data['Creation Timestamp'])

#start_date = pd.Timestamp('2023-03-31') #change your start date accordingly
#end_date = pd.Timestamp('2024-04-01') #change your end date
#post_data = post_data[(post_data['Creation Timestamp'] >= start_date) & (post_data['Creation Timestamp'] < end_date)]

In [None]:
#available metrics for posts: Profile visits, Impressions, Follows, Accounts reached, Saves, Likes, Comments, Shares
#post_data.sort_values(by='insert one of above metrics here', ascending = False)
#This sorts all available posts according to the top metric selected
#Replace 'insert one of above metrics here'
post_data.sort_values(by='Likes', ascending = False)

In [None]:
#available metrics for posts: Profile visits, Impressions, Follows, Accounts reached, Saves, Likes, Comments, Shares
#plot_average_interaction_day_time(post_data, 'Day of Creation', 'Hour of Creation', 'insert one of above metrics here', 'Images/yourfilename.png')
#This plots a dot plot of the average of a selected metric for each time of the day and day of the week
#Replace 'insert one of above metrics here' and 'yourfilename' with your ideal filename

plot_average_interaction_day_time(post_data, 'Day of Creation', 'Hour of Creation', 'Likes', 'Images/posts_averagelikes_weektime.png')

**6 Post Comments**

This section prints out all the comments on posts that are not from the poster's account, removes uernames and gibberish characters. A wordcloud is then generated. No action required.

In [None]:
#post comments
with open('post_comments_1.json', 'r', encoding='utf-8') as file_object:
    post_comments = json.load(file_object)

In [None]:
#remove usernames (e.g. @username)
def remove_usernames(comment):
    return re.sub(r'(@[A-Za-z0-9_]+)', '', comment)

In [None]:
#filter out each comment that is not from lewisham_communityspace
filtered_comments = [comment['string_map_data']['Comment']['value'] for comment in post_comments 
                     if 'Media Owner' not in comment['string_map_data'] or 
                     comment['string_map_data']['Media Owner']['value'] != 'lewisham_communityspace']

#remove gibberish characters (emojis, etc.)
cleaned_comments = [comment.encode('ascii', 'ignore').decode() for comment in filtered_comments]

#apply function to remove usernames
cleaned_comments_without_usernames = [remove_usernames(comment) for comment in cleaned_comments]

#print the cleaned comments
for comment in cleaned_comments_without_usernames:
    print(comment)

In [None]:
#create a WordCloud object
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(" ".join(cleaned_comments_without_usernames))

#display the word cloud
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
#save the plot
plt.savefig('Images/post_word_clouds.png', dpi=300, bbox_inches='tight', transparent = True) 

**7. Reels**

This section allows you to see which reels performs better based on certain metrics, plot visuals of selected metric against space/time/space and time to understand when to post.

Available metrics for reels: Accounts reached, Instagram Plays, Instagram Likes, Instagram Comments, Instagram Shares, Instagram Saves

Available functions
- reel_data.sort_values(by='insert one of above metrics here', ascending = False)
     - This sorts all available reels according to the top metric selected
     - Replace 'insert one of above metrics here'
- plot_average_interaction_day(reel_data, 'Day of Creation', 'insert one of above metrics here', 'Images/yourfilename.png')
     - This plots a bar graph of the selected metric for the entire week
     - Replace 'insert one of above metrics here' and 'yourfilename' with your ideal filename
- plot_average_interaction_time(reel_data, 'Hour of Creation', 'insert one of above metrics here', 'Images/yourfilename.png')
     - This plots a line graph of the selected metric for the entire 24 hours in a day
     - Replace 'insert one of above metrics here' and 'yourfilename' with your ideal filename
- plot_average_interaction_day_time(reel_data, 'Day of Creation', 'Hour of Creation', 'insert one of above metrics here', 'Images/yourfilename.png')
     - This plots a dot plot of the average of a selected metric for each time of the day and day of the week
     - Replace 'insert one of above metrics here' and 'yourfilename' with your ideal filename

In [None]:
#load reels
with open('reels.json', 'r', encoding='utf-8') as file_object:
    reels = json.load(file_object)

In [None]:
#initialize lists to store data
titles = []
timestamps = []
days_of_week = []
hours_of_creation = []

#initialize a dictionary for fields to be converted to integers
reel_data = { 
    'Duration': [], 
    'Accounts reached': [], 
    'Instagram Plays': [],
    'Instagram Likes': [],
    'Instagram Comments': [],
    'Instagram Shares': [],
    'Instagram Saves': []
}

#extract relevant data for reels 
for reel in reels['organic_insights_reels']: 
    titles.append(reel['media_map_data'].get('Media Thumbnail', {}).get('title', 'N/A'))
    timestamp = reel['string_map_data']['Upload Timestamp']['timestamp'] 
    dt = datetime.datetime.fromtimestamp(timestamp)
    timestamps.append(dt.strftime('%Y-%m-%d %H:%M:%S'))
    days_of_week.append(dt.strftime('%A'))
    hours_of_creation.append(dt.hour)

    #convert specific fields to integers (new fields), handling errors
    for key in reel_data.keys():  # Changed to reel_data
        try:
            value = reel['string_map_data'].get(key, {}).get('value', 'N/A') 
            # Check for duration separately as it might be a float
            if key == 'Duration' and isinstance(value, (float, int)):
                reel_data[key].append(int(value))  # Convert to integer seconds if float
            elif isinstance(value, int):
                reel_data[key].append(value)  # No change if already int
            elif isinstance(value, str) and value.isdigit():
                reel_data[key].append(int(value))
            else:
                reel_data[key].append(0)  # Default for non-numeric or missing
        except KeyError:
            reel_data[key].append(0)  # Default for missing keys

#create a DataFrame
reel_data = pd.DataFrame({ # Changed to reel_data
    'Title': titles,
    'Upload Timestamp': timestamps, 
    'Day of Creation': days_of_week,
    'Hour of Creation': hours_of_creation,
    **reel_data  # Unpack the reel_data dictionary 
})

#print the DataFrame
reel_data.head()

In [None]:
##optional step to filter out reels for specific dates. Remove only one # from the start of each each line if running this code
#reel_data['Upload Timestamp'] = pd.to_datetime(reel_data['Upload Timestamp'])

#start_date = pd.Timestamp('2023-03-31') #change your start date accordingly
#end_date = pd.Timestamp('2024-04-01') #change your end date
#reel_data = reel_data[(reel_data['Upload Timestamp'] >= start_date) & (reel_data['Upload Timestamp'] < end_date)]

In [None]:
#available metrics for reels: Accounts reached, Instagram Plays, Instagram Likes, Instagram Comments, Instagram Shares, Instagram Saves
#reel_data.sort_values(by='insert one of above metrics here', ascending = False)
#This sorts all available reels according to the top metric selected
#Replace 'insert one of above metrics here'
reel_data.sort_values(by='Instagram Likes', ascending = False)

In [None]:
#available metrics for reels: Accounts reached, Instagram Plays, Instagram Likes, Instagram Comments, Instagram Shares, Instagram Saves
#plot_average_interaction_day_time(reel_data, 'Day of Creation', 'Hour of Creation', 'insert one of above metrics here', 'Images/yourfilename.png')
#This plots a dot plot of the average of a selected metric for each time of the day and day of the week
#Replace 'insert one of above metrics here' and 'yourfilename' with your ideal filename
plot_average_interaction_day_time(reel_data, 'Day of Creation', 'Hour of Creation', 'Instagram Likes', 'Images/reels_averagelikes_weektime.png')