# Podsumowanie

Zalecane wczytywanie:
- #Username string
- #Category category
- #Followers uint32
- #Followees uint32
- #Posts uint32

Niewielka korelacja dodatnia (#Posts i #Followers) oraz (#Posts i #Followees) \
Niewielka korelacja ujemna (#Followers i #Followees) \
Możliwy wpływ kategorii na powyższe korelacje

Należy dodać zmienną środowiskową categories zawierającą poprawne kategorie

Część profili ma literówki w kolumnie 'Category'. Jest to niewielle próbek. Można wyrzucić lub ręcznie przypisać prawidłową kategorię

Duże wartości odstające \
Jest niewielka część instagramerów posiadających znaczne (2 rzędów) większe wartości #Posts, #Followers, #Followees 

Nierówny rozkład Influencerów w poszczególnych kategoriach, dominuje fashion



# Imports and utils

In [1]:
import os
import pandas as pd
import plotly.express as px
import yaml

In [None]:
def find_and_set_config_path():
    current_path = os.getcwd()

    while not os.path.exists(os.path.join(current_path, 'config')):
        parent_path = os.path.dirname(current_path)

        # Sprawdź, czy osiągnęliśmy korzeń systemu plików
        if current_path == parent_path:
            print("Nie znaleziono folderu 'config' w żadnym z katalogów nadrzędnych.")
            return

        current_path = parent_path

    os.chdir(current_path)
    print(f'Znaleziono folder "config" w: {current_path}')
    
find_and_set_config_path()

In [16]:
config_path = os.path.join('config', '1_per.yaml')

with open(config_path) as f:
    cfg = yaml.load(f, Loader=yaml.FullLoader)
    
data_path = cfg['path_influencers']
categories = cfg['categories']

# Read data

In [24]:
df = pd.read_csv(data_path, delimiter='\t', header=0, skiprows=[1])

# Basic info

In [25]:
df.head()

Unnamed: 0,Username,Category,#Followers,#Followees,#Posts
0,adila_makeup,beauty,13424,6690,545
1,hairbycourtneyd,beauty,1200,5372,266
2,mandiglitter,beauty,7557,6195,1649
3,ldelabios,beauty,1781,1205,273
4,legendvrry,beauty,22018,2218,2229


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 338 entries, 0 to 337
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Username    338 non-null    object
 1   Category    338 non-null    object
 2   #Followers  338 non-null    int64 
 3   #Followees  338 non-null    int64 
 4   #Posts      338 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 13.3+ KB


In [27]:
df.describe()

Unnamed: 0,#Followers,#Followees,#Posts
count,338.0,338.0,338.0
mean,114572.0,1676.789941,1433.949704
std,572294.7,1710.597251,1923.308926
min,1011.0,0.0,242.0
25%,5126.0,584.0,533.5
50%,17319.5,1007.5,979.0
75%,52927.5,2053.5,1680.75
max,7052689.0,7500.0,26088.0


In [28]:
fig1 = px.histogram(df, x='#Posts', title='Rozkład #Posts')
fig2 = px.histogram(df, x='#Followers', title='Rozkład #Followers')
fig3 = px.histogram(df, x='#Followees', title='Rozkład #Followees')

# Wyświetlanie wykresów
fig1.show()
fig2.show()
fig3.show()

In [29]:
df.corr()





Unnamed: 0,#Followers,#Followees,#Posts
#Followers,1.0,-0.057062,0.060104
#Followees,-0.057062,1.0,0.135277
#Posts,0.060104,0.135277,1.0


In [30]:
counts = df['Category'].value_counts().reset_index()
counts.columns = ['Category', 'count']

# Tworzenie wykresu barplot w Plotly
fig = px.bar(counts, x='Category', y='count', title="Ilość wystąpień kategorii")
fig.show()

In [31]:
# ręczne sprawdzenie pokazało, że są to prawdziwe profile z błędną nazwą
df[~df['Category'].isin(categories)]


Unnamed: 0,Username,Category,#Followers,#Followees,#Posts


In [32]:
#df = df[df['Category'].isin(categories)]

In [33]:
# Top 5 kont z największą ilością #Posts
top_posts = df.nlargest(5, '#Posts')
print("Top 5 kont z największą ilością #Posts:")
print(top_posts)



Top 5 kont z największą ilością #Posts:
           Username Category  #Followers  #Followees  #Posts
325   marieclairebr    other      528289        1990   26088
147     glencyfeliz  fashion      499793        1028    9914
291  unibadan_efiwe    other       50595        7049    9837
323   bryanrindfuss    other        1685        2902    8106
313      sallynatun    other       64980         276    7987


In [34]:
# Top 5 kont z największą ilością #Followers
top_followers = df.nlargest(5, '#Followers')
print("\nTop 5 kont z największą ilością #Followers:")
print(top_followers)


Top 5 kont z największą ilością #Followers:
         Username Category  #Followers  #Followees  #Posts
115   malutrevejo  fashion     7052689         213    1833
153          twan  fashion     6290311         246    1237
131   lira_galore  fashion     4007201        3559    1129
126  shawnjohnson  fashion     1561053        2733    2632
248       kpunkka   travel     1206881         321    1362


In [35]:
# Top 5 kont z największą ilością #Followees
top_followees = df.nlargest(5, '#Followees')
print("\nTop 5 kont z największą ilością #Followees:")
print(top_followees)


Top 5 kont z największą ilością #Followees:
             Username Category  #Followers  #Followees  #Posts
296        bottshoppe    other        1291        7500    1079
124  therealupdate100  fashion       27207        7398    2808
287      ibadanmarket    other        8127        7360    3220
75              jpsdn  fashion      111283        7321     550
63         bricolling  fashion       23420        7314     710


In [36]:
fig = px.box(df, x='Category', y='#Posts', title='Rozkład libczy postów u influencerów przypisanych do kategorii')
fig.show()

In [37]:
fig = px.box(df, x='Category', y='#Followers', title='Rozkład liczy Followers u influencerów przypisanych do kategorii')
fig.show()

In [38]:
fig = px.box(df, x='Category', y='#Followees', title='Rozkład liczby Followees u influencerów przypisanych do kategorii')
fig.show()

In [39]:
correlations_posts_followers = {}
correlations_posts_followees = {}
correlations_followers_followees = {}

for category in categories:
    subset = df[df['Category'] == category]
    corr_matrix = subset[['#Posts', '#Followers', '#Followees']].corr()
    correlations_posts_followers[category] = corr_matrix.loc['#Posts', '#Followers']
    correlations_posts_followees[category] = corr_matrix.loc['#Posts', '#Followees']
    correlations_followers_followees[category] = corr_matrix.loc['#Followers', '#Followees']

correlations_posts_followers = {k: v for k, v in sorted(correlations_posts_followers.items(), 
                                                        key=lambda item: item[1],
                                                        reverse=True)}
correlations_posts_followees = {k: v for k, v in sorted(correlations_posts_followees.items(), 
                                                        key=lambda item: item[1],
                                                        reverse=True)}
correlations_followers_followees = {k: v for k, v in sorted(correlations_followers_followees.items(), 
                                                            key=lambda item: item[1],
                                                            reverse=True)}

In [40]:
print("Correlation between #Posts and #Followers")
for category, correlation in correlations_posts_followers.items():
    print(f"Category {category}: {correlation:.2f}")

Correlation between #Posts and #Followers
Category food: 0.55
Category interior: 0.44
Category other: 0.38
Category pet: 0.27
Category family: 0.17
Category travel: 0.16
Category beauty: 0.07
Category fashion: 0.06
Category fitness: -0.13


In [41]:
print("Correlation between #Posts and #Followees")
for category, correlation in correlations_posts_followees.items():
    print(f"Category {category}: {correlation:.2f}")

Correlation between #Posts and #Followees
Category interior: 0.76
Category food: 0.59
Category pet: 0.42
Category other: 0.20
Category family: 0.15
Category travel: 0.04
Category fashion: 0.01
Category beauty: -0.05
Category fitness: -0.25


In [42]:
print("Correlation between #Followers and #Followees")
for category, correlation in correlations_followers_followees.items():
    print(f"Category {category}: {correlation:.2f}")

Correlation between #Followers and #Followees
Category interior: 0.46
Category fitness: 0.31
Category food: 0.16
Category beauty: 0.16
Category fashion: -0.07
Category other: -0.13
Category travel: -0.14
Category family: -0.18
Category pet: -0.48
