## About Dataset

This dataset contains profile data collected from VK.com (VKontakte), Russia's largest social network, for distinguishing between genuine users and automated bots. The data includes both numerical and categorical features extracted from user profiles.

##### Data Collection:

Collected from public VK.com profiles.
Includes both verified human users and verified bot accounts.
Represents realistic social network conditions with incomplete profiles.

##### Feature Types:

Numerical Features (NaN values preserved):
Activity metrics (average posts per week, hashtag usage, etc.)
Friend/follower counts.
Categorical Features (missing values marked as 'unknown'):
Profile attributes (has_photo, has_mobile, etc.)
Privacy settings (is_closed_profile, etc)
Binary flags (can_post, can_message, etc)

##### Data Processing:

Missing values handled differently by feature type:
Categorical: Filled with 'unknown' string
Numerical: Preserved as NaN
Boolean values converted to binary (0/1) where applicable

##### Potential Use Cases:

Binary classification (user vs bot detection)
Social media behavior analysis
Anomaly detection in social networks
Feature engineering exercises
Dataset Size:
5874 rows × 60 features (balanced 50/50)


In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("./archive/bots_vs_users.csv")
df.shape

(5874, 60)

In [4]:
df.head()

Unnamed: 0,has_domain,has_birth_date,has_photo,can_post_on_wall,can_send_message,has_website,gender,has_short_name,has_first_name,has_last_name,...,ads_ratio,avg_views,posting_frequency_days,phone_numbers_ratio,avg_text_uniqueness,city,has_occupation,occupation_type_university,occupation_type_work,has_personal_data
0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,,,,,,Unknown,Unknown,Unknown,Unknown,Unknown
1,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,,,,,,Unknown,Unknown,Unknown,Unknown,Unknown
2,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,,,,,,Unknown,Unknown,Unknown,Unknown,Unknown
3,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,,,,,,Unknown,Unknown,Unknown,Unknown,Unknown
4,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,,,,,,Unknown,Unknown,Unknown,Unknown,Unknown


In [9]:
df["access_to_closed_profile"].value_counts()

access_to_closed_profile
1.0        5091
0.0         759
Unknown      24
Name: count, dtype: int64