## Data Collection & Organization

Data has been manually downloaded from [this GitHub repo](https://github.com/fcakyon/instafake-dataset) and organized in the data folder. For this project, we don't need to build or run any data pipeline. Data and files will be added to this [Fake Instagram Account Detection repo](https://github.com/midori256/Fake-Instagram-Account-Detection).

## Data Definition

Goal: Gain an understanding of our data features to inform the next steps of our project.

In [2]:
# install libraries
import pandas as pd
import numpy as np

In [3]:
# load dataframes from json files
fake_df = pd.read_json("data/fakeAccountData.json")
real_df = pd.read_json("data/realAccountData.json")

In [4]:
real_df.head()

Unnamed: 0,userFollowerCount,userFollowingCount,userBiographyLength,userMediaCount,userHasProfilPic,userIsPrivate,usernameDigitCount,usernameLength,isFake
0,258,238,0,0,1,0,0,10,0
1,263,482,30,29,1,1,0,8,0
2,51,78,9,0,1,1,0,10,0
3,297,480,22,25,1,1,2,9,0
4,113,242,0,95,1,1,0,10,0


In [5]:
fake_df.head()

Unnamed: 0,userFollowerCount,userFollowingCount,userBiographyLength,userMediaCount,userHasProfilPic,userIsPrivate,usernameDigitCount,usernameLength,isFake
0,25,1937,0,0,1,1,0,10,1
1,324,4122,0,0,1,0,4,15,1
2,15,399,0,0,0,0,3,12,1
3,14,107,0,1,1,0,1,10,1
4,264,4651,0,0,1,0,0,14,1


In [6]:
# check columns
# both dataframes have the same columns
real_df.columns

Index(['userFollowerCount', 'userFollowingCount', 'userBiographyLength',
       'userMediaCount', 'userHasProfilPic', 'userIsPrivate',
       'usernameDigitCount', 'usernameLength', 'isFake'],
      dtype='object')

### Column Descriptions

There are 9 columns which are defined as below:

1. user_media_count - Total number of posts, an account has.
2. user_follower_count - Total number of followers, an account has.
3. user_following_count - Total number of followings, an account has.
4. user_has_profil_pic - Whether an account has a profile picture, or not.
5. user_is_private - Whether an account is a private profile, or not.
6. user_biography_length - Number of characters present in account biography.
7. username_length - Number of characters present in account username.
8. username_digit_count - Number of digits present in account username.
9. is_fake - True, if account is a spam/fake account, False otherwise


In [9]:
# Check types of columns in real_df
real_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 994 entries, 0 to 993
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   userFollowerCount    994 non-null    int64
 1   userFollowingCount   994 non-null    int64
 2   userBiographyLength  994 non-null    int64
 3   userMediaCount       994 non-null    int64
 4   userHasProfilPic     994 non-null    int64
 5   userIsPrivate        994 non-null    int64
 6   usernameDigitCount   994 non-null    int64
 7   usernameLength       994 non-null    int64
 8   isFake               994 non-null    int64
dtypes: int64(9)
memory usage: 70.0 KB


In [10]:
# Check types of columns in fake_df
fake_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   userFollowerCount    200 non-null    int64
 1   userFollowingCount   200 non-null    int64
 2   userBiographyLength  200 non-null    int64
 3   userMediaCount       200 non-null    int64
 4   userHasProfilPic     200 non-null    int64
 5   userIsPrivate        200 non-null    int64
 6   usernameDigitCount   200 non-null    int64
 7   usernameLength       200 non-null    int64
 8   isFake               200 non-null    int64
dtypes: int64(9)
memory usage: 14.2 KB


There are 994 records for real account, and only 200 records for fake accounts.

In [11]:
# check statistics for each dataframe
real_df.describe()

Unnamed: 0,userFollowerCount,userFollowingCount,userBiographyLength,userMediaCount,userHasProfilPic,userIsPrivate,usernameDigitCount,usernameLength,isFake
count,994.0,994.0,994.0,994.0,994.0,994.0,994.0,994.0,994.0
mean,419.891348,516.138833,25.034205,68.473843,0.986922,0.724346,0.2666,11.070423,0.0
std,366.998029,517.709885,34.128111,113.963572,0.113668,0.447068,0.851721,2.877679,0.0
min,1.0,4.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0
25%,218.0,267.0,0.0,8.0,1.0,0.0,0.0,9.0,0.0
50%,345.0,419.5,12.0,30.0,1.0,1.0,0.0,11.0,0.0
75%,515.75,614.0,36.0,78.75,1.0,1.0,0.0,13.0,0.0
max,4492.0,6640.0,150.0,1058.0,1.0,1.0,7.0,22.0,0.0


In [12]:
fake_df.describe()

Unnamed: 0,userFollowerCount,userFollowingCount,userBiographyLength,userMediaCount,userHasProfilPic,userIsPrivate,usernameDigitCount,usernameLength,isFake
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,116.64,1878.03,11.98,3.535,0.605,0.325,1.635,11.39,1.0
std,289.906744,1871.377801,27.757558,28.585036,0.490077,0.46955,1.902597,3.532747,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,1.0
25%,10.75,278.0,0.0,0.0,0.0,0.0,0.0,9.0,1.0
50%,29.5,1446.5,0.0,0.0,1.0,0.0,1.0,11.0,1.0
75%,102.5,2505.5,4.5,1.0,1.0,1.0,3.0,13.0,1.0
max,3208.0,7497.0,138.0,396.0,1.0,1.0,10.0,30.0,1.0


## Data Cleaning