## Detecting Fake Instagram Accounts

Envision yourself in the role of a data analyst, employed by Meta Company, with your position situated within the Instagram Safety Team. Picture a scenario where your superior assigns you a critical task centered around the detection of counterfeit Instagram profiles. This initiative stems from a noticeable surge in spam-laden content proliferating across Instagram, which is detrimentally impacting the overall user experience for the platform's audience. 

Contemplate the strategies and methodologies you would employ to tackle this challenge effectively. 

In the near horizon, specifically within the next 60 minutes, you are scheduled for a significant consultation. This meeting will bring you face-to-face with Mr. X, the esteemed head of data analytics at Instagram. The agenda of this notebook is to deliberate on the prospective direction and the anticipated outcomes of this pivotal project. 

As you prepare for this discussion, consider the various facets of your approach to identifying and mitigating the presence of fake accounts on Instagram, thereby enhancing the quality of user experience by curbing the spread of unwanted spam content.

Ensure you are prepared to articulate any uncertainties you may have and seek the necessary information.

**Primary Objective:** Identify counterfeit Instagram profiles.

**Key Considerations:**

**Hypotheses for Identifying Potential Fake Accounts:**

- Profiles with a minimal follower count are often fraudulent.
- Usernames featuring a higher ratio of numerical digits to alphabetic characters are likely indicative of inauthentic accounts.
- An absence of biographical information may suggest a profile is not genuine.
- Biographies containing offensive language could be a marker of fake accounts.
- The lack of a profile picture might signal an account's lack of authenticity.
- Accounts that follow a large number of users, yet have few followers themselves, can be suspected of being fake.
- A disproportionately high number of posts relative to the account's follower count could indicate inauthenticity.

**The Significance of Data Analytics:**

This delineation underscores the pivotal role of data analytics in our approach to discerning and addressing the issue of fake Instagram accounts. Through analytical scrutiny, we can validate these hypotheses and refine our strategy for safeguarding the user experience on the platform.

**Actionable Step:**

Proceed to engage with the jupyter notebook for hands-on data analysis to further this investigation.

The goal of the analysis is to find insights into the charecteristics of instagram fake accounts, such as common traits or patterns that differentiate them from real accounts. These insights will serve as the foundation for further analysis and modeling tasks aimed at developing a predictive model to detect fake instagram accounts based on their charecteristics.

In [None]:
# Importing necessary librariesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Defining the file path of the CSV file containing the data
csvFile = "./instagram_project/insta_data.csv"

# Loading the data from the CSV file into a pandas DataFrame
df = pd.read_csv(csvFile)

# Displaying the contents of the DataFrame
df

In [None]:
# Getting the shape of the DataFrame
df.shape

In [None]:
# Getting the first 5 rows of the DataFrame
df.head()

In [None]:
# Getting information about the DataFrame
df.info()

In [None]:
# Checking for missing values in the 'following' column of the DataFrame
df["following"].isnull()

In [None]:
# Creating a histogram of the 'following' column of the DataFrame using seaborn
sns.histplot(df["following"])

In [None]:
# Calculating the median of the 'following' column of the DataFrame
following_median = df["following"].median()

# Filling missing values in the 'following' column with the median value
df["following"] = df["following"].fillna(following_median)

# Casting the 'following' column as integers
df["following"] = df["following"].astype(int)

# Printing the median value of the 'following' column
following_median

In [None]:
# Getting information about the DataFrame
df.info()

In [None]:
# Getting the unique values in the 'Private' column of the DataFrame
df["Private"].unique()

In [None]:
# Getting the unique values in the 'Fake' column of the DataFrame
df['Fake'].unique()

**Hypothesis:** Profiles with a minimal follower count are often fraudulent.

In [None]:
sns.scatterplot(data = df, x = 'followers', y = 'posts')

**Hypothesis:** Profiles with a minimal follower count and less number of posts are often fraudulent

In [None]:
sns.scatterplot(data=df, x="followers", y="posts", hue="Fake")

**Hypothesis:** Accounts with less number of followers and a high number of following are fake accounts.

In [None]:
sns.scatterplot(data=df, x="followers", y="following", hue="Fake")

**Hypothesis:** Presence of a profile photo is more common in real instagram accounts compared to fake.

In [None]:
sns.countplot(data=df, x="profile_picture", hue="Fake")

**Hypothesis:** Private accounts are more likely to be real instagram accounts compared to fake.

In [None]:
sns.countplot(data=df, x="Private", hue="Fake")

### Data for ML Model

In [None]:
# Replacing values in the 'profile_picture' column of the DataFrame
df["profile_picture"] = df["profile_picture"].replace({"yes": 1, "no": 0})

# Replacing values in the 'Private' column of the DataFrame
df["Private"] = df["Private"].replace({"private": 1, "public": 0})

# Replacing values in the 'Fake' column of the DataFrame
df["Fake"] = df["Fake"].replace({"fake": 1, "real": 0})

# Printing the first 10 rows of the DataFrame
df.head(10)

In [None]:
df.info()

**To determine whether an account is authentic or fraudulent based on its characteristics column**

In [None]:
# Dropping the 'Fake' column from the DataFrame to create the feature matrix X
X = df.drop("Fake", axis=1)

# Selecting the 'Fake' column from the DataFrame to create the target vector y
y = df["Fake"]

In [None]:
# Importing the train_test_split function from the sklearn.model_selection module
from sklearn.model_selection import train_test_split

# Splitting the feature matrix X and target vector y into training and testing sets using the train_test_split function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
X_train.head()

In [None]:
# Importing the LogisticRegression class from the sklearn.linear_model module
from sklearn.linear_model import LogisticRegression

# Creating an instance of the LogisticRegression class
clf = LogisticRegression()

# Fitting the LogisticRegression model to the training data using the fit method
clf.fit(X_train, y_train)

# Making predictions on the testing data using the predict method
y_pred = clf.predict(X_test)

In [None]:
y_pred

In [None]:
y_test

In [None]:
# Importing the accuracy_score function from the sklearn.metrics module
from sklearn.metrics import accuracy_score

# Calculating the accuracy of the model using the accuracy_score function
accuracy = accuracy_score(y_test, y_pred)

# Printing the accuracy of the model
print("Accuracy:", accuracy)