# Social Network Ads Dataset: High-Level Plan for Data Loading and Analysis
### Team: The p < 0.05 Team

## Objective
To make a plan for loading and analyzing a social network dataset to predict user buying behavior based on centrality measures and comparing different categorical groups.

## Data Source
The dataset is a CSV file from Kaggle containing social network ads data, which includes user details and whether they purchased a product.

## Dataset Overview
- **URL**: [Social Network Ads Dataset](https://www.kaggle.com/datasets/rakeshrau/social-network-ads/data)
- **Columns**: 
  - User ID
  - Gender
  - Age
  - EstimatedSalary
  - Purchased

## Steps

### 1. Data Loading
- **Install Necessary Libraries**: We will be using `pandas` and `networkx`.
- **Load the Dataset**: `pandas` to load the dataset.

### 2. Data Exploration and Preprocessing
- **Exploratory Data Analysis (EDA)**:
  - Exploring the data structure and types.
  - Handling missing values.
  - Visualizing the distribution of key variables.
- **Handling Categorical Data**: We will have to convert categorical variables like Gender to numbers for analysis.

### 3. Network Construction
- **Create a Social Network Graph**:
  - We will use `networkx` to create a graph where nodes represent users and edges represent hypothetical interactions (like users with similar attributes).
  - Add nodes with attributes from the dataset.
  - Define criteria for creating edges (like users of the same age and gender).

### 4. Centrality Measures
- **Calculate Degree Centrality**:
  - We will use `networkx` to calculate the degree centrality of each node (user).

### 5. Analysis and Prediction
- **Compare Centrality Across Categorical Groups**:
  - We can visualize centrality measures by groups like gender or purchasing behavior.
- **Predictive Modeling**:
  - We can use centrality measures and other features to predict purchasing behavior.
  - We can also train a machine learning model to classify users as purchasers or non-purchasers.

### Hypothetical Outcome
By comparing degree centrality across categorical groups, we might discover that users with higher centrality (more connections) are more likely to purchase products. This insight can help target marketing efforts towards highly connected individuals who may influence their network. For example, if females aged 25-35 with high degree centrality show higher purchase rates, marketing strategies could focus on engaging these influential users to drive product adoption within their social circles.

In [1]:
# Social Network Ads: https://www.kaggle.com/datasets/rakeshrau/social-network-ads/data

# Loading the data
import pandas as pd

url = "https://raw.githubusercontent.com/hbedros/centrality-measures/main/data/Social_Network_Ads.csv"
data = pd.read_csv(url)

# Display the first few rows of the dataset
print(data.head())

# Get basic information about the dataset
print(data.info())

# Check for any missing values
print(data.isnull().sum())

# Summary statistics of the dataset
print(data.describe())


    User ID  Gender  Age  EstimatedSalary  Purchased
0  15624510    Male   19            19000          0
1  15810944    Male   35            20000          0
2  15668575  Female   26            43000          0
3  15603246  Female   27            57000          0
4  15804002    Male   19            76000          0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB
None
User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64
            User ID         Age  EstimatedSalary   Purchased
count  4.000000e+02  400.000