# Social Network Ads Dataset: High-Level Plan for Data Loading and Analysis
### Team: The p < 0.05 Team - Haig Bedros, Noori Selina, Julia Ferris, Matthew Roland

## Objective
To make a plan for loading and analyzing a social network dataset to predict user buying behavior based on centrality measures and comparing different categorical groups.

## Data Source
The dataset is a CSV file from Kaggle containing social network ads data, which includes user details and whether they purchased a product.

## Dataset Overview
- **URL**: [Social Network Ads Dataset](https://www.kaggle.com/datasets/rakeshrau/social-network-ads/data)
- **Columns**: 
  - User ID
  - Gender
  - Age
  - EstimatedSalary
  - Purchased

## Steps

### 1. Data Loading
- **Install Necessary Libraries**: We will be using `pandas`, `networkx`, `matplotlib`, and `seaborn`.
- **Load the Dataset**: Use `pandas` to load the dataset.

### 2. Data Exploration and Preprocessing
- **Exploratory Data Analysis (EDA)**:
  - Explore the data structure and types.
  - Handle any missing values.
  - Visualize the distribution of key variables.
- **Handling Categorical Data**: Convert categorical variables like Gender to numbers for analysis.

### 3. Network Construction
- **Create a Social Network Graph**:
  - Use `networkx` to create a graph where nodes represent users and edges represent hypothetical interactions (based on similar age and estimated salary).
  - Add nodes with attributes from the dataset.
  - Define criteria for creating edges.

### 4. Centrality Measures
- **Calculate Degree and Eigenvector Centrality**:
  - Use `networkx` to calculate degree centrality of each node (user).
  - Implement Google's PageRank algorithm for eigenvector centrality to handle potential convergence issues.

### 5. Visualization and Analysis
- **Visualization**:
  - Visualize degree distribution of the nodes.
  - Identify top users based on degree centrality.
  - Calculate and visualize the minimum, average, and range of degrees.
- **Snowball Sampling**:
  - Apply snowball sampling to create a smaller subset of the network.
- **Trimming the Network**:
  - Continuously trim the network to ensure a manageable size for analysis.

### Hypothetical Outcome
By comparing degree centrality across categorical groups, we might discover that users with higher centrality (more connections) are more likely to purchase products. This insight can help target marketing efforts towards highly connected individuals who may influence their network. For example, if females aged 25-35 with high degree centrality show higher purchase rates, marketing strategies could focus on engaging these influential users to drive product adoption within their social circles.

In [1]:
# Social Network Ads: https://www.kaggle.com/datasets/rakeshrau/social-network-ads/data

# Loading the data
import pandas as pd

url = "https://raw.githubusercontent.com/hbedros/centrality-measures/main/data/Social_Network_Ads.csv"
data = pd.read_csv(url)

# Display the first few rows of the dataset
print(data.head())

# Get basic information about the dataset
print(data.info())

# Check for any missing values
print(data.isnull().sum())

# Summary statistics of the dataset
print(data.describe())


    User ID  Gender  Age  EstimatedSalary  Purchased
0  15624510    Male   19            19000          0
1  15810944    Male   35            20000          0
2  15668575  Female   26            43000          0
3  15603246  Female   27            57000          0
4  15804002    Male   19            76000          0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB
None
User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64
            User ID         Age  EstimatedSalary   Purchased
count  4.000000e+02  400.000