# Synthetic Social Science Dataset Creation

This notebook demonstrates the creation of a **synthetic dataset** for educational purposes, focusing on democracy-related attitudes and demographic variables. The dataset contains 1,000 respondents and 10 variables, including:
- **Demographic characteristics**: Age, gender, education level, and income level.
- **Attitudes**: Democracy ratings and trust in government.
- **Behavioral and social variables**: Voting behavior, political knowledge, and support for social equality.

### Purpose
The dataset is designed to be used in teaching the following methods in computational social science and quantitative research:
1. **Exploratory Data Analysis (EDA)**: 
2. **Descriptive statistics**: 
3. **Inferentiao statistics**: 
4. **Chosen Machine Learning algorithms for research**: 

### Key Features
- **Simulated Relationships**: Variables are correlated to reflect real-world patterns. For example:
  - Older individuals rate democracy more highly.
  - Higher education correlates with greater political knowledge.
  - Trust in government is associated with higher support for social equality.
- **Reproducibility**: A random seed ensures consistency when regenerating the dataset.
- **Educational Focus**: The dataset is ideal for demonstrating foundational statistical and data analysis techniques.

Let's begin by generating the dataset and inspecting its structure.


In [1]:
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Number of respondents
n = 1000

# Generate synthetic data
data = pd.DataFrame({
    # Demographics
    "Age": np.random.randint(18, 80, size=n),  # Age between 18 and 79
    "Gender": np.random.choice(["Male", "Female"], size=n, p=[0.50, 0.50]),
    "Education_Level": np.random.choice(["High School", "Bachelor's", "Master's", "PhD"], size=n, 
                                        p=[0.4, 0.35, 0.2, 0.05]),
    "Income_Level": np.random.choice(["Low", "Middle", "High"], size=n, p=[0.4, 0.4, 0.2]),

    # Democracy-related attitudes
    "Democracy_Rating": np.random.randint(1, 11, size=n),  # Scale 1-10
    "Trust_in_Government": np.random.randint(1, 6, size=n),  # Scale 1-5

    # Binary variable for participation in elections
    "Voted_Last_Election": np.random.choice(["Yes", "No"], size=n, p=[0.7, 0.3]),

    # Political awareness and engagement
    "Political_Knowledge": np.random.randint(0, 11, size=n),  # Score 0-10
    "Media_Consumption": np.random.choice(["Daily", "Weekly", "Rarely", "Never"], size=n, 
                                          p=[0.5, 0.3, 0.15, 0.05]),

    # Social and cultural attitudes
    "Social_Equality_Support": np.random.randint(1, 6, size=n)  # Scale 1-5
})

# Introduce some relationships and variability
# Older individuals might rate democracy higher
data["Democracy_Rating"] = (data["Democracy_Rating"] + (data["Age"] / 20)).clip(1, 10).astype(int)

# Higher education correlates with better political knowledge
education_mapping = {"High School": 5, "Bachelor's": 7, "Master's": 8, "PhD": 9}
data["Political_Knowledge"] = data["Political_Knowledge"] + data["Education_Level"].map(education_mapping) // 2
data["Political_Knowledge"] = data["Political_Knowledge"].clip(0, 10)

# Higher trust in government correlates with higher social equality support
data["Social_Equality_Support"] = (data["Social_Equality_Support"] + data["Trust_in_Government"] / 2).clip(1, 5).astype(int)

# Display the first few rows of the dataset
data.head()


Unnamed: 0,Age,Gender,Education_Level,Income_Level,Democracy_Rating,Trust_in_Government,Voted_Last_Election,Political_Knowledge,Media_Consumption,Social_Equality_Support
0,56,Male,High School,High,10,5,Yes,8,Daily,5
1,69,Female,High School,Middle,5,2,Yes,10,Weekly,2
2,46,Male,Bachelor's,Middle,6,4,No,7,Daily,5
3,32,Female,High School,Middle,3,2,Yes,10,Daily,2
4,60,Female,High School,High,9,1,Yes,10,Daily,2


# Code Walkthrough and Explanations

## 1. Importing Libraries

```python
import pandas as pd
import numpy as np
```

- **pandas**: Used for creating and managing the tabular data structure (DataFrame).
- **numpy**: Provides tools for efficient numerical computations, including random number generation.

## 2. Setting the Random Seed

```python
np.random.seed(42)
```

- **Purpose**: Ensures reproducibility of the random numbers generated.
- **How it works**: The seed sets the initial state of the random number generator. If you re-run the code with the same seed, you'll get the same random outputs.

## 3. Number of Respondents

```python
n = 1000
```

- **Purpose**: Defines the size of the dataset (number of rows).
- **Value**: Here, we set 1,000 respondents to simulate a medium-sized survey dataset.

## 4. Generating Demographic Variables

### Age

```python
"Age": np.random.randint(18, 80, size=n)
```

- **np.random.randint(start, stop, size)**:
  - Generates random integers between the specified range (start inclusive, stop exclusive).
  - **size=n**: Creates an array with n values.
- **Purpose**: Simulates respondent ages between 18 and 79.

### Gender

```python
"Gender": np.random.choice(["Male", "Female"], size=n, p=[0.50, 0.50])
```

- **np.random.choice(array, size, p)**:
  - Randomly selects values from the specified array.
  - **size=n**: Generates n random selections.
  - **p=[0.50, 0.50]**: Specifies the probabilities for each category.
  - Here, there's a 50% chance for "Male" and 50% chance for "Female."
- **Purpose**: Simulates a binary gender variable with an equal distribution.

### Education Level

```python
"Education_Level": np.random.choice(["High School", "Bachelor's", "Master's", "PhD"], size=n, p=[0.4, 0.35, 0.2, 0.05])
```

- **Purpose**: Simulates respondents' highest education levels with predefined probabilities:
  - 40% "High School", 35% "Bachelor's", 20% "Master's", and 5% "PhD."

### Income Level

```python
"Income_Level": np.random.choice(["Low", "Middle", "High"], size=n, p=[0.4, 0.4, 0.2])
```

- **Purpose**: Creates income categories with probabilities:
  - 40% "Low", 40% "Middle", 20% "High."

## 5. Generating Democracy-Related Attitudes

### Democracy Rating

```python
"Democracy_Rating": np.random.randint(1, 11, size=n)
```

- **Generates random ratings** on a scale from 1 (not important) to 10 (very important).
- **Purpose**: Simulates a subjective measure of democracy importance.

### Trust in Government

```python
"Trust_in_Government": np.random.randint(1, 6, size=n)
```

- **Generates random values** on a scale from 1 (no trust) to 5 (full trust).
- **Purpose**: Captures attitudes toward governmental trust.

## 6. Binary Behavioral Variable

### Voted Last Election

```python
"Voted_Last_Election": np.random.choice(["Yes", "No"], size=n, p=[0.7, 0.3])
```

- **70%** of respondents are simulated to have voted, while **30%** did not.

## 7. Generating Political Awareness and Media Consumption

### Political Knowledge

```python
"Political_Knowledge": np.random.randint(0, 11, size=n)
```

- **Generates a quiz score** ranging from 0 (no knowledge) to 10 (high knowledge).

### Media Consumption

```python
"Media_Consumption": np.random.choice(["Daily", "Weekly", "Rarely", "Never"], size=n, p=[0.5, 0.3, 0.15, 0.05])
```

- **Simulates respondents' frequency** of consuming political news:
  - 50% "Daily", 30% "Weekly", 15% "Rarely", 5% "Never."

## 8. Social and Cultural Attitudes

### Social Equality Support

```python
"Social_Equality_Support": np.random.randint(1, 6, size=n)
```

- **Generates random values** on a scale from 1 (low support) to 5 (high support).

## 9. Introducing Relationships and Variability

### Adjusting Democracy Rating with Age

```python
data["Democracy_Rating"] = (data["Democracy_Rating"] + (data["Age"] / 20)).clip(1, 10).astype(int)
```

- **Purpose**: Introduces a relationship where older respondents tend to rate democracy higher:
  - ✨ **Adjusted Rating** = Original Rating + (Age / 20).
- **.clip(1, 10)** ensures values remain within the range of 1–10.

### Correlating Education Level with Political Knowledge

```python
education_mapping = {"High School": 5, "Bachelor's": 7, "Master's": 8, "PhD": 9}
data["Political_Knowledge"] = data["Political_Knowledge"] + data["Education_Level"].map(education_mapping) // 2
data["Political_Knowledge"] = data["Political_Knowledge"].clip(0, 10)
```

- **Maps education levels** to knowledge scores:
  - Higher education results in higher political knowledge.
- **Adjusts scores** and clips them to stay within the 0–10 range.

### Correlating Trust in Government with Social Equality Support

```python
data["Social_Equality_Support"] = (data["Social_Equality_Support"] + data["Trust_in_Government"] / 2).clip(1, 5).astype(int)
```

- **Higher trust in government** increases support for social equality:
  - ✨ **Adjusted Support** = Original Support + (Trust / 2).

## Summary

- **numpy for Random Sampling**: Generates random values for numeric and categorical variables.
- **Controlled Relationships**: Adjustments mimic real-world patterns.
- **Reproducibility**: Using `np.random.seed()` ensures consistent results.