## Introduction to Data Wrangling in Python Using Pandas

Data wrangling is a critical step in the data analysis process, especially in psychological research. Before drawing insights from data, researchers must clean, structure, and transform raw data into a usable format. In this tutorial, we will use Python's Pandas library to perform common data wrangling tasks on a simulated psychology dataset.

We will cover:
1. Loading and inspecting data
2. Handling missing values
3. Transforming and categorizing data
4. Filtering and merging datasets

Let's get started!


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

In [None]:
# Simulating a psychology survey dataset
np.random.seed(42)
data = {
    "Participant_ID": range(1, 11),
    "Age": [23, 45, 34, np.nan, 29, 40, 36, 28, 33, np.nan],
    "Gender": ["F", "M", "M", "F", "F", "M", "F", "M", "M", "F"],
    "Stress_Level": [3, 7, 5, 6, 2, 8, 4, 5, np.nan, 6],
    "Anxiety_Score": [np.nan, 12, 8, 15, 9, 14, 6, 7, 10, 11]
}
df = pd.DataFrame(data)

In [None]:
# Display the first few rows
print("Original Dataset:")
print(df)

### Example 1: Handling Missing Values

In [None]:
# Checking for missing values
def check_missing_values(df):
    print("\nMissing Values:")
    print(df.isnull().sum())

check_missing_values(df)

In [None]:
# Filling missing values in Age with the median
# Filling missing values in Stress_Level and Anxiety_Score with the mean
df["Age"].fillna(df["Age"].median(), inplace=True)
df["Stress_Level"].fillna(df["Stress_Level"].mean(), inplace=True)
df["Anxiety_Score"].fillna(df["Anxiety_Score"].mean(), inplace=True)

print("\nDataset after handling missing values:")
print(df)

### Example 2: Categorizing Psychological Data

In [None]:
# Creating a categorical variable based on Anxiety Score
def categorize_anxiety(score):
    if score < 8:
        return "Low"
    elif 8 <= score < 12:
        return "Moderate"
    else:
        return "High"

df["Anxiety_Category"] = df["Anxiety_Score"].apply(categorize_anxiety)

print("\nDataset with Anxiety Categories:")
print(df)

### Example 3: Filtering Data

In [None]:
# Selecting participants with high stress levels (>=6)
high_stress_df = df[df["Stress_Level"] >= 6]

print("\nParticipants with High Stress Levels:")
print(high_stress_df)

### Example 4: Merging Data

# Creating a new dataset with additional participant information
demographics_data = {
    "Participant_ID": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Education_Level": ["Bachelor", "Master", "PhD", "Bachelor", "High School", "Master", "PhD", "Bachelor", "Master", "High School"]
}
demographics_df = pd.DataFrame(demographics_data)

# Merging datasets on Participant_ID
df_merged = pd.merge(df, demographics_df, on="Participant_ID")

print("\nMerged Dataset:")
print(df_merged)

## Exercise

Now it's your turn! Complete the following exercises to practice data wrangling on the psychology dataset.

1. **Filtering**: Create a new DataFrame that includes only participants with a "High" Anxiety Category. Display the resulting DataFrame.
2. **Data Transformation**: Add a new column called "Mental_Wellbeing_Score" by calculating the difference between Stress_Level and Anxiety_Score. Higher scores indicate better mental well-being. Display the updated dataset.

Write your code below each question and run the cells to test your solution!
