
# Project 2 – Part 1  
**Course:** MATH 014 (04) — Introduction to Data Science Honors  
**Student Name:** Manjil Rawal  
**Student ID:** @03086947  
**Date:** 2025-10-31 19:45

---



## 1. Dataset Information

**Title / Topic:** Students' Social Media Addiction vs. Relationships  
**Source (link):** https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships  

**Brief real‑world context / purpose:**  
This dataset is a cross‑country survey of students’ social‑media behaviors and related outcomes, including usage hours, most‑used platforms, sleep, mental‑health score, relationship status, and conflicts over social media. It is suitable for basic data exploration, wrangling, and simple visual analysis.



## Setup and Load

```python
import os
import pandas as pd
import matplotlib.pyplot as plt

# Make tables easier to read
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

# === Load your CSV ===
# Put the CSV in the same folder as this notebook, then set the exact filename here:
csv_path = "Project_data.csv" 

df = pd.read_csv(csv_path)
print("Loaded:", csv_path)
print("Shape:", df.shape)
df.head()
```








## 2. Basic Data Exploration

**a) First and last 5 rows; rows and columns; column names**

```python
# Shape (rows, cols)
df.shape
```

```python
# First 5 rows
df.head()
```

```python
# Last 5 rows
df.tail()
```

```python
# Column names
list(df.columns)
```




## 3. Data Types and Structure

**a) `df.info()`**  
```python
df.info()
```

**b) Identify numeric vs categorical columns**  
```python
numeric_cols = [c for c in df.columns if pd.api.types.is_numeric_dtype(df[c])]
categorical_cols = [c for c in df.columns if not pd.api.types.is_numeric_dtype(df[c])]

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)
```

**c) Any conversions needed? (brief note)**  
_Example: If a numeric-looking column was read as object/text, we would convert with `pd.to_numeric(..., errors='coerce')`. For this dataset, most numeric fields (e.g., `Age`, `Avg_Daily_Usage_Hours`, `Sleep_Hours_Per_Night`, `Mental_Health_Score`, `Conflicts_Over_Social_Media`, `Addicted_Score`) should already be numeric._



## 4. Descriptive Summary

```python
# Summary statistics for numeric columns
df.describe()
```

_Interpretation prompt: Note minimum/maximum usage hours, average sleep hours, and ranges of mental health / addicted scores. Mention any surprising values (e.g., very high or very low hours)._



## 5. Missing or Duplicate Data

```python
# Missing values per column
missing = df.isna().sum().to_frame('missing_count')
missing['missing_%'] = (missing['missing_count'] / len(df) * 100).round(2)
missing
```

```python
# Duplicate row count
dup_count = df.duplicated().sum()
dup_count
```

_Observations: Briefly note which columns (if any) have missing values and how that could affect analysis. If duplicates exist, state whether you would drop them in future steps._


## Data Wrangling


### 6. Add a New Column Using Existing Columns

We create a simple **normalized conflict score** on a 0–100 scale.

```python
df_new = df.copy()
max_conflict = df_new['Conflicts_Over_Social_Media'].max()
df_new['Conflict_Normalization'] = (df_new['Conflicts_Over_Social_Media'] / max_conflict) * 100
df_new[['Conflicts_Over_Social_Media', 'Conflict_Normalization']].head()
```

_This new column represents relative conflict level compared to the maximum observed in the data, making values easier to interpret._



### 7. Filter the Data (Two Different Filters)

**Filter A: by a categorical variable (e.g., Gender)**  
```python
# If 'Gender' exists, filter to the most common category
if 'Gender' in df_new.columns:
    top_gender = df_new['Gender'].value_counts().idxmax()
    filtered_a = df_new[df_new['Gender'] == top_gender]
    print("Filter A: Gender == ", top_gender, " -> shape:", filtered_a.shape)
    display(filtered_a.head())
else:
    print("No 'Gender' column found; skipping Filter A.")
```

**Why this subset?** It allows us to compare the majority group to others in later analysis.

**Filter B: using the derived feature (Conflict_Normalization > 50)**  
```python
filtered_b = df_new[df_new['Conflict_Normalization'] > 50]
print("Filter B: Conflict_Normalization > 50 -> shape:", filtered_b.shape)
filtered_b.head()
```

**Why this subset?** It highlights students with higher-than-median conflict levels for further study.



### 8. Unique Values and Categories

_List unique values for each categorical column._

```python
if 'categorical_cols' not in globals():
    categorical_cols = [c for c in df_new.columns if not pd.api.types.is_numeric_dtype(df_new[c])]

for c in categorical_cols:
    print(f"\nColumn: {c}")
    display(df_new[c].value_counts(dropna=False))
```



## (Optional) Two Quick Visuals

```python
# Histogram of Avg_Daily_Usage_Hours
plt.figure()
df_new['Avg_Daily_Usage_Hours'].dropna().hist(bins=20)
plt.title("Histogram: Avg_Daily_Usage_Hours")
plt.xlabel("Hours per day")
plt.ylabel("Count")
plt.show()
```

```python
# Bar chart: Top 5 Most Used Platforms
if 'Most_Used_Platform' in df_new.columns:
    top5 = df_new['Most_Used_Platform'].value_counts().head(5)
    plt.figure()
    plt.bar(top5.index.astype(str), top5.values)
    plt.title("Top 5 Most Used Platforms")
    plt.xlabel("Platform")
    plt.ylabel("Count")
    plt.xticks(rotation=30)
    plt.tight_layout()
    plt.show()
```



## 9. Summary and Conclusion — Initial Observations

_Write a short paragraph in your own words answering:_  
- What was interesting about usage hours, sleep patterns, and conflict/mental‑health scores?  
- Any patterns across platforms or gender/academic level?  
- Any data quality issues (missing values, duplicates) to watch for in Part 2?  

> _Example starter:_ The dataset contains 705 rows and 13 columns covering demographics, usage behavior, and outcomes. Most numeric variables were already typed correctly. Summary statistics suggest moderate average daily usage with a long right tail. Conflict normalization helps identify a high‑conflict subgroup (>50), which could be compared by platform or country in future analysis.



---

### Export to HTML (for submission)
In Jupyter/Colab: **File → Download as → HTML** and save as `Manjil_Project2_Part1.html`.

### Reminder
Also submit the `.csv` file you used (e.g., `Project_data.csv`).
