# Cleaning and Structuring the Data

## **Introduction**
The data is messy. Before i can analyze it effectively, i need to **clean and structure the data** properly.

---

## **Task 1: Identify Issues in the Data**
Some records are incomplete or incorrect.

```json
{
    "users": [
        {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
        {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
        {"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]},
        {"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]},
        {"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
    ],
    "pages": [
        {"id": 101, "name": "Python Developers"},
        {"id": 102, "name": "Data Science Enthusiasts"},
        {"id": 103, "name": "AI & ML Community"},
        {"id": 104, "name": "Web Dev Hub"},
        {"id": 104, "name": "Web Development"}
    ]
}
```

---

## **Task 2: Clean the Data**

1. Remove users with missing names.
2. Remove duplicate friend entries.
3. Remove inactive users (users with no friends and no liked pages).
4. Deduplicate pages based on IDs.



In [5]:
import json

def clean_data(data):
    # Remove users with missing names
    data["users"] = [user for user in data["users"] if user["name"].strip()]
    
    # Remove duplicate friends
    for user in data["users"]:
        user['friends'] = list(set(user['friends']))
        
    # Remove inactive users
    data['users'] = [user for user in data['users'] if user['friends'] or user['liked_pages']]

    # Remove duplicate pages
    unique_pages = {}
    for page in data['pages']:
        unique_pages[page['id']] = page
    data['pages'] = list(unique_pages.values())
    return data
    

# Load the data
data = json.load(open("data2.json"))
data = clean_data(data)
json.dump(data, open("cleaned_data2.json", "w"), indent=4)
print("Data has been cleaned successfully")

Data has been cleaned successfully
