# Welcome to CodeBook – Your Data Science Internship Begins!

# **Part-1 Data Loading and Exploration**
Congratulations! You have just been hired as a **Data Scientist Intern** at **CodeBook – The Social Media for Coders**. This XYZ company is offering you a **job** if you successfully complete this **1-month internship**. But before you get there, you must prove your skills using **only Python**—no pandas, NumPy, or fancy libraries!

Your manager has assigned you your **first task**: analyzing a data dump of CodeBook users using pure python. Your job is to **load and explore the data** to understand its structure.

---

## **Task 1: Load the User Data**
Your manager has given you a dataset containing information about CodeBook users, their connections (friends), and the pages they have liked.

This is how the data will look like (in JSON format):
```json
{
    "users": [
        {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
        {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
        {"id": 3, "name": "Rahul", "friends": [1], "liked_pages": [101, 103]},
        {"id": 4, "name": "Sara", "friends": [2], "liked_pages": [104]}
    ],
    "pages": [
        {"id": 101, "name": "Python Developers"},
        {"id": 102, "name": "Data Science Enthusiasts"},
        {"id": 103, "name": "AI & ML Community"},
        {"id": 104, "name": "Web Dev Hub"}
    ]
}
```
Read this data and understand its structure. The data contains three main components:
1. **Users**: Each user has an ID, name, a list of friends (by their IDs), and a list of liked pages (by their IDs).
2. **Pages**: Each page has an ID and a name.
3. **Connections**: Users can have multiple friends and can like multiple pages.
---

## **Task 2: Read and Display the Data using Python**
Your goal is to **load** this data and **print** it in a structured way. Use Python's built-in modules to accomplish this.

We will:
1. Save the JSON data in a file (`data.json`).
2. Read the JSON file using Python.
3. Print user details and their connections.
4. Print available pages.

---

In [1]:
import json

In [2]:
# Let's write a function to load the data

def load_data(filename):
    with open(filename,"r") as f:
        data = json.load(f)
    return data

In [3]:
data = load_data("data/original_data.json")

In [4]:
data 

{'users': [{'id': 1, 'name': 'Amit', 'friends': [2, 3], 'liked_pages': [101]},
  {'id': 2, 'name': 'Priya', 'friends': [1, 4], 'liked_pages': [102]},
  {'id': 3, 'name': 'Rahul', 'friends': [1], 'liked_pages': [101, 103]},
  {'id': 4, 'name': 'Sara', 'friends': [2], 'liked_pages': [104]}],
 'pages': [{'id': 101, 'name': 'Python Developers'},
  {'id': 102, 'name': 'Data Science Enthusiasts'},
  {'id': 103, 'name': 'AI & ML Community'},
  {'id': 104, 'name': 'Web Dev Hub'}]}

In [5]:
# Write as function to display users and thier connections with pages

def display_users(data):
    print("Users and their Connetions:-")
    for user in data['users']:
        print(f"ID:{user['id']} - {user['name']}'s friends are {user['friends']} and his liked pages are {user['liked_pages']}.")
    print("\nPages informtion:-")
    for page in data['pages']:
        print(f"{page['id']}: {page['name']}")

display_users(data)

Users and their Connetions:-
ID:1 - Amit's friends are [2, 3] and his liked pages are [101].
ID:2 - Priya's friends are [1, 4] and his liked pages are [102].
ID:3 - Rahul's friends are [1] and his liked pages are [101, 103].
ID:4 - Sara's friends are [2] and his liked pages are [104].

Pages informtion:-
101: Python Developers
102: Data Science Enthusiasts
103: AI & ML Community
104: Web Dev Hub


# **Part-2 Cleaning and Structuring the Data**
Your manager is impressed with your progress but points out that the data is messy. Before we can analyze it effectively, we need to **clean and structure the data** properly.

Your task is to:
- Handle missing values
- Remove duplicate or inconsistent data
- Standardize the data format

Let's get started!

---

## **Task 1: Identify Issues in the Data**
Your manager provides you with an example dataset where some records are incomplete or incorrect. Here’s an example:

```json
{
    "users": [
        {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
        {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
        {"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]},
        {"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]},
        {"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
    ],
    "pages": [
        {"id": 101, "name": "Python Developers"},
        {"id": 102, "name": "Data Science Enthusiasts"},
        {"id": 103, "name": "AI & ML Community"},
        {"id": 104, "name": "Web Dev Hub"},
        {"id": 104, "name": "Web Development"}
    ]
}
```

**Problems:**
1. User **ID 3** has an empty name.
2. User **ID 4** has a duplicate friend entry.
3. User **ID 5** has no connections or liked pages (inactive user).
4. The **pages list** contains duplicate page IDs.

---

## **Task 2: Clean the Data**
We will:
1. Remove users with missing names.
2. Remove duplicate friend entries.
3. Remove inactive users (users with no friends and no liked pages).
4. Deduplicate pages based on IDs.


In [6]:
def clean_data(data):
    # Remove users with missing name
    data['users'] = [user for user in data['users'] if user['name'].strip()]
    
    # Remove duplicate friends by converting to a set and then back to a list
    for user in data['users']:
        user['friends'] = list(set(user['friends']))
        
    # Remove inactive users (users with no friends and no liked pages)  
    data['users'] = [user for user in data['users'] if (user['friends'] or user['liked_pages'])]

    # Remove duplicate pages
    """ This section uses a dictionary to store pages, leveraging the unique 'id' as keys. When a duplicate 'id' is encountered, the new page object overwrites the old one.
     This effectively keeps only the last occurrence of a page with a given ID if there are duplicates."""
    unique_pages = {}
    for page in data['pages']:
        unique_pages[page['id']] = page
    data['pages'] = list(unique_pages.values())
    return data

# Load the data
data = json.load(open("data/messy_data.json")) 
data = clean_data(data)
json.dump(data, open("cleaned_messy_data.json", "w"), indent = 4)
print("Data has been cleaned successfully")

Data has been cleaned successfully


# **Part-3 Feature - "People You May Know"**
Now that our data is cleaned and structured, your manager assigns you a new task: **Build a 'People You May Know' feature!**

In social networks, this feature helps users connect with others by suggesting friends based on mutual connections. Your job is to **analyze mutual friends and recommend potential connections**.

---

## **Task 1: Understand the Logic**
### **How 'People You May Know' Works:**

- If **User A** and **User B** are not friends but have **mutual friends**, we suggest User B to User A and vice versa.
- More mutual friends = higher priority recommendation.

Example:
- **Amit (ID: 1)** is friends with **Priya (ID: 2)** and **Rahul (ID: 3)**.
- **Priya (ID: 2)** is friends with **Sara (ID: 4)**.
- Amit is not directly friends with Sara, but they share **Priya as a mutual friend**.
- Suggest **Sara to Amit** as "People You May Know".

---

## **Task 2: Implement the Algorithm**
We'll create a function that:
1. Finds all friends of a given user.
2. Identifies mutual friends between non-friends.
3. Ranks recommendations by the number of mutual friends.

In [7]:
def load_data(filename):
    with open(filename,"r") as file:
        return json.load(file)

def find_people_you_may_know(user_id,data):
    # Create a dictionary to map each user ID to a set of their friends for efficient lookup.
    user_friends = {}
    for user in data['users']:
        user_friends[user['id']] = set(user['friends'])

    # If the provided user_id does not exist, return an empty list.
    if user_id not in user_friends:
        return []

    direct_friends = user_friends[user_id]
    suggestions = {}

    # Iterate through all of the user's direct friends.
    for friend in direct_friends:     # id:1's friends = {2,3}
        # For all friends of friend
        for mutual in user_friends[friend]:    # id:2's friends = {1,4} & id:3's friends = {1}
            # If mutual id is not the same user and not already a direct friend of user
            if mutual != user_id and mutual not in direct_friends:
                # Count mutual friends
                suggestions[mutual] = suggestions.get(mutual, 0) + 1
                
    # Sort the suggestions in descending order based on the number of mutual friends.
    sorted_suggestions = sorted(suggestions.items(), key=lambda x:x[1], reverse = True)
    return [suggestion_id for suggestion_id, mutual_count in sorted_suggestions]
    
# Load the data
data = load_data("data/original_data.json")
user_id = 3
recommendations = find_people_you_may_know(user_id,data)
print(f"People You May Know for User {user_id}: {recommendations}")

People You May Know for User 3: [2]


# **Part-4 Feature - "Pages You Might Like"**
We’ve officially reached the final milestone of our first data science project at
CodeBook – The Social Media for Coders. After cleaning messy data and building
features like People You May Know, it’s time to launch our last feature: **"Pages You
Might Like"**.

---
## **Task 1: Understand the Logic**
### **How ‘Pages You Might Like’ Works:**

- Users engage with pages (like, comment, share, etc.).
- If two users have interacted with similar pages, they are likely to have common
interests.
- For the sake of this implementation, we consider liking a page as an
interaction
- Pages followed by similar users should be recommended.
  
Example:
- **Amit (ID: 1)** likes Python Hub **(Page ID: 101)** and AI World **(Page ID: 102)**.
- **Priya (ID: 2)** likes AI World **(Page ID: 102)** and Data Science Daily **(Page ID:
103)**.
- Since Amit and Priya both like AI World **(102)**, we suggest Data Science Daily
**(103)** to **Amit** and Python Hub **(101)** to **Priya**.

---

## **Task 2: Implement the Algorithm**

We’ll create a function that:
1. Maps users to pages they have interacted with.
2. Identifies pages liked by users with similar interests.
3. Ranks recommendations based on common interactions.


In [8]:
# Function to load the json data
def load_data(filename):
    with open(filename, "r") as f:
        return json.load(f)

# Function to find pages a user might like based on common interests
def find_pages_you_might_like(user_id, data):
    # Dictionary to store user interactions with pages
    user_pages = {}
    
    # Populate the dictionary by iterating through the 'users' in the data
    for user in data['users']:
        user_pages[user['id']] = set(user['liked_pages'])

    # If the user is not found, return an empty list
    if user_id not in user_pages:
        return []
        
    user_liked_pages = user_pages[user_id]
    page_suggestion = {}

    # Iterate through all other users to find shared pages.
    for other_user, pages in user_pages.items():
        if other_user != user_id:
            # Find pages liked by both the target user and the other user
            shared_pages = user_liked_pages.intersection(pages)

        # Iterate through the other user's liked pages
        for page in pages:
            if page not in user_liked_pages:
                # Count page liked by a user
                page_suggestion[page] = page_suggestion.get(page, 0) + 1

    # Sort the suggestions in descending order based on their recommendation score
    sorted_pages = sorted(page_suggestion.items(), key=lambda x: x[1], reverse=True)
    return [sorted_pages]

# Load the data
data = load_data("data/original_data.json")
user_id = 4
page_recommendations = find_pages_you_might_like(user_id, data)
print(f"Pages You Might Like for User {user_id}: {page_recommendations}")

Pages You Might Like for User 4: [[(101, 2), (102, 1), (103, 1)]]
