# LinkedIn Data Cleaning

## Objective:
This notebook aims to process and analyze LinkedIn data to extract a clean network representation of student connections. Below are the key steps:

<h3>1. Data Ingestion & Conversion</h3>
As some files were in .xlsx format, so changed them to .csv

<h3>2. LinkedIn Data Extraction</h3>
Read and parse the LinkedIn connection CSV files.
Extract user connection pairs (edges) from the dataset.

<h3>3. Data Cleaning & Graph Construction</h3>
Build an adjacency list to represent the connections as a graph.
Remove invalid or duplicate entries.

<h3>4. Graph Summarization</h3>
Use the First Year Batch List (All.csv) to provide summaries:
Number of valid students.
Connection patterns within the batch.
Disconnected or isolated nodes.

<h3>5. Output & Export</h3>
Save the cleaned graph data as a .json file for downstream usage.

# Manual Work
- Some students submitted their CSV files with file names that do not match their LinkedIn profile names so manual cleaning of file names was needed to ensure accurate mapping with LinkedIn profiles.
- Extracted one zip file.
- some students have submitted multiple file, so collected main copy of file.
- Deleted and improved mannually some more necessary changes.

- Student names = "Aman Adarsh", "Samina Sultana" and "Sneha Shaw" have submitted two files, so deleted them.

- some names of student did not come as expected like for "Anand Kumar Pandey" it comes "Anand Pandey" which mismatches in actual connection name, So changes their name mannually.

In [18]:
# Imports required package and dependancies.
import os
import csv
import json
from collections import defaultdict
import pandas as pd


In [19]:
import os
import pandas as pd
import re

folder_path = r'C:\Desktop\MFC ass\LinkedIn Data Public'

for filename in os.listdir(folder_path):
    if filename.endswith(('.csv', '.xlsx')):
        old_file_path = os.path.join(folder_path, filename)

        name_part = filename.rsplit('.', 1)[0]
        name_part = name_part.split('-', 1)[-1] if '-' in name_part else name_part
        name_part = re.sub(r'^[^a-zA-Z]+', '', name_part).strip()

        name_clean = name_part.replace('_', ' ').replace('-', ' ')
        name_clean = ' '.join(word.capitalize() for word in name_clean.split())

        new_filename = name_clean + '.csv'
        new_file_path = os.path.join(folder_path, new_filename)

        try:
            if filename.endswith('.xlsx'):
                df = pd.read_excel(old_file_path)
                df.to_csv(new_file_path, index=False)
                os.remove(old_file_path)
            else:
                os.rename(old_file_path, new_file_path)

            print(f"Processed: {filename} -> {new_filename}")
        except Exception as e:
            print(f"Error processing {filename}: {e}")

print("All files renamed and converted successfully.")


Processed: Aaditya_Raj - Aaditya Raj.csv -> Aaditya Raj.csv
Processed: Abhishek_Singh - Abhishek Singh.csv -> Abhishek Singh.csv
Processed: Aditya_Singh - Aditya NO-LASTNAME.csv -> Aditya No Lastname.csv
Processed: Afzal_Raza - Afzl Raza.csv -> Afzl Raza.csv
Processed: Ajay Jatav Connections-1 - Ajay Jatav.csv -> Ajay Jatav.csv
Processed: Ajit_Yadav - Ajit Yadav.csv -> Ajit Yadav.csv
Processed: Akanksha_Kushwaha - Akanksha.csv -> Akanksha.csv
Processed: Alok_raj - Alok Raj.csv -> Alok Raj.csv
Processed: Aman_ Adarsh.csv -> Aman Adarsh.csv
Processed: Aman_Singh - Aman Singh.csv -> Aman Singh.csv
Processed: amit_kumar - Amit Kumar.csv -> Amit Kumar.csv
Processed: Anamika_Kumari - Anamika Kumari.csv -> Anamika Kumari.csv
Processed: Anand_Pandey - Anand Pandey.csv -> Anand Pandey.csv
Processed: Anoop_Kumar - ANOOP KUMAR.csv -> Anoop Kumar.csv
Processed: Anshu_Kumar - Anshu Kumar.csv -> Anshu Kumar.csv
Processed: Anuradha_Tiwari - Anuradha Tiwari.csv -> Anuradha Tiwari.csv
Processed: Anushr

### ✅ Cleaned Pooran Singh CSV File Separately

I have cleaned the **Pooran Singh** data file which had the following issues:
- Merged first and last names in a single column
- Missing or misaligned columns
- Extra/missing name parts

Now the file has:
- Properly separated **First Name**, **Last Name**, and **Company** columns
- Handled edge cases like middle names or missing values
- Saved as **Pooran_Singh.csv** in a consistent format

####  Python Code Used:

```python
import pandas as pd

df = pd.read_csv(r"LinkedIn Data Public\Pooran Singh.csv", sep="\t", engine='python')
df.columns = df.columns.str.strip()

cleaned_data = []

for _, row in df.iterrows():
    first = row['First Name'].strip()
    last = row['Last Name'].strip()
    company = row['Company'].strip() if 'Company' in row else ''
    cleaned_data.append([first, last, company])

cleaned_df = pd.DataFrame(cleaned_data, columns=['First Name', 'Last Name', 'Company'])
cleaned_df.to_csv("Pooran_Singh.csv", index=False)

print("Data cleaning complete! The cleaned data is saved as 'Pooran_Singh.csv'.")


In [24]:
import os
import pandas as pd

folder_path = r"C:\Desktop\MFC ass\LinkedIn Data Public"

def clean_text(text, allow_digits=False):
    cleaned = ""
    for char in text:
        if char.isalpha() or char.isspace() or (allow_digits and char.isdigit()):
            cleaned += char
    return cleaned

for file_name in os.listdir(folder_path):
    if file_name.endswith(".csv"):
        file_path = os.path.join(folder_path, file_name)
        try:
            df = pd.read_csv(file_path, encoding='ISO-8859-1')
        except:
            continue

        df.columns = [col.strip() for col in df.columns]

        if 'First Name' in df.columns and 'Last Name' in df.columns and 'Company' in df.columns:
            for index in df.index:
                first = str(df.at[index, 'First Name'])
                last = str(df.at[index, 'Last Name'])
                company = str(df.at[index, 'Company'])

                df.at[index, 'First Name'] = clean_text(first)
                df.at[index, 'Last Name'] = clean_text(last) if last.lower() != "nan" else ""
                df.at[index, 'Company'] = clean_text(company, allow_digits=True) if company.lower() != "nan" else ""

            df.to_csv(file_path, index=False)
print("All files cleaned successfully.")

All files cleaned successfully.


In [23]:

import os

folder_path = r"C:\Desktop\MFC ass\LinkedIn Data Public"
count = 0

for file_name in os.listdir(folder_path):
    if file_name.endswith(".csv"):
        count += 1

print("Total CSV files:", count)


Total CSV files: 126
