## Data loading

### Subtask:
Load the three CSV files into pandas DataFrames.


**Reasoning**:
Load the three CSV files into pandas DataFrames, ensuring the `account_id` column is treated as a string.



In [1]:
import pandas as pd

df1 = pd.read_csv('Aws_account_details - Sheet1.csv', dtype={'account_id': str})
df2 = pd.read_csv('Aws_account_details - Sheet2.csv', dtype={'account_id': str})
df3 = pd.read_csv('Aws_account_details - Sheet3.csv', dtype={'account_id': str})

## Data preparation

### Subtask:
Prepare the dataframes for merging.


**Reasoning**:
Standardize column names to lowercase and ensure the "account_id" column exists and has a consistent string data type across all three dataframes. Also, handle other relevant columns for merging as per the task description.



In [2]:
import pandas as pd

# Standardize column names to lowercase
df1.columns = df1.columns.str.lower()
df2.columns = df2.columns.str.lower()
df3.columns = df3.columns.str.lower()

# Rename the inconsistent column in df3
df3 = df3.rename(columns={'account _id': 'account_id'})

# Ensure 'account_id' is string type in all dataframes
for df in [df1, df2, df3]:
    if df['account_id'].dtype != 'object':
        df['account_id'] = df['account_id'].astype(str)

# Display the column names of each dataframe to confirm changes
display(df1.columns)
display(df2.columns)
display(df3.columns)

Index(['account_id', 'account_name', 'owner', 'entity'], dtype='object')

Index(['account_id', 'account_name'], dtype='object')

Index(['account_id', 'account_name', 'owner'], dtype='object')

## Data wrangling

### Subtask:
Merge the three dataframes (df1, df2, df3) into a single dataframe.


**Reasoning**:
Merge df1 and df2, then merge the result with df3, and select the required columns.



In [3]:
# Merge df2 into df1
merged_df = pd.merge(df1, df2, on='account_id', how='left', suffixes=('_df1', '_df2'))
merged_df['account_name'] = merged_df['account_name_df1'].combine_first(merged_df['account_name_df2'])
merged_df['owner'] = merged_df['owner'].combine_first(merged_df['owner'])
merged_df = merged_df[['account_id', 'account_name', 'owner', 'entity']]

# Merge df3 into merged_df
merged_df = pd.merge(merged_df, df3, on='account_id', how='left', suffixes=('_merged', '_df3'))
merged_df['account_name'] = merged_df['account_name_merged'].combine_first(merged_df['account_name_df3'])
merged_df['owner'] = merged_df['owner_merged'].combine_first(merged_df['owner_df3'])
merged_df['entity'] = merged_df['entity'].combine_first(merged_df['entity'])
merged_df = merged_df[['account_id', 'account_name', 'owner', 'entity']]

# Display the merged dataframe
display(merged_df)

Unnamed: 0,account_id,account_name,owner,entity
0,072401294672,DMS AWS,Abhinav Gupta,OCL
1,901553615594,merchantdashboards,Abhinav Gupta,OCL
2,480732456233,dms-prod_ocl,Abhinav Gupta,OCL
3,182401677120,onboardingengine,Abhinav Gupta,OCL
4,389881277731,merchant-business,Abhinav Gupta,OCL
...,...,...,...,...
223,654654138046,it-infra,Saurabh Gupta,PAI
224,318633126153,AWS SMS,Saurabh Gupta,OCL
225,474140895141,Enterprise SMS,Saurabh Gupta,OCL
226,367464822874,One97 IT,Saurabh Gupta,OCL


## Data cleaning

### Subtask:
Clean the merged dataframe by handling missing values and removing duplicate rows.


**Reasoning**:
Fill missing values in specified columns with "Unknown" and remove duplicate rows based on "account_id", keeping the first occurrence.



In [4]:
# Fill missing values
for col in ['account_name', 'owner', 'entity']:
    merged_df[col] = merged_df[col].fillna('Unknown')

# Remove duplicate rows based on 'account_id'
merged_df = merged_df.drop_duplicates(subset='account_id', keep='first')

# Display the cleaned dataframe
display(merged_df)

Unnamed: 0,account_id,account_name,owner,entity
0,072401294672,DMS AWS,Abhinav Gupta,OCL
1,901553615594,merchantdashboards,Abhinav Gupta,OCL
2,480732456233,dms-prod_ocl,Abhinav Gupta,OCL
3,182401677120,onboardingengine,Abhinav Gupta,OCL
4,389881277731,merchant-business,Abhinav Gupta,OCL
...,...,...,...,...
211,907058949943,ppbloe-prod ($),Vishal Jangda,Nearbuy
212,,International Expansion,Shrey Agarwal,OCL
221,905418158003,FInmate,Saurabh Gupta,OCIL
222,767397877262,it-infra,Saurabh Gupta,PIBPL


## Data preparation

### Subtask:
Reorder the columns in the merged dataframe to match the specified order.


**Reasoning**:
Reorder the columns of the merged dataframe to match the specified order and display the columns to verify the change.



In [5]:
# Reorder the columns
merged_df = merged_df.reindex(columns=['account_id', 'account_name', 'owner', 'entity'])

# Display the reordered columns
display(merged_df.columns)

Index(['account_id', 'account_name', 'owner', 'entity'], dtype='object')

## Data loading

### Subtask:
Save the merged and cleaned dataframe `merged_df` to a new CSV file named "combined_account_details.csv".


**Reasoning**:
Save the merged dataframe `merged_df` to a CSV file named "combined_account_details.csv" without the index.



In [6]:
merged_df.to_csv('combined_account_details.csv', index=False)

In [None]:
import pandas as pd

# Load the combined CSV file
df = pd.read_csv('combined_account_details.csv')

# Convert 'account_id' to numeric, coercing errors to NaN
df['account_id'] = pd.to_numeric(df['account_id'], errors='coerce')

# Remove rows where 'account_id' is NaN (invalid text or blank)
df = df.dropna(subset=['account_id'])

# Ensure 'account_id' is of integer type
df['account_id'] = df['account_id'].astype(int)

# Split the DataFrame into two based on a condition (example: account_id > 1000)
df_part1 = df[df['account_id'] > 1000]
df_part2 = df[df['account_id'] <= 1000]

# Save the DataFrames to separate CSV files
df_part1.to_csv('accounts_final_part1.csv', index=False)
df_part2.to_csv('accounts_final_part2.csv', index=False)
