# Notebook 01: Data Cleaning & EDA

## Introduction
- In this notebook, we will perform data cleaning and exploratory data analysis (EDA) on our dataset. The goal is to understand the structure of the data, identify any missing values or outliers, and gain insights that will inform our subsequent analysis

### 1- importing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("default")


### 2- loading the dataset
- We will load the dataset into a pandas DataFrame and take a look at the first few rows to understand its structure.

In [None]:
df = pd.read_csv(r"d:\my_projcts\job-salary-prediction\data\raw\salaries.csv")

print(df.shape)
(df.columns.tolist())

### 3- Dataset sampling
- To make our analysis more manageable, we will take a random sample of the dataset. This will allow us to perform EDA without having to process the entire dataset, which can be time-consuming.

In [None]:
# we take a random sample of the dataset to make our analysis more manageable
df_sample = df.sample(n=50000, random_state=42).reset_index(drop=True)
df_sample.shape # (50000 , 11)


### 4- Data Cleaning
- In this step, we will check for missing values, handle duplicates, and perform any necessary data transformations to prepare the dataset for analysis.

In [None]:
display(df_sample.head())
display(df_sample.tail())

In [None]:
display(df_sample.info())

In [None]:
display(df_sample.describe(include='all').T)

In [None]:
print(df_sample.isnull().sum()) # no missing values
print(df_sample.duplicated().sum()) # 18058 duplicate rows

In [None]:
df_sample.drop_duplicates(inplace=True)

print(df_sample.duplicated().sum()) # no duplicate rows after dropping duplicates


### 5- Type conversion
- We will check the data types of each column and convert them to appropriate types if necessary.

In [None]:
df_sample.dtypes