In [1]:
import pandas as pd

df = pd.read_csv('../data/raw/job_title_des.csv')

print("Datasset Information:")
df.info()

Datasset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2277 entries, 0 to 2276
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       2277 non-null   int64 
 1   Job Title        2277 non-null   object
 2   Job Description  2277 non-null   object
dtypes: int64(1), object(2)
memory usage: 53.5+ KB


In [8]:
print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0.1,Unnamed: 0,Job Title,Job Description
0,0,Flutter Developer,We are looking for hire experts flutter develo...
1,1,Django Developer,PYTHON/DJANGO (Developer/Lead) - Job Code(PDJ ...
2,2,Machine Learning,"Data Scientist (Contractor)\n\nBangalore, IN\n..."
3,3,iOS Developer,JOB DESCRIPTION:\n\nStrong framework outside o...
4,4,Full Stack Developer,job responsibility full stack engineer – react...


In [9]:
print("Column Names:")
df.columns.tolist()

Column Names:


['Unnamed: 0', 'Job Title', 'Job Description']

In [10]:
print("Total Job: :", len(df))

Total Job: : 2277


## Step 2: Check Shape of Dataset

The `shape` attribute returns a tuple (rows, columns).
This tells us the size of our dataset - important for understanding data volume.

In [11]:
print(f"Dataset Shape: {df.shape}")
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")

Dataset Shape: (2277, 3)
Rows: 2277
Columns: 3


## Step 3: Check for Missing Values

`isnull()` returns True/False for each cell if data is missing.
`sum()` counts how many missing values in each column.
Critical for data cleaning - we need to handle missing data before building models.

In [12]:
print("Missing Values per Column:")
print(df.isnull().sum())

print("\nPercentage of Missing Values:")
print((df.isnull().sum() / len(df)) * 100)

Missing Values per Column:
Unnamed: 0         0
Job Title          0
Job Description    0
dtype: int64

Percentage of Missing Values:
Unnamed: 0         0.0
Job Title          0.0
Job Description    0.0
dtype: float64


## Step 4: Examine a Sample Job Description

Let's look at one complete job description to understand the data quality.
This helps us see what kind of text we're working with for recommendations.

## Step 5: Job Description Statistics

Let's check the distribution of description lengths across all jobs.

In [14]:
# Calculate word counts for all descriptions
df['word_count'] = df['Job Description'].apply(lambda x: len(str(x).split()))
df['char_count'] = df['Job Description'].apply(len)

print("Description Word Count Statistics:")
print(df['word_count'].describe())

print("\nDescription Character Count Statistics:")
print(df['char_count'].describe())

# Find shortest and longest descriptions
shortest_idx = df['word_count'].idxmin()
longest_idx = df['word_count'].idxmax()

print(f"\nShortest description ({df['word_count'].min()} words): {df['Job Title'].iloc[shortest_idx]}")
print(f"Longest description ({df['word_count'].max()} words): {df['Job Title'].iloc[longest_idx]}")

Description Word Count Statistics:
count    2277.000000
mean      276.201581
std       201.828979
min        12.000000
25%       130.000000
50%       223.000000
75%       372.000000
max      1575.000000
Name: word_count, dtype: float64

Description Character Count Statistics:
count     2277.000000
mean      1986.595520
std       1442.771512
min        116.000000
25%        921.000000
50%       1604.000000
75%       2691.000000
max      10802.000000
Name: char_count, dtype: float64

Shortest description (12 words): Node js developer
Longest description (1575 words): iOS Developer


## Step 6: Unique Job Titles

Let's see how many unique job titles we have and the most common ones.

In [15]:
print(f"Total unique job titles: {df['Job Title'].nunique()}")

print("\nTop 10 Most Common Job Titles:")
print(df['Job Title'].value_counts().head(10))

print("\nJob Titles with only one occurrence:")
single_occurrence = df['Job Title'].value_counts()[df['Job Title'].value_counts() == 1]
print(f"Count: {len(single_occurrence)}")
print(single_occurrence.head(10).index.tolist())

Total unique job titles: 15

Top 10 Most Common Job Titles:
Job Title
JavaScript Developer    166
Java Developer          161
Software Engineer       160
Node js developer       160
iOS Developer           159
PHP Developer           156
Flutter Developer       155
DevOps Engineer         155
Django Developer        152
Machine Learning        152
Name: count, dtype: int64

Job Titles with only one occurrence:
Count: 0
[]
