# Data Science Job Postings Analysis on Glassdoor Dataset

## Introduction
This project analyzes a dataset obtained from Glassdoor to gain insights into Data Science job postings. 

## Setup
Ensure that you have the necessary tools and libraries installed. You may use Python with Pandas, NumPy, and other relevant libraries.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

## Data Loading
Load the Data Science Job Postings dataset into a Pandas DataFrame.

In [2]:
df = pd.read_csv('DS_jobs.csv')

## Explore the Data
Examine the structure of the dataset to identify missing values, outliers, and potential issues.
#### Display the Head of the DataFrame

In [3]:
df.head()

Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1
2,2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1
3,3,Data Scientist,$137K-$171K (Glassdoor est.),JOB DESCRIPTION:\n\nDo you have a passion for ...,3.5,INFICON\n3.5,"Newton, MA","Bad Ragaz, Switzerland",501 to 1000 employees,2000,Company - Public,Electrical & Electronic Manufacturing,Manufacturing,$100 to $500 million (USD),"MKS Instruments, Pfeiffer Vacuum, Agilent Tech..."
4,4,Data Scientist,$137K-$171K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee"


#### Display the Shape of the dataset

In [4]:
df.shape

(672, 15)

#### Display Column Names

In [5]:
df.columns

Index(['index', 'Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors'],
      dtype='object')

#### Display basic information about the dataset


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672 entries, 0 to 671
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              672 non-null    int64  
 1   Job Title          672 non-null    object 
 2   Salary Estimate    672 non-null    object 
 3   Job Description    672 non-null    object 
 4   Rating             672 non-null    float64
 5   Company Name       672 non-null    object 
 6   Location           672 non-null    object 
 7   Headquarters       672 non-null    object 
 8   Size               672 non-null    object 
 9   Founded            672 non-null    int64  
 10  Type of ownership  672 non-null    object 
 11  Industry           672 non-null    object 
 12  Sector             672 non-null    object 
 13  Revenue            672 non-null    object 
 14  Competitors        672 non-null    object 
dtypes: float64(1), int64(2), object(12)
memory usage: 78.9+ KB


## Comprehensive Data Cleaning and Transformation Pipeline

### Data Cleaning and Transformation


In [7]:
# Removing all whitespace and trailing characters in each column
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df = df.applymap(lambda x: x.replace('\n', ' ') if isinstance(x, str) else x)

In [8]:
# Remove digits and dots from the 'Company Name' column
df['Company Name'] = df['Company Name'].str.replace('[\d+.]', '', regex=True)

In [9]:
# Extract minimum and maximum salary values from the 'Salary Estimate' column
salary_range = df['Salary Estimate'].str.extract(r'(\d+)K.*?(\d+)K')

# Convert extracted values to numeric and handle errors by coercing them to NaN
df['Min Salary'] = pd.to_numeric(salary_range[0], errors='coerce')
df['Max Salary'] = pd.to_numeric(salary_range[1], errors='coerce')

# Calculate average salary based on minimum and maximum values
df['Average Salary'] = (df['Min Salary'] + df['Max Salary']) / 2

# Update 'Salary Estimate' column with the new minimum and maximum values
df['Salary Estimate'] = df['Min Salary'].astype(str) + 'K - ' + df['Max Salary'].astype(str) + 'K'

In [10]:
# setting the value to '1' if 'Job Description' contains those skills in the column, otherwise '0'
df['Python'] = df['Job Description'].apply(lambda x: 1 if "python" in x.lower() else 0)
df['SQL'] = df['Job Description'].apply(lambda x: 1 if "sql" in x.lower() else 0)
df['Machine learning'] = df['Job Description'].apply(lambda x: 1 if "machine learning" in x.lower() else 0)
df['BIG DATA'] = df['Job Description'].apply(lambda x: 1 if "big data" in x.lower() else 0)
df['Hadoop'] = df['Job Description'].apply(lambda x: 1 if "hadoop" in x.lower() else 0)
df['Spark'] = df['Job Description'].apply(lambda x: 1 if "spark" in x.lower() else 0)
df['AWS'] = df['Job Description'].apply(lambda x: 1 if "aws" in x.lower() else 0)
df['Tableau'] = df['Job Description'].apply(lambda x: 1 if "tableau" in x.lower() else 0)
df['Power BI'] = df['Job Description'].apply(lambda x: 1 if "power bi" in x.lower() else 0)
df['Excel'] = df['Job Description'].apply(lambda x: 1 if "excel" in x.lower() else 0)

In [11]:
# changing rating that is -01.00 to 0.00
df['Rating'].replace(-1, 0, inplace=True)

In [12]:
# adding company age column with handling for -1 value
current_year = datetime.now().year
df['Company Age'] = current_year - df['Founded'].replace(-1, current_year)
df['Company Age'] = df['Company Age'].apply(lambda x: -1 if x < 1 else x)

In [13]:
# adding state column based from location column
df['State'] = df['Location'].str.split(',').str[-1]
df['State'].replace(['United States', 'New Jersey', 'California', 'Texas', 'Utah'],
                    ["US", "NJ", "CA", "TX", "UT"], inplace=True)

In [14]:
# Drop Unnecessary Columns
df = df.drop(['index', 'Founded'], axis=1)

### Handling Missing Values



In [15]:
# Check the number of missing values in each column
df.isna().sum()

Job Title            0
Salary Estimate      0
Job Description      0
Rating               0
Company Name         0
Location             0
Headquarters         0
Size                 0
Type of ownership    0
Industry             0
Sector               0
Revenue              0
Competitors          0
Min Salary           0
Max Salary           0
Average Salary       0
Python               0
SQL                  0
Machine learning     0
BIG DATA             0
Hadoop               0
Spark                0
AWS                  0
Tableau              0
Power BI             0
Excel                0
Company Age          0
State                0
dtype: int64

In [16]:
# Check the total number of missing values in the entire DataFrame
df.isnull().sum().sum()

0

### Duplicate Removal


In [17]:
# Check and print the number of duplicated rows in the DataFrame
df.duplicated().sum()

13

In [18]:
# Display rows where duplicate records are found
df[df.duplicated(keep= False) == True]

Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Type of ownership,Industry,...,Machine learning,BIG DATA,Hadoop,Spark,AWS,Tableau,Power BI,Excel,Company Age,State
131,Senior Data Engineer,90K - 109K,Lendio is looking to fill a position for a Sen...,4.9,Lendio,"Lehi, UT","Lehi, UT",201 to 500 employees,Company - Private,Lending,...,0,1,1,1,1,0,0,0,12,UT
134,Machine Learning Engineer,90K - 109K,Role Description Triplebyte screens and evalua...,3.2,Triplebyte,Remote,"San Francisco, CA",51 to 200 employees,Company - Private,Computer Hardware & Software,...,1,0,0,0,0,0,0,0,8,Remote
135,Machine Learning Engineer,90K - 109K,Role Description Triplebyte screens and evalua...,3.2,Triplebyte,Remote,"San Francisco, CA",51 to 200 employees,Company - Private,Computer Hardware & Software,...,1,0,0,0,0,0,0,0,8,Remote
136,Senior Data Engineer,90K - 109K,Lendio is looking to fill a position for a Sen...,4.9,Lendio,"Lehi, UT","Lehi, UT",201 to 500 employees,Company - Private,Lending,...,0,1,1,1,1,0,0,0,12,UT
357,Data Scientist,122K - 146K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,-1,...,1,0,0,1,1,0,0,0,-1,CA
358,Data Scientist,122K - 146K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,-1,...,1,0,0,1,1,0,0,0,-1,CA
359,Data Scientist,122K - 146K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,-1,...,1,0,0,1,1,0,0,0,-1,CA
360,Data Scientist,122K - 146K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,-1,...,1,0,0,1,1,0,0,0,-1,CA
361,Data Scientist,122K - 146K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,-1,...,1,0,0,1,1,0,0,0,-1,CA
362,Data Scientist,122K - 146K,Job Overview: The Data Scientist is a key memb...,0.0,Hatch Data Inc,"San Francisco, CA",-1,-1,-1,-1,...,1,0,0,1,1,0,0,0,-1,CA


In [19]:
# Drop duplicate rows from the DataFrame
df.drop_duplicates(inplace= True)

In [20]:
# Re-check the number of duplicated rows after dropping
df.duplicated().sum()

0

### Column Reordering

In [21]:
# Change the order of the columns based on the provided instructions.
df = df[['Job Title', 'Salary Estimate', 'Min Salary', 'Max Salary', 'Average Salary', 'Job Description', 'Rating',
       'Company Name', 'Location', 'State', 'Headquarters', 'Size', 'Company Age',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors', 'Python', 'SQL',
       'Machine learning', 'BIG DATA', 'Hadoop', 'Spark', 'AWS', 'Tableau',
       'Power BI', 'Excel']]

In [22]:
# Display the cleaned DataFrame
df.head()

Unnamed: 0,Job Title,Salary Estimate,Min Salary,Max Salary,Average Salary,Job Description,Rating,Company Name,Location,State,...,Python,SQL,Machine learning,BIG DATA,Hadoop,Spark,AWS,Tableau,Power BI,Excel
0,Sr Data Scientist,137K - 171K,137,171,154.0,Description The Senior Data Scientist is resp...,3.1,Healthfirst,"New York, NY",NY,...,0,0,1,0,0,0,1,0,0,0
1,Data Scientist,137K - 171K,137,171,154.0,"Secure our Nation, Ignite your Future Join th...",4.2,ManTech,"Chantilly, VA",VA,...,0,1,1,1,1,0,0,0,0,0
2,Data Scientist,137K - 171K,137,171,154.0,Overview Analysis Group is one of the larges...,3.8,Analysis Group,"Boston, MA",MA,...,1,0,1,0,0,0,1,0,0,1
3,Data Scientist,137K - 171K,137,171,154.0,JOB DESCRIPTION: Do you have a passion for Da...,3.5,INFICON,"Newton, MA",MA,...,1,1,1,0,0,0,1,0,0,1
4,Data Scientist,137K - 171K,137,171,154.0,Data Scientist Affinity Solutions / Marketing ...,2.9,Affinity Solutions,"New York, NY",NY,...,1,1,1,0,0,0,0,0,0,1


## Save Cleaned Data


In [23]:
df.to_csv('Cleaned_DS_jobs.csv', index=False)


## Conclusion
By following these steps, you have successfully cleaned and transformed the Data Science job postings dataset, making it ready for further analysis.