# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised



# **Project Summary -**

The Glassdoor Project aims to analyze employee reviews, salary data, and company ratings collected from the Glassdoor platform to uncover insights about workplace culture, job satisfaction, salary trends, and key drivers of employee retention. The primary objective of this project is to assist job seekers, HR professionals, and companies in making informed decisions using data-driven analysis.

The project involves gathering structured and unstructured data from Glassdoor, including job titles, salary estimates, employee reviews, company ratings, and pros/cons mentioned in reviews. Using web scraping tools, a comprehensive dataset was created for further analysis.

For the quantitative part, data cleaning and preprocessing were performed, followed by statistical analysis and data visualization. Machine learning techniques were applied to predict overall company ratings based on review content and salary satisfaction. Feature analysis helped interpret which keywords or sentiments most strongly impacted review scores.

Key findings included correlations between job roles and satisfaction levels, insights into salary fairness by role and region, and identification of frequently mentioned positive and negative themes such as “good work-life balance” or “poor management.”

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**
1. How does salary vary by job position?
2. What is the impact of company size on salary levels?
3. how do salaries differ by location?
4. Build a predicitve model to predict sdalary based on job attributes?

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

drive.mount('/content/drive')


### Dataset Loading

In [None]:
# Load Dataset
file_path = '/content/drive/MyDrive/glassdoor_jobs.csv'

df = pd.read_csv(file_path)


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
row,col = df.shape
print(f"Number of rows: {row}")
print(f"Number of columns: {col}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.shape[0] - df.nunique()
print(f"Duplicate values count:{duplicate_count}")

### What did you know about your dataset?


The given dataset had 13 attributes in total.some of these attributes are - job title,company name location,revenue and others.Target feature here in this datadet is salary estimate as we can observe how different factors affect salary levels of an employee.
None of the attributes have any null or missing values.
upon preprocessing the dataset we can observe duplicate values in the dataset for almost each attribute.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns.tolist()

In [None]:
# Dataset Describe
df.describe()

### Variables Description

In the above dataset the target variable is 'salary estimate'. Various features like job description, company rating allow us to study how salary levels vary accross different values of job attributes.
Most of the attributes in teh above dataset have categorical data type.
'Rating' and 'Founded' are the two attributes with numerical datatype.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df = df[df['Salary Estimate'].notnull()]
df['Salary Estimate Cleaned'] = (
    df['Salary Estimate']
    .str.replace(r'\$|K|per hour|employer provided salary:|\\n|', '', regex=True)
    .str.lower()
    .str.strip()
)
df[['min_salary', 'max_salary']] = df['Salary Estimate Cleaned'].str.extract(r'(\d+)[\s\-to]*?(\d+)')
df['min_salary'] = df['min_salary'].astype(float) * 1000
df['max_salary'] = df['max_salary'].astype(float) * 1000
df['salary'] = (df['min_salary'] + df['max_salary']) / 2

# Group by Job Title and calculate average salary
salary_by_title = df.groupby('Job Title')['salary'].mean().sort_values(ascending=False).head(10)

# encoding categorical variables for plotting correlation matrix
categorical_cols = df.select_dtypes(include='object').columns.tolist()
df_encoded = df[categorical_cols].copy()
# Apply LabelEncoder to each column
le = LabelEncoder()
for col in categorical_cols:
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))


### What all manipulations have you done and insights you found?

in the above dataset i have manipulated various features like salary attribute. i have converted salary attribute fdrom categorical to numerical by taking the mean of upper and lower ranges of salary of the employees.
i have also performed label encoding to ecnode categorical variables soi as to enbale machine learning models to work on t==hese variables effectively.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8,8))
plt.scatter(df['Rating'],df['salary'],alpha = 0.6)
plt.xlabel('Rating')
plt.ylabel('Salary Estimate')
plt.grid(True)
plt.show()



##### 1. Why did you pick the specific chart?

above chart allow us to observe to which area the datapoints are more grouped.

##### 2. What is/are the insight(s) found from the chart?

From the above graph we can infer that most of the employees join compny with higher rating and employees of the compny with compnay rating in between 2-5 have salary ranges from $10k-$150k.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12,6))
salary_by_title.plot(kind='bar', color='skyblue')
plt.title('Top 10 Job Titles by Average Salary')
plt.xlabel('Job Title')
plt.ylabel('Average Salary')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.grid(axis='y')
plt.show()


##### 1. Why did you pick the specific chart?

It allows easily study and infer new information from the graph.

##### 2. What is/are the insight(s) found from the chart?

We can see that Data Scienits job title offers highest salary then any other job titles.

*Answer* Here

#### Chart - 3 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr_matrix = df_encoded.corr()
plt.figure(figsize=(12,10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Heatmap of Categorical Attributes (Label Encoded)')
plt.tight_layout()
plt.show()


## ***5. Feature Engineering & Data Pre-processing***

### 1. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
sns.boxplot(x=df['salary'])
plt.title("Boxplot of Average Salary")
plt.show()

In [None]:
from scipy.stats import zscore

df['z_score_salary'] = zscore(df['salary'])
outliers_z = df[(df['z_score_salary'].abs() > 3)]
print(f"Z-score outliers: {outliers_z.shape[0]}")


##### What all outlier treatment techniques have you used and why did you use those techniques?

i hvae used univariate analysis and z-score analysis for detecting outliers within the dataset.

### 2. Categorical Encoding

In [None]:
# Encode your categorical columns

categorical_cols = df.select_dtypes(include='object').columns.tolist()

# Copy DataFrame to avoid changing original
df_encoded = df[categorical_cols].copy()

# Apply LabelEncoder to each column
le = LabelEncoder()
for col in categorical_cols:
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))


#### What all categorical encoding techniques have you used?

I have used label encoding technique.

### 3. Data Scaling

In [None]:
# Scaling your data
scaler = MinMaxScaler()
df['salary_scaled'] = scaler.fit_transform(df[['salary']])


##### Which method have you used to scale you data and why?

i have used min max scaling as it bring down the value between the range of 0 to 1.this makes computations process faster.

### 4. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

No,I do not believe the dimensionality reduction techniwue is needed here as the number of features within the dataset is only 13 so it is not much costly in terms of computation.

### 5. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
X = df.drop(['salary','Salary Estimate'],axis=1)
y = df['salary']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

##### What data splitting ratio have you used and why?

Data splitting ratio is 3:7 where 70% of the of data is used for training the model and the rest of the 30% of the data is used for machine learning model testing.
This ratio allows the model for better accurcay as it provides ml model with more data to test itself.

### 6. Handling Imbalanced Dataset

In [None]:
df['Job Title'].value_counts(normalize=True)

##### Do you think the dataset is imbalanced? Explain Why.

The dataset is not imbalanced. upon  analysis we could easily observe that there were no classes that majorly domniated other classes or attributes.

## ***7. ML Model Implementation***



```
# This is formatted as code
```

### Binary Classification

In [None]:
# ML Model - 1 Implementation
median_salary = df['salary'].median()
df['high_salary'] = (df['salary'] > median_salary).astype(int)

# Drop original avg_salary and any columns not needed
X = df.drop(['salary', 'high_salary', 'Salary Estimate'], axis=1)
X = pd.get_dummies(X, drop_first=True)
y = df['high_salary']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the Algorithm
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict on the model
y_pred = model.predict(X_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


### Linear Regression

In [None]:
features = [
    'Rating', 'Size', 'Founded', 'Type of ownership',
    'Industry', 'Sector', 'Job Title', 'Location', 'Company Name'
]
target = 'salary'

# Drop rows with missing values
df_model = df[features + [target]].dropna()

X = df_model[features]
y = df_model[target]

# Identify categorical columns
categorical_cols = ['Size', 'Type of ownership', 'Industry', 'Sector', 'Job Title', 'Location', 'Company Name']

# Define preprocessor and pipeline
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore', max_categories=20), categorical_cols)
], remainder='passthrough')

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

i have used linear regression model to calculate new salary for various new inputs of different job attributes along with encoding and other techniques due to categorical attributes present.

In [None]:
# Visualizing evaluation Metric Score chart
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

plt.figure(figsize=(10, 5))
sns.scatterplot(x=y_test, y=y_pred)
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Actual vs Predicted Salary")
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')  # Ideal line
plt.grid(True)
plt.show()

## ***8.*** ***Future Work (Optional)***



```
`# This is formatted as code`
```

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

# Save the model
joblib.dump(model, 'linear_regression_model.joblib')

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.

> Add blockquote

> Add blockquote






In [None]:
# Load the File and predict unseen data.

# Load the trained model
model = joblib.load('linear_regression_model.joblib')

new_job = {
    'Rating': 4.2,
    'Size': '501 to 1000 employees',
    'Founded': 2015,
    'Type of ownership': 'Company - Private',
    'Industry': 'Internet',
    'Sector': 'Information Technology',
    'Job Title': 'Data Scientist',
    'Location': 'San Francisco, CA',
    'Company Name': 'TechNova'
}
new_input_df = pd.DataFrame([new_job])
predicted_salary = model.predict(new_input_df)
print("Predicted Average Salary (in $K):", predicted_salary[0])


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

We can conclude that private companies have higher revenues than companies with public ownership.
High company rating usually have moderate no of employees and offer higher salaries.
Depending on the job titles salary varies a lot. Among all the job titles 'Data Scientist' have the highest salaries than any other job titles.




### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***