# Project overview
Use tools and techniques taught throughout the course to both find insgihts and assess the quailty of a dataset you choose.

Prepare a report for a senior audience for example, the Chief Data Officer, or the Head of Analytics at your company.

**Requirements for the report**:
* A summary of the data, clearly showing the size of the dataset, its variables, and possible target variables.
* A well-structured data exploration plan that is logical, meaningful, and outlines the vision for analysis.
* A detailed discussion of Exploratory Data Analysis (EDA) results that are informative, actionable, and insightful.
*  A clear explanation of data cleaning and feature engineering steps, including handling missing values, encoding, and visualizations. The report should also include the output of data cleaning, feature engineering steps, handling missing values, encoding, and visualizations. Relevant screenshots should be included.
* A dedicated section summarizing key findings and insights, effectively synthesizing EDA results in a meaningful and actionable way.
* A section that presents at least three hypotheses relevant to the dataset. 
* A thorough discussion of a significance test for at least one strong hypothesis. The results or their presentation should be truly insightful and exceed expectations, even if there are slight misinterpretations or room for feedback.
* A concluding section that includes key takeaways and next steps.

In [4]:
# Import libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import shutil
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from scipy.stats import norm
from scipy import stats
import kagglehub

In [6]:

# Download latest version from  Kaggle
path = kagglehub.dataset_download("ayeshasiddiqa123/salary-data")
print("Path to dataset files:", path)

Path to dataset files: C:\Users\gunay\.cache\kagglehub\datasets\ayeshasiddiqa123\salary-data\versions\1


In [7]:
# Copy files to current working directory
current_dir = os.getcwd()
for file_name in os.listdir(path):
    full_file_name = os.path.join(path, file_name)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, current_dir)
data=pd.read_csv('Salary_Data.csv')


In [8]:
# Get some information about the dataset
print("Size of the dataset: {}".format(data.shape))
print("Number of entries:{}".format(data.shape[0])) # or len(data)
print("Column names:{}".format(data.columns.to_list()))
print("Data types:\n{} \n".format(data.dtypes))


Size of the dataset: (6704, 6)
Number of entries:6704
Column names:['Age', 'Gender', 'Education Level', 'Job Title', 'Years of Experience', 'Salary']
Data types:
Age                    float64
Gender                  object
Education Level         object
Job Title               object
Years of Experience    float64
Salary                 float64
dtype: object 



In [9]:
print("Some more information:{}".format(data.info()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6704 entries, 0 to 6703
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  6702 non-null   float64
 1   Gender               6702 non-null   object 
 2   Education Level      6701 non-null   object 
 3   Job Title            6702 non-null   object 
 4   Years of Experience  6701 non-null   float64
 5   Salary               6699 non-null   float64
dtypes: float64(3), object(3)
memory usage: 314.4+ KB
Some more information:None


In [10]:
# Some statistical information
print(data.describe())

               Age  Years of Experience         Salary
count  6702.000000          6701.000000    6699.000000
mean     33.620859             8.094687  115326.964771
std       7.614633             6.059003   52786.183911
min      21.000000             0.000000     350.000000
25%      28.000000             3.000000   70000.000000
50%      32.000000             7.000000  115000.000000
75%      38.000000            12.000000  160000.000000
max      62.000000            34.000000  250000.000000


In [11]:
# check missing values
data.isnull().sum()

Age                    2
Gender                 2
Education Level        3
Job Title              2
Years of Experience    3
Salary                 5
dtype: int64

In [12]:
# See where exactly are those missing values in Salary column
print(data[data['Salary'].isnull()==1])

       Age  Gender    Education Level            Job Title  \
172    NaN     NaN                NaN                  NaN   
260    NaN     NaN                NaN                  NaN   
3136  31.0    Male    Master's Degree  Full Stack Engineer   
5247  26.0  Female  Bachelor's Degree             Social M   
6455  36.0    Male  Bachelor's Degree       Sales Director   

      Years of Experience  Salary  
172                   NaN     NaN  
260                   NaN     NaN  
3136                  8.0     NaN  
5247                  NaN     NaN  
6455                  6.0     NaN  


In [13]:
# Reveal some information about categorical columns
print(data['Gender'].value_counts())
print(data['Education Level'].value_counts())
print(data['Job Title'].value_counts())

Gender
Male      3674
Female    3014
Other       14
Name: count, dtype: int64
Education Level
Bachelor's Degree    2267
Master's Degree      1573
PhD                  1368
Bachelor's            756
High School           448
Master's              288
phD                     1
Name: count, dtype: int64
Job Title
Software Engineer                 518
Data Scientist                    453
Software Engineer Manager         376
Data Analyst                      363
Senior Project Engineer           318
                                 ... 
Junior Social Media Specialist      1
Senior Software Architect           1
Developer                           1
Social M                            1
Social Media Man                    1
Name: count, Length: 193, dtype: int64


In [53]:
pd.set_option("display.max_rows", None)  # show all rows
print(data['Job Title'].value_counts())

Job Title
Software Engineer                        518
Data Scientist                           453
Software Engineer Manager                376
Data Analyst                             363
Senior Project Engineer                  318
Product Manager                          313
Full Stack Engineer                      309
Marketing Manager                        255
Senior Software Engineer                 244
Back end Developer                       244
Front end Developer                      241
Marketing Coordinator                    158
Junior Sales Associate                   142
Financial Manager                        134
Marketing Analyst                        132
Software Developer                       125
Operations Manager                       114
Human Resources Manager                  104
Director of Marketing                     88
Web Developer                             87
Product Designer                          75
Research Director                         75


### Initial steps:
Fix values in categorical variables, change data type of educational level values from categorical to numerical, do some Data Wrangling(something with senior, junior),  check and remove duplicates, handle missing values, find and handle outliers, use data filtering in somewhere, use `groupby()` method

In [54]:
# There is phD and PhD, change phD to PhD in the educational level
data['Education Level'] = data['Education Level'].replace("phD", "PhD")
print(data['Education Level'].value_counts())


Education Level
Bachelor's Degree    2267
Master's Degree      1573
PhD                  1369
Bachelor's            756
High School           448
Master's              288
Name: count, dtype: int64


In [58]:
# Replace Bachelor's with Bachelor's Degree and Master's with Master's Degree
data['Education Level'] = data['Education Level'].replace("Bachelor's", "Bachelor's Degree")
data['Education Level'] = data['Education Level'].replace("Master's", "Master's Degree")
print(data['Education Level'].value_counts())

Education Level
Bachelor's Degree    3023
Master's Degree      1861
PhD                  1369
High School           448
Name: count, dtype: int64


In [55]:
# replace job titles that include Human resources with Hr
data['Job Title'] = data['Job Title'].str.replace("Human Resources", "HR")
pd.set_option("display.max_rows", None)  # show all rows
print(data['Job Title'].value_counts())


Job Title
Software Engineer                        518
Data Scientist                           453
Software Engineer Manager                376
Data Analyst                             363
Senior Project Engineer                  318
Product Manager                          313
Full Stack Engineer                      309
Marketing Manager                        255
Senior Software Engineer                 244
Back end Developer                       244
Front end Developer                      241
Marketing Coordinator                    158
Junior Sales Associate                   142
Financial Manager                        134
Marketing Analyst                        132
Software Developer                       125
Operations Manager                       114
HR Manager                               106
Director of Marketing                     88
Web Developer                             87
Research Director                         75
Product Designer                          75


In [56]:
# Make all entries in Job title all capital so that they all have the same style:
data['Job Title'] = data['Job Title'].str.upper()
pd.set_option("display.max_rows", None)  # show all rows
print(data['Job Title'].value_counts())

Job Title
SOFTWARE ENGINEER                        518
DATA SCIENTIST                           453
SOFTWARE ENGINEER MANAGER                376
DATA ANALYST                             363
SENIOR PROJECT ENGINEER                  318
PRODUCT MANAGER                          313
FULL STACK ENGINEER                      309
FRONT END DEVELOPER                      272
MARKETING MANAGER                        255
SENIOR SOFTWARE ENGINEER                 244
BACK END DEVELOPER                       244
MARKETING COORDINATOR                    158
JUNIOR SALES ASSOCIATE                   142
FINANCIAL MANAGER                        134
MARKETING ANALYST                        132
SOFTWARE DEVELOPER                       125
OPERATIONS MANAGER                       114
HR MANAGER                               106
DIRECTOR OF MARKETING                     88
WEB DEVELOPER                             87
PRODUCT DESIGNER                          75
RESEARCH DIRECTOR                         75


In [57]:
# change JUNIOUR to JUNIOR, REPRESENTATIVE to REP, SOCIAL MEDIA MAN to SOCIAL MEDIA MANAGER
data['Job Title'] = data['Job Title'].str.replace("JUNIOUR", "JUNIOR")
data['Job Title'] = data['Job Title'].str.replace("REPRESENTATIVE", "REP")
data['Job Title'] = data['Job Title'].str.replace("SOCIAL MEDIA MAN", "SOCIAL MEDIA MANAGER")
data['Job Title'] = data['Job Title'].str.replace("JUNIOR SOCIAL MEDIA MANAGERAGER", "JUNIOR SOCIAL MEDIA MANAGER")
pd.set_option("display.max_rows", None)  # show all rows
print(data['Job Title'].value_counts())

Job Title
SOFTWARE ENGINEER                        518
DATA SCIENTIST                           453
SOFTWARE ENGINEER MANAGER                376
DATA ANALYST                             363
SENIOR PROJECT ENGINEER                  318
PRODUCT MANAGER                          313
FULL STACK ENGINEER                      309
FRONT END DEVELOPER                      272
MARKETING MANAGER                        255
BACK END DEVELOPER                       244
SENIOR SOFTWARE ENGINEER                 244
MARKETING COORDINATOR                    158
JUNIOR SALES ASSOCIATE                   142
FINANCIAL MANAGER                        134
MARKETING ANALYST                        132
SOFTWARE DEVELOPER                       125
OPERATIONS MANAGER                       114
HR MANAGER                               106
DIRECTOR OF MARKETING                     88
WEB DEVELOPER                             87
PRODUCT DESIGNER                          75
RESEARCH DIRECTOR                         75


In [None]:
# Create a new column called years of education. Assume Bachelor's degree is 3 years, Master's 1 year, PhD 4 years, high school 12 years.
edu_map = {
    "High School": 12,
    "Bachelor's Degree": 15,  # 12 + 3
    "Master's Degree": 16,    # 12 + 3 + 1
    "PhD": 20          # 12 + 3 + 1 + 4
}
data["Years of Education"] = data["Education Level"].map(edu_map)
data.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary,Years of Education
0,32.0,Male,Bachelor's Degree,SOFTWARE ENGINEER,5.0,90000.0,15.0
1,28.0,Female,Master's Degree,DATA ANALYST,3.0,65000.0,16.0
2,45.0,Male,PhD,SENIOR MANAGER,15.0,150000.0,20.0
3,36.0,Female,Bachelor's Degree,SALES ASSOCIATE,7.0,60000.0,15.0
4,52.0,Male,Master's Degree,DIRECTOR,20.0,200000.0,16.0


In [74]:
# Groupby() method 
group_experience=data.groupby(['Years of Experience'])['Salary'].mean()
print("Average salary according to the years of experience:\n{}\n".format(group_experience))

group_gender=data.groupby(['Gender'])['Salary'].mean()
print('Average salary of people according to their gender:\n{}\n'.format(group_gender))

group_education_level=data.groupby(['Education Level'])['Salary'].mean()
print('Average salary of people according to their education level:\n{}\n'.format(group_education_level))

Average salary according to the years of experience:
Years of Experience
0.0      29680.233333
0.5      35000.000000
1.0      46992.846296
1.5      36279.166667
2.0      58699.457377
3.0      72944.406977
4.0      83332.090038
5.0     103111.092732
6.0     111891.146119
7.0     122108.232295
8.0     126438.138824
9.0     138021.460526
10.0    131690.322917
11.0    153060.318750
12.0    153398.064626
13.0    153002.181818
14.0    168632.363636
15.0    160664.759690
16.0    183285.438017
17.0    184053.830189
18.0    184340.412698
19.0    182430.330579
20.0    182288.721311
21.0    176734.190476
22.0    188644.531915
23.0    189573.702703
24.0    211225.421053
25.0    177803.458333
26.0    187717.285714
27.0    187922.636364
28.0    189774.812500
29.0    181437.000000
30.0    163339.833333
31.0    183027.200000
32.0    192540.800000
33.0    186400.666667
34.0    188651.000000
Name: Salary, dtype: float64

Average salary of people according to their gender:
Gender
Female    107888.998672


EDA: Look for correlations, check normal distribution using `displot`, if there is no normal distribution do `log_transformation`

Vizualization: use `plotly.express()`, `pairplot()`, `boxplot()`


Feature Engineering: polynomial features, feature interactions, for example you can check and compare the correlation between the features with or without senior in it and salary, years of experience vs salary etc. 

Hypothesis Testing: set null and alternative hypotheses